CN102622443A - Customized screening system and method for microblog - Google Patents

Customized screening system and method for microblog Download PDF

Info

Publication number
CN102622443A
CN102622443A CN2012100656789A CN201210065678A CN102622443A CN 102622443 A CN102622443 A CN 102622443A CN 2012100656789 A CN2012100656789 A CN 2012100656789A CN 201210065678 A CN201210065678 A CN 201210065678A CN 102622443 A CN102622443 A CN 102622443A
Authority
CN
China
Prior art keywords
module
data
microblogging
index
query
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2012100656789A
Other languages
Chinese (zh)
Inventor
闫丹凤
田瑞
刘佳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Posts and Telecommunications
Original Assignee
Beijing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Posts and Telecommunications filed Critical Beijing University of Posts and Telecommunications
Priority to CN2012100656789A priority Critical patent/CN102622443A/en
Publication of CN102622443A publication Critical patent/CN102622443A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a customized screening system and a customized screening method for microblog. The customized screening system for the microblog comprises a background module and an interactive module, wherein the background module is used for acquiring data, analyzing the data, performing local storage, establishing an index and providing a search function; and the interactive module performs information interaction with the background module and provides a WEB interface interacted with the background module. The system is a solution for solving the problem of information overloading, namely, a type of information which a user concerns is targetedly screened out of huge microblog data, a great deal of non-concerned information is filtered for the user, and the data can be locally stored so that the user can use the data for a long time conveniently; the safety of the system is enhanced by a verification mechanism; the working logic of the whole system is clear and smooth; the coupling degree among modules of the system is reduced; and each module consists of a plurality of modules so as to facilitate the functional expansion of each module.

Description

A kind of customization screening system and method towards microblogging
Technical field
The present invention relates to a kind of screening system and method, relate in particular to a kind of customization screening system and method, belong to network information technology field towards microblogging.
Background technology
Microblogging is that the information based on customer relationship is shared, propagated and obtains platform, and the user can pass through WEB, WAP and various client component individual community, with the literal lastest imformation about 140 words, and realizes sharing immediately.Today is issued " the 28th China Internet network state of development statistical report " in CNNIC (CNNIC); Report shows that the first half of the year in 2011, Chinese microblogging user increases to 1.95 hundred million from 6,331 ten thousand; Increase about 2 times, huge customer volume also can bring huge quantity of information by microblogging.
The present instant communication function in microblogging website is very powerful, directly writes through QQ and MSN, and in the place that does not have network, as long as but the also content of immediate updating oneself of mobile phone is arranged, even you are just in the site of the accident.Be similar to some big accidents or cause the major issue that the whole world is paid close attention to,, utilize various means on the microblogging visitor, to deliver out if there is the microblogging visitor on the scene, its real-time, presence and agility, even surpass all medium.
Though microblogging ability fast updating information; Information real-time; But also brought a large amount of gibberishes in the time of frequent updating information, added huge microblogging user group, and microblogging multipath, quick and easy published method; The problem that information overload occurred brings inconvenience to the use of effective information.
Summary of the invention
The objective of the invention is to the awkward deficiency of effective information, provides a kind of customization screening system towards microblogging that can from a large amount of microblogging information, filter out effective microblogging data to the prior art information overload.
The technical scheme that the present invention solves the problems of the technologies described above is following: a kind of customization screening system towards microblogging; Comprise background module and interactive module, said background module is used for image data, analysis data, local storage, sets up index and search function is provided;
Said interactive module and background module information interaction, and the WEB interface mutual with background module is provided;
Said background module comprises acquisition module, analysis module, index module and the retrieval module of information interaction successively; Said acquisition module is gathered original microblogging data;
The data that said analysis module transmits acquisition module extract, go heavily to reach to filter and obtain valid data, and to valid data classification, storage, said filtration comprises the filtration to rubbish, advertisement and yellow anti-data;
Said index module is carried out Chinese and English participle to the data that analysis module transfers to, and sets up inverted index and increment index according to the result of participle, and the deletion index regular according to the microblogging status file;
Said retrieval module receives the search key of interactive module transmission, and search key is carried out error correction, synonym conversion, participle and optimization, and result for retrieval is screened and sorts, and further ranking results is returned interactive module.
The invention has the beneficial effects as follows: native system is a solution that solves information overload; Promptly from huge microblogging data; Filter out the category information that the user pays close attention to pointedly; For user filtering falls large quantities of non-concern information, and can make things convenient for the user to use for a long time these data localization storages; And the security through authentication mechanism enhanced system itself; Whole system operation clear logic, smoothness have reduced the degree of coupling between each module of system, and every inside modules all is made up of plurality of modules, helps the expansion of every functions of modules.
On the basis of technique scheme, the present invention can also do following improvement.
Further; Said retrieval module comprises Query search key processing module and Query search key optimal module; Said Query processing module receives the Query search key that interactive module transfers to; The Query search key is handled, and the Query after will handling is sent to the Query optimal module;
The Query that said Query optimal module is sent to the Query processing module omits conversion and classification, and Query and classification thereof are sent to index module, the result that the reception hint module is returned;
Said Query optimal module comprises Query elision module and Query sort module, and said Query elision module receives the data that the Query processing module is sent to, and said data are carried out the canonical coupling, and unmatched Query is omitted; Said Query sort module will be classified from data based its theme of Query elision module, and with sorted data transmission to index module;
Said Query elision module is handled the data that transfer to through mining rule, finds out unessential participle, and sets up the canonical rule, matees for the data and the said canonical rule of back input.
Further, said interactive module comprises control of authority module, enquiry module, screening module, warehouse-in data management module and cura specialis module, and said control of authority module controls different user is to the different operation authority of system;
Said enquiry module is realized checking microblogging information through the mode of seniority among brothers and sisters inquiry, searching label and advanced search;
Said screening module garbled data also adds self-defined theme, and be stored in the database;
Said warehouse-in data management module is showed having deposited data of database in the screening module in;
Said cura specialis module is used for the famous person and organization names, famous person and mechanism classify and the url web page address is managed.
Further; Said acquisition module comprises that network climbs delivery piece and microblogging API application programming interface module; Said network is climbed the delivery piece URL web page address of appointment is grasped, and the URL request of choosing of sending is obtained the original HTML hypertext markup language page in website and is sent to analysis module; The microblogging API that said microblogging API module adopts existing microblogging platform to provide obtains the data of JSON lightweight data interchange format and is sent to analysis module;
Said analysis module comprises data extraction module, data filter module, text classification module, data memory module; Said data extraction module receives network in the acquisition module and climbs html page that the delivery piece collects and the data that are formatted as the JSON form; And with said data transmission to the data filter module that is formatted as the JSON form; Said data extraction module is climbed the original html web page that the delivery piece obtains to network and is carried out the conversion of standard x ML extend markup language form; Search back end, data are added respective labels, it is mapped to the data of JSON form; Said data filter module receives the data of the JSON form of microblogging API module output in the acquisition module and the data of the JSON form that data extraction module transmits; And with said data through going heavy and filtration obtains valid data, and said valid data are transferred to text classification module and data memory module; The valid data that said text classification module transfers to filtering module are classified and classification results are sent to data memory module; Said data memory module writes file with data and the classification results that data filter module and text classification module transfer to, and stores said file data respectively, and the attribute information of extracted valid data writes file simultaneously;
Said data memory module comprises database and text, and said database is used to store complete data message and according to user instruction data is sent to interactive module; Said text is used to store id, content and the classification of data, and calling data transmission to mutual module according to index module;
Said index module comprises that text word-dividing mode and index set up module, and said text word-dividing mode combines dictionary that files stored content in the data memory module is carried out participle through the segmenter of dismembering an ox as skillfully as a butcher, and obtains setting up the raw data of index; Said index sets up that data that module transfers to the text word-dividing mode are set up inverted index and increment index obtains index data.
Described microblogging API module is obtained the famous person who is paid close attention to, the Twitter message of mechanism through the mode of API, comprises through specifying relevant famous person, mechanism's table, obtains their Twitter message; The preceding 24 hours microblogging primary ID of current system time is promptly obtained in the renewal of the corresponding comment of microblogging from the microblogging table, utilize microblogging API; Again obtain comment number, forwarding number, the comment tabulation of this microblogging, remove and repeat comment, renewal microblogging table and comment table; The renewal of user's bean vermicelli number, the famous person's organization names in the traversal name robot mechanism file utilizes api interface; Check and whether add concern,, then upgrade the bean vermicelli number if pay close attention to; Otherwise, check whether the bean vermicelli number reaches the concern threshold value, reach and then utilize API to add concern, and its information is written in the robot mechanism table.
Described data extraction module, the original html web page that the web crawlers mode is obtained carries out the conversion of standard x ML, searches back end; Data are added respective labels; It is mapped to the microblogging data of JSON form,, all need carries out result's merging and handle for updating data each time; Described text classification module adopts the segmenting method based on the preceding paragraph maximum matching algorithm that the microblogging content is carried out participle, and word segmentation result is mated in classified dictionary, draws the classification results of a microblogging according to relevant rule; Described text word-dividing mode through adding self-defined dictionary, having the segmenter of dismembering an ox as skillfully as a butcher of good extendability and support Chinese word segmentation, is carried out participle to the microblogging content that writes file in the data memory module, obtains setting up the raw data of index.
Described search key optimal module; The search key daily record of statistics a period of time; Analyze the search key characteristic, find out and cause recalling a few redundant speech of result and a sentence formula, it is configured to rule; Search key is carried out the canonical coupling, the speech that comprises in the rule in the search key is omitted.
Described enquiry module provides three kinds of inquiry modes, promptly ranks inquiry, searching label and advanced search.Wherein the seniority among brothers and sisters inquiry comprises up-to-date information, hot issue, popular microblogging, popular famous person and popular mechanism.Up-to-date information comprises the Twitter message that the same day is all; Hot issue is showed the maximum topN bar topic of topic number of Twitter message; Popular microblogging is showed topN bar microblogging and the maximum topN bar microblogging of comment number that the forwarding number is maximum; Popular famous person shows topN maximum famous person of concern number, and popular mechanism shows topN maximum mechanism of concern number, and it is recommended the user.Searching label comprises microblogging source, microblogging tag along sort, and interpolation, the delete function of label is provided simultaneously.The microblogging source is about to show from all Twitter messages of this microblogging website, and all Twitter messages that the microblogging tag along sort is about to belong under this classification show.The condition of advanced search comprises time range, microblogging source, microblogging tag along sort and the key word of the inquiry of Twitter message issue; The result that enquiry module will satisfy condition represents, and the content that represents comprises: microblogging content, microblogging author, microblogging issuing time, microblogging source and microblogging label and for the review information of this microblogging; Described screening module; The user can select interested Twitter message, and this type data set is added self-defining theme, and is stored in the database; The presented with topic list of this data set; Invisible but visible to the keeper to other users, the keeper can single or batch selected Twitter message, deletes its storage in source data and index data; Said cura specialis module comprises the management of famous person's organization names, the classification of name robot mechanism and url; Provide the information of name robot mechanism is added and editing operation, can carry out single or deletion in batches, can search qualified famous person's mechanism information through querying conditions such as famous person's organization names, the classification of name robot mechanism to famous person's mechanism information; Be used for disposing the Twitter message of the name robot mechanism that microblogging API module need be collected in the acquisition module; Promptly through input famous person organization names; Specify the microblogging source to add new data,, and do not have this famous person's organization names in the database if add successfully; Microblogging API module in the acquisition module will be collected the Twitter message of this robot mechanism issue, exist this famous person's organization names then to carry out more capable to its bean vermicelli number in the database.Otherwise add failure, the microblogging API module in the acquisition module is not done any response.
JSON (JavaScript Object Notation) is a kind of data interchange format of lightweight.
XML is a kind of general data interchange format; Annotate: extend markup language (Extensible Markup Language; XML); Be used for the electroactive marker son file and make it have structural SGML, can be used for flag data, definition of data type, the source language that to be a kind of user of permission define oneself SGML.
Another object of the present invention is to the prior art information overload, to the awkward deficiency of effective information, a kind of customization screening technique towards microblogging that can from a large amount of microblogging information, filter out effective microblogging data is provided.
The technical scheme that the present invention solves the problems of the technologies described above is following: a kind of customization screening technique towards microblogging specifically may further comprise the steps:
Step 1: collect from the data of website through acquisition module;
Step 2: the data that analysis module filters acquisition module obtain valid data;
Step 3: index module is set up index to the data that analysis module transfers to;
Step 4: the user input query request is obtained the related data in the analysis module through retrieval module.
Further, said step 1 obtains data through two kinds of methods;
The opening API DLL that the microblogging API module of system through acquisition module provides from the website obtains the microblogging data of JSON lightweight data interchange format;
System climbs the delivery piece through the network of acquisition module and grasps specific microblogging website, obtains original semi-structured html page.
Further, said step 2 specifically may further comprise the steps:
Step 2.1: judgment data is that network is climbed delivery piece or the transmission of microblogging API module, if network is climbed the semi-structured html page of delivery piece transmission, gets into step 2.2; If the data of microblogging API module transmission get into step 2.4;
Step 2.2: search back end, original nonstandard html page is converted into XML, from the tree of XML, combine the attribute of Twitter message, search specific zone, therefrom extract relevant microblogging data and good invocation point of mark or anchor;
Step 2.3: utilize xsl file sign anchor, specify from anchor and obtain the microblogging attribute data that setting is searched, and with JSON output file of corresponding form structure;
Step 2.4: the going heavily of microblogging data; Use Message-Digest Algorithm 5 (MD5 algorithms) to character string " microblogging data+issuing time+author " ("+" expression character string connects) compute signature, signature is stored in the database, the new microblogging that grasps of each bar; Calculate the MD5 signature; Check in the database whether exist,, said signature is abandoned if there has been the expression repetition; If there is no, said signature is stored in the database;
Step 2.5: the filtration of microblogging data, speech to be filtered is configured in the vocabulary, the microblogging data are carried out multimode matching, check that the speech that whether has in the microblogging data in the vocabulary exists; Said classified dictionary is called file designation with the different classes name, is stored in respectively in the text; The text participle of microblogging data adopts the segmenting method based on the preceding paragraph maximum matching algorithm that the microblogging content is carried out participle;
Step 2.6: the text classification module in the analysis module is carried out classification processing to filtered data, and deposits the result in database and text; Wherein deposit complete microblogging data in database, deposit data id, content and classification in text.
The detailed rules of classification, cardinal rule is that a classification scores the most points, and then this text just belongs to this classification, and the score of classification embodies through the weight of the length of the speech under this classification, word frequency number, speech, and concrete formula is: C = Σ i = 1 n Len ( Word i ) * Weight ( Word i ) , Wherein the n representative belongs to total number of the speech under this classification, word iRepresentative belongs to each concrete speech under this classification, len (word i) represent the length of speech, weight (word i) represent the weight of speech, the weight rule is: length>4 be 1.5, length=4 are 1.25; Length=3 be 1, length=2 be 0.5, for example: five speech that we branch away are respectively: we | can | perhaps | Pingan Insurance | society; Wherein first three speech belongs to category-A, and ' Pingan Insurance ' belongs to category-B, and ' society ' belongs to the C class; Then the score of category-A is 2*0.5+2*0.5+2*0.5=3; The score of category-B is 4*1.25=5, and the score of C class is 2*0.5=1, so the text belongs to the highest category-B of score.
Further, said step 4 specifically comprises following operation:
Step 4.1: through screening module data retrieval key word, retrieval module is handled search key, to carry out pretreated work such as error correction, synonym conversion and participle from the search key in the interactive module;
Step 4.2: optimize search key, pretreated search key omitted and classifies, and with the classification results of correspondence and the search key after handling export to index module;
Step 4.3: retrieval module and index module information interaction, the extremely mutual module of complete microblogging information transmission that the database in the control index module invokes analysis module is corresponding with the data id of corresponding search key;
Step 4.4: interactive module deposits the data that transfer in the warehouse-in data management module.
Further, said step 4.2 specifically comprises following operation:
Step 4.2.1: the search key elision module is handled the data that transfer to according to mining rule, finds out unessential participle, and sets up the canonical rule;
Step 4.2.2: the data and the said canonical rule of input are mated based on the canonical rule;
Step 4.2.3: the search key sort module is classified to the data of input and is called the id of the correspondence in the text of index module; And the id of said correspondence is sent to analysis module; From the database of analysis module, take out the corresponding complete microblogging information of id, and transmit it to interactive module; Described classification needs corresponding classification chart based on setting in advance to mate generation.
In the Query sort module sorted table to set up process following:
A) processing of search dog input method dictionary
The search dog input method provides the dictionary in a lot of fields, can download to financial dictionaries such as finance and economics, insurance, foreign trade, ecommerce, stock from its official website.The download network address is: http://pinyin.sogou.com/dict/.Through with the correspondence of the taxonomic hierarchies of platform, can obtain the file of < term, class_id >.
B) have the processing of the document of classification
The document that will have classification uses Paoding segmenter participle, and statistics is cut length among the speech result greater than 3 term and its corresponding classification, obtains < term, class_id>file.
Use the shell script, the file in the operation < term, class_id >; Obtain that < this term is to the chi-square value of this classification for term, class_id; The variance of the chi-square value of corresponding each classification of this term>file; Variance is represented the degree of fluctuation of this term to difference classification degree of support, and variance is big more, and term can distinguish different classification more.Filter out the term of variance yields, get the classification of that maximum classification of this term chi-square value again as this term less than certain threshold value.
Randomly draw a part of classification results and carry out the Badcase analysis; Manual work is checked; Badcase mainly be because: when utilizing chi-square value to extract centre word, whether too general threshold value limit and improperly can produce the problem to certain certain term of classification, can delete some too general term; When a plurality of centre word term appear among the query simultaneously, can produce a part of badcase, can extract out, be placed in the through vocabulary; Information such as city, place name also can influence classification results, can it be deleted.
Assorting process in the Query sort module is specific as follows:
A) structure trie tree and failure pointer
Set up the trie tree, the trie tree is claimed word lookup tree or key tree again, is a kind of tree structure, is a kind of mutation of Hash tree, 1 byte of each node storage, the child node internal memory of dynamic assignment node.Use breadth-first search and formation, add the failure pointer of each node.In the GBK coding; Each character is made up of 2 bytes, because 1 byte of each node storage, character splits may produce to split and accidentally injures; After promptly two Chinese characters split, first byte of second byte of first Chinese character and second Chinese character produced a new Chinese character.Accidentally injure for avoiding splitting, English character is encoded to 0, Chinese character is encoded to 10, promptly 01 do not handle as a character.
B) utilize the longest matching strategy to carry out multimode matching
During coupling,, after the success of the end of centre word node matching, still continue coupling query if the node matching success continues coupling query.It fails to match as if node, goes the node of the failure pointed of node to continue coupling, and centre word that matches before the output and classification id are to be matched intact, exports the classification id of all couplings.
Description of drawings
Fig. 1 is the embodiment of the invention 1 a described customization screening system structural drawing towards microblogging;
Fig. 2 is the embodiment of the invention 1 a described customization screening technique process flow diagram towards microblogging;
Fig. 3 hits the processing policy exemplary plot of many rules for the embodiment of the invention 1 described screening technique query.
Embodiment
Below in conjunction with accompanying drawing principle of the present invention and characteristic are described, institute gives an actual example and only is used to explain the present invention, is not to be used to limit scope of the present invention.
As shown in Figure 1, the embodiment of the invention 1 described a kind of customization screening system towards microblogging comprises background module and interactive module, and said background module is used for image data, local storage, sets up index and search function is provided;
Said interactive module and background module information interaction, and the WEB interface mutual with background module is provided.
Said background module comprises acquisition module, analysis module, index module and the retrieval module of information interaction successively; Said acquisition module is integrated different data obtain manner is gathered original microblogging data; The data that said analysis module transmits acquisition module extract, go heavily to reach to filter and obtain valid data, and to valid data classification, storage, said filtration comprises the filtration to rubbish, advertisement and yellow anti-data; Said index module is carried out Chinese and English participle to the data that analysis module transfers to, and sets up inverted index and increment index according to the result of participle, and the deletion index regular according to the microblogging status file; Said retrieval module receives the search key of interactive module transmission, and search key is carried out error correction, synonym conversion, participle and optimization, and result for retrieval is screened and sorts, and further ranking results is returned interactive module.
Said interactive module comprises control of authority module, enquiry module, screening module, warehouse-in data management module and cura specialis module, and said control of authority module user controls the different rights of different user to system; Said enquiry module is realized checking microblogging information through the mode of seniority among brothers and sisters inquiry, searching label and advanced search; Said screening module garbled data also adds self-defined theme, and be stored in the database; Said warehouse-in data management module is showed the data in the database in the screening module; Said cura specialis module is used for organization names, mechanism's classification and url are managed, and said management comprises interpolation, deletion, modification and query manipulation.
Said acquisition module comprises that network climbs delivery piece and microblogging API module; Said network is climbed the delivery piece URL webpage of appointment is grasped; And to the URL that chooses the request of sending obtains the original html page in website, comprise initialization URL, the filtration in URL storehouse and choosing of URL;
The microblogging API that said microblogging API module adopts existing microblogging platform to provide obtains the data of JSON form.
Said analysis module comprises data extraction module, data filter module, text classification module, data memory module; Said data extraction module receives network in the acquisition module and climbs html page that the delivery piece collects and through modes such as filtration or conversion or extractions it is formatted as the data of JSON, and with said data transmission to the data filter module that is formatted as JSON;
Said data filter module receives the data of the JSON form of microblogging API module output in the acquisition module and the data of the JSON form that data extraction module transmits; And with said data through going heavy and filtration obtains valid data; And said valid data are transferred to text classification module and data memory module, said filtration comprises the filtration to rubbish, the yellow anti-data of advertisement machine;
The valid data that said text classification module transfers to filtering module are classified and classification results are sent to data memory module; Comprise participle to the microblogging content of effective microblogging data; Word segmentation result is mated in classified dictionary, return classification results;
Said data memory module writes file with data and the classification results that data filter module and text classification module transfer to, and the said file data of classification and storage, and the attribute information of preserving extracted valid data simultaneously writes file and preserves.
Said index module comprises that module set up in the text word-dividing mode and the index of data connection successively; Said text word-dividing mode combines dictionary that files stored content in the data memory module is carried out participle through the segmenter of dismembering an ox as skillfully as a butcher, and obtains setting up the raw data of index; Said index sets up that data that module transfers to the text word-dividing mode are set up inverted index and increment index obtains index data.
Said retrieval module and interactive module information interaction; Said retrieval module comprises search key processing module and search key optimal module; The search key that said search key processing module connects in the interactive module carries out pre-service, and said pre-service comprises error correction, synonym conversion and participle etc.;
Said search key optimal module is omitted conversion and classification to pretreated search key, and search key and classification thereof are sent to index module.
Said data extraction module is climbed the original html web page that the delivery piece obtains to network and is carried out the conversion of standard x ML, searches back end, and data are added respective labels, and it is mapped to the data of JSON form.
As shown in Figure 2, the embodiment of the invention 1 described a kind of customization screening technique towards microblogging specifically may further comprise the steps:
Step 1: judgement is the opening API that provides through the website, obtains the microblogging data of JSON form;
Step 2: the mode through web crawlers grasps specific microblogging website, obtains original semi-structured html page.
Step 3: the abstraction module in the said analysis module is searched back end; Original nonstandard html page is converted into XML, from the tree of XML, combines the attribute of Twitter message, search specific zone; Therefrom extract relevant microblogging data, and good invocation point of mark or anchor;
Step 4: utilize xsl file sign anchor, specify from anchor and obtain the microblogging attribute data that setting is searched, and with JSON output file of corresponding form structure;
Step 5: going heavily of microblogging data, use Message-Digest Algorithm 5 (MD5 algorithms) to character string " microblogging data+issuing time+author " ("+" expression character string connects) compute signature, signature is stored in the database.To the new microblogging that grasps of each bar, calculating MD5 signature checks in the database whether exist, if there has been the expression repetition, otherwise representes not exist, and signature is stored in the database;
Step 6: the filtration of microblogging data, speech to be filtered is configured in the vocabulary, the microblogging data are carried out multimode matching, check that the speech that whether has in the microblogging data in the vocabulary exists.
Step 7: text is classified according to classified dictionary, and classified dictionary is called file designation with the different classes name, is stored in respectively in the text;
Step 8: the text participle of microblogging data, adopt the segmenting method based on the preceding paragraph maximum matching algorithm that the microblogging content is carried out participle;
Step 9: index module is set up index to the data that analysis module transfers to;
Step 10: handle search key, to carry out pretreated work such as error correction, synonym conversion and participle from the search key in the interactive module;
Step 11: optimize search key, pretreated search key is omitted conversion, classification, and search key and classification thereof are sent to index module.
Step 12: the user is through administration interface customization screening conditions, and the data after analysis module will screen are preserved warehouse-in.
The above is merely preferred embodiment of the present invention, and is in order to restriction the present invention, not all within spirit of the present invention and principle, any modification of being done, is equal to replacement, improvement etc., all should be included within protection scope of the present invention.

Claims (9)

1. the customization screening system towards microblogging is characterized in that, comprises background module and interactive module, and said background module is used for image data, analysis data, local storage, sets up index and search function is provided;
Said interactive module and background module information interaction, and the WEB interface mutual with background module is provided;
Said background module comprises acquisition module, analysis module, index module and the retrieval module of information interaction successively; Said acquisition module is gathered original microblogging data;
The data that said analysis module transmits acquisition module extract, go heavily to reach to filter and obtain valid data, and to valid data classification, storage, said filtration comprises the filtration to rubbish, advertisement and yellow anti-data;
Said index module is carried out Chinese and English participle to the data that analysis module transfers to, and sets up inverted index and increment index according to the result of participle, and the deletion index regular according to the microblogging status file;
Said retrieval module receives the search key of interactive module transmission, and search key is carried out error correction, synonym conversion, participle and optimization, and result for retrieval is screened and sorts, and further ranking results is returned interactive module.
2. customization screening system according to claim 1; It is characterized in that; Said retrieval module comprises Query search key processing module and Query search key optimal module; Said Query processing module receives the Query search key that interactive module transfers to, and the Query search key is handled, and the Query after will handling is sent to the Query optimal module;
The Query that said Query optimal module is sent to the Query processing module omits conversion and classification, and Query and classification thereof are sent to index module, the result that the reception hint module is returned;
Said Query optimal module comprises Query elision module and Query sort module, and said Query elision module receives the data that the Query processing module is sent to, and said data are carried out the canonical coupling, and unmatched Query is omitted; Said Query sort module will be classified from data based its theme of Query elision module, and with sorted data transmission to index module;
Said Query elision module is handled the data that transfer to through mining rule, finds out unessential participle, and sets up the canonical rule, matees for the data and the said canonical rule of back input.
3. customization screening system according to claim 1; It is characterized in that; Said interactive module comprises control of authority module, enquiry module, screening module, warehouse-in data management module and cura specialis module, and said control of authority module controls different user is to the different operation authority of system;
Said enquiry module is realized checking microblogging information through the mode of seniority among brothers and sisters inquiry, searching label and advanced search;
Said screening module garbled data also adds self-defined theme, and be stored in the database;
Said warehouse-in data management module is showed having deposited data of database in the screening module in;
Said cura specialis module is used for the famous person and organization names, famous person and mechanism classify and the url web page address is managed.
4. customization screening system according to claim 3; It is characterized in that; Said acquisition module comprises that network climbs delivery piece and microblogging API application programming interface module; Said network is climbed the delivery piece URL web page address of appointment is grasped, and the URL request of choosing of sending is obtained the original HTML hypertext markup language page in website and is sent to analysis module; The microblogging API that said microblogging API module adopts existing microblogging platform to provide obtains the data of JSON lightweight data interchange format and is sent to analysis module;
Said analysis module comprises data extraction module, data filter module, text classification module, data memory module; Said data extraction module receives network in the acquisition module and climbs html page that the delivery piece collects and the data that are formatted as the JSON form; And with said data transmission to the data filter module that is formatted as the JSON form; Said data extraction module is climbed the original html web page that the delivery piece obtains to network and is carried out the conversion of standard x ML extend markup language form; Search back end, data are added respective labels, it is mapped to the data of JSON form; Said data filter module receives the data of the JSON form of microblogging API module output in the acquisition module and the data of the JSON form that data extraction module transmits; And with said data through going heavy and filtration obtains valid data, and said valid data are transferred to text classification module and data memory module; The valid data that said text classification module transfers to filtering module are classified and classification results are sent to data memory module; Said data memory module writes file with data and the classification results that data filter module and text classification module transfer to, and stores said file data respectively, and the attribute information of extracted valid data writes file simultaneously;
Said data memory module comprises database and text, and said database is used to store complete data message and according to user instruction data is sent to interactive module; Said text is used to store id, content and the classification of data, and calling data transmission to mutual module according to index module;
Said index module comprises that text word-dividing mode and index set up module, and said text word-dividing mode combines dictionary that files stored content in the data memory module is carried out participle through the segmenter of dismembering an ox as skillfully as a butcher, and obtains setting up the raw data of index; Said index sets up that data that module transfers to the text word-dividing mode are set up inverted index and increment index obtains index data.
5. the customization screening technique towards microblogging is characterized in that, specifically may further comprise the steps:
Step 1: collect from the data of website through acquisition module;
Step 2: the data that analysis module filters acquisition module obtain valid data;
Step 3: index module is set up index to the data that analysis module transfers to;
Step 4: the user input query request is obtained the related data in the analysis module through retrieval module.
6. customization screening technique according to claim 5 is characterized in that, said step 1 obtains data through two kinds of methods;
The opening API DLL that the microblogging API module of system through acquisition module provides from the website obtains the microblogging data of JSON lightweight data interchange format;
System climbs the delivery piece through the network of acquisition module and grasps specific microblogging website, obtains original semi-structured html page.
7. customization screening technique according to claim 5 is characterized in that, said step 2 specifically may further comprise the steps:
Step 2.1: judgment data is that network is climbed delivery piece or the transmission of microblogging API module, if network is climbed the semi-structured html page of delivery piece transmission, gets into step 2.2; If the data of microblogging API module transmission get into step 2.4;
Step 2.2: search back end, original nonstandard html page is converted into XML, from the tree of XML, combine the attribute of Twitter message, search specific zone, therefrom extract relevant microblogging data and good invocation point of mark or anchor;
Step 2.3: utilize xsl file sign anchor, specify from anchor and obtain the microblogging attribute data that setting is searched, and with JSON output file of corresponding form structure;
Step 2.4: the going heavily of microblogging data;
Step 2.5: the filtration of microblogging data, speech to be filtered is configured in the vocabulary, the microblogging data are carried out multimode matching, check that the speech that whether has in the microblogging data in the vocabulary exists; Said classified dictionary is called file designation with the different classes name, is stored in respectively in the text; The text participle of microblogging data adopts the segmenting method based on the preceding paragraph maximum matching algorithm that the microblogging content is carried out participle;
Step 2.6: the text classification module in the analysis module is carried out classification processing to filtered data, and deposits the result in database and text; Wherein deposit complete microblogging data in database, deposit data id, content and classification in text.
8. customization screening technique according to claim 5 is characterized in that, said step 4 specifically comprises following operation:
Step 4.1: through screening module data retrieval key word, retrieval module is handled search key, to carry out pretreated work such as error correction, synonym conversion and participle from the search key in the interactive module;
Step 4.2: optimize search key, pretreated search key omitted and classifies, and with the classification results of correspondence and the search key after handling export to index module;
Step 4.3: retrieval module and index module information interaction, the extremely mutual module of complete microblogging information transmission that the database in the control index module invokes analysis module is corresponding with the data id of corresponding search key;
Step 4.4: interactive module deposits the data that transfer in the warehouse-in data management module.
9. customization screening technique according to claim 8 is characterized in that, said step 4.2 specifically comprises following operation:
Step 4.2.1: the search key elision module is handled the data that transfer to according to mining rule, finds out unessential participle, and sets up the canonical rule;
Step 4.2.2: the data and the said canonical rule of input are mated based on the canonical rule;
Step 4.2.3: the search key sort module is classified to the data of input and is called the id of the correspondence in the text of index module; And the id of said correspondence is sent to analysis module; From the database of analysis module, take out the corresponding complete microblogging information of id, and transmit it to interactive module; Described classification needs corresponding classification chart based on setting in advance to mate generation.
CN2012100656789A 2012-03-13 2012-03-13 Customized screening system and method for microblog Pending CN102622443A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2012100656789A CN102622443A (en) 2012-03-13 2012-03-13 Customized screening system and method for microblog

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2012100656789A CN102622443A (en) 2012-03-13 2012-03-13 Customized screening system and method for microblog

Publications (1)

Publication Number Publication Date
CN102622443A true CN102622443A (en) 2012-08-01

Family

ID=46562362

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2012100656789A Pending CN102622443A (en) 2012-03-13 2012-03-13 Customized screening system and method for microblog

Country Status (1)

Country Link
CN (1) CN102622443A (en)

Cited By (47)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102945246A (en) * 2012-09-28 2013-02-27 北界创想(北京)软件有限公司 Method and device for processing network information data
CN103064888A (en) * 2012-12-10 2013-04-24 北京小米科技有限责任公司 Information publish method and device
CN103136346A (en) * 2013-02-07 2013-06-05 珠海市君天电子科技有限公司 Method for identifying microblog fake advertisements
CN103150662A (en) * 2013-02-07 2013-06-12 珠海市君天电子科技有限公司 Method for identifying false commodity advertisement in Taobao
CN103152347A (en) * 2013-03-13 2013-06-12 珠海市君天电子科技有限公司 Method for prompting microblog false advertisements
CN103150353A (en) * 2013-02-18 2013-06-12 人民搜索网络股份公司 Method and device for acquiring microblog information
CN103150378A (en) * 2013-03-13 2013-06-12 珠海市君天电子科技有限公司 Method for identifying false favorable comments in microblog advertisements
CN103186675A (en) * 2013-04-03 2013-07-03 南京安讯科技有限责任公司 Automatic webpage classification method based on network hot word identification
CN103200182A (en) * 2013-03-13 2013-07-10 珠海市君天电子科技有限公司 Method of identifying microblog marketing account spreading false advertisements
CN103279483A (en) * 2013-04-23 2013-09-04 中国科学院计算技术研究所 Topic prevalence range assessment method and system facing micro-blogs
CN103312589A (en) * 2013-03-13 2013-09-18 四川天翼网络服务有限公司 Work micro-blog group internal communication system and method
CN103440139A (en) * 2013-09-11 2013-12-11 北京邮电大学 Acquisition method and tool facing microblog IDs (identitiesy) of mainstream microblog websites
CN103544275A (en) * 2013-10-22 2014-01-29 华为技术有限公司 Data processing method and device
CN103577464A (en) * 2012-08-02 2014-02-12 百度在线网络技术(北京)有限公司 Method and device for excavating badcase of search engine
CN103823808A (en) * 2012-11-16 2014-05-28 云壤(北京)信息技术有限公司 System and method for searching web page by using microblog short link
CN103856565A (en) * 2014-03-18 2014-06-11 浪潮集团有限公司 E-commerce tax source management cloud collection monitoring method
WO2014114143A1 (en) * 2013-01-23 2014-07-31 Tencent Technology (Shenzhen) Company Limited Method, apparatus and computer storage medium for acquiring hot content
CN104063390A (en) * 2013-03-20 2014-09-24 腾讯科技(深圳)有限公司 Microblog data processing method and system
CN104133834A (en) * 2014-06-09 2014-11-05 合肥工业大学 Designated area microblog data collecting and processing method
CN104281680A (en) * 2014-09-30 2015-01-14 百度在线网络技术(北京)有限公司 Data processing system, method and device for acquiring website resources
CN104375826A (en) * 2014-10-11 2015-02-25 北京中搜网络技术股份有限公司 High-availability microblog collecting platform and method
CN105095271A (en) * 2014-05-12 2015-11-25 北京大学 Microblog retrieval method and microblog retrieval apparatus
CN105095270A (en) * 2014-05-12 2015-11-25 北京大学 Retrieval apparatus and retrieval method
CN105447719A (en) * 2015-12-01 2016-03-30 苏州铭冠软件科技有限公司 Data processing method suitable for big data analysis
CN105512261A (en) * 2015-12-02 2016-04-20 广州华多网络科技有限公司 Method and system for expressing front end lightweight statistical data
CN106021450A (en) * 2016-05-17 2016-10-12 华中科技大学 Event-oriented microblog search method
CN106326316A (en) * 2015-07-08 2017-01-11 腾讯科技(深圳)有限公司 Web page advertisement filtering method and device
CN106372049A (en) * 2016-08-31 2017-02-01 符文忠 Word document editor
CN107222381A (en) * 2016-03-21 2017-09-29 北大方正集团有限公司 The propagation path of microblog data determines method and apparatus
CN107704491A (en) * 2017-08-22 2018-02-16 腾讯科技(深圳)有限公司 Message treatment method and device
CN107911453A (en) * 2017-11-16 2018-04-13 北京锐安科技有限公司 A kind of data processing method and device for customizing client
CN108073604A (en) * 2016-11-10 2018-05-25 北京国双科技有限公司 Text handling method and device
CN108415748A (en) * 2018-03-01 2018-08-17 广州南方人才资讯科技有限公司 Method for information display and system, computer storage media and equipment
CN108416264A (en) * 2018-01-29 2018-08-17 山东汇贸电子口岸有限公司 A kind of searching method and search module of supporting OCR to input
CN108427639A (en) * 2018-01-24 2018-08-21 深圳壹账通智能科技有限公司 Automated testing method, application server and computer readable storage medium
CN108701038A (en) * 2017-01-24 2018-10-23 华为技术有限公司 A kind of method, terminal and the advertisement delivery system of terminal display advertisement
CN108897831A (en) * 2018-06-22 2018-11-27 济源职业技术学院 A kind of Artificial intelligent information screening system
CN109241432A (en) * 2018-09-07 2019-01-18 云南东巴文信息技术有限公司 Discrete data acquisition analysis system and method
CN110971476A (en) * 2018-09-29 2020-04-07 珠海格力电器股份有限公司 Method and system for analyzing file downloading behavior and intelligent terminal
CN111131268A (en) * 2019-12-27 2020-05-08 南京邮电大学 User data acquisition and storage system and method based on microblog platform
CN111342933A (en) * 2020-02-25 2020-06-26 卓望数码技术(深圳)有限公司 Data transmission method, device and medium
CN111614575A (en) * 2020-04-01 2020-09-01 宜通世纪科技股份有限公司 Deep packet inspection method, system and storage medium based on internet flow
CN112306998A (en) * 2020-10-13 2021-02-02 武汉中科通达高新技术股份有限公司 Commission data duplicate removal method, device and server
CN112632361A (en) * 2020-12-29 2021-04-09 中科院计算技术研究所大数据研究院 Iterative data acquisition method
CN112860754A (en) * 2021-03-11 2021-05-28 恒基文化实业(深圳)有限公司 Data processing method for screening corresponding users based on big data
CN113742478A (en) * 2020-05-29 2021-12-03 国家计算机网络与信息安全管理中心 Directed screening framework and method for massive text data
CN114925259A (en) * 2022-04-20 2022-08-19 北京网景盛世技术开发中心 Information acquisition and extraction method and system based on government portal and new media

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100306249A1 (en) * 2009-05-27 2010-12-02 James Hill Social network systems and methods

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100306249A1 (en) * 2009-05-27 2010-12-02 James Hill Social network systems and methods

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
何屹: "基于Web分类技术的农业信息获取系统的研究与实现", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *
廉捷等: "新浪微博数据挖掘方案", 《清华大学学报(自然科学版)》 *
段利国等: "限定语义距离的关键词同义扩展及精简", 《计算机工程与应用》 *

Cited By (66)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103577464A (en) * 2012-08-02 2014-02-12 百度在线网络技术(北京)有限公司 Method and device for excavating badcase of search engine
CN103577464B (en) * 2012-08-02 2018-07-10 百度在线网络技术(北京)有限公司 A kind of method for digging and device of search engine bad example
CN102945246A (en) * 2012-09-28 2013-02-27 北界创想(北京)软件有限公司 Method and device for processing network information data
CN103823808A (en) * 2012-11-16 2014-05-28 云壤(北京)信息技术有限公司 System and method for searching web page by using microblog short link
CN103064888A (en) * 2012-12-10 2013-04-24 北京小米科技有限责任公司 Information publish method and device
US9454568B2 (en) 2013-01-23 2016-09-27 Tencent Technology (Shenzhen) Company Limited Method, apparatus and computer storage medium for acquiring hot content
WO2014114143A1 (en) * 2013-01-23 2014-07-31 Tencent Technology (Shenzhen) Company Limited Method, apparatus and computer storage medium for acquiring hot content
CN103136346A (en) * 2013-02-07 2013-06-05 珠海市君天电子科技有限公司 Method for identifying microblog fake advertisements
CN103150662A (en) * 2013-02-07 2013-06-12 珠海市君天电子科技有限公司 Method for identifying false commodity advertisement in Taobao
CN103150662B (en) * 2013-02-07 2016-07-06 珠海市君天电子科技有限公司 A kind of method identifying Taobao's falseness Commdity advertisement
CN103150353A (en) * 2013-02-18 2013-06-12 人民搜索网络股份公司 Method and device for acquiring microblog information
CN103150378A (en) * 2013-03-13 2013-06-12 珠海市君天电子科技有限公司 Method for identifying false favorable comments in microblog advertisements
CN103152347A (en) * 2013-03-13 2013-06-12 珠海市君天电子科技有限公司 Method for prompting microblog false advertisements
CN103150378B (en) * 2013-03-13 2016-04-06 珠海市君天电子科技有限公司 A kind of method identifying false favorable comment in microblogging advertisement
CN103312589A (en) * 2013-03-13 2013-09-18 四川天翼网络服务有限公司 Work micro-blog group internal communication system and method
CN103312589B (en) * 2013-03-13 2016-06-01 四川天翼网络服务有限公司 Work micro-blog group internal communication system and method
CN103200182B (en) * 2013-03-13 2016-01-27 珠海市君天电子科技有限公司 A kind of method identifying the microblogging marketing account propagating sham publicity
CN103200182A (en) * 2013-03-13 2013-07-10 珠海市君天电子科技有限公司 Method of identifying microblog marketing account spreading false advertisements
CN103152347B (en) * 2013-03-13 2016-11-16 珠海市君天电子科技有限公司 A kind of method that microblogging sham publicity is pointed out
CN104063390A (en) * 2013-03-20 2014-09-24 腾讯科技(深圳)有限公司 Microblog data processing method and system
CN103186675A (en) * 2013-04-03 2013-07-03 南京安讯科技有限责任公司 Automatic webpage classification method based on network hot word identification
CN103279483B (en) * 2013-04-23 2016-04-13 中国科学院计算技术研究所 A kind of topic Epidemic Scope appraisal procedure towards micro-blog and system
CN103279483A (en) * 2013-04-23 2013-09-04 中国科学院计算技术研究所 Topic prevalence range assessment method and system facing micro-blogs
CN103440139A (en) * 2013-09-11 2013-12-11 北京邮电大学 Acquisition method and tool facing microblog IDs (identitiesy) of mainstream microblog websites
CN103544275A (en) * 2013-10-22 2014-01-29 华为技术有限公司 Data processing method and device
CN103856565A (en) * 2014-03-18 2014-06-11 浪潮集团有限公司 E-commerce tax source management cloud collection monitoring method
CN105095271B (en) * 2014-05-12 2019-04-05 北京大学 Microblogging search method and microblogging retrieve device
CN105095270A (en) * 2014-05-12 2015-11-25 北京大学 Retrieval apparatus and retrieval method
CN105095271A (en) * 2014-05-12 2015-11-25 北京大学 Microblog retrieval method and microblog retrieval apparatus
CN105095270B (en) * 2014-05-12 2019-02-26 北京大学 Retrieve device and search method
CN104133834B (en) * 2014-06-09 2018-05-04 合肥工业大学 Specify the collection of region microblog data and processing method
CN104133834A (en) * 2014-06-09 2014-11-05 合肥工业大学 Designated area microblog data collecting and processing method
CN104281680B (en) * 2014-09-30 2018-08-21 百度在线网络技术(北京)有限公司 Data processing system, method and device for obtaining site resource
CN104281680A (en) * 2014-09-30 2015-01-14 百度在线网络技术(北京)有限公司 Data processing system, method and device for acquiring website resources
CN104375826A (en) * 2014-10-11 2015-02-25 北京中搜网络技术股份有限公司 High-availability microblog collecting platform and method
CN106326316A (en) * 2015-07-08 2017-01-11 腾讯科技(深圳)有限公司 Web page advertisement filtering method and device
CN105447719A (en) * 2015-12-01 2016-03-30 苏州铭冠软件科技有限公司 Data processing method suitable for big data analysis
CN105512261A (en) * 2015-12-02 2016-04-20 广州华多网络科技有限公司 Method and system for expressing front end lightweight statistical data
CN107222381B (en) * 2016-03-21 2020-03-06 北大方正集团有限公司 Microblog data propagation path determining method and device
CN107222381A (en) * 2016-03-21 2017-09-29 北大方正集团有限公司 The propagation path of microblog data determines method and apparatus
CN106021450A (en) * 2016-05-17 2016-10-12 华中科技大学 Event-oriented microblog search method
CN106021450B (en) * 2016-05-17 2019-06-18 华中科技大学 A kind of event-oriented microblogging searching method
CN106372049A (en) * 2016-08-31 2017-02-01 符文忠 Word document editor
CN108073604A (en) * 2016-11-10 2018-05-25 北京国双科技有限公司 Text handling method and device
CN108701038A (en) * 2017-01-24 2018-10-23 华为技术有限公司 A kind of method, terminal and the advertisement delivery system of terminal display advertisement
CN107704491A (en) * 2017-08-22 2018-02-16 腾讯科技(深圳)有限公司 Message treatment method and device
CN107911453A (en) * 2017-11-16 2018-04-13 北京锐安科技有限公司 A kind of data processing method and device for customizing client
CN108427639A (en) * 2018-01-24 2018-08-21 深圳壹账通智能科技有限公司 Automated testing method, application server and computer readable storage medium
CN108416264A (en) * 2018-01-29 2018-08-17 山东汇贸电子口岸有限公司 A kind of searching method and search module of supporting OCR to input
CN108415748B (en) * 2018-03-01 2021-06-01 广州南方人才资讯科技有限公司 Information display method and system, computer storage medium and device
CN108415748A (en) * 2018-03-01 2018-08-17 广州南方人才资讯科技有限公司 Method for information display and system, computer storage media and equipment
CN108897831A (en) * 2018-06-22 2018-11-27 济源职业技术学院 A kind of Artificial intelligent information screening system
CN109241432A (en) * 2018-09-07 2019-01-18 云南东巴文信息技术有限公司 Discrete data acquisition analysis system and method
CN110971476A (en) * 2018-09-29 2020-04-07 珠海格力电器股份有限公司 Method and system for analyzing file downloading behavior and intelligent terminal
CN111131268A (en) * 2019-12-27 2020-05-08 南京邮电大学 User data acquisition and storage system and method based on microblog platform
CN111342933B (en) * 2020-02-25 2022-06-07 卓望数码技术(深圳)有限公司 Data transmission method, device and medium
CN111342933A (en) * 2020-02-25 2020-06-26 卓望数码技术(深圳)有限公司 Data transmission method, device and medium
CN111614575A (en) * 2020-04-01 2020-09-01 宜通世纪科技股份有限公司 Deep packet inspection method, system and storage medium based on internet flow
CN111614575B (en) * 2020-04-01 2022-11-08 宜通世纪科技股份有限公司 Deep packet inspection method, system and storage medium based on internet flow
CN113742478A (en) * 2020-05-29 2021-12-03 国家计算机网络与信息安全管理中心 Directed screening framework and method for massive text data
CN113742478B (en) * 2020-05-29 2023-09-05 国家计算机网络与信息安全管理中心 Directional screening device and method for massive text data
CN112306998A (en) * 2020-10-13 2021-02-02 武汉中科通达高新技术股份有限公司 Commission data duplicate removal method, device and server
CN112306998B (en) * 2020-10-13 2023-11-24 武汉中科通达高新技术股份有限公司 Method, device and server for de-duplication of traffic and delegation data
CN112632361A (en) * 2020-12-29 2021-04-09 中科院计算技术研究所大数据研究院 Iterative data acquisition method
CN112860754A (en) * 2021-03-11 2021-05-28 恒基文化实业(深圳)有限公司 Data processing method for screening corresponding users based on big data
CN114925259A (en) * 2022-04-20 2022-08-19 北京网景盛世技术开发中心 Information acquisition and extraction method and system based on government portal and new media

Similar Documents

Publication Publication Date Title
CN102622443A (en) Customized screening system and method for microblog
CN101593200B (en) Method for classifying Chinese webpages based on keyword frequency analysis
Ratkiewicz et al. Truthy: mapping the spread of astroturf in microblog streams
CN103136360B (en) A kind of internet behavior markup engine and to should the behavior mask method of engine
CN103544210B (en) System and method for identifying webpage types
US8423565B2 (en) Information life cycle search engine and method
CN102171702B (en) The detection of confidential information
CN104850574B (en) A kind of filtering sensitive words method of text-oriented information
CN103294781B (en) A kind of method and apparatus for processing page data
US20050267915A1 (en) Method and apparatus for recognizing specific type of information files
Li et al. Markuplm: Pre-training of text and markup language for visually-rich document understanding
CN104268148B (en) A kind of forum page Information Automatic Extraction method and system based on time string
CN102902703A (en) Network sensitive information-oriented screenshot discovery and locking callback method
CN103942340A (en) Microblog user interest recognizing method based on text mining
CN103136358B (en) A kind of method of Automatic Extraction forum data
CN103823824A (en) Method and system for automatically constructing text classification corpus by aid of internet
CN105550189A (en) Ontology-based intelligent retrieval system for information security event
CN102890702A (en) Internet forum-oriented opinion leader mining method
CN103678412A (en) Document retrieval method and device
CN104199938B (en) Agricultural land method for sending information and system based on RSS
CN102169496A (en) Anchor text analysis-based automatic domain term generating method
CN106484797A (en) Accident summary abstracting method based on sparse study
CN114064851A (en) Multi-machine retrieval method and system for government office documents
CN103235827A (en) Method for automatically classifying and screening scientific and technological information
CN111859065A (en) Big data-based public opinion listening system

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20120801

RJ01 Rejection of invention patent application after publication