CN103778200A - Method for extracting information source of message and system thereof - Google Patents
Method for extracting information source of message and system thereof Download PDFInfo
- Publication number
- CN103778200A CN103778200A CN201410010836.XA CN201410010836A CN103778200A CN 103778200 A CN103778200 A CN 103778200A CN 201410010836 A CN201410010836 A CN 201410010836A CN 103778200 A CN103778200 A CN 103778200A
- Authority
- CN
- China
- Prior art keywords
- information source
- message
- extraction
- character
- information
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Machine Translation (AREA)
Abstract
The invention discloses a method for extracting an information source of a message and a system thereof. According to the method, the information source in the message is extracted by matching key words of an information source extracting rule base, and an information source type is judged by matching rules of the information source extracting rule base. The method comprises the steps of message analysis and information source extraction, wherein the message analysis comprises the steps of extracting characters in a text according to the input text and carrying out segmenting treatment on the characters to obtain different sub-clauses, and the information source extraction comprises the steps of carrying out key word matching on the sub-clauses according to the information source extracting rule base, extracting a useful element sequence from the sub-clauses, extracting the information source on the useful element sequence, and judging the information source type by matching the rules of the information source extracting rule base.
Description
Technical field
The present invention relates to text mining field, particularly a kind of message information source abstracting method and system.
Background technology
In recent years, along with the development of Internet technology, the various information on network are able to wide-scale distribution, and these information qualities and confidence level difference are very big, existing relatively regular traditional news media, the relatively poor emerging media of confidence level such as Ye You forum, blog, microblogging.So how to extract Useful Information source and will become studying a question of everybody extensive concern.
Information extraction (Information Extraction:IE), as its name suggests, is that the information comprising in text is carried out to structuring processing, becomes the organizational form that form is the same.Input message extraction system be urtext, output be the information point of set form, information point is extracted out from various documents, then integrates with unified form, the main task of Here it is information extraction.
Information extraction technique is not attempted complete understanding entire chapter document, just the part that comprises relevant information in document is analyzed, and as for which information is correlated with, and that territory of fixing when by system is determined.
Information extraction technique is very useful for the specific fact that extraction needs from a large amount of documents.On internet, just exist so document library, on the internet, the information of same subject disperses to leave on different web sites conventionally, the form of performance is also different, if can be by these informations together, with structured form storage, that will be highly profitable.
Summary of the invention
The technical problem to be solved in the present invention is to provide a kind of message information source abstracting method and system thereof, low to overcome the information extraction efficiency of information extraction technique in prior art, the problem of complicated operation.
In order to reach above object, the invention provides a kind of message information source abstracting method, it is characterized in that, described method is by the information source in the keyword extraction message in decimation rule storehouse, match information source, and mate information source type described in the rule judgment in described information source decimation rule storehouse, the method comprises:
Packet parsing step: according to the text of input, extract the character in described text, and described character is made pauses in reading unpunctuated ancient writings and is treated to different subordinate sentences;
Information source extraction step: described subordinate sentence is carried out to keyword coupling according to described information source decimation rule storehouse, described subordinate sentence is extracted to useful key element sequence, and in described useful key element sequence, information extraction source, and by the rule judgment information source type in coupling described information source decimation rule storehouse.
Above-mentioned message information source abstracting method, is characterized in that, described information source decimation rule storehouse further comprises: useful key element storehouse, real information source recognition rule, information source type recognition rule and character types recognition rule.
Above-mentioned message information source abstracting method, is characterized in that, described method, before described packet parsing step, further comprises:
Message content adaptation step: for shielding the difference of coding or storage mode of message, provide unified message character iteration fetch interface.
Above-mentioned message information source abstracting method, is characterized in that, described method further comprises:
Information source statistic procedure: gather the extraction result in described information extraction source, calculate the statistical information of described information source.
Above-mentioned message information source abstracting method, is characterized in that, described packet parsing step also comprises:
Message character read step: read message byte stream, and according to coded system, byte is assembled into actual character;
Character types determining step: according to described character types recognition rule, character is divided into dissimilar;
Response events step: dissimilar according to described character, notify user to carry out the extraction operation of dissimilar character.
Above-mentioned message information source abstracting method, is characterized in that, described information source extraction step also comprises:
Index establishment step: set up TRIE keyword index according to described useful key element storehouse;
Subordinate sentence step: the described character in described response events step is made pauses in reading unpunctuated ancient writings and is treated to different subordinate sentences;
Extract treatment step: according to described TRIE keyword index, described different subordinate sentence is carried out to keyword coupling, Extracting Information source, and judge and the authenticity of described information source complete the differentiation of described information source type;
Output step: the information of described information source and described information source type is exported.
Above-mentioned message information source abstracting method, is characterized in that, described extraction treatment step also comprises:
Information source extraction step: carry out information source extraction take described subordinate sentence as unit, the TRIE keyword index of setting up according to described useful key element storehouse, extracts candidate's information source or the list of candidate's information source;
Useful key element extraction step: according to described candidate's information source or the list of candidate's information source, extract the positional information in subordinate sentence described in useful key element and described useful key element from described subordinate sentence;
Real information source determining step: by predefined described real information source recognition rule, judge whether described candidate information source is real information source;
Information source type extraction step: mated information source type by predefined described information source type recognition rule with described useful key element and differentiated.
Above-mentioned message information source abstracting method, is characterized in that, described useful key element storehouse includes by key element, and described useful key element comprises: media name deictic words, date and time information, media report behavior word and media deictic words.
Above-mentioned message information source abstracting method, is characterized in that, described real information source recognition rule is heuristic rule, manually formulates by observing message, and rule can add or revise.
Above-mentioned message information source abstracting method, it is characterized in that, described real information source recognition rule comprises a heuristic rule: if only have a described candidate information source in subordinate sentence, and there is described media report behavior word, and the character that meets described candidate information source occurs occurring described media deictic words in described date and time information or described follow-up source word symbol with the subordinate sentence at described media name deictic words ending or described follow-up source string place, judges that described candidate information source is real information source.
Above-mentioned message information source abstracting method, is characterized in that, described information source type comprises: news media, forum, blog and microblogging.
Above-mentioned message information source abstracting method, is characterized in that, in described information source type extraction step, for described information source type be blog and or the information source of microblogging, need further to extract user's name or Blog Website information.
The present invention also provides a kind of message information source extraction system, adopts described message information source abstracting method, it is characterized in that, described system comprises:
Packet parsing module: according to the text of input, carry out code parsing, extract the character in described text, and described character is made pauses in reading unpunctuated ancient writings and is treated to different subordinate sentences;
Information source abstraction module: described subordinate sentence is carried out to keyword coupling according to described information source decimation rule storehouse, described subordinate sentence is extracted to useful key element sequence, and in described useful key element sequence, information extraction source, and by the rule judgment information source type in coupling described information source decimation rule storehouse.
Above-mentioned message information source extraction system, is characterized in that, described information source decimation rule storehouse further comprises: useful key element storehouse, real information source recognition rule, information source type recognition rule and character types recognition rule.
Above-mentioned message information source extraction system, is characterized in that, described system further comprises:
Message content adaptation module: for shielding the difference of coding or storage mode of message, provide unified message character iteration fetch interface.
Above-mentioned message information source abstracting method, is characterized in that, described system further comprises:
Information source statistical module: gather the extraction result in described information extraction source, calculate the statistical information of described information source.
Above-mentioned message information source extraction system, is characterized in that, described packet parsing module also comprises:
Message character read module: read message byte stream, and according to coded system, byte is assembled into actual character;
Character types judge module: according to described character types recognition rule, character is divided into dissimilar;
Response events module: dissimilar according to described character, notify user to carry out the extraction operation of dissimilar character.
Above-mentioned message information source extraction system, is characterized in that, described information source abstraction module also comprises:
Module set up in index: set up TRIE keyword index according to described useful key element storehouse;
Subordinate sentence module: the described character in described response events step is made pauses in reading unpunctuated ancient writings and is treated to different subordinate sentences;
Extract processing module: according to described TRIE keyword index, described different subordinate sentence is carried out to keyword coupling, Extracting Information source, and judge and the authenticity of described information source complete the differentiation of described information source type;
Output module: the information of described information source and described information source type is exported.
Above-mentioned message information source extraction system, is characterized in that, described extraction processing module also comprises:
Information source abstraction module: carry out information source extraction take described subordinate sentence as unit, the TRIE keyword index of setting up according to described useful key element storehouse, extracts candidate's information source or the list of candidate's information source;
Useful key element abstraction module: according to described candidate's information source or the list of candidate's information source, extract the positional information in subordinate sentence described in useful key element and described useful key element from described subordinate sentence;
Real information source judge module: by predefined described real information source recognition rule, judge whether described candidate information source is real information source;
Information source type abstraction module: mated information source type by predefined described information source type recognition rule with described useful key element and differentiated.
Above-mentioned message information source extraction system, is characterized in that, in described information source type abstraction module, is the information source of blog and microblogging for described information source type, needs further to extract user's name and Blog Website information.
Compared with prior art, beneficial effect of the present invention is:
1, the present invention is based on the general information extraction framework based on event response, energy flexible expansion, realizes concrete extraction task.
2, the present invention can effectively integrate information source decimation rule storehouse, extracts message source from message, and judges its type, improves message information source extraction efficiency and reduces operation easier.
Accompanying drawing explanation
Fig. 1 is message information of the present invention source abstracting method step schematic diagram;
Fig. 2 is packet parsing step schematic diagram of the present invention;
Fig. 3 is information source extraction step schematic diagram of the present invention;
Fig. 4 is that the present invention extracts treatment step schematic diagram;
Fig. 5 is message information of the present invention source extracting method embodiment step schematic diagram;
Fig. 6 is embodiments of the invention packet parsing step schematic diagram;
Fig. 7 is embodiments of the invention message extraction step schematic diagram;
Fig. 8 is message information of the present invention source extraction system structural representation;
Fig. 9 is specific embodiment of the invention message information source extraction system structural representation.
Wherein, Reference numeral:
1 message content adaptation module 2 information source abstraction modules
3 packet parsing module 4 information source statistical modules
21 message character read module 22 character types judge modules
23 response events modules
Module 32 subordinate sentence modules set up in 31 index
33 extract processing module 34 output modules
The useful key element abstraction module of 331 information source abstraction module 332
333 real information source judge module 334 information source type abstraction modules
S1~S4, S11~S13, S21~S24, S231~S234, S100~S102, S1031~S1034: the administration step of various embodiments of the present invention.
Embodiment
Provide the specific embodiment of the present invention below, in conjunction with diagram, the present invention has been made to detailed description.
Fig. 1 is message information of the present invention source abstracting method step schematic diagram, as shown in Figure 1, a kind of message information provided by the invention source abstracting method, the method is by the information source in the keyword extraction message in decimation rule storehouse, match information source, and information source type described in the rule judgment in decimation rule storehouse, match information source, the method comprises:
Message content adaptation step S1: for shielding the difference of coding or storage mode of message, provide unified message character iteration fetch interface;
Packet parsing step S2: according to the text of input, extract the character in text, and character is made pauses in reading unpunctuated ancient writings and is treated to different subordinate sentences;
Information source extraction step S3: subordinate sentence is carried out to keyword coupling according to information source decimation rule storehouse, subordinate sentence is extracted to useful key element sequence, and in useful key element sequence, information extraction source, and by the rule judgment information source type in decimation rule storehouse, match information source;
Information source statistic procedure S4: gather the extraction result in information extraction source, the statistical information in computing information source.
Information source decimation rule storehouse wherein further comprises: useful key element storehouse, real information source recognition rule, information source type recognition rule and character types recognition rule.
Fig. 2 is packet parsing step schematic diagram of the present invention, and as shown in Figure 2, wherein, packet parsing step S2 also comprises:
Message character read step S21: read message byte stream, and according to coded system, byte is assembled into actual character;
Character types determining step S22: according to character types recognition rule, character is divided into dissimilar;
Response events step S23: dissimilar according to character, notify user to carry out the extraction operation of dissimilar character.
Fig. 3 is information source extraction step schematic diagram of the present invention, and as shown in Figure 3, wherein, information source extraction step S3 also comprises:
Index establishment step S31: set up TRIE keyword index according to useful key element storehouse;
Subordinate sentence step S32: the character in response events step is made pauses in reading unpunctuated ancient writings and is treated to different subordinate sentences;
Extract treatment step S33: according to TRIE keyword index, different subordinate sentences is carried out to keyword coupling, Extracting Information source, and judge and the authenticity of information source complete the differentiation of information source type;
Output step S34: the information of information source and information source type is exported.
Wherein, Fig. 4 is message information of the present invention source abstracting method detailed step schematic diagram, as shown in Figure 4, extracts treatment step S33 and also comprises:
Information source extraction step S331: carry out information source extraction take subordinate sentence as unit, the TRIE keyword index of setting up according to useful key element storehouse, extracts candidate's information source or the list of candidate's information source;
Useful key element extraction step S332: according to candidate's information source or the list of candidate's information source, extract the positional information in subordinate sentence in useful key element and useful key element from subordinate sentence;
Real information source determining step S333: by predefined real information source recognition rule, judge whether candidate information source is real information source;
Information source type extraction step S334: mated information source type by predefined information source type recognition rule with useful key element and differentiated.
Useful key element storehouse wherein includes by key element, and useful key element comprises: media name deictic words, date and time information, media report behavior word and media deictic words.
Real information source recognition rule is wherein heuristic rule, manually formulates by observing message, and rule can add or revise.
Further, real information of the present invention source recognition rule comprises a heuristic rule: if only have a candidate information source in subordinate sentence, and there is media report behavior word, and the character that meets candidate information source occurs occurring media deictic words in date and time information or follow-up source word symbol with the subordinate sentence at the ending of media name deictic words or follow-up source string place, judges that candidate information source is real information source.
Information source type wherein comprises: news media, forum, blog and microblogging.
In information source type extraction step S334, for information source type be blog and or the information source of microblogging, need further to extract user's name or Blog Website information.
Below in conjunction with the step that illustrates the specific embodiment of the invention, Fig. 5 is message information of the present invention source extracting method one embodiment step schematic diagram, and as shown in Figure 5, a specific embodiment operation steps of the present invention, illustrates message information source extraction process.
The object of the invention is to provide a kind of information extraction technique of hommization, comprising extract the information source occurring from message, and type (news, forum, blog, microblogging) and the title of automatic analysis message source, the user's name of extraction blog and microblogging.
To achieve these goals, the invention provides a kind of method of rule-based coupling and the rule base that information source extracts, comprise the following steps:
Step S100: read rule base, therefrom extracting keywords and type information thereof, sets up TRIE keyword index.
Step S101: according to the text of input, carry out code parsing, extract character stream from text, as Chinese character, punctuate etc.
Step S102: the processing of making pauses in reading unpunctuated ancient writings, is divided into different subordinate sentences by input text.
Step S103: each subordinate sentence is handled as follows respectively to step, comprises:
Step S1031: utilize the TRIE book index of setting up in advance to carry out multi-key word coupling and date coupling, subordinate sentence is divided into " useful key element " sequence, simultaneously the positional information of record " useful key element " in subordinate sentence.Useful key element comprises report behavior word, the media deictic words etc. of media name deictic words, media.
Step S1032: in useful key element sequence, mate one by one various predefined rules, if there is candidate's information source, extract candidate's information source, and determine whether real information source.
Step S1033: by mating predefined rule, further the information source extracting is judged to its type.
Step S1034: result output.
Fig. 6 is embodiments of the invention packet parsing step schematic diagram, as shown in Figure 6, is specifically made up of three steps:
Step S200: message character reads, Parser reads a character by message character iteration fetch interface, that is to say that message character iteration fetch interface reads message byte stream, and according to corresponding coded system, byte is assembled into actual character, as a Chinese character, returns to Parser.
Step S201: judge the type of character, character according to it functional role in different key elements extract be divided into dissimilar, as year, month, day and some special punctuation marks etc.
Step S202: notice Listeners response events, according to the type of character, notify each Listeners(observer) carry out corresponding call back function and carry out response character and read event.
Information source extracts the in fact realization corresponding to a concrete Listener of general extraction framework, reads event complete information source extract function by continuous response character.Fig. 7 is embodiments of the invention message extraction step schematic diagram, as shown in Figure 7, is described as follows for the concrete steps of this flow process:
Step S301: the punctuation marks such as our utilization ", " carry out subordinate sentence to be cut apart, and then carries out information source extraction take subordinate sentence as unit.
Step S302: we extract candidate's information source (conventionally surrounding with " " or " ") or the list of candidate's information source.
Step S303: if there is candidate's information source, extract useful key element and the positional information in subordinate sentence thereof from subordinate sentence.These useful key elements and positional information thereof contribute to locate real information source, and judge its type.Here, useful key element comprises following several types:
A) media name deictic words, as " Times ", " net ", " news ", " blog ", " mhkc ", " evening paper " etc.Candidate's news sources character string, using media name deictic words as ending, shows that this candidate's news sources may be true media name, as " Sina's blog ", " Maeil Business Newspaper " etc.
B) date and time information, general candidate's news sources is often followed the report date: as " June 24-25 ", " April 1 " etc.
C) the report behavior word of media, as " message ", " report ", " reprinting ", " comment ", " publication ", " issue " etc., show that this short sentence may state a news report behavior, thereby contribute to judge the whether true news sources of candidate's news sources.
D) media deictic words, as " within the border ", " according to ", " media ", " website ".Conventionally around candidate's news sources, occur, show that candidate's news sources character string may be media noun.
Step S304: on this basis, we can be easy to mate one by one various predefined rules, judges whether real information source of candidate's information source (if any).
Such as, wherein a heuristic rule the simplest is as follows: if only have a candidate information source in subordinate sentence, and occur the report behavior word of media meeting one of following condition simultaneously, can judge that candidate information source is real information source:
A) candidate's news sources character string is using media name deictic words as ending.
B) there is date and time information in the short sentence at candidate's news sources character string place.
C) around candidate's news sources character string, occur " within the border ", " according to ", the media deictic words such as " media ", " website ".
Such as, domestic " NGO develops AC network " the daily magazine note in March 11 of subordinate sentence meets above heuristic rule, can Extracting Information source " NGO develops AC network " be information source.
The heuristic rule here is mainly manually formulated by observing message, may comprise a lot of complex rules, and rule is also constantly add or revise.We have realized an efficient extendible information extraction system, can support flexibly regular interpolation or modification.
Step S305: we further judge its type to the information source extracting, and comprise news media, forum, blog and microblogging, for blog and microblogging, we further extract user's name and blog or microblogging site information.Here, we have formulated series of rules equally, complete information source type by matched rule one by one and differentiate, these rules utilize useful element information that step S303 provides to comprise media name deictic words information (if any) in information source title and other element informations around.As advised for the micro-blog user of the www.xinhuanet.com " XXXX ", the information source type of extraction is microblogging, and its user's name is " XXXX ", and microblogging website is " www.xinhuanet.com's microblogging ".
We export all information sources in message and type information thereof step S306..
The present invention also provides a kind of message information source extraction system, has adopted message information source abstracting method, and Fig. 8 is message information of the present invention source extraction system structural representation, and as shown in Figure 8, this system comprises:
Message content adaptation module 1: for shielding the difference of coding or storage mode of message, provide unified message character iteration fetch interface;
Packet parsing module 2: according to the text of input, carry out code parsing, extract the character in text, and character is made pauses in reading unpunctuated ancient writings and is treated to different subordinate sentences;
Information source abstraction module 3: according to information source decimation rule storehouse, subordinate sentence is carried out to keyword coupling, subordinate sentence is extracted to useful key element sequence, and in useful key element sequence, information extraction source, and by the rule judgment information source type in decimation rule storehouse, match information source;
Information source statistical module 4: gather the extraction result in information extraction source, the statistical information in computing information source.
Wherein, packet parsing module 2 also comprises:
Message character read module 21: read message byte stream, and according to coded system, byte is assembled into actual character;
Character types judge module 22: according to character types recognition rule, character is divided into dissimilar;
Response events module 23: dissimilar according to character, notify user to carry out the extraction operation of dissimilar character.
Wherein, information source abstraction module 3 also comprises:
Module 31 set up in index: set up TRIE keyword index according to useful key element storehouse;
Subordinate sentence module 32: the character in response events step is made pauses in reading unpunctuated ancient writings and is treated to different subordinate sentences;
Extract processing module 33: according to TRIE keyword index, different subordinate sentences is carried out to keyword coupling, Extracting Information source, and judge and the authenticity of information source complete the differentiation of information source type;
Output module 34: the information of information source and information source type is exported.
Wherein, extracting processing module 33 also comprises:
Information source abstraction module 331: carry out information source extraction take subordinate sentence as unit, the TRIE keyword index of setting up according to useful key element storehouse, extracts candidate's information source or the list of candidate's information source;
Useful key element abstraction module 332: according to candidate's information source or the list of candidate's information source, extract the positional information in subordinate sentence in useful key element and useful key element from subordinate sentence;
Real information source judge module 333: by predefined real information source recognition rule, judge whether candidate information source is real information source;
Information source type abstraction module 334: mated information source type by predefined information source type recognition rule with useful key element and differentiated.
Wherein, in information source type abstraction module 334, be the information source of blog and microblogging for information source type, need further to extract user's name and Blog Website information.
Below in conjunction with specific embodiment of the invention explanation message information source extraction system, Fig. 9 is specific embodiment of the invention message information source extraction system structural representation, and as shown in Figure 9, message information of the present invention source extraction system comprises: following four levels:
1) message content adaptation layer: the differences such as shielding message coding, storage mode, for upper layer module provides consistent message character iteration fetch interface, make upper layer module only need to be concerned about the logic of extraction.
2) Parser layer: the information extraction overall procedure based on event response.Here adopt observer to design a model, Parser is actually a target (Subject), and registration has a series of observers (Observer).Overall procedure is as follows: read message character by stacking generation of content adaptation, often read a character as an event, notify each observer to carry out corresponding call back function and carry out corresponding event.
3) Extractor layer: an in fact corresponding observer Listener, by realizing concrete event response action, completes concrete information extraction function etc.It is to Extractor layer specific implementation that information source extracts, and according to the message content of input, therefrom extracts the type information sources such as news, forum, blog and microblogging; Provide title standardization function for news, forum information source; Provide user's name and site name extract function for blog and micro-blog information source.
4) information source statistics layer: information source statistics reads message from message data storehouse traversal, and each message content is carried out to information source extraction.Finally, gather all extraction results, calculate the statistical information such as occurrence number, message category distribution of the information source extracting, by statistics write into Databasce.
Certainly; the present invention also can have other various embodiments; in the situation that not deviating from spirit of the present invention and essence thereof; those of ordinary skill in the art are when making according to the present invention various corresponding changes and distortion, but these corresponding changes and distortion all should belong to the protection domain of the appended claim of the present invention.
Claims (20)
1. a message information source abstracting method, is characterized in that, described method is passed through the information source in the keyword extraction message in decimation rule storehouse, match information source, and mates the rule judgment information source type in described information source decimation rule storehouse, and the method comprises:
Packet parsing step: according to the text of input, extract the character in described text, and described character is made pauses in reading unpunctuated ancient writings and is treated to different subordinate sentences;
Information source extraction step: described subordinate sentence is carried out to keyword coupling according to described information source decimation rule storehouse, described subordinate sentence is extracted to useful key element sequence, and in described useful key element sequence, information extraction source, and by the rule judgment information source type in coupling described information source decimation rule storehouse.
2. message information source abstracting method according to claim 1, is characterized in that, described information source decimation rule storehouse further comprises: useful key element storehouse, real information source recognition rule, information source type recognition rule and character types recognition rule.
3. message information source abstracting method according to claim 1, is characterized in that, described method, before described packet parsing step, further comprises:
Message content adaptation step: for shielding the difference of coding or storage mode of message, provide unified message character iteration fetch interface.
4. message information source abstracting method according to claim 3, is characterized in that, described method further comprises:
Information source statistic procedure: gather the extraction result in described information extraction source, calculate the statistical information of described information source.
5. according to message information source abstracting method described in claim 1 or 2, it is characterized in that, described packet parsing step also comprises:
Message character read step: read message byte stream, and according to coded system, byte is assembled into actual character;
Character types determining step: according to described character types recognition rule, character is divided into dissimilar;
Response events step: dissimilar according to described character, notify user to carry out the extraction operation of dissimilar character.
6. message information source abstracting method according to claim 1, is characterized in that, described information source extraction step also comprises:
Index establishment step: set up TRIE keyword index according to described useful key element storehouse;
Subordinate sentence step: the described character in described response events step is made pauses in reading unpunctuated ancient writings and is treated to different subordinate sentences;
Extract treatment step: according to described TRIE keyword index, described different subordinate sentence is carried out to keyword coupling, Extracting Information source, and judge and the authenticity of described information source complete the differentiation of described information source type;
Output step: the information of described information source and described information source type is exported.
7. according to message information source abstracting method described in claim 6 or 2, it is characterized in that, described extraction treatment step also comprises:
Information source extraction step: carry out information source extraction take described subordinate sentence as unit, the TRIE keyword index of setting up according to described useful key element storehouse, extracts candidate's information source or the list of candidate's information source;
Useful key element extraction step: according to described candidate's information source or the list of candidate's information source, extract the positional information in subordinate sentence described in useful key element and described useful key element from described subordinate sentence;
Real information source determining step: by predefined described real information source recognition rule, judge whether described candidate information source is real information source;
Information source type extraction step: mated information source type by predefined described information source type recognition rule with described useful key element and differentiated.
8. message information source abstracting method according to claim 2, is characterized in that, described useful key element storehouse includes by key element, and described useful key element comprises: media name deictic words, date and time information, media report behavior word and media deictic words.
9. message information source abstracting method according to claim 2, is characterized in that, described real information source recognition rule is heuristic rule, manually formulates by observing message, and rule can add or revise.
10. message information source abstracting method according to claim 9, it is characterized in that, described real information source recognition rule comprises a heuristic rule: if only have a described candidate information source in subordinate sentence, and there is described media report behavior word, and the character that meets described candidate information source occurs occurring described media deictic words in described date and time information or described follow-up source word symbol with the subordinate sentence at described media name deictic words ending or described follow-up source string place, judges that described candidate information source is real information source.
11. message information source abstracting methods according to claim 1, is characterized in that, described information source type comprises: news media, forum, blog and microblogging.
12. message information source abstracting methods according to claim 7, is characterized in that, in described information source type extraction step, for described information source type be blog and or the information source of microblogging, need further to extract user's name or Blog Website information.
13. 1 kinds of message information source extraction systems, adopt the message information source abstracting method as described in any one in claim 1-12, it is characterized in that, described system comprises:
Packet parsing module: according to the text of input, carry out code parsing, extract the character in described text, and described character is made pauses in reading unpunctuated ancient writings and is treated to different subordinate sentences;
Information source abstraction module: described subordinate sentence is carried out to keyword coupling according to described information source decimation rule storehouse, described subordinate sentence is extracted to useful key element sequence, and in described useful key element sequence, information extraction source, and by the rule judgment information source type in coupling described information source decimation rule storehouse.
14. according to message information source extraction system described in claim 13, it is characterized in that, described information source decimation rule storehouse further comprises: useful key element storehouse, real information source recognition rule, information source type recognition rule and character types recognition rule.
15. according to message information source extraction system described in claim 13, it is characterized in that, described system further comprises:
Message content adaptation module: for shielding the difference of coding or storage mode of message, provide unified message character iteration fetch interface.
16. according to message information source extraction system described in claim 13 or 14, it is characterized in that, described system further comprises:
Information source statistical module: gather the extraction result in described information extraction source, calculate the statistical information of described information source.
17. according to message information source extraction system described in claim 13, it is characterized in that, described packet parsing module also comprises:
Message character read module: read message byte stream, and according to coded system, byte is assembled into actual character;
Character types judge module: according to described character types recognition rule, character is divided into dissimilar;
Response events module: dissimilar according to described character, notify user to carry out the extraction operation of dissimilar character.
18. according to message information source extraction system described in claim 13, it is characterized in that, described information source abstraction module also comprises:
Module set up in index: set up TRIE keyword index according to described useful key element storehouse;
Subordinate sentence module: the described character in described response events step is made pauses in reading unpunctuated ancient writings and is treated to different subordinate sentences;
Extract processing module: according to described TRIE keyword index, described different subordinate sentence is carried out to keyword coupling, Extracting Information source, and judge and the authenticity of described information source complete the differentiation of described information source type;
Output module: the information of described information source and described information source type is exported.
19. according to message information source extraction system described in claim 18 or 14, it is characterized in that, described extraction processing module also comprises:
Information source abstraction module: carry out information source extraction take described subordinate sentence as unit, the TRIE keyword index of setting up according to described useful key element storehouse, extracts candidate's information source or the list of candidate's information source;
Useful key element abstraction module: according to described candidate's information source or the list of candidate's information source, extract the positional information in subordinate sentence described in useful key element and described useful key element from described subordinate sentence;
Real information source judge module: by predefined described real information source recognition rule, judge whether described candidate information source is real information source;
Information source type abstraction module: mated information source type by predefined described information source type recognition rule with described useful key element and differentiated.
20. according to message information source extraction system described in claim 19, it is characterized in that, and in described information source type abstraction module, be the information source of blog and microblogging for described information source type, need further to extract user's name and Blog Website information.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410010836.XA CN103778200B (en) | 2014-01-09 | 2014-01-09 | A kind of message information source abstracting method and its system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410010836.XA CN103778200B (en) | 2014-01-09 | 2014-01-09 | A kind of message information source abstracting method and its system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN103778200A true CN103778200A (en) | 2014-05-07 |
CN103778200B CN103778200B (en) | 2017-08-08 |
Family
ID=50570435
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201410010836.XA Active CN103778200B (en) | 2014-01-09 | 2014-01-09 | A kind of message information source abstracting method and its system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103778200B (en) |
Cited By (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104408101A (en) * | 2014-11-19 | 2015-03-11 | 南京大学 | Whole-process Web information extraction integration method |
CN105447202A (en) * | 2015-12-31 | 2016-03-30 | 宁波公众信息产业有限公司 | Internet information collecting system |
CN106021439A (en) * | 2016-05-16 | 2016-10-12 | 腾讯科技(深圳)有限公司 | Communication number processing method and device |
CN106484767A (en) * | 2016-09-08 | 2017-03-08 | 中国科学院信息工程研究所 | A kind of event extraction method across media |
CN106815203A (en) * | 2015-12-01 | 2017-06-09 | 北京国双科技有限公司 | A kind of amount of money analysis method and device in judgement document |
CN107169061A (en) * | 2017-05-02 | 2017-09-15 | 广东工业大学 | A kind of text multi-tag sorting technique for merging double information sources |
CN107423279A (en) * | 2017-04-11 | 2017-12-01 | 美林数据技术股份有限公司 | A kind of information extraction and analysis method of credit financing short message |
CN108268438A (en) * | 2016-12-30 | 2018-07-10 | 腾讯科技(深圳)有限公司 | A kind of content of pages extracting method, device and client |
CN111090744A (en) * | 2019-12-17 | 2020-05-01 | 中科鼎富(北京)科技发展有限公司 | Stock market operation risk information mining method and device |
CN112380257A (en) * | 2020-11-26 | 2021-02-19 | 厦门市美亚柏科信息股份有限公司 | Network data stream locking method, terminal equipment and storage medium |
CN112597405A (en) * | 2020-12-17 | 2021-04-02 | 中国科学院计算技术研究所数字经济产业研究院 | Event external information source extraction method based on microblog platform |
Family Cites Families (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN100476800C (en) * | 2007-06-22 | 2009-04-08 | 腾讯科技(深圳)有限公司 | Method and system for cutting index participle |
CN101344889B (en) * | 2008-07-31 | 2011-04-13 | 中国农业大学 | Method and system for network information extraction |
CN101727461B (en) * | 2008-10-13 | 2012-11-21 | 中国科学院计算技术研究所 | Method for extracting content of web page |
CN102999534A (en) * | 2011-09-19 | 2013-03-27 | 北京金和软件股份有限公司 | Chinese word segmentation algorithm based on reverse maximum matching |
CN103150432B (en) * | 2013-03-07 | 2016-05-11 | 宁波成电泰克电子信息技术发展有限公司 | A kind of Internet public opinion analysis method |
-
2014
- 2014-01-09 CN CN201410010836.XA patent/CN103778200B/en active Active
Cited By (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104408101B (en) * | 2014-11-19 | 2018-01-09 | 南京大学 | A kind of full range Web information extracts integrated approach |
CN104408101A (en) * | 2014-11-19 | 2015-03-11 | 南京大学 | Whole-process Web information extraction integration method |
CN106815203B (en) * | 2015-12-01 | 2021-03-30 | 北京国双科技有限公司 | Method and device for analyzing amount of money in referee document |
CN106815203A (en) * | 2015-12-01 | 2017-06-09 | 北京国双科技有限公司 | A kind of amount of money analysis method and device in judgement document |
CN105447202A (en) * | 2015-12-31 | 2016-03-30 | 宁波公众信息产业有限公司 | Internet information collecting system |
CN106021439A (en) * | 2016-05-16 | 2016-10-12 | 腾讯科技(深圳)有限公司 | Communication number processing method and device |
CN106484767B (en) * | 2016-09-08 | 2019-06-21 | 中国科学院信息工程研究所 | A kind of event extraction method across media |
CN106484767A (en) * | 2016-09-08 | 2017-03-08 | 中国科学院信息工程研究所 | A kind of event extraction method across media |
CN108268438A (en) * | 2016-12-30 | 2018-07-10 | 腾讯科技(深圳)有限公司 | A kind of content of pages extracting method, device and client |
CN108268438B (en) * | 2016-12-30 | 2021-10-22 | 腾讯科技(深圳)有限公司 | Page content extraction method and device and client |
CN107423279A (en) * | 2017-04-11 | 2017-12-01 | 美林数据技术股份有限公司 | A kind of information extraction and analysis method of credit financing short message |
CN107169061A (en) * | 2017-05-02 | 2017-09-15 | 广东工业大学 | A kind of text multi-tag sorting technique for merging double information sources |
CN111090744A (en) * | 2019-12-17 | 2020-05-01 | 中科鼎富(北京)科技发展有限公司 | Stock market operation risk information mining method and device |
CN112380257A (en) * | 2020-11-26 | 2021-02-19 | 厦门市美亚柏科信息股份有限公司 | Network data stream locking method, terminal equipment and storage medium |
CN112597405A (en) * | 2020-12-17 | 2021-04-02 | 中国科学院计算技术研究所数字经济产业研究院 | Event external information source extraction method based on microblog platform |
Also Published As
Publication number | Publication date |
---|---|
CN103778200B (en) | 2017-08-08 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103778200A (en) | Method for extracting information source of message and system thereof | |
Jung | Online named entity recognition method for microtexts in social networking services: A case study of twitter | |
CN100405371C (en) | Method and system for abstracting new word | |
CN103617169B (en) | A kind of hot microblog topic extracting method based on Hadoop | |
Kumar et al. | Analyzing Twitter sentiments through big data | |
CN104572849A (en) | Automatic standardized filing method based on text semantic mining | |
CN101593200A (en) | Chinese Web page classification method based on the keyword frequency analysis | |
CN101695082B (en) | Service organization method based on relation mining and device thereof | |
CN103294664A (en) | Method and system for discovering new words in open fields | |
CN104820686A (en) | Network search method and network search system | |
Rao et al. | CMEE-IL: Code Mix Entity Extraction in Indian Languages from Social Media Text@ FIRE 2016-An Overview. | |
CN105718585B (en) | Document and label word justice correlating method and its device | |
CN103020159A (en) | Method and device for news presentation facing events | |
CN102207948A (en) | Method for generating incident statement sentence material base | |
CN102622453A (en) | Body-based food security event semantic retrieval system | |
CN104978332B (en) | User-generated content label data generation method, device and correlation technique and device | |
CN101566995A (en) | Method and system for integral release of internet information | |
CN102169496A (en) | Anchor text analysis-based automatic domain term generating method | |
CN103077207A (en) | Method and system for analyzing microblog happiness index | |
CN102508830A (en) | Method and system for extracting social network from news document | |
CN114064851A (en) | Multi-machine retrieval method and system for government office documents | |
CN114792145B (en) | Standard digital management maintenance system and method based on knowledge graph | |
CN103440343B (en) | Knowledge base construction method facing domain service target | |
Bhardwaj et al. | Web scraping using summarization and named entity recognition (ner) | |
CN116628328A (en) | Web API recommendation method and device based on functional semantics and structural interaction |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |