CN103778200A - Method for extracting information source of message and system thereof - Google Patents

Method for extracting information source of message and system thereof Download PDF

Info

Publication number
CN103778200A
CN103778200A CN201410010836.XA CN201410010836A CN103778200A CN 103778200 A CN103778200 A CN 103778200A CN 201410010836 A CN201410010836 A CN 201410010836A CN 103778200 A CN103778200 A CN 103778200A
Authority
CN
China
Prior art keywords
information source
message
extraction
character
information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201410010836.XA
Other languages
Chinese (zh)
Other versions
CN103778200B (en
Inventor
刘春阳
程工
张旭
王卿
程学旗
吴琼
徐学可
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Computing Technology of CAS
National Computer Network and Information Security Management Center
Original Assignee
Institute of Computing Technology of CAS
National Computer Network and Information Security Management Center
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Computing Technology of CAS, National Computer Network and Information Security Management Center filed Critical Institute of Computing Technology of CAS
Priority to CN201410010836.XA priority Critical patent/CN103778200B/en
Publication of CN103778200A publication Critical patent/CN103778200A/en
Application granted granted Critical
Publication of CN103778200B publication Critical patent/CN103778200B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a method for extracting an information source of a message and a system thereof. According to the method, the information source in the message is extracted by matching key words of an information source extracting rule base, and an information source type is judged by matching rules of the information source extracting rule base. The method comprises the steps of message analysis and information source extraction, wherein the message analysis comprises the steps of extracting characters in a text according to the input text and carrying out segmenting treatment on the characters to obtain different sub-clauses, and the information source extraction comprises the steps of carrying out key word matching on the sub-clauses according to the information source extracting rule base, extracting a useful element sequence from the sub-clauses, extracting the information source on the useful element sequence, and judging the information source type by matching the rules of the information source extracting rule base.

Description

A kind of message information source abstracting method and system thereof
Technical field
The present invention relates to text mining field, particularly a kind of message information source abstracting method and system.
Background technology
In recent years, along with the development of Internet technology, the various information on network are able to wide-scale distribution, and these information qualities and confidence level difference are very big, existing relatively regular traditional news media, the relatively poor emerging media of confidence level such as Ye You forum, blog, microblogging.So how to extract Useful Information source and will become studying a question of everybody extensive concern.
Information extraction (Information Extraction:IE), as its name suggests, is that the information comprising in text is carried out to structuring processing, becomes the organizational form that form is the same.Input message extraction system be urtext, output be the information point of set form, information point is extracted out from various documents, then integrates with unified form, the main task of Here it is information extraction.
Information extraction technique is not attempted complete understanding entire chapter document, just the part that comprises relevant information in document is analyzed, and as for which information is correlated with, and that territory of fixing when by system is determined.
Information extraction technique is very useful for the specific fact that extraction needs from a large amount of documents.On internet, just exist so document library, on the internet, the information of same subject disperses to leave on different web sites conventionally, the form of performance is also different, if can be by these informations together, with structured form storage, that will be highly profitable.
Summary of the invention
The technical problem to be solved in the present invention is to provide a kind of message information source abstracting method and system thereof, low to overcome the information extraction efficiency of information extraction technique in prior art, the problem of complicated operation.
In order to reach above object, the invention provides a kind of message information source abstracting method, it is characterized in that, described method is by the information source in the keyword extraction message in decimation rule storehouse, match information source, and mate information source type described in the rule judgment in described information source decimation rule storehouse, the method comprises:
Packet parsing step: according to the text of input, extract the character in described text, and described character is made pauses in reading unpunctuated ancient writings and is treated to different subordinate sentences;
Information source extraction step: described subordinate sentence is carried out to keyword coupling according to described information source decimation rule storehouse, described subordinate sentence is extracted to useful key element sequence, and in described useful key element sequence, information extraction source, and by the rule judgment information source type in coupling described information source decimation rule storehouse.
Above-mentioned message information source abstracting method, is characterized in that, described information source decimation rule storehouse further comprises: useful key element storehouse, real information source recognition rule, information source type recognition rule and character types recognition rule.
Above-mentioned message information source abstracting method, is characterized in that, described method, before described packet parsing step, further comprises:
Message content adaptation step: for shielding the difference of coding or storage mode of message, provide unified message character iteration fetch interface.
Above-mentioned message information source abstracting method, is characterized in that, described method further comprises:
Information source statistic procedure: gather the extraction result in described information extraction source, calculate the statistical information of described information source.
Above-mentioned message information source abstracting method, is characterized in that, described packet parsing step also comprises:
Message character read step: read message byte stream, and according to coded system, byte is assembled into actual character;
Character types determining step: according to described character types recognition rule, character is divided into dissimilar;
Response events step: dissimilar according to described character, notify user to carry out the extraction operation of dissimilar character.
Above-mentioned message information source abstracting method, is characterized in that, described information source extraction step also comprises:
Index establishment step: set up TRIE keyword index according to described useful key element storehouse;
Subordinate sentence step: the described character in described response events step is made pauses in reading unpunctuated ancient writings and is treated to different subordinate sentences;
Extract treatment step: according to described TRIE keyword index, described different subordinate sentence is carried out to keyword coupling, Extracting Information source, and judge and the authenticity of described information source complete the differentiation of described information source type;
Output step: the information of described information source and described information source type is exported.
Above-mentioned message information source abstracting method, is characterized in that, described extraction treatment step also comprises:
Information source extraction step: carry out information source extraction take described subordinate sentence as unit, the TRIE keyword index of setting up according to described useful key element storehouse, extracts candidate's information source or the list of candidate's information source;
Useful key element extraction step: according to described candidate's information source or the list of candidate's information source, extract the positional information in subordinate sentence described in useful key element and described useful key element from described subordinate sentence;
Real information source determining step: by predefined described real information source recognition rule, judge whether described candidate information source is real information source;
Information source type extraction step: mated information source type by predefined described information source type recognition rule with described useful key element and differentiated.
Above-mentioned message information source abstracting method, is characterized in that, described useful key element storehouse includes by key element, and described useful key element comprises: media name deictic words, date and time information, media report behavior word and media deictic words.
Above-mentioned message information source abstracting method, is characterized in that, described real information source recognition rule is heuristic rule, manually formulates by observing message, and rule can add or revise.
Above-mentioned message information source abstracting method, it is characterized in that, described real information source recognition rule comprises a heuristic rule: if only have a described candidate information source in subordinate sentence, and there is described media report behavior word, and the character that meets described candidate information source occurs occurring described media deictic words in described date and time information or described follow-up source word symbol with the subordinate sentence at described media name deictic words ending or described follow-up source string place, judges that described candidate information source is real information source.
Above-mentioned message information source abstracting method, is characterized in that, described information source type comprises: news media, forum, blog and microblogging.
Above-mentioned message information source abstracting method, is characterized in that, in described information source type extraction step, for described information source type be blog and or the information source of microblogging, need further to extract user's name or Blog Website information.
The present invention also provides a kind of message information source extraction system, adopts described message information source abstracting method, it is characterized in that, described system comprises:
Packet parsing module: according to the text of input, carry out code parsing, extract the character in described text, and described character is made pauses in reading unpunctuated ancient writings and is treated to different subordinate sentences;
Information source abstraction module: described subordinate sentence is carried out to keyword coupling according to described information source decimation rule storehouse, described subordinate sentence is extracted to useful key element sequence, and in described useful key element sequence, information extraction source, and by the rule judgment information source type in coupling described information source decimation rule storehouse.
Above-mentioned message information source extraction system, is characterized in that, described information source decimation rule storehouse further comprises: useful key element storehouse, real information source recognition rule, information source type recognition rule and character types recognition rule.
Above-mentioned message information source extraction system, is characterized in that, described system further comprises:
Message content adaptation module: for shielding the difference of coding or storage mode of message, provide unified message character iteration fetch interface.
Above-mentioned message information source abstracting method, is characterized in that, described system further comprises:
Information source statistical module: gather the extraction result in described information extraction source, calculate the statistical information of described information source.
Above-mentioned message information source extraction system, is characterized in that, described packet parsing module also comprises:
Message character read module: read message byte stream, and according to coded system, byte is assembled into actual character;
Character types judge module: according to described character types recognition rule, character is divided into dissimilar;
Response events module: dissimilar according to described character, notify user to carry out the extraction operation of dissimilar character.
Above-mentioned message information source extraction system, is characterized in that, described information source abstraction module also comprises:
Module set up in index: set up TRIE keyword index according to described useful key element storehouse;
Subordinate sentence module: the described character in described response events step is made pauses in reading unpunctuated ancient writings and is treated to different subordinate sentences;
Extract processing module: according to described TRIE keyword index, described different subordinate sentence is carried out to keyword coupling, Extracting Information source, and judge and the authenticity of described information source complete the differentiation of described information source type;
Output module: the information of described information source and described information source type is exported.
Above-mentioned message information source extraction system, is characterized in that, described extraction processing module also comprises:
Information source abstraction module: carry out information source extraction take described subordinate sentence as unit, the TRIE keyword index of setting up according to described useful key element storehouse, extracts candidate's information source or the list of candidate's information source;
Useful key element abstraction module: according to described candidate's information source or the list of candidate's information source, extract the positional information in subordinate sentence described in useful key element and described useful key element from described subordinate sentence;
Real information source judge module: by predefined described real information source recognition rule, judge whether described candidate information source is real information source;
Information source type abstraction module: mated information source type by predefined described information source type recognition rule with described useful key element and differentiated.
Above-mentioned message information source extraction system, is characterized in that, in described information source type abstraction module, is the information source of blog and microblogging for described information source type, needs further to extract user's name and Blog Website information.
Compared with prior art, beneficial effect of the present invention is:
1, the present invention is based on the general information extraction framework based on event response, energy flexible expansion, realizes concrete extraction task.
2, the present invention can effectively integrate information source decimation rule storehouse, extracts message source from message, and judges its type, improves message information source extraction efficiency and reduces operation easier.
Accompanying drawing explanation
Fig. 1 is message information of the present invention source abstracting method step schematic diagram;
Fig. 2 is packet parsing step schematic diagram of the present invention;
Fig. 3 is information source extraction step schematic diagram of the present invention;
Fig. 4 is that the present invention extracts treatment step schematic diagram;
Fig. 5 is message information of the present invention source extracting method embodiment step schematic diagram;
Fig. 6 is embodiments of the invention packet parsing step schematic diagram;
Fig. 7 is embodiments of the invention message extraction step schematic diagram;
Fig. 8 is message information of the present invention source extraction system structural representation;
Fig. 9 is specific embodiment of the invention message information source extraction system structural representation.
Wherein, Reference numeral:
1 message content adaptation module 2 information source abstraction modules
3 packet parsing module 4 information source statistical modules
21 message character read module 22 character types judge modules
23 response events modules
Module 32 subordinate sentence modules set up in 31 index
33 extract processing module 34 output modules
The useful key element abstraction module of 331 information source abstraction module 332
333 real information source judge module 334 information source type abstraction modules
S1~S4, S11~S13, S21~S24, S231~S234, S100~S102, S1031~S1034: the administration step of various embodiments of the present invention.
Embodiment
Provide the specific embodiment of the present invention below, in conjunction with diagram, the present invention has been made to detailed description.
Fig. 1 is message information of the present invention source abstracting method step schematic diagram, as shown in Figure 1, a kind of message information provided by the invention source abstracting method, the method is by the information source in the keyword extraction message in decimation rule storehouse, match information source, and information source type described in the rule judgment in decimation rule storehouse, match information source, the method comprises:
Message content adaptation step S1: for shielding the difference of coding or storage mode of message, provide unified message character iteration fetch interface;
Packet parsing step S2: according to the text of input, extract the character in text, and character is made pauses in reading unpunctuated ancient writings and is treated to different subordinate sentences;
Information source extraction step S3: subordinate sentence is carried out to keyword coupling according to information source decimation rule storehouse, subordinate sentence is extracted to useful key element sequence, and in useful key element sequence, information extraction source, and by the rule judgment information source type in decimation rule storehouse, match information source;
Information source statistic procedure S4: gather the extraction result in information extraction source, the statistical information in computing information source.
Information source decimation rule storehouse wherein further comprises: useful key element storehouse, real information source recognition rule, information source type recognition rule and character types recognition rule.
Fig. 2 is packet parsing step schematic diagram of the present invention, and as shown in Figure 2, wherein, packet parsing step S2 also comprises:
Message character read step S21: read message byte stream, and according to coded system, byte is assembled into actual character;
Character types determining step S22: according to character types recognition rule, character is divided into dissimilar;
Response events step S23: dissimilar according to character, notify user to carry out the extraction operation of dissimilar character.
Fig. 3 is information source extraction step schematic diagram of the present invention, and as shown in Figure 3, wherein, information source extraction step S3 also comprises:
Index establishment step S31: set up TRIE keyword index according to useful key element storehouse;
Subordinate sentence step S32: the character in response events step is made pauses in reading unpunctuated ancient writings and is treated to different subordinate sentences;
Extract treatment step S33: according to TRIE keyword index, different subordinate sentences is carried out to keyword coupling, Extracting Information source, and judge and the authenticity of information source complete the differentiation of information source type;
Output step S34: the information of information source and information source type is exported.
Wherein, Fig. 4 is message information of the present invention source abstracting method detailed step schematic diagram, as shown in Figure 4, extracts treatment step S33 and also comprises:
Information source extraction step S331: carry out information source extraction take subordinate sentence as unit, the TRIE keyword index of setting up according to useful key element storehouse, extracts candidate's information source or the list of candidate's information source;
Useful key element extraction step S332: according to candidate's information source or the list of candidate's information source, extract the positional information in subordinate sentence in useful key element and useful key element from subordinate sentence;
Real information source determining step S333: by predefined real information source recognition rule, judge whether candidate information source is real information source;
Information source type extraction step S334: mated information source type by predefined information source type recognition rule with useful key element and differentiated.
Useful key element storehouse wherein includes by key element, and useful key element comprises: media name deictic words, date and time information, media report behavior word and media deictic words.
Real information source recognition rule is wherein heuristic rule, manually formulates by observing message, and rule can add or revise.
Further, real information of the present invention source recognition rule comprises a heuristic rule: if only have a candidate information source in subordinate sentence, and there is media report behavior word, and the character that meets candidate information source occurs occurring media deictic words in date and time information or follow-up source word symbol with the subordinate sentence at the ending of media name deictic words or follow-up source string place, judges that candidate information source is real information source.
Information source type wherein comprises: news media, forum, blog and microblogging.
In information source type extraction step S334, for information source type be blog and or the information source of microblogging, need further to extract user's name or Blog Website information.
Below in conjunction with the step that illustrates the specific embodiment of the invention, Fig. 5 is message information of the present invention source extracting method one embodiment step schematic diagram, and as shown in Figure 5, a specific embodiment operation steps of the present invention, illustrates message information source extraction process.
The object of the invention is to provide a kind of information extraction technique of hommization, comprising extract the information source occurring from message, and type (news, forum, blog, microblogging) and the title of automatic analysis message source, the user's name of extraction blog and microblogging.
To achieve these goals, the invention provides a kind of method of rule-based coupling and the rule base that information source extracts, comprise the following steps:
Step S100: read rule base, therefrom extracting keywords and type information thereof, sets up TRIE keyword index.
Step S101: according to the text of input, carry out code parsing, extract character stream from text, as Chinese character, punctuate etc.
Step S102: the processing of making pauses in reading unpunctuated ancient writings, is divided into different subordinate sentences by input text.
Step S103: each subordinate sentence is handled as follows respectively to step, comprises:
Step S1031: utilize the TRIE book index of setting up in advance to carry out multi-key word coupling and date coupling, subordinate sentence is divided into " useful key element " sequence, simultaneously the positional information of record " useful key element " in subordinate sentence.Useful key element comprises report behavior word, the media deictic words etc. of media name deictic words, media.
Step S1032: in useful key element sequence, mate one by one various predefined rules, if there is candidate's information source, extract candidate's information source, and determine whether real information source.
Step S1033: by mating predefined rule, further the information source extracting is judged to its type.
Step S1034: result output.
Fig. 6 is embodiments of the invention packet parsing step schematic diagram, as shown in Figure 6, is specifically made up of three steps:
Step S200: message character reads, Parser reads a character by message character iteration fetch interface, that is to say that message character iteration fetch interface reads message byte stream, and according to corresponding coded system, byte is assembled into actual character, as a Chinese character, returns to Parser.
Step S201: judge the type of character, character according to it functional role in different key elements extract be divided into dissimilar, as year, month, day and some special punctuation marks etc.
Step S202: notice Listeners response events, according to the type of character, notify each Listeners(observer) carry out corresponding call back function and carry out response character and read event.
Information source extracts the in fact realization corresponding to a concrete Listener of general extraction framework, reads event complete information source extract function by continuous response character.Fig. 7 is embodiments of the invention message extraction step schematic diagram, as shown in Figure 7, is described as follows for the concrete steps of this flow process:
Step S301: the punctuation marks such as our utilization ", " carry out subordinate sentence to be cut apart, and then carries out information source extraction take subordinate sentence as unit.
Step S302: we extract candidate's information source (conventionally surrounding with " " or " ") or the list of candidate's information source.
Step S303: if there is candidate's information source, extract useful key element and the positional information in subordinate sentence thereof from subordinate sentence.These useful key elements and positional information thereof contribute to locate real information source, and judge its type.Here, useful key element comprises following several types:
A) media name deictic words, as " Times ", " net ", " news ", " blog ", " mhkc ", " evening paper " etc.Candidate's news sources character string, using media name deictic words as ending, shows that this candidate's news sources may be true media name, as " Sina's blog ", " Maeil Business Newspaper " etc.
B) date and time information, general candidate's news sources is often followed the report date: as " June 24-25 ", " April 1 " etc.
C) the report behavior word of media, as " message ", " report ", " reprinting ", " comment ", " publication ", " issue " etc., show that this short sentence may state a news report behavior, thereby contribute to judge the whether true news sources of candidate's news sources.
D) media deictic words, as " within the border ", " according to ", " media ", " website ".Conventionally around candidate's news sources, occur, show that candidate's news sources character string may be media noun.
Step S304: on this basis, we can be easy to mate one by one various predefined rules, judges whether real information source of candidate's information source (if any).
Such as, wherein a heuristic rule the simplest is as follows: if only have a candidate information source in subordinate sentence, and occur the report behavior word of media meeting one of following condition simultaneously, can judge that candidate information source is real information source:
A) candidate's news sources character string is using media name deictic words as ending.
B) there is date and time information in the short sentence at candidate's news sources character string place.
C) around candidate's news sources character string, occur " within the border ", " according to ", the media deictic words such as " media ", " website ".
Such as, domestic " NGO develops AC network " the daily magazine note in March 11 of subordinate sentence meets above heuristic rule, can Extracting Information source " NGO develops AC network " be information source.
The heuristic rule here is mainly manually formulated by observing message, may comprise a lot of complex rules, and rule is also constantly add or revise.We have realized an efficient extendible information extraction system, can support flexibly regular interpolation or modification.
Step S305: we further judge its type to the information source extracting, and comprise news media, forum, blog and microblogging, for blog and microblogging, we further extract user's name and blog or microblogging site information.Here, we have formulated series of rules equally, complete information source type by matched rule one by one and differentiate, these rules utilize useful element information that step S303 provides to comprise media name deictic words information (if any) in information source title and other element informations around.As advised for the micro-blog user of the www.xinhuanet.com " XXXX ", the information source type of extraction is microblogging, and its user's name is " XXXX ", and microblogging website is " www.xinhuanet.com's microblogging ".
We export all information sources in message and type information thereof step S306..
The present invention also provides a kind of message information source extraction system, has adopted message information source abstracting method, and Fig. 8 is message information of the present invention source extraction system structural representation, and as shown in Figure 8, this system comprises:
Message content adaptation module 1: for shielding the difference of coding or storage mode of message, provide unified message character iteration fetch interface;
Packet parsing module 2: according to the text of input, carry out code parsing, extract the character in text, and character is made pauses in reading unpunctuated ancient writings and is treated to different subordinate sentences;
Information source abstraction module 3: according to information source decimation rule storehouse, subordinate sentence is carried out to keyword coupling, subordinate sentence is extracted to useful key element sequence, and in useful key element sequence, information extraction source, and by the rule judgment information source type in decimation rule storehouse, match information source;
Information source statistical module 4: gather the extraction result in information extraction source, the statistical information in computing information source.
Wherein, packet parsing module 2 also comprises:
Message character read module 21: read message byte stream, and according to coded system, byte is assembled into actual character;
Character types judge module 22: according to character types recognition rule, character is divided into dissimilar;
Response events module 23: dissimilar according to character, notify user to carry out the extraction operation of dissimilar character.
Wherein, information source abstraction module 3 also comprises:
Module 31 set up in index: set up TRIE keyword index according to useful key element storehouse;
Subordinate sentence module 32: the character in response events step is made pauses in reading unpunctuated ancient writings and is treated to different subordinate sentences;
Extract processing module 33: according to TRIE keyword index, different subordinate sentences is carried out to keyword coupling, Extracting Information source, and judge and the authenticity of information source complete the differentiation of information source type;
Output module 34: the information of information source and information source type is exported.
Wherein, extracting processing module 33 also comprises:
Information source abstraction module 331: carry out information source extraction take subordinate sentence as unit, the TRIE keyword index of setting up according to useful key element storehouse, extracts candidate's information source or the list of candidate's information source;
Useful key element abstraction module 332: according to candidate's information source or the list of candidate's information source, extract the positional information in subordinate sentence in useful key element and useful key element from subordinate sentence;
Real information source judge module 333: by predefined real information source recognition rule, judge whether candidate information source is real information source;
Information source type abstraction module 334: mated information source type by predefined information source type recognition rule with useful key element and differentiated.
Wherein, in information source type abstraction module 334, be the information source of blog and microblogging for information source type, need further to extract user's name and Blog Website information.
Below in conjunction with specific embodiment of the invention explanation message information source extraction system, Fig. 9 is specific embodiment of the invention message information source extraction system structural representation, and as shown in Figure 9, message information of the present invention source extraction system comprises: following four levels:
1) message content adaptation layer: the differences such as shielding message coding, storage mode, for upper layer module provides consistent message character iteration fetch interface, make upper layer module only need to be concerned about the logic of extraction.
2) Parser layer: the information extraction overall procedure based on event response.Here adopt observer to design a model, Parser is actually a target (Subject), and registration has a series of observers (Observer).Overall procedure is as follows: read message character by stacking generation of content adaptation, often read a character as an event, notify each observer to carry out corresponding call back function and carry out corresponding event.
3) Extractor layer: an in fact corresponding observer Listener, by realizing concrete event response action, completes concrete information extraction function etc.It is to Extractor layer specific implementation that information source extracts, and according to the message content of input, therefrom extracts the type information sources such as news, forum, blog and microblogging; Provide title standardization function for news, forum information source; Provide user's name and site name extract function for blog and micro-blog information source.
4) information source statistics layer: information source statistics reads message from message data storehouse traversal, and each message content is carried out to information source extraction.Finally, gather all extraction results, calculate the statistical information such as occurrence number, message category distribution of the information source extracting, by statistics write into Databasce.
Certainly; the present invention also can have other various embodiments; in the situation that not deviating from spirit of the present invention and essence thereof; those of ordinary skill in the art are when making according to the present invention various corresponding changes and distortion, but these corresponding changes and distortion all should belong to the protection domain of the appended claim of the present invention.

Claims (20)

1. a message information source abstracting method, is characterized in that, described method is passed through the information source in the keyword extraction message in decimation rule storehouse, match information source, and mates the rule judgment information source type in described information source decimation rule storehouse, and the method comprises:
Packet parsing step: according to the text of input, extract the character in described text, and described character is made pauses in reading unpunctuated ancient writings and is treated to different subordinate sentences;
Information source extraction step: described subordinate sentence is carried out to keyword coupling according to described information source decimation rule storehouse, described subordinate sentence is extracted to useful key element sequence, and in described useful key element sequence, information extraction source, and by the rule judgment information source type in coupling described information source decimation rule storehouse.
2. message information source abstracting method according to claim 1, is characterized in that, described information source decimation rule storehouse further comprises: useful key element storehouse, real information source recognition rule, information source type recognition rule and character types recognition rule.
3. message information source abstracting method according to claim 1, is characterized in that, described method, before described packet parsing step, further comprises:
Message content adaptation step: for shielding the difference of coding or storage mode of message, provide unified message character iteration fetch interface.
4. message information source abstracting method according to claim 3, is characterized in that, described method further comprises:
Information source statistic procedure: gather the extraction result in described information extraction source, calculate the statistical information of described information source.
5. according to message information source abstracting method described in claim 1 or 2, it is characterized in that, described packet parsing step also comprises:
Message character read step: read message byte stream, and according to coded system, byte is assembled into actual character;
Character types determining step: according to described character types recognition rule, character is divided into dissimilar;
Response events step: dissimilar according to described character, notify user to carry out the extraction operation of dissimilar character.
6. message information source abstracting method according to claim 1, is characterized in that, described information source extraction step also comprises:
Index establishment step: set up TRIE keyword index according to described useful key element storehouse;
Subordinate sentence step: the described character in described response events step is made pauses in reading unpunctuated ancient writings and is treated to different subordinate sentences;
Extract treatment step: according to described TRIE keyword index, described different subordinate sentence is carried out to keyword coupling, Extracting Information source, and judge and the authenticity of described information source complete the differentiation of described information source type;
Output step: the information of described information source and described information source type is exported.
7. according to message information source abstracting method described in claim 6 or 2, it is characterized in that, described extraction treatment step also comprises:
Information source extraction step: carry out information source extraction take described subordinate sentence as unit, the TRIE keyword index of setting up according to described useful key element storehouse, extracts candidate's information source or the list of candidate's information source;
Useful key element extraction step: according to described candidate's information source or the list of candidate's information source, extract the positional information in subordinate sentence described in useful key element and described useful key element from described subordinate sentence;
Real information source determining step: by predefined described real information source recognition rule, judge whether described candidate information source is real information source;
Information source type extraction step: mated information source type by predefined described information source type recognition rule with described useful key element and differentiated.
8. message information source abstracting method according to claim 2, is characterized in that, described useful key element storehouse includes by key element, and described useful key element comprises: media name deictic words, date and time information, media report behavior word and media deictic words.
9. message information source abstracting method according to claim 2, is characterized in that, described real information source recognition rule is heuristic rule, manually formulates by observing message, and rule can add or revise.
10. message information source abstracting method according to claim 9, it is characterized in that, described real information source recognition rule comprises a heuristic rule: if only have a described candidate information source in subordinate sentence, and there is described media report behavior word, and the character that meets described candidate information source occurs occurring described media deictic words in described date and time information or described follow-up source word symbol with the subordinate sentence at described media name deictic words ending or described follow-up source string place, judges that described candidate information source is real information source.
11. message information source abstracting methods according to claim 1, is characterized in that, described information source type comprises: news media, forum, blog and microblogging.
12. message information source abstracting methods according to claim 7, is characterized in that, in described information source type extraction step, for described information source type be blog and or the information source of microblogging, need further to extract user's name or Blog Website information.
13. 1 kinds of message information source extraction systems, adopt the message information source abstracting method as described in any one in claim 1-12, it is characterized in that, described system comprises:
Packet parsing module: according to the text of input, carry out code parsing, extract the character in described text, and described character is made pauses in reading unpunctuated ancient writings and is treated to different subordinate sentences;
Information source abstraction module: described subordinate sentence is carried out to keyword coupling according to described information source decimation rule storehouse, described subordinate sentence is extracted to useful key element sequence, and in described useful key element sequence, information extraction source, and by the rule judgment information source type in coupling described information source decimation rule storehouse.
14. according to message information source extraction system described in claim 13, it is characterized in that, described information source decimation rule storehouse further comprises: useful key element storehouse, real information source recognition rule, information source type recognition rule and character types recognition rule.
15. according to message information source extraction system described in claim 13, it is characterized in that, described system further comprises:
Message content adaptation module: for shielding the difference of coding or storage mode of message, provide unified message character iteration fetch interface.
16. according to message information source extraction system described in claim 13 or 14, it is characterized in that, described system further comprises:
Information source statistical module: gather the extraction result in described information extraction source, calculate the statistical information of described information source.
17. according to message information source extraction system described in claim 13, it is characterized in that, described packet parsing module also comprises:
Message character read module: read message byte stream, and according to coded system, byte is assembled into actual character;
Character types judge module: according to described character types recognition rule, character is divided into dissimilar;
Response events module: dissimilar according to described character, notify user to carry out the extraction operation of dissimilar character.
18. according to message information source extraction system described in claim 13, it is characterized in that, described information source abstraction module also comprises:
Module set up in index: set up TRIE keyword index according to described useful key element storehouse;
Subordinate sentence module: the described character in described response events step is made pauses in reading unpunctuated ancient writings and is treated to different subordinate sentences;
Extract processing module: according to described TRIE keyword index, described different subordinate sentence is carried out to keyword coupling, Extracting Information source, and judge and the authenticity of described information source complete the differentiation of described information source type;
Output module: the information of described information source and described information source type is exported.
19. according to message information source extraction system described in claim 18 or 14, it is characterized in that, described extraction processing module also comprises:
Information source abstraction module: carry out information source extraction take described subordinate sentence as unit, the TRIE keyword index of setting up according to described useful key element storehouse, extracts candidate's information source or the list of candidate's information source;
Useful key element abstraction module: according to described candidate's information source or the list of candidate's information source, extract the positional information in subordinate sentence described in useful key element and described useful key element from described subordinate sentence;
Real information source judge module: by predefined described real information source recognition rule, judge whether described candidate information source is real information source;
Information source type abstraction module: mated information source type by predefined described information source type recognition rule with described useful key element and differentiated.
20. according to message information source extraction system described in claim 19, it is characterized in that, and in described information source type abstraction module, be the information source of blog and microblogging for described information source type, need further to extract user's name and Blog Website information.
CN201410010836.XA 2014-01-09 2014-01-09 A kind of message information source abstracting method and its system Active CN103778200B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410010836.XA CN103778200B (en) 2014-01-09 2014-01-09 A kind of message information source abstracting method and its system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410010836.XA CN103778200B (en) 2014-01-09 2014-01-09 A kind of message information source abstracting method and its system

Publications (2)

Publication Number Publication Date
CN103778200A true CN103778200A (en) 2014-05-07
CN103778200B CN103778200B (en) 2017-08-08

Family

ID=50570435

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410010836.XA Active CN103778200B (en) 2014-01-09 2014-01-09 A kind of message information source abstracting method and its system

Country Status (1)

Country Link
CN (1) CN103778200B (en)

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104408101A (en) * 2014-11-19 2015-03-11 南京大学 Whole-process Web information extraction integration method
CN105447202A (en) * 2015-12-31 2016-03-30 宁波公众信息产业有限公司 Internet information collecting system
CN106021439A (en) * 2016-05-16 2016-10-12 腾讯科技(深圳)有限公司 Communication number processing method and device
CN106484767A (en) * 2016-09-08 2017-03-08 中国科学院信息工程研究所 A kind of event extraction method across media
CN106815203A (en) * 2015-12-01 2017-06-09 北京国双科技有限公司 A kind of amount of money analysis method and device in judgement document
CN107169061A (en) * 2017-05-02 2017-09-15 广东工业大学 A kind of text multi-tag sorting technique for merging double information sources
CN107423279A (en) * 2017-04-11 2017-12-01 美林数据技术股份有限公司 A kind of information extraction and analysis method of credit financing short message
CN108268438A (en) * 2016-12-30 2018-07-10 腾讯科技(深圳)有限公司 A kind of content of pages extracting method, device and client
CN111090744A (en) * 2019-12-17 2020-05-01 中科鼎富(北京)科技发展有限公司 Stock market operation risk information mining method and device
CN112380257A (en) * 2020-11-26 2021-02-19 厦门市美亚柏科信息股份有限公司 Network data stream locking method, terminal equipment and storage medium
CN112597405A (en) * 2020-12-17 2021-04-02 中国科学院计算技术研究所数字经济产业研究院 Event external information source extraction method based on microblog platform

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN100476800C (en) * 2007-06-22 2009-04-08 腾讯科技(深圳)有限公司 Method and system for cutting index participle
CN101344889B (en) * 2008-07-31 2011-04-13 中国农业大学 Method and system for network information extraction
CN101727461B (en) * 2008-10-13 2012-11-21 中国科学院计算技术研究所 Method for extracting content of web page
CN102999534A (en) * 2011-09-19 2013-03-27 北京金和软件股份有限公司 Chinese word segmentation algorithm based on reverse maximum matching
CN103150432B (en) * 2013-03-07 2016-05-11 宁波成电泰克电子信息技术发展有限公司 A kind of Internet public opinion analysis method

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104408101B (en) * 2014-11-19 2018-01-09 南京大学 A kind of full range Web information extracts integrated approach
CN104408101A (en) * 2014-11-19 2015-03-11 南京大学 Whole-process Web information extraction integration method
CN106815203B (en) * 2015-12-01 2021-03-30 北京国双科技有限公司 Method and device for analyzing amount of money in referee document
CN106815203A (en) * 2015-12-01 2017-06-09 北京国双科技有限公司 A kind of amount of money analysis method and device in judgement document
CN105447202A (en) * 2015-12-31 2016-03-30 宁波公众信息产业有限公司 Internet information collecting system
CN106021439A (en) * 2016-05-16 2016-10-12 腾讯科技(深圳)有限公司 Communication number processing method and device
CN106484767B (en) * 2016-09-08 2019-06-21 中国科学院信息工程研究所 A kind of event extraction method across media
CN106484767A (en) * 2016-09-08 2017-03-08 中国科学院信息工程研究所 A kind of event extraction method across media
CN108268438A (en) * 2016-12-30 2018-07-10 腾讯科技(深圳)有限公司 A kind of content of pages extracting method, device and client
CN108268438B (en) * 2016-12-30 2021-10-22 腾讯科技(深圳)有限公司 Page content extraction method and device and client
CN107423279A (en) * 2017-04-11 2017-12-01 美林数据技术股份有限公司 A kind of information extraction and analysis method of credit financing short message
CN107169061A (en) * 2017-05-02 2017-09-15 广东工业大学 A kind of text multi-tag sorting technique for merging double information sources
CN111090744A (en) * 2019-12-17 2020-05-01 中科鼎富(北京)科技发展有限公司 Stock market operation risk information mining method and device
CN112380257A (en) * 2020-11-26 2021-02-19 厦门市美亚柏科信息股份有限公司 Network data stream locking method, terminal equipment and storage medium
CN112597405A (en) * 2020-12-17 2021-04-02 中国科学院计算技术研究所数字经济产业研究院 Event external information source extraction method based on microblog platform

Also Published As

Publication number Publication date
CN103778200B (en) 2017-08-08

Similar Documents

Publication Publication Date Title
CN103778200A (en) Method for extracting information source of message and system thereof
Jung Online named entity recognition method for microtexts in social networking services: A case study of twitter
CN100405371C (en) Method and system for abstracting new word
CN103617169B (en) A kind of hot microblog topic extracting method based on Hadoop
Kumar et al. Analyzing Twitter sentiments through big data
CN104572849A (en) Automatic standardized filing method based on text semantic mining
CN101593200A (en) Chinese Web page classification method based on the keyword frequency analysis
CN101695082B (en) Service organization method based on relation mining and device thereof
CN103294664A (en) Method and system for discovering new words in open fields
CN104820686A (en) Network search method and network search system
Rao et al. CMEE-IL: Code Mix Entity Extraction in Indian Languages from Social Media Text@ FIRE 2016-An Overview.
CN105718585B (en) Document and label word justice correlating method and its device
CN103020159A (en) Method and device for news presentation facing events
CN102207948A (en) Method for generating incident statement sentence material base
CN102622453A (en) Body-based food security event semantic retrieval system
CN104978332B (en) User-generated content label data generation method, device and correlation technique and device
CN101566995A (en) Method and system for integral release of internet information
CN102169496A (en) Anchor text analysis-based automatic domain term generating method
CN103077207A (en) Method and system for analyzing microblog happiness index
CN102508830A (en) Method and system for extracting social network from news document
CN114064851A (en) Multi-machine retrieval method and system for government office documents
CN114792145B (en) Standard digital management maintenance system and method based on knowledge graph
CN103440343B (en) Knowledge base construction method facing domain service target
Bhardwaj et al. Web scraping using summarization and named entity recognition (ner)
CN116628328A (en) Web API recommendation method and device based on functional semantics and structural interaction

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant