CN104346382A - Text analysis system and method employing language query - Google Patents
Text analysis system and method employing language query Download PDFInfo
- Publication number
- CN104346382A CN104346382A CN201310330423.5A CN201310330423A CN104346382A CN 104346382 A CN104346382 A CN 104346382A CN 201310330423 A CN201310330423 A CN 201310330423A CN 104346382 A CN104346382 A CN 104346382A
- Authority
- CN
- China
- Prior art keywords
- text
- knowledge
- lql
- extracted
- word
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/3332—Query translation
- G06F16/3334—Selection or weighting of terms from queries, including natural language queries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/211—Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- General Health & Medical Sciences (AREA)
- Machine Translation (AREA)
Abstract
The invention discloses a text analysis system and method employing language query and aims to extract required knowledge by acquiring Chinese text information from the internet and analyzing the same. The Chinese word segmentation technique and the LQL (language query language) technique are adopted. Chinese text can be subjected to word segmentation by the Chinese word segmentation technique, and segmented words are subjected to part-of-speech tagging. The segmented and tagged Chinese text is subjected to LQL analysis by the LQL technique and is subjected to knowledge extraction. The invention further provides error corrective analysis for eliminating mistakenly extracted knowledge. The text analysis system and method has the advantages that non-computer programmers can simply set LQL rules; a network format and a network structure independent of text contents are used, and the range of information collection is greatly enlarged; the system and the method are applicable to the fields such as network information extraction, business information mining, information aggregation and network knowledge base establishment.
Description
Technical field
The invention belongs to the network branches in computer science, be specifically related to a kind of text analysis system and the method that use language inquiry, be applicable to the applications such as network information extraction, business intelligence excavation, information fusion, networked knowledge base foundation.
Background technology
Along with the high speed development of internet, the information on network is explosive growth, and people are more and more accustomed to obtaining information on network.But because the information on network is too many, even if there has been cyber stalker, people have also been difficult to find required information.In addition, network also often occurs many incoherent noise informations, although a lot of information to be retrieved, its content may be irrelevant or inaccurate.
Therefore, people wish to occur a kind of intelligent tool, according to the wish of user, help people to get rid of noise, in a large amount of information, filter out the real information needed.
Traditional natural language processing (NLP) system, can utilize natural language processing technique, as a point part-of-speech tagging, and classification tree, synonym, index allusion quotation etc., from the content of text, the meaning in the middle of extraction.Therefore a large amount of computer programs is also developed, with from these through NLP processing after content of text, extract knowledge.But the exploitation of computer program is normally very consuming time.In addition, along with passage of time, just need more computer program to extract new knowledge, this makes the maintenance cost of whole analytic system become expensive.Many times, because the knowledge be extracted is ambiguous, also need manually to examine and correct.
Chinese invention patent application application number be 200810142630.7 and 200910104805.X the text analysis system that utilizes classification tree to analyze text is proposed.But this system height depends on the structure of blog or webpage, using the input as system.For many text analysis systems, because the source (as the news article from different news website, the content of microblogging) of content may not have good or identical structure, this means that each website or each webpage just need corresponding rule.In addition, the source structure of this content may change in time, so when this structure changes, classification tree also must be rebuild, this be all do not have cost-benefit.
U.S. Patent Application Publication No. 2011/019671 and PCT international publication number WO2012/099970A1 propose brand valuation system.This systematic collection brand website is sold and transmission data, with the value of brand evaluation.It also attempts to compare different brands, to be created in the brand index of some industries.But the problem of this system is, sale and the data on flows of collecting rival website are quite difficult.Theoretically, if a tissue can be collected from different company obtain data, this index can be established.But actually, because sales data is normally highly confidential, so this is infeasible.
Summary of the invention
According to above problem, the invention discloses a kind of text analysis system and the method that use language inquiry.The present invention uses Chinese word segmentation (Chinese Segmentation) and language inquiry language (Linguistics Query Language, LQL) technology.Through Chinese word segmentation, the cutting of word can be carried out to Chinese text, and part-of-speech tagging (Part-of-Speech, POS Tagging) is carried out to be syncopated as word.LQL technology can be split this and by the Chinese text of part-of-speech tagging, analyze further, to extract required knowledge.
According to an aspect of the present invention, provide a kind of text analysis system using language inquiry, described system comprises:
Content of text load module, for inputting Chinese text in described text analysis system;
Chinese word segmentation module, for carrying out the cutting of word to this Chinese text;
Part-of-speech tagging module, for the word be split out this, part of speech label on mark;
Application dictionary database, comprise one or more application dictionary, this application dictionary comprises one or more keyword;
Language inquiry language (LQL) rule database, for storing one or more LQL rule, wherein, the setting of this LQL rule comprises:
Define the position of knowledge in this Chinese text (Extraction Position) be extracted;
Definition coverage (Coverage), this coverage is a sentence, a paragraph or a document;
Define one or more matching condition (MatchCriteria), this matching condition is list of phrases (Phrase List) or the word (WORD POS) with specific part-of-speech tagging;
Definition match pattern (MatchPattern), this match pattern is for defining matching condition, when this matching condition is list of phrases, its match pattern is a file name, this file name points to the one or more keywords in this application dictionary, when this matching condition be this there is the word of specific part-of-speech tagging time, its match pattern is part of speech label;
LQL analysis module, according to this LQL rule, for being split this and by the Chinese text of part-of-speech tagging, carrying out LQL analysis, and the knowledge needed for extracting, wherein, this LQL analyzes and comprises:
Establish the coverage that this LQL rule defines;
According to the part of speech label that the matching condition of this LQL rule defines, be split at this and by the Chinese text of part-of-speech tagging, find out the word with this part of speech label;
According to the keyword that the matching condition of this LQL rule defines, be split at this and by the Chinese text of part-of-speech tagging, find out the word identical with this keyword;
When in this coverage, this matching condition can be met, and what define according to this LQL rule is extracted the position of knowledge in Chinese text, extracts one or more word;
Be extracted knowledge data base, for storing the knowledge that this is extracted.
According to another aspect of the present invention, provide a kind of text analyzing method using said system, described method comprises:
S1: obtain Chinese text;
S2: use Chinese word segmentation module, this Chinese text is carried out to the cutting of word;
S3: use part-of-speech tagging module, the word that this is split out, is carried out to part-of-speech tagging;
S4: at LQL analysis module, use LQL rule, the Chinese text being split this and marking, carries out LQL analysis, and to extract knowledge, wherein, this LQL analyzes and comprises the following steps:
Establish the coverage that this LQL rule defines;
According to the part of speech label that the matching condition of this LQL rule defines, be split at this and by the Chinese text of part-of-speech tagging, find out the word with this part of speech label;
According to the keyword that the matching condition of this LQL rule defines, be split at this and by the Chinese text of part-of-speech tagging, find out the word identical with this keyword;
When in this coverage, this matching condition can be met, and what define according to this LQL rule is extracted the position of knowledge in Chinese text, extracts one or more word.
According to an aspect of the present invention, provide a kind of text analysis system using language inquiry, described system comprises:
Content of text load module, for inputting the text of this language in described text analysis system;
Language word-dividing mode, for carrying out the cutting of word to the text;
Part-of-speech tagging module, for the word be split out this, part of speech label on mark;
Application dictionary database, comprises one or more application dictionary;
Language inquiry language (LQL) rule database, for storing one or more LQL rule, wherein, the setting of this LQL rule comprises:
Define the position of knowledge in the text (Extraction Position) be extracted;
Definition coverage (Coverage), this coverage is a sentence, a paragraph or a document;
Define one or more matching condition (MatchCriteria), this matching condition is list of phrases (Phrase List) or the word (WORD POS) with specific part-of-speech tagging;
Definition match pattern (MatchPattern), this match pattern is for defining matching condition, when this matching condition is list of phrases, its match pattern is a file name, this file name points to the one or more keywords in this application dictionary, when this matching condition be this there is the word of specific part-of-speech tagging time, its match pattern is part of speech label;
LQL analysis module, according to this LQL rule, for being split this and by the text of part-of-speech tagging, carrying out LQL analysis, and the knowledge needed for extracting, it is characterized in that, this LQL analyzes and comprises:
Establish the coverage that this LQL rule defines;
According to the part of speech label that the matching condition of this LQL rule defines, be split at this and by the text of part-of-speech tagging, find out the word with this part of speech label;
According to the keyword that the matching condition of this LQL rule defines, be split at this and by the text of part-of-speech tagging, find out the word identical with this keyword;
When in this coverage, this matching condition can be met, and what define according to this LQL rule is extracted knowledge position in the text, extracts one or more word;
Be extracted knowledge data base, for storing the knowledge that this is extracted.
According to the present invention, comprise content of text load module, text grammer analysis module, text word-dividing mode, part-of-speech tagging module, LQL analysis module with the text analysis system of language inquiry, be extracted knowledge data base, Chinese word segmentation dictionary, LQL rule database, application dictionary database, error recovery rule database, error correction module, LQL rule inputting interface, application dictionary inputting interface and error recovery rule inputting interface.
Participle is exactly process continuous print word sequence being reassembled into word sequence according to certain specification.Chinese word segmentation refers to and Chinese character sequence is cut into word independent one by one.Described Chinese word segmentation module is the cutting for carrying out word to Chinese text, namely as English, makes to leave space between each word in Chinese sentence.Described part-of-speech tagging module carries out part-of-speech tagging (POS Tagging) to the word be split out.
Described Chinese word segmentation dictionary comprises term list, and central term has the frequency of part-of-speech tagging and the appearance of this part-of-speech tagging.Text word-dividing mode and this part-of-speech tagging module are based on this Chinese word segmentation dictionary, the cutting and the part-of-speech tagging that Chinese text are carried out to word.
Described application dictionary database comprises one or more application dictionary.Each application dictionary describes a series of keyword according to application-specific.Application dictionary can be applied in the setting of LQL rule.
Described LQL analysis module uses LQL rule, to being split and being analyzed by the Chinese text of part-of-speech tagging, and therefrom extracts required knowledge.User can use LQL rule inputting interface, according to different needs, and the LQL rule needed for setting, and LQL rule is stored in the middle of LQL rule database.This knowledge be extracted can be stored in and be extracted in the middle of knowledge data base.
This error correction module energy mistake in correction rule, makes analysis to the knowledge be extracted, and deletes those by the knowledge of error extraction, thus improves the accuracy of knowledge extraction.User's energy mistake in correction rule inputting interface, according to different needs, setting error recovery rule.The error recovery rule be set can be stored in the middle of error recovery rule database.
According to an aspect of the present invention, LQL rule settings comprises:
Definition is extracted knowledge position in the text (Extraction Position);
Definition coverage (Coverage), this coverage can be a sentence, a paragraph or a document;
Definition matching condition (MatchCriteria), this matching condition can be list of phrases (Phrase List), have the word (WORD POS) of specific part of speech label or do not have the word (WORD NOT POS) of specific part of speech label;
Definition match pattern (MatchPattern), this match pattern is for defining matching condition, for Phrase List, its match pattern can be a file name, this file name points to a series of keywords in application dictionary, for WORD POS or WORD NOT POS, its match pattern is part of speech label;
Define optional condition (OptionalCriteria), for matching condition, and can define by general regular expression.
According to an aspect of the present invention, described LQL analysis module uses LQL rule, and to being split and being analyzed by the text of part-of-speech tagging, this LQL analyzes and comprises:
Establish the coverage that LQL rule defines;
According to the part of speech label that the matching condition of LQL rule defines, be split at this and by the text of part-of-speech tagging, find out the word with this part of speech label;
According to the keyword that the matching condition of LQL rule defines, be split at this and by the text of part-of-speech tagging, find out the word identical with this keyword;
When in this coverage, can find out the word and the word identical with this keyword with this part of speech label, namely this matching condition can be met, and what define according to this LQL rule is extracted the position of knowledge in Chinese text, extracts one or more word.
According to an aspect of the present invention, described error recovery rule settings comprises the one or more numerical value of definition and numeric ratio comparatively requirement.Defining this numerical value can be:
Defining the knowledge be extracted is monodrome or many-valued;
Definition is extracted the threshold value of Knowledge Source number;
Definition is extracted the threshold value of knowledge number; Or
The number that definition is extracted knowledge accounts for all threshold values being extracted the number percent of knowledge number.
This numeric ratio comparatively requires it is comparative statistics value and this numerical value, can be to be greater than, to be less than or equal to.
When the knowledge be extracted does not meet above one or more error recovery rule, the knowledge that these mistakes are extracted can be deleted.
According to an aspect of the present invention, described error correction module mistake in correction rule, to being split and being analyzed by the text of part-of-speech tagging, this error recovery is analyzed and is comprised:
All knowledge be extracted is added up, to obtain statistical value;
Regular to this statistical value and this error recovery numeric ratio defined comparatively;
Delete the knowledge be extracted not meeting numeric ratio and comparatively require.
According to another aspect of the present invention, provide a kind of text analyzing method using language inquiry, described method comprises following step and gathers:
S1: use LQL rule inputting interface, definition LQL rule;
S2: use application dictionary inputting interface, definition application dictionary;
S3: mistake in correction rule inputting interface, definition error correction rule;
S4: use content of text load module, obtain text;
S5: use text grammer analysis module, grammatical analysis is carried out to the text.
S6: use text word-dividing mode, the text is carried out to the cutting of word;
S7: use part-of-speech tagging module, part-of-speech tagging is carried out to the word be split out;
S8: at LQL analysis module, uses LQL rule, to the Chinese text being split and marking, carries out LQL analysis, to extract knowledge;
S9: the knowledge be extracted, is stored in and is extracted in knowledge data base;
S10: mistake in correction module, and according to error recovery rule, the knowledge that deletion error is extracted, described by the accuracy of knowledge carried to increase.
The invention has the advantages that, due to the setting very access expansion language of language inquiry language, but not general computerese, so non-computer formula person also can set language rule language simply, to extract knowledge, thus lower the difficulty of computer program exploitation, effectively reduce system development and maintenance cost.Meanwhile, the language inquiry language be set can be stored in language inquiry language database by Cumulate Sum, using the reference as new opplication.In addition, the present invention independent of the webpage format of content of text and structure, can greatly strengthen the scope of collection information.
According to many aspects of the present invention, only need change language inquiry language and more new opplication dictionary simply, just user can set up different types of application because of needs.Such as, personage searches for, to extract the relation of people and mechanism; News hunting system, it can contact one section of news article in a place; Brand valuation, to monitor brand recognizing by degree in different social media platform.
Accompanying drawing explanation
Better understanding will be had to the present invention by accompanying drawing those skilled in the art below, and more can clearly embody advantage of the present invention.Accompanying drawing described herein is only in order to the illustration purpose of selected embodiment, instead of all possible embodiment and be intended to not limit scope of the present invention.
Fig. 1 is the text analysis system block scheme according to use language inquiry of the present invention;
Fig. 2 is the method according to a kind of part-of-speech tagging of the present invention;
Fig. 3 is the text analyzing method flow diagram according to use language inquiry of the present invention;
Fig. 4 is according to LQL analytical approach process flow diagram of the present invention;
Fig. 5 is according to error recovery analysis process figure of the present invention.
Embodiment
Fig. 1 shows text analysis system according to an embodiment of the invention, comprise content of text load module 101, text grammer analysis module 102, text word-dividing mode 103, part-of-speech tagging module 104, LQL analysis module 105, be extracted knowledge data base 106, Chinese word segmentation dictionary 107, LQL rule database 108, application dictionary database 109, error recovery rule database 110, error correction module 111, LQL rule inputting interface 112, application dictionary inputting interface 113 and error recovery rule inputting interface 114.
Text content load module 101 enters LQL text analysis system for input text content.Text content can obtain on the internet or on non-internet.When content of text be on the internet time, text content load module 101 can be used in Application Program Interface (Application Program Interface, API) that website provides with obtain by API text in the webpage that activates.Or, use Web crawler to capture the website that (crawl) has hypertext format, and extract the text having hypertext format.
Text syntax Analysis Module 102 is for analyzing the grammer of text content.
Text word-dividing mode 103 is for carrying out the cutting of Chinese word segmentation to text content.Such as, Chinese sentence " winter storm attack phenanthrene probably take hundred lives by force ", can be split as winter, storm, attack, luxuriant and rich with fragrance, fear, take by force, hundred, order.
This part-of-speech tagging module 104 can carry out part-of-speech tagging to the word be split, i.e. each word be split out, then according to its part of speech, is marked with corresponding English alphabet, i.e. part of speech label.Such as, winter/t, storm/n, attack/v, phenanthrene/j, probably/d, take/v, hundred/m, life/n by force.T represents time word, n representation noun, v represents verb, j representative is called for short abbreviation, d represents adverbial word, m represents number.
Figure below is according to a part of speech label complete list of the present invention.It is central that a represents adjective, Ag represents shape morpheme, ad represents secondary shape word, an represents adnoun, b represents distinction word etc.
Preferably, part-of-speech tagging module 104 uses viterbi algorithm (viterbi algorithm) in part-of-speech tagging.Viterbi algorithm is a kind of dynamic programming algorithm, and for finding most probable hidden state sequence, this sequence is called Viterbi path, especially in geneva information source, or hidden Markov model, the sequence of events be observed can be summed up.Other method uses forwards algorithms (forward algorithm), and this algorithm calculates the probability observing sequence of events, also belongs to theory of probability scope.Fig. 3 is according to the present invention, uses viterbi algorithm in an example of part-of-speech tagging.For sentence " winter storm attack phenanthrene probably take hundred lives by force ", the part-of-speech tagging of central each word be winter/t, storm/n, attack/v, phenanthrene/j, probably/d, take/v, hundred/m and life/n by force.
Chinese word segmentation dictionary 107 comprises this term list and corresponding part-of-speech tagging, for carrying out participle and part-of-speech tagging for text.Chinese word segmentation dictionary 107 can be defined by user or revise.
Described application dictionary database 109 comprises at least one application dictionary.This application dictionary according to application set by, for recording a series of keywords of application-specific.User can use application dictionary inputting interface 113, to create, and editor or deletion application dictionary.According to one embodiment of present invention, in the application that a brand is analyzed, just comprise the keyword that brand is analyzed, such as, fashion brand (LV, Gucci etc.) or industry particular term (name of product, model etc.).These keywords can be used in LQL rule settings.Figure below be according to of the present invention one for finding out the application dictionary of news and areas relationship.
This LQL processing module 105 can according to LQL rule, to being split and the text of part-of-speech tagging, and the knowledge needed for extraction, and by Knowledge Storage in being extracted in the middle of knowledge data base 106.LQL is a kind of script, is similar to Structured Query Language (SQL) (SQL), but LQL from without structurized text information, can extract required information.In addition, LQL based on application and the needs of user, and can be defined gained.LQL rule inputting interface 112 inputs LQL rule for allowing user, and this LQL rule can be stored in this LQL rule database 108.
According to one embodiment of present invention, LQL rule settings comprises:
Select is the meaning selected.Extraction Position is extracted knowledge position in the text, represents with numerical value.Therefore, Select<Extraction Position> represents and selects to be extracted knowledge position in the text.
Coverage is the coverage that LQL analyzes, and this coverage can be a sentence (Sentence), a paragraph (Paragraph) or a document (Document).
MatchCriteria is matching condition, and this matching condition can be list of phrases (Phrase List), have the word (WORD POS) of specific part of speech label or do not have the word (WORD NOT POS) of specific part of speech label.
MatchPattern is match pattern, and this match pattern is for defining matching condition.For Phrase List, match pattern can be a file name, and this file name points to a series of keywords in an application dictionary.For WORD POS or WORD NOT POS, its match pattern is part of speech label, as n, v, t etc.
OptionalCriteria is optional condition, is applied to matching condition, simultaneously it can define by general regular expression.
Following is one for finding out the example what someone has said.
In this LQL rule, Select<1,3> are that representative is selected to be extracted knowledge position in the text.1 and 3 represent first and the 3rd matching condition (Word NOT pos is not included).It is sentence that Sentence represents coverage.[Word pos=" nr "] finds out the word with name, and " nr " represents name.{ 0-5} is just by five words after the name found out to [Word NOT pos=" nr "] *, without the word of name, occurs to prevent the situation more than two people.For [Phrase list=" speech_word.txt "], " speech_word.txt " is a file name, it points to is an application dictionary, in the middle of comprise a series of keyword, as proposed, saying, emphasize, point out, represent, indicate, claim, estimate, think, reaffirm, estimate, estimate, predict, expect, all the synonym of " saying ", for representing what someone has said.When the word occurring the part of speech label with name in a sentence and the keyword defined, namely above matching condition can be met, one or more words after this name (first matching condition) and these keywords (the 3rd matching condition, but be not revealed) just can be extracted.Such as, Chen great Wen estimates that stock can rise.According to this LQL rule, " Chen great Wen, stock, meeting, rise " these four words are just extracted from this sentence.
Following is one for analyzing the example of someone nationality.
Select<1,3> represent the word that is selected on the position of [Word pos=" nr "] and [Word pos=" ns "].It is sentence that Sentence represents coverage.[Word pos=" nr "] finds out the word with name.{ 0-5} is just by five words after the name found out to [Word NOT pos=" nr "] *, without the word of name, occurs to prevent the situation more than two people.For [Phrase list=" nationality_word.txt "], " nationality_word.txt " is a file name, its point to be one application dictionary, in the middle of comprise a series of keyword, as ancestral home, nationality Consistent etc.[Word pos=" ns "] is the word finding out ground party name.When above four matching conditions are all met in a sentence, be just extracted with name and local word.Such as, the ancestral home of Wang great Wen is the Taishan." Wang great Wen " and " Taishan " is just extracted.
Following is one for finding the example in the place met accident in news content.
Select<1,3> represent the word that is selected on the position of [Phrase list=" accidentType_word.txt "] and [Word pos=" ns "].It is sentence that Sentence represents coverage.[Phrase list=" accidentType_word.txt "] finds out with the unexpected keyword looked like as Wind Disaster, earthquake , Hai Xiao , Shui Difficult etc.[Phrase list=" accident_word.txt "] finds out keyword what as raw in Hair, and position is being waited.[Word pos=" ns "] finds out the word (ns) that part of speech label is place name.When above three matching conditions are all met in a sentence, the keyword with the unexpected meaning is just extracted with this place name.Example is as, Wind Disaster Hair Sheng Yu Philippine.“ Wind Disaster " and " Philippine " be just extracted.
Following is the example that one of them is analyzed for brand.
This LQL rule is: [brand name]+[new range/new product]+[new product name].[brand name] is an application dictionary, and it comprises the title of a series of brand.[new range/new product] is an application dictionary, and it comprises a series of keyword in brand name prefix, as new range.[new product name] needs by the name of product found out.
This LQL rule is:
Select<3> representative is chosen at the word after the keyword in product_prefix.txt.It is sentence that Sentence represents coverage.[Phrase list=" brand_name.txt "] finds out the pointed keyword about brand name of brand_name.txt.[Phrase list=" product_prefix.txt "] finds out the pointed keyword about brand name prefix of product_prefix.txt.When above two matching conditions are all met in a sentence, new product name is just extracted.Listen, GUCCI "new series bamboo bag" There are only 2011 new." bamboo Festival wraps " can be extracted as new product name.
Many times, multiple answer is extracted, but in the middle of only have one or several to be correct.Error correction module 111 according to error recovery rule, can delete some by the knowledge of error extraction.Error recovery rule inputting interface 114 sets and input error recovery rule for allowing user.Error recovery rule can be stored in error recovery rule database 110.In addition, this error correction module 111 can be added up the knowledge be extracted, to obtain statistical value.
Under illustrate one for finding the example of the date of birth of a people.
This error recovery rule is:
Answer only has one, is monodrome (date of birth because of a people only has);
The number of sources being extracted knowledge needs to be greater than 3 (such as, in websites different more than three, obtaining the knowledge that this is extracted);
The number being extracted knowledge accounts for all number percent being extracted knowledge number to be needed to be greater than 70%.
At this, 3 and 70% is the numerical value defined in this error recovery rule." being greater than " is the numeric ratio comparatively requirement defined in this error recovery rule.Therefore, 3 and 70% also can be described as threshold values.Number in figure is the statistical value of the knowledge that these are extracted.06/07/1951 is only had to meet the comparatively requirement of above numeric ratio, because its number of sources (this statistical value is 6) being extracted knowledge be greater than 3 and its number being extracted knowledge account for all percentages (this statistical value is 88%) being extracted knowledge number and be greater than 70% than also, be therefore chosen as correct answer.Other two selections, 07/06/1951 and 06/07/1952 is deleted.
Under illustrate one, for finding, the unexpected local example of earthquake occur.
This error recovery rule is:
Answer can have multiple, is many-valued (because multiple earthquake can occur in the same period);
The number of sources being extracted knowledge needs to be greater than 3;
The number being extracted knowledge accounts for all number percent being extracted knowledge number to be needed to be greater than 20%.
At this, 3 and 20% is the numerical value defined in this error recovery rule." being greater than " is the numeric ratio comparatively requirement defined in this error recovery rule.Therefore, 3 and 20% also can be described as threshold values.Only have Wenchuan County in Sichuan and the beautiful Trees in Qinghai to meet the comparatively requirement of above numeric ratio, be therefore chosen as correct answer.Cloud river, Sichuan only has a text source to account for all number percent being extracted knowledge number with the number being extracted knowledge and only has 2%, therefore deleted.
Under illustrate one for finding the example of new product name.
This error recovery rule is:
Answer can have multiple, is many-valued (because can have multiple new product) simultaneously;
The number of sources being extracted knowledge needs to be greater than 3;
The number being extracted knowledge accounts for all number percent being extracted knowledge number to be needed to be greater than 20%.
At this, 3 and 20% is the threshold values in this error recovery rule.Bamboo Festival wraps and crime likes that undercurrent meets the comparatively requirement of above numeric ratio, is therefore chosen as correct answer.But favorite undercurrent is because failing to meet above requirement, therefore deleted.
According to another aspect of the present invention, provide a kind of text analyzing method using language inquiry, as shown in Figure 3, described method comprises following step and gathers:
S301: use LQL rule inputting interface, definition LQL rule;
S302: use application dictionary inputting interface, definition application dictionary;
S303: mistake in correction rule inputting interface, definition error correction rule;
S304: use content of text load module, obtain content of text;
S305: use text grammer analysis module, grammatical analysis is carried out to the text.
S306: use text word-dividing mode, the text is carried out to the cutting of word;
S307: use part-of-speech tagging module, to the word be split out, carry out part-of-speech tagging;
S308: at LQL analysis module, uses LQL rule, to the text being split and marking, carries out LQL analysis, to extract knowledge;
S309: the knowledge be extracted, is stored in and is extracted in knowledge data base;
S310: mistake in correction module, and according to error recovery rule, the knowledge that deletion error is extracted, to be extracted the accuracy of knowledge described in increasing.
Gather in S308 in step, as shown in Figure 4, this LQL analysis comprises following step and gathers:
S401: establish the coverage that LQL rule defines;
S402: the part of speech label defined according to the matching condition of LQL rule, is split at this and by the text of part-of-speech tagging, finds out the word with this part of speech label;
S403: according to the keyword that matching condition defines of LQL rule, is split at this and by the text of part-of-speech tagging, finds out the word identical with this keyword;
S404: when in this coverage, this matching condition can be met, what define according to LQL rule is extracted knowledge position in the text, is split and by the text of part-of-speech tagging, extracts one or more word at this.
Gather in S310 in step, as shown in Figure 5, this error recovery analysis comprises following step and gathers:
S501: add up the knowledge be extracted, to obtain statistical value;
S502: regular to this statistical value and this error recovery numeric ratio defined comparatively;
S503: delete the knowledge be extracted not meeting numeric ratio and comparatively require.
According to of the present invention with the text analyzing method and system of language inquiry, except Chinese, be also applicable to other language, as English, German, Japanese, Korean etc., only need to use suitable word-dividing mode and part-of-speech tagging module just can.
Through the above description of the embodiments, those skilled in the art can be well understood to the mode that the present invention can add required hardware platform by software and realize, can certainly all be implemented by hardware, but in a lot of situation, the former is better embodiment.Based on such understanding, what technical scheme of the present invention contributed to background technology can embody with the form of software product in whole or in part, this computer software product can be stored in storage medium, as ROM/RAM, magnetic disc, CD etc., comprising some instructions in order to make a computer equipment (can be personal computer, server, or the network equipment etc.) perform the method described in some part of each embodiment of the present invention or embodiment.
Although illustrate and describe the present invention, it will be appreciated by those skilled in the art that, under the prerequisite not departing from principle of the present invention and spirit, can change in the present embodiment, scope of the present invention is by claims and equivalents thereof.
Claims (21)
1. use a text analysis system for language inquiry, it is characterized in that, described system comprises:
Content of text load module, for inputting Chinese text in described text analysis system;
Chinese word segmentation module, for carrying out the cutting of word to this Chinese text;
Part-of-speech tagging module, for the word be split out this, part of speech label on mark;
Application dictionary database, comprise one or more application dictionary, this application dictionary comprises one or more keyword;
Language inquiry language (LQL) rule database, for storing one or more LQL rule, wherein, the setting of this LQL rule comprises:
Define the position of knowledge in this Chinese text (Extraction Position) be extracted;
Definition coverage (Coverage), this coverage is a sentence, a paragraph or a document;
Define one or more matching condition (MatchCriteria), this matching condition is list of phrases (Phrase List) or the word (WORD POS) with specific part-of-speech tagging;
Definition match pattern (MatchPattern), this match pattern is for defining matching condition, when this matching condition is list of phrases, its match pattern is a file name, this file name points to the one or more keywords in this application dictionary, when this matching condition be this there is the word of specific part-of-speech tagging time, its match pattern is part of speech label;
LQL analysis module, according to this LQL rule, for being split this and by the Chinese text of part-of-speech tagging, carrying out LQL analysis, and the knowledge needed for extracting, wherein, this LQL analyzes and comprises:
Establish the coverage that this LQL rule defines;
According to the part of speech label that the matching condition of this LQL rule defines, be split at this and by the Chinese text of part-of-speech tagging, find out the word with this part of speech label;
According to the keyword that the matching condition of this LQL rule defines, be split at this and by the Chinese text of part-of-speech tagging, find out the word identical with this keyword;
When in this coverage, this matching condition can be met, and what define according to this LQL rule is extracted the position of knowledge in Chinese text, extracts one or more word;
Be extracted knowledge data base, for storing the knowledge that this is extracted.
2. text analysis system according to claim 1, is characterized in that, described system also comprises:
Error recovery rule database, for storing one or more error recovery rule;
Error correction module, can use this error recovery rule, to the knowledge be extracted, carry out error recovery analysis, the knowledge be extracted with deletion error, the accuracy of the knowledge be extracted described in increase.
3. text analysis system according to claim 2, it is characterized in that, this error recovery rule comprises the one or more numerical value of setting and numeric ratio comparatively requirement, this error correction module is added up the knowledge be extracted, obtain statistical value, and and this numeric ratio comparatively, when this statistical value of this knowledge be extracted does not meet the comparatively requirement of this numeric ratio, this knowledge be extracted can be deleted.
4. text analysis system according to claim 3, is characterized in that, the number that this statistical value comprises the number of sources being extracted knowledge, the number being extracted knowledge or is extracted knowledge accounts for all number percent being extracted knowledge number.
5. text analysis system according to claim 3, it is characterized in that, the number that this numerical value comprises the threshold value being extracted Knowledge Source number, the threshold value being extracted knowledge number or is extracted knowledge accounts for all threshold values being extracted the number percent of knowledge number, this numeric ratio comparatively requires it is compare this statistical value and this numerical value, and this statistical value is greater than, be less than or equal to this numerical value.
6. text analysis system according to claim 1, is characterized in that, described system also comprises:
Text grammer analysis module, for analyzing the grammer of this Chinese text;
Chinese word segmentation dictionary, comprises term list, and the term in this term list has the frequency of part-of-speech tagging and the appearance of this part-of-speech tagging, for carrying out cutting and the part-of-speech tagging of word to this Chinese text;
LQL rule inputting interface, sets LQL rule for allowing user;
Application dictionary inputting interface, sets application dictionary for allowing user.
7. text analysis system according to claim 2, is characterized in that, described system also comprises:
Error recovery rule inputting interface, for allowing user's input error recovery rule.
8. text analysis system according to claim 1, is characterized in that, this Chinese text is acquired in internet.
9. text analysis system according to claim 8, is characterized in that, uses Application Program Interface or Web crawler to obtain this Chinese text on the internet.
10. text analysis system according to claim 1, is characterized in that, uses viterbi algorithm or forwards algorithms to carry out part-of-speech tagging to the word be split out.
11. text analysis systems according to claim 1, is characterized in that, this matching condition also comprises the word (WORD NOT POS) without specific part-of-speech tagging, and its match pattern is part of speech label.
12. text analysis systems according to claim 1, is characterized in that, the setting of this LQL rule also comprises:
Define optional condition (OptionalCriteria), for matching condition.
13. text analysis systems according to claim 2, is characterized in that, it is monodrome or many-valued that this error recovery rule comprises this knowledge be extracted of setting.
14. text analysis systems according to claim 1, is characterized in that, described system is applied to personage's search, news hunting system or brand analysis.
15. 1 kinds of text analyzing methods using the system described in claim 1, it is characterized in that, described method comprises:
S1: obtain Chinese text;
S2: use Chinese word segmentation module, this Chinese text is carried out to the cutting of word;
S3: use part-of-speech tagging module, the word that this is split out, is carried out to part-of-speech tagging;
S4: at LQL analysis module, use LQL rule, the Chinese text being split this and marking, carries out LQL analysis, and to extract knowledge, wherein, this LQL analyzes and comprises the following steps:
Establish the coverage that this LQL rule defines;
According to the part of speech label that the matching condition of this LQL rule defines, be split at this and by the Chinese text of part-of-speech tagging, find out the word with this part of speech label;
According to the keyword that the matching condition of this LQL rule defines, be split at this and by the Chinese text of part-of-speech tagging, find out the word identical with this keyword;
When in this coverage, this matching condition can be met, and what define according to this LQL rule is extracted the position of knowledge in Chinese text, extracts one or more word.
16. text analyzing methods according to claim 15, it is characterized in that, described method also comprises:
According to error recovery rule, error recovery analysis is carried out to this knowledge be extracted, the knowledge be extracted with deletion error, the accuracy of the knowledge be extracted described in increase.
17. text analyzing methods according to claim 16, is characterized in that, this error recovery is analyzed and comprised:
The knowledge that this is extracted is added up, to obtain statistical value;
The numerical value that this statistical value and this error recovery rule define is compared;
When this statistical value of this knowledge be extracted does not meet numeric ratio comparatively requirement, this knowledge be extracted can be deleted.
18. text analyzing methods according to claim 17, is characterized in that, the number that this statistical value comprises the number of sources being extracted knowledge, the number being extracted knowledge or is extracted knowledge accounts for all number percent being extracted knowledge number.
19. text analyzing methods according to claim 17, it is characterized in that, the number that this numerical value comprises the threshold value being extracted Knowledge Source number, the threshold value being extracted knowledge number or is extracted knowledge accounts for all threshold values being extracted the number percent of knowledge number, this numeric ratio comparatively requires it is compare this statistical value and this numerical value, and this statistical value is greater than, be less than or equal to this numerical value.
20. text analyzing methods according to claim 15, is characterized in that, described method is applied to personage's search, news hunting system or brand analysis.
21. 1 kinds of text analysis systems using language inquiry, it is characterized in that, described system is applicable to different language, and described system comprises:
Content of text load module, for inputting the text of this language in described text analysis system;
Language word-dividing mode, for carrying out the cutting of word to the text;
Part-of-speech tagging module, for the word be split out this, part of speech label on mark;
Application dictionary database, comprises one or more application dictionary;
Language inquiry language (LQL) rule database, for storing one or more LQL rule, wherein, the setting of this LQL rule comprises:
Define the position of knowledge in the text (Extraction Position) be extracted;
Definition coverage (Coverage), this coverage is a sentence, a paragraph or
Individual document;
Define one or more matching condition (MatchCriteria), this matching condition is list of phrases (Phrase List) or the word (WORD POS) with specific part-of-speech tagging;
Definition match pattern (MatchPattern), this match pattern is for defining matching condition, when this matching condition is list of phrases, its match pattern is a file name, this file name points to the one or more keywords in this application dictionary, when this matching condition be this there is the word of specific part-of-speech tagging time, its match pattern is part of speech label;
LQL analysis module, according to this LQL rule, for being split this and by the text of part-of-speech tagging, carrying out LQL analysis, and the knowledge needed for extracting, it is characterized in that, this LQL analyzes and comprises:
Establish the coverage that this LQL rule defines;
According to the part of speech label that the matching condition of this LQL rule defines, be split at this and by the text of part-of-speech tagging, find out the word with this part of speech label;
According to the keyword that the matching condition of this LQL rule defines, be split at this and by the text of part-of-speech tagging, find out the word identical with this keyword;
When in this coverage, this matching condition can be met, and what define according to this LQL rule is extracted knowledge position in the text, extracts one or more word;
Be extracted knowledge data base, for storing the knowledge that this is extracted.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310330423.5A CN104346382B (en) | 2013-07-31 | 2013-07-31 | Use the text analysis system and method for language inquiry |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310330423.5A CN104346382B (en) | 2013-07-31 | 2013-07-31 | Use the text analysis system and method for language inquiry |
Publications (2)
Publication Number | Publication Date |
---|---|
CN104346382A true CN104346382A (en) | 2015-02-11 |
CN104346382B CN104346382B (en) | 2017-08-29 |
Family
ID=52501997
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201310330423.5A Active CN104346382B (en) | 2013-07-31 | 2013-07-31 | Use the text analysis system and method for language inquiry |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN104346382B (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104778262A (en) * | 2015-04-21 | 2015-07-15 | 无锡天脉聚源传媒科技有限公司 | Searching method and searching device |
CN105243130A (en) * | 2015-09-29 | 2016-01-13 | 中国电子科技集团公司第三十二研究所 | Text processing system and method for data mining |
CN107870966A (en) * | 2017-08-11 | 2018-04-03 | 成都萌想科技有限责任公司 | A kind of recruitment general regulations data pick-up method based on semantic model |
CN109214005A (en) * | 2018-09-14 | 2019-01-15 | 南威软件股份有限公司 | A kind of clue extracting method and system based on Chinese word segmentation |
CN109558589A (en) * | 2018-11-12 | 2019-04-02 | 速度时空信息科技股份有限公司 | A kind of method and system of the free thought document based on Chinese words segmentation |
CN113239206A (en) * | 2021-06-18 | 2021-08-10 | 广东博维创远科技有限公司 | Judgment document accurate data classification analysis method and storage device capable of being read by computer |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR20100069120A (en) * | 2008-12-16 | 2010-06-24 | 한국전자통신연구원 | Method for tagging morphology by using prosody modeling and its apparatus |
CN102207947A (en) * | 2010-06-29 | 2011-10-05 | 天津海量信息技术有限公司 | Direct speech material library generation method |
CN102253930A (en) * | 2010-05-18 | 2011-11-23 | 腾讯科技(深圳)有限公司 | Method and device for translating text |
CN102654873A (en) * | 2011-03-03 | 2012-09-05 | 苏州同程旅游网络科技有限公司 | Tourism information extraction and aggregation method based on Chinese word segmentation |
-
2013
- 2013-07-31 CN CN201310330423.5A patent/CN104346382B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR20100069120A (en) * | 2008-12-16 | 2010-06-24 | 한국전자통신연구원 | Method for tagging morphology by using prosody modeling and its apparatus |
CN102253930A (en) * | 2010-05-18 | 2011-11-23 | 腾讯科技(深圳)有限公司 | Method and device for translating text |
CN102207947A (en) * | 2010-06-29 | 2011-10-05 | 天津海量信息技术有限公司 | Direct speech material library generation method |
CN102654873A (en) * | 2011-03-03 | 2012-09-05 | 苏州同程旅游网络科技有限公司 | Tourism information extraction and aggregation method based on Chinese word segmentation |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104778262A (en) * | 2015-04-21 | 2015-07-15 | 无锡天脉聚源传媒科技有限公司 | Searching method and searching device |
CN104778262B (en) * | 2015-04-21 | 2018-07-24 | 无锡天脉聚源传媒科技有限公司 | A kind of searching method and device |
CN105243130A (en) * | 2015-09-29 | 2016-01-13 | 中国电子科技集团公司第三十二研究所 | Text processing system and method for data mining |
CN107870966A (en) * | 2017-08-11 | 2018-04-03 | 成都萌想科技有限责任公司 | A kind of recruitment general regulations data pick-up method based on semantic model |
CN109214005A (en) * | 2018-09-14 | 2019-01-15 | 南威软件股份有限公司 | A kind of clue extracting method and system based on Chinese word segmentation |
CN109558589A (en) * | 2018-11-12 | 2019-04-02 | 速度时空信息科技股份有限公司 | A kind of method and system of the free thought document based on Chinese words segmentation |
CN113239206A (en) * | 2021-06-18 | 2021-08-10 | 广东博维创远科技有限公司 | Judgment document accurate data classification analysis method and storage device capable of being read by computer |
CN113239206B (en) * | 2021-06-18 | 2023-05-12 | 广东博维创远科技有限公司 | Judgment document accurate data classification analysis method and computer readable storage device |
Also Published As
Publication number | Publication date |
---|---|
CN104346382B (en) | 2017-08-29 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108763333B (en) | Social media-based event map construction method | |
CN106874378B (en) | Method for constructing knowledge graph based on entity extraction and relation mining of rule model | |
CN111723215B (en) | Device and method for establishing biotechnological information knowledge graph based on text mining | |
CN102254014B (en) | Adaptive information extraction method for webpage characteristics | |
CN103544255B (en) | Text semantic relativity based network public opinion information analysis method | |
CN111767725B (en) | Data processing method and device based on emotion polarity analysis model | |
CN108628828A (en) | A kind of joint abstracting method of viewpoint and its holder based on from attention | |
US20080306941A1 (en) | System for automatically extracting by-line information | |
CN103544210A (en) | System and method for identifying webpage types | |
CN104346382A (en) | Text analysis system and method employing language query | |
CN106484797A (en) | Accident summary abstracting method based on sparse study | |
CN111104801B (en) | Text word segmentation method, system, equipment and medium based on website domain name | |
CN104965823A (en) | Big data based opinion extraction method | |
CN104281565A (en) | Semantic dictionary constructing method and device | |
CN105183765A (en) | Big data-based topic extraction method | |
Loynes et al. | The detection and location estimation of disasters using Twitter and the identification of Non-Governmental Organisations using crowdsourcing | |
Campbell et al. | Content+ context networks for user classification in twitter | |
Hedar et al. | Mining social networks arabic slang comments | |
CN105677684A (en) | Method for making semantic annotations on content generated by users based on external data sources | |
Zhang et al. | Event-based summarization for scientific literature in chinese | |
Yang et al. | A topic-specific web crawler with web page hierarchy based on HTML Dom-Tree | |
Tian et al. | Research of product ranking technology based on opinion mining | |
Kannan et al. | Text document clustering using statistical integrated graph based sentence sensitivity ranking algorithm | |
CN113157857A (en) | Hot topic detection method, device and equipment for news | |
Raj et al. | A trigraph based centrality approach towards text summarization |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |