CN104346382B - Use the text analysis system and method for language inquiry - Google Patents
Use the text analysis system and method for language inquiry Download PDFInfo
- Publication number
- CN104346382B CN104346382B CN201310330423.5A CN201310330423A CN104346382B CN 104346382 B CN104346382 B CN 104346382B CN 201310330423 A CN201310330423 A CN 201310330423A CN 104346382 B CN104346382 B CN 104346382B
- Authority
- CN
- China
- Prior art keywords
- text
- knowledge
- lql
- extracted
- word
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000004458 analytical method Methods 0.000 title claims abstract description 82
- 238000000034 method Methods 0.000 title claims abstract description 28
- 238000012937 correction Methods 0.000 claims abstract description 57
- 230000011218 segmentation Effects 0.000 claims abstract description 19
- 239000000284 extract Substances 0.000 claims abstract description 16
- 238000000605 extraction Methods 0.000 claims abstract description 15
- 238000012217 deletion Methods 0.000 claims abstract description 7
- 230000037430 deletion Effects 0.000 claims abstract description 7
- 238000004422 calculation algorithm Methods 0.000 claims description 10
- 238000011156 evaluation Methods 0.000 claims description 3
- 238000011084 recovery Methods 0.000 claims description 2
- 238000005516 engineering process Methods 0.000 abstract description 6
- 238000009412 basement excavation Methods 0.000 abstract description 2
- 230000004927 fusion Effects 0.000 abstract description 2
- YNPNZTXNASCQKK-UHFFFAOYSA-N phenanthrene Chemical compound C1=CC=C2C3=CC=CC=C3C=CC2=C1 YNPNZTXNASCQKK-UHFFFAOYSA-N 0.000 description 6
- 238000003058 natural language processing Methods 0.000 description 4
- 235000017166 Bambusa arundinacea Nutrition 0.000 description 3
- 235000017491 Bambusa tulda Nutrition 0.000 description 3
- 241001330002 Bambuseae Species 0.000 description 3
- 235000015334 Phyllostachys viridis Nutrition 0.000 description 3
- 239000011425 bamboo Substances 0.000 description 3
- 238000004590 computer program Methods 0.000 description 3
- 230000018109 developmental process Effects 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 239000003205 fragrance Substances 0.000 description 2
- 238000012423 maintenance Methods 0.000 description 2
- 241001660917 Crassula ovata Species 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 239000002360 explosive Substances 0.000 description 1
- 206010016256 fatigue Diseases 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 230000033772 system development Effects 0.000 description 1
- 230000009897 systematic effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/3332—Query translation
- G06F16/3334—Selection or weighting of terms from queries, including natural language queries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/211—Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- General Health & Medical Sciences (AREA)
- Machine Translation (AREA)
Abstract
The invention discloses a kind of text analysis system and method for use language inquiry, Chinese text information can be obtained from network, and makes analysis, so as to extract required knowledge.The present invention uses Chinese word segmentation and language inquiry language(LQL)Technology.Through Chinese words segmentation, the cutting of word can be carried out to Chinese text, and part-of-speech tagging is carried out to the word being syncopated as.LQL technologies carry out LQL analyses, and extract knowledge to the Chinese text for being split and being marked.Present system also provides a kind of error correction analysis, the knowledge being extracted for deletion error.It is an advantage of the current invention that non-computer formula person can also simply set up LQL rules.Meanwhile, the present invention is independently of the network format and structure of content of text, greatly strengthen the scope for collecting information.The present invention is applied to the application fields such as network information extraction, business intelligence excavation, information fusion, networked knowledge base foundation.
Description
Technical field
The invention belongs to the network branches in computer science, and in particular to a kind of text of use language inquiry point
Analysis system and method, it is adaptable to the application neck such as network information extraction, business intelligence excavation, information fusion, networked knowledge base foundation
Domain.
Background technology
With the high speed development of internet, the information on network is in explosive growth, and people are increasingly accustomed on network
Obtain information.However, because the information on network is too many, even with cyber stalker, people are also difficult to required for finding
Information.In addition, also often occur many incoherent noise informations on network, although many information are to be retrieved,
Its content is probably irrelevant or inaccurate.
Accordingly it is desirable to a kind of intelligence tool occur, according to the wish of user, people are helped to get rid of noise, a large amount of
Information in, filter out the information really needed.
Traditional natural language processing(NLP)System, can utilize natural language processing technique, such as point part-of-speech tagging, classification
Tree, synonym, index allusion quotation etc., from the content of text, extract central meaning.Therefore substantial amounts of computer program is also developed
Come, with the content of text after being processed from these through NLP, extract knowledge.But, the exploitation of computer program typically consumes very much
When.In addition, elapsing over time, just need more computer programs to extract new knowledge, this makes whole analysis system
Maintenance cost becomes expensive.Many times, because the knowledge being extracted is ambiguous, in addition it is also necessary to artificial to examine and correct.
Chinese invention patent application Application No. 200810142630.7 and 200910104805.X propose to utilize classification tree
The text analysis system analyzed text.However, the system altitude is dependent on blog or the structure of webpage, to be used as system
Input.For many text analysis systems, due to the source of content(Such as from the news article of different news websites, microblogging
Content)May be without good or identical structure, it means that each website or each webpage just need corresponding
Rule.In addition, the source structure of the content may be changed with the time, so when changing the structure, classification
Tree must also rebuild, and this is all without cost-benefit.
U.S. Patent Application Publication No. 2011/019671 and PCT international publication numbers WO2012/099970A1 propose that brand is estimated
Valve system.The systematic collection brand website is sold and transmission data, with the value of brand evaluation.It also attempts to the different product of comparison
Board, to create the brand index in some industry.But the problem of system is, sale and the flow of rival website are collected
Data are extremely difficult.Theoretically, if a tissue can be collected from different company obtain data, the index is can be with
It is established.But actually, because sales data is typically highly confidential, this is infeasible.
The content of the invention
According to problem above, the invention discloses a kind of text analysis system and method for use language inquiry.The present invention
Use Chinese word segmentation(Chinese Segmentation)With language inquiry language(Linguistics Query Language,
LQL)Technology.Through Chinese word segmentation, the cutting of word can be carried out to Chinese text, and part-of-speech tagging is carried out to the word being syncopated as
(Part-of-Speech, POS Tagging).LQL technologies can be split to this and by the Chinese text of part-of-speech tagging, be made into one
Step analysis, to extract required knowledge.
According to an aspect of the invention, there is provided a kind of text analysis system of use language inquiry, the system bag
Include:
Content of text input module, for inputting Chinese text in described text analysis system;
Chinese word segmentation module, the cutting for carrying out word to the Chinese text;
Part-of-speech tagging module, for the word being split out to this, part of speech label on mark;
Using dictionary database, including one or more application dictionaries, this includes one or more keywords using dictionary;
Language inquiry language(LQL)Rule database, for storing one or more LQL rules, wherein, LQL rules
Setting include:
Define position of the knowledge being extracted in the Chinese text(Extraction Position);
Define coverage(Coverage), the coverage is a sentence, a paragraph or a document;
Define one or more matching conditions(MatchCriteria), the matching condition is list of phrases(Phrase
List)Or the word with specific part-of-speech tagging(WORD POS);
Define match pattern(MatchPattern), the match pattern is to be used to define matching condition, when the matching condition
When being list of phrases, its match pattern is a file name, and the file name is pointed at this using one or many in dictionary
Individual keyword, when the matching condition is that this has the word of specific part-of-speech tagging, its match pattern is part of speech label;
LQL analysis modules, according to LQL rules, for being split to this and by the Chinese text of part-of-speech tagging, are carried out
LQL is analyzed, and extracts required knowledge, wherein, LQL analyses include:
Establish coverage defined in LQL rules;
The part of speech label according to defined in the matching condition of LQL rules, is split at this and by the Chinese of part-of-speech tagging
Text, finds out the word with the part of speech label;
The keyword according to defined in the matching condition of LQL rules, is split at this and by the Chinese text of part-of-speech tagging
This, finds out and the keyword identical word;
When in the coverage, the matching condition can be met, knowledge is extracted according to defined in LQL rules
Position in Chinese text, extracts one or more words;
Knowledge data base is extracted, for storing the knowledge being extracted.
There is provided a kind of text analyzing method of use said system, methods described according to another aspect of the present invention
Including:
S1:Obtain Chinese text;
S2:Using Chinese word segmentation module, the cutting of word is carried out to the Chinese text;
S3:Using part-of-speech tagging module, to the word that this is split out, part-of-speech tagging is carried out;
S4:In LQL analysis modules, using LQL rules, the Chinese text for being split and marking to this carries out LQL analyses,
To extract knowledge, wherein, LQL analyses comprise the following steps:
Establish coverage defined in LQL rules;
The part of speech label according to defined in the matching condition of LQL rules, is split at this and by the Chinese of part-of-speech tagging
Text, finds out the word with the part of speech label;
The keyword according to defined in the matching condition of LQL rules, is split at this and by the Chinese text of part-of-speech tagging
This, finds out and the keyword identical word;
When in the coverage, the matching condition can be met, knowledge is extracted according to defined in LQL rules
Position in Chinese text, extracts one or more words.
According to an aspect of the invention, there is provided a kind of text analysis system of use language inquiry, the system bag
Include:
Content of text input module, for inputting the text of the language in described text analysis system;
Language word-dividing mode, the cutting for carrying out word to the text;
Part-of-speech tagging module, for the word being split out to this, part of speech label on mark;
Using dictionary database, including one or more application dictionaries;
Language inquiry language(LQL)Rule database, for storing one or more LQL rules, wherein, LQL rules
Setting include:
Define position of the knowledge being extracted in the text(Extraction Position);
Define coverage(Coverage), the coverage is a sentence, a paragraph or a document;
Define one or more matching conditions(MatchCriteria), the matching condition is list of phrases(Phrase
List)Or the word with specific part-of-speech tagging(WORD POS);
Define match pattern(MatchPattern), the match pattern is to be used to define matching condition, when the matching condition
When being list of phrases, its match pattern is a file name, and the file name is pointed at this using one or many in dictionary
Individual keyword, when the matching condition is that this has the word of specific part-of-speech tagging, its match pattern is part of speech label;
LQL analysis modules, according to LQL rules, for being split to this and by the text of part-of-speech tagging, carry out LQL points
Analysis, and extract required knowledge, it is characterised in that LQL analyses include:
Establish coverage defined in LQL rules;
The part of speech label according to defined in the matching condition of LQL rules, is split at this and by the text of part-of-speech tagging,
Find out the word with the part of speech label;
The keyword according to defined in the matching condition of LQL rules, is split at this and by the text of part-of-speech tagging, looks for
Go out and the keyword identical word;
When in the coverage, the matching condition can be met, knowledge is extracted according to defined in LQL rules
Position in the text, extracts one or more words;
Knowledge data base is extracted, for storing the knowledge being extracted.
According to the present invention, content of text input module, text grammer are included with the text analysis system of language inquiry and analyzed
Module, text word-dividing mode, part-of-speech tagging module, LQL analysis modules, it is extracted knowledge data base, Chinese word segmentation dictionary, LQL
Rule database, using dictionary database, error correction rule database, error correction module, the regular inputting interfaces of LQL, should
With dictionary inputting interface and the regular inputting interface of error correction.
Participle is exactly the process that continuous word sequence is reassembled into word sequence according to certain specification.Chinese word segmentation refers to
Be that Chinese character sequence is cut into single word one by one.The Chinese word segmentation module is for carrying out word to Chinese text
Cutting, i.e., as English so that leave space between each word in Chinese sentence.The part-of-speech tagging module is pair
The word being split out carries out part-of-speech tagging(POS Tagging).
The Chinese word segmentation dictionary includes term list, and central term has what part-of-speech tagging and the part-of-speech tagging occurred
Frequency.Text word-dividing mode and the part-of-speech tagging module are to carry out word based on the Chinese word segmentation dictionary, to Chinese text
Cutting and part-of-speech tagging.
The application dictionary database includes one or more application dictionaries.Each describes a series of using dictionary
According to the keyword of application-specific.In the setting that LQL rules can be applied to using dictionary.
The LQL analysis modules are using LQL rules, to being split and being analyzed by the Chinese text of part-of-speech tagging, and
Therefrom extract required knowledge.User can use the regular inputting interfaces of LQL, according to difference the need for, and LQL rule needed for setting
Then, and LQL rules it is stored among LQL rule databases.The knowledge being extracted, which can be stored in, is extracted knowledge data
Among storehouse.
The error correction module can use error correction rule, analysis be made to the knowledge being extracted, and delete those quilts
The knowledge of error extraction, so as to improve the accuracy of knowledge extraction.User can use the regular inputting interface of error correction, according to not
With the need for, setting error correction rule.The error correction rule being set can be stored in error correction rule database and work as
In.
According to an aspect of the present invention, LQL rule settings include:
Definition is extracted the position of knowledge in the text(Extraction Position);
Define coverage(Coverage), the coverage can be a sentence, a paragraph or a document;
Define matching condition(MatchCriteria), the matching condition can be list of phrases(Phrase List), tool
There is the word of specific part of speech label(WORD POS)Or the word without specific part of speech label(WORD NOT POS);
Define match pattern(MatchPattern), the match pattern is to be used to define matching condition, for Phrase
List, its match pattern can be a file name, and the file name points to a series of keywords in application dictionary, right
In WORD POS or WORD NOT POS, its match pattern is part of speech label;
Define optional condition(OptionalCriteria), for matching condition, and can be by general regular expression
Defined.
According to an aspect of the present invention, the LQL analysis modules are using LQL rules, to being split and by part-of-speech tagging
Text analyzed, the LQL analysis include:
Establish coverage defined in LQL rules;
The part of speech label according to defined in the matching condition of LQL rules, is split at this and by the text of part-of-speech tagging, looks for
Go out there is the word of the part of speech label;
The keyword according to defined in the matching condition of LQL rules, is split at this and by the text of part-of-speech tagging, finds out
With the keyword identical word;
When in the coverage, can find out word with the part of speech label and with the keyword identical word, i.e., this
Can be met with condition, position of the knowledge in Chinese text is extracted according to defined in LQL rules, extraction one or
Multiple words.
According to an aspect of the present invention, the error correction rule settings include defining one or more numerical value sums
Value compares requirement.Defining the numerical value can be:
It is monodrome or multivalue to define the knowledge being extracted;
Definition is extracted the threshold value of Knowledge Source number;
Definition is extracted the threshold value of knowledge number;Or
The number that definition is extracted knowledge accounts for the threshold values of all percentages for being extracted knowledge number.
The numeric ratio relatively requires it is to compare statistical value and the numerical value, may be greater than, is less than or equal to.
When the knowledge being extracted does not meet the one or more error correction rules of the above, the knowledge meeting that these mistakes are extracted
It is deleted.
According to an aspect of the present invention, the error correction module is using error correction rule, to being split and by word
Property mark text analyzed, the error correction analyze include:
All knowledge being extracted are counted, to obtain statistical value;
The statistical value and the error correction rule defined in numeric ratio compared with;
Deletion does not meet numeric ratio compared with the desired knowledge being extracted.
There is provided a kind of text analyzing method of use language inquiry, methods described according to another aspect of the present invention
It is poly- including following step:
S1:Using the regular inputting interfaces of LQL, LQL rules are defined;
S2:Using dictionary inputting interface is applied, dictionary is applied in definition;
S3:Using the regular inputting interface of error correction, error correction rule is defined;
S4:Using content of text input module, text is obtained;
S5:Using text grammer analysis module, syntactic analysis is carried out to the text.
S6:Using text word-dividing mode, the cutting of word is carried out to the text;
S7:Using part-of-speech tagging module, part-of-speech tagging is carried out to the word being split out;
S8:In LQL analysis modules, using LQL rules, to the Chinese text for being split and having marked, LQL analyses are carried out,
To extract knowledge;
S9:The knowledge being extracted, it is stored in and is extracted in knowledge data base;
S10:Using error correction module, and according to error correction rule, the knowledge that deletion error is extracted, to increase
State the accuracy of the knowledge put forward.
It is an advantage of the current invention that because the setting of language inquiry language is very close to natural language, rather than general calculating
Machine language, so non-computer formula person can also simply set up language rule language, to extract knowledge, so as to lower computer
The difficulty of program development, is effectively reduced system development and maintenance cost.Meanwhile, the language inquiry language being set can be tired out
Accumulate and be stored in language inquiry language database, using the reference as new opplication.In addition, the present invention is can be independently of in text
The webpage format and structure of appearance, greatly strengthen the scope for collecting information.
According to many aspects of the present invention, language inquiry language and renewal application dictionary need to be simply only changed, just can be made
User sets up different types of application because of needs.For example, people search, to extract the relation of people and mechanism;News search system
System, it can contact a news article in a place;Brand valuation, with monitor brand in different social media platforms recognize by
Degree.
Brief description of the drawings
By the way that following accompanying drawing those skilled in the art will present invention may be better understood, and more can clearly it embody
Go out advantages of the present invention.Accompanying drawing described herein only for selected embodiment illustration purpose, rather than all it is possible implement
Mode and it is intended to not limit the scope of the present invention.
Fig. 1 is the text analysis system block diagram of the use language inquiry according to the present invention;
Fig. 2 is a kind of method of part-of-speech tagging according to the present invention;
Fig. 3 is the text analyzing method flow diagram of the use language inquiry according to the present invention;
Fig. 4 is the LQL analysis method flow charts according to the present invention;
Fig. 5 is the error correction analysis process figure according to the present invention.
Embodiment
Fig. 1 shows text analysis system according to an embodiment of the invention, including content of text input module
101st, text grammer analysis module 102, text word-dividing mode 103, part-of-speech tagging module 104, LQL analysis modules 105, be extracted
Knowledge data base 106, Chinese word segmentation dictionary 107, LQL rule databases 108, using dictionary database 109, error correction rule
The regular inputting interface 112 of database 110, error correction module 111, LQL, advise using dictionary inputting interface 113 and error correction
Then inputting interface 114.
Text content input module 101 is used to input content of text into LQL text analysis systems.Text content can
To obtain on the internet or on non-internet.When content of text is on the internet, text content input module
101 can use the Application Program Interface provided on website(Application Program Interface, API)To obtain
Text in the webpage activated by API.Or, using Web crawler to capture(crawl)There is hypertext format
Website, and extract the text for having hypertext format.
Text syntax Analysis Module 102 is used for the grammer for analyzing text content.
Text word-dividing mode 103 is used for the cutting that Chinese word segmentation is carried out to text content.For example, a Chinese sentence
" winter storm attack probably luxuriant and rich with fragrance take hundred lives by force ", can be split for winter, storm, attack, it is luxuriant and rich with fragrance, fear, take by force, hundred, order.
The part-of-speech tagging module 104 can carry out part-of-speech tagging to the word that is split, i.e., the word being each split out, further according to
Its part of speech, is marked with corresponding English alphabet, i.e. part of speech label.For example, winter/t, storm/n, attack/v, phenanthrene/j, probably/d,
Take by force/v, hundred/m, life/n.T represent time word, n representation nouns, v represent verb, j represent abbreviation abbreviation, d represent adverbial word, m represent
Number.
Figure below is a part of speech label list according to the present invention.Central a represents adjective, Ag and represents shape morpheme, ad
Secondary shape word, an is represented to represent adnoun, b and represent distinction word etc..
Preferably, part-of-speech tagging module 104 uses viterbi algorithm(viterbi algorithm)In part-of-speech tagging.
Viterbi algorithm is a kind of dynamic programming algorithm, and for finding most probable hidden state sequence, the sequence is referred to as Viterbi road
Footpath, especially in geneva information source, or hidden Markov model, can sum up the sequence of events being observed.Another method is to use
Forwards algorithms(forward algorithm), the algorithm is to calculate the probability for observing sequence of events, also belongs to probability theory model
Enclose.Fig. 3 is according to the present invention, using viterbi algorithm in an example of part-of-speech tagging.For sentence, " winter storm is attacked phenanthrene and feared
Take hundred lives by force ", the part-of-speech tagging of central each word is winter/t, storm/n, attack/v, phenanthrene/j, probably/d, take/v, hundred/m and life/n by force.
Chinese word segmentation dictionary 107 includes the term list and corresponding part-of-speech tagging, for carrying out participle for text
And part-of-speech tagging.Chinese word segmentation dictionary 107 can be defined by a user or change.
The application dictionary database 109 applies dictionary including at least one.This is set according to application using dictionary
, a series of keywords for recording application-specific.Application dictionary inputting interface 113, to create, Bian Jihuo can be used in user
Deletion application dictionary.According to one embodiment of present invention, in the application of a brand analysis, the pass just analyzed including brand
Keyword, for example, fashion brand(LV, Gucci etc.)Or industry particular term(Name of product, model etc.).These keywords can quilt
For in LQL rule settings.Figure below is according to one of the present invention application dictionary for being used to find out news and areas relationship.
The LQL processing modules 105 can according to LQL rule, to be split and part-of-speech tagging text, extract needed for knowing
Know, and by Knowledge Storage in being extracted among knowledge data base 106.LQL is a kind of script, similar to structuralized query language
Speech(SQL), but LQL is can to extract required data from unstructured text information.In addition, LQL is can be based on application
The need for user, and it is defined gained.LQL rule inputting interfaces 112 are used to allow user to input LQL rules, LQL rules
The LQL rule databases 108 can be stored in.
According to one embodiment of present invention, LQL rule settings include:
Select is the meaning of selection.Extraction Position are to be extracted the position of knowledge in the text, with number
Value is represented.Therefore, Select<Extraction Position>Represent selection and be extracted the position of knowledge in the text.
Coverage is the coverage of LQL analyses, and the coverage can be a sentence(Sentence), a section
Fall(Paragraph)Or a document(Document).
MatchCriteria is matching condition, and the matching condition can be list of phrases(Phrase List), with spy
Determine the word of part of speech label(WORD POS)Or the word without specific part of speech label(WORD NOT POS).
MatchPattern is match pattern, and the match pattern is to be used to define matching condition.For Phrase List,
Match pattern can be a file name, and the file name points to a series of keywords in an application dictionary.For
WORD POS or WORD NOT POS, its match pattern is part of speech label, such as n, v, t.
OptionalCriteria is optional condition, applied to matching condition, while it can be by general regular expression
Formula is defined.
Following, which is one, is used to find out someone and said what example.
In LQL rules, Select<1,3>It is to represent selection to be extracted the position of knowledge in the text.1 and 3 represent
First and the 3rd matching condition(In Word NOT pos are not included in).It is sentence that Sentence, which represents coverage,.[Word
Pos=" nr "] it is to find out the word with name, " nr " represents name.[Word NOT pos=" nr "] * { 0-5 } is looked for just
In five words after the name gone out, without the word of name, to prevent the situation of more than two people from occurring.For [Phrase
List=" speech_word.txt "], " speech_word.txt " is a file name, and its sensing is one and applies dictionary,
It is central including a series of keyword, such as propose, say, emphasizing, pointing out, representing, indicating, claiming, being expected, thinking, reaffirming, estimating,
Estimate, predict, being expected, being all the synonym of " saying ", for representing what someone has said.When appearance has people in a sentence
The word of part of speech label and the matching condition of defined keyword, the i.e. above of name can be met, the name(First matching
Condition)With one or more words after these keywords(3rd matching condition, but be not revealed)It will be extracted
Out.For example, Chen great Wen estimation stocks can rise.According to LQL rules, " Chen great Wen, stock, meeting, rise " this four words are just from this
It is extracted in sentence.
Following is an example for being used to analyze someone nationality.
Select<1,3>The word being selected is represented on [Word pos=" nr "] and [Word pos=" ns "] position.
It is sentence that Sentence, which represents coverage,.[Word pos=" nr "] is to find out the word with name.[Word NOT pos=″
Nr "] * { 0-5 } is in five words after the name being just found, without the word of name, to prevent the feelings of more than two people
Condition occurs.For [Phrase list=" nationality_word.txt "], " nationality_word.txt " is one
File name, its sensing is one and applies dictionary, central including a series of keyword, such as ancestral home, nationality Consistent etc..[Word pos
=" ns "] it is the word for finding out local title.Four matching conditions are all met in a sentence more than, with name
Just it is extracted with the word in place.For example, Wang great Wen ancestral home is the Taishan." Wang great Wen " and " Taishan " is just extracted.
Following, which is one, is used to find the example that unexpected place occurs in news content.
Select<1,3>Represent the word that is selected at [Phrase list=" accidentType_word.txt "] and
On the position of [Word pos=" ns "].It is sentence that Sentence, which represents coverage,.[Phrase list=″
AccidentType_word.txt "] it is to find out the keyword such as Wind Disaster with the unexpected meaning, earthquake , Hai Xiao , Shui Difficult etc..
[Phrase list=" accident_word.txt "] is to find out keyword such as Hair to give birth to what, and position is being waited.[Word pos=″
Ns "] it is to find out the word that part of speech label is place name(ns).When three above matching condition is all met in a sentence,
Keyword and the place name with the unexpected meaning are just extracted.Example such as, Wind Disaster Hair Sheng Yu Philippine." Wind Disaster " and " Philippine " are just
It is extracted.
Following is that one of them is used for the example that brand is analyzed.
The LQL rules are:[brand name]+[new range/new product]+[new product name].[brand name] is one
Using dictionary, it includes a series of title of brands.[new range/new product] is one and applies dictionary, it include it is a series of
The keyword of brand name prefix, such as new range.[new product name] is the name of product for needing to be found.
The LQL rules are:
Select<3>Represent the word being chosen at after the keyword in product_prefix.txt.Sentence generations
Table coverage is sentence.[Phrase list=" brand_name.txt "] be find out it is relevant pointed by brand_name.txt
The keyword of brand name.[Phrase list=" product_prefix.txt "] is to find out product_prefix.txt institutes
Point to the keyword about brand name prefix.When two above matching condition is all met in a sentence, new product
Name, which is found a great convenience, to be extracted.Example sentence, the trendy Zhi You it Li of GUCCI " new range bamboo Festival bags " 2011 have." bamboo Festival bags " can be extracted as newly
Name of product.
Many times, multiple answers are extracted, but central only one of which or it is several be correct.Error correction module
111 can delete some by the knowledge of error extraction according to error correction rule.Error correction rule inputting interface 114 is used to allow
User sets and input error recovery rule.Error correction rule can be stored in error correction rule database 110.This
Outside, the error correction module 111 can be counted to the knowledge being extracted, to obtain statistical value.
Under illustrate one be used for find a people date of birth example.
The error correction rule is:
Answer only one of which, as monodrome(Because the date of birth only one of which of a people);
Being extracted the number of sources of knowledge needs to be more than 3(For example, in website different more than three, obtaining the quilt
The knowledge of extraction);
Be extracted knowledge number account for all percentages for being extracted knowledge number need be more than 70%.
Here, 3 and 70% are the numerical value defined in error correction rule." being more than " be the error correction rule in
The numeric ratio of definition is relatively required.Therefore, 3 and 70% are alternatively referred to as threshold values.Number in figure is the system of these knowledge being extracted
Evaluation.Only 06/07/1951 numeric ratio for meeting the above is relatively required, because its number of sources for being extracted knowledge(The statistics
It is worth for 6)More than 3 all percentages for being extracted knowledge number are accounted for its number for being extracted knowledge(The statistical value is 88%)Than
Also greater than 70%, therefore it is chosen as correct answer.Other two selections, 07/06/1951 and 06/07/1952 is deleted.
Under illustrate one be used for find occur the unexpected local example of earthquake.
The error correction rule is:
Answer can have multiple, as multivalue(Because can occur multiple earthquakes in the same period);
Being extracted the number of sources of knowledge needs to be more than 3;
Be extracted knowledge number account for all percentages for being extracted knowledge number need be more than 20%.
Here, 3 and 20% are the numerical value defined in error correction rule." being more than " be the error correction rule in
The numeric ratio of definition is relatively required.Therefore, 3 and 20% are alternatively referred to as threshold values.Only Wenchuan County in Sichuan and Qinghai jade Trees meets above number
Value compares requirement, therefore is chosen as correct answer.Sichuan Cloud rivers only one of which text source is accounted for the number for being extracted knowledge
All percentages for being extracted knowledge number only have 2%, therefore are deleted.
Under illustrate one and be used to find the example of new product name.
The error correction rule is:
Answer can have multiple, as multivalue(Because can have multiple new products simultaneously);
Being extracted the number of sources of knowledge needs to be more than 3;
Be extracted knowledge number account for all percentages for being extracted knowledge number need be more than 20%.
Here, 3 and 20% are the threshold values in error correction rule.Bamboo Festival bags and crime love undercurrent meet the numerical value of the above
Compare requirement, therefore be chosen as correct answer.But requirement of the favorite undercurrent because failing to meet the above, therefore be deleted.
According to another aspect of the present invention there is provided a kind of text analyzing method of use language inquiry, such as Fig. 3 institutes
Show, methods described is poly- including following step:
S301:Using the regular inputting interfaces of LQL, LQL rules are defined;
S302:Using dictionary inputting interface is applied, dictionary is applied in definition;
S303:Using the regular inputting interface of error correction, error correction rule is defined;
S304:Using content of text input module, content of text is obtained;
S305:Using text grammer analysis module, syntactic analysis is carried out to the text.
S306:Using text word-dividing mode, the cutting of word is carried out to the text;
S307:Using part-of-speech tagging module, to the word being split out, part-of-speech tagging is carried out;
S308:In LQL analysis modules, using LQL rules, to the text for being split and having marked, LQL analyses are carried out, with
Extract knowledge;
S309:The knowledge being extracted, it is stored in and is extracted in knowledge data base;
S310:Using error correction module, and according to error correction rule, the knowledge that deletion error is extracted, to increase
The accuracy for being extracted knowledge.
In the poly- S308 of step, gather as shown in figure 4, LQL analyses include following step:
S401:Establish coverage defined in LQL rules;
S402:The part of speech label according to defined in the matching condition of LQL rules, is split at this and by the text of part-of-speech tagging
This, finds out the word with the part of speech label;
S403:Keyword is defined according to the matching condition of LQL rules, is split at this and by the text of part-of-speech tagging,
Find out and the keyword identical word;
S404:When in the coverage, the matching condition can be met, it is extracted according to defined in LQL rules
Knowledge position in the text, is split and by the text of part-of-speech tagging at this, extracts one or more words.
In the poly- S310 of step, gather as shown in figure 5, error correction analysis includes following step:
S501:The knowledge being extracted is counted, to obtain statistical value;
S502:The statistical value and the error correction rule defined in numeric ratio compared with;
S503:Deletion does not meet numeric ratio compared with the desired knowledge being extracted.
According to the text analyzing method and system with language inquiry of the present invention, in addition to Chinese, it is equally applicable to
His language, such as English, German, Japanese, Korean, it is only necessary to just may be used using suitable word-dividing mode and part-of-speech tagging module.
Through the above description of the embodiments, those skilled in the art can be understood that the present invention can be by
Software adds the mode of required hardware platform to realize, naturally it is also possible to all implemented by hardware, but in many cases before
Person is more preferably embodiment.Understood based on such, whole that technical scheme contributes to background technology or
Person part can be embodied in the form of software product, and the computer software product can be stored in storage medium, such as
ROM/RAM, magnetic disc, CD etc., including some instructions are to cause a computer equipment (can be personal computer, service
Device, or the network equipment etc.) perform method described in some parts of each embodiment of the invention or embodiment.
While there has been shown and described that it is of the invention, it will be appreciated by those skilled in the art that, without departing from this hair
On the premise of bright principle and spirit, can be changed in the present embodiment, the scope of the present invention by appended claims and
Its equivalent is limited.
Claims (21)
1. a kind of text analysis system of use language inquiry, it is characterised in that the system includes:
Content of text input module, for inputting Chinese text in described text analysis system;
Chinese word segmentation module, the cutting for carrying out word to the Chinese text;
Part-of-speech tagging module, for the word being split out to this, part of speech label on mark;
Using dictionary database, including one or more application dictionaries, this includes one or more keywords using dictionary;
Language inquiry language LQL rule databases, for storing one or more LQL rules, wherein, the setting of LQL rules
Including:
Define position (Extraction Position) of the knowledge being extracted in the Chinese text;
Coverage (Coverage) is defined, the coverage is a sentence, a paragraph or a document;
Define one or more matching conditions (MatchCriteria), the matching condition be list of phrases (Phrase List) or
Word (WORD POS) with specific part-of-speech tagging;
Match pattern (MatchPattern) is defined, the match pattern is to be used to define matching condition, when the matching condition is short
During language list, its match pattern is a file name, and the file name is pointed at this using one or more passes in dictionary
Keyword, when the matching condition is that this has the word of specific part-of-speech tagging, its match pattern is part of speech label;
LQL analysis modules, according to LQL rules, for being split to this and by the Chinese text of part-of-speech tagging, carry out LQL points
Analysis, and required knowledge is extracted, wherein, LQL analyses include:
Establish coverage defined in LQL rules;
The part of speech label according to defined in the matching condition of LQL rules, is split at this and by the Chinese text of part-of-speech tagging,
Find out the word with the part of speech label;
The keyword according to defined in the matching condition of LQL rules, is split at this and by the Chinese text of part-of-speech tagging, looks for
Go out and the keyword identical word;
When in the coverage, the matching condition can be met, knowledge is extracted according to defined in LQL rules
Position in text, extracts one or more words;
Knowledge data base is extracted, for storing the knowledge being extracted.
2. text analysis system according to claim 1, it is characterised in that the system also includes:
Error correction rule database, for storing one or more error correction rules;
Error correction module, can use error correction rule, to the knowledge being extracted, carry out error correction analysis, to delete
The knowledge that mistake is extracted, the accuracy of knowledge being extracted described in increase.
3. text analysis system according to claim 2, it is characterised in that error correction rule include setting one or
Multiple numerical value and one or more numeric ratios relatively require that the error correction module is counted to the knowledge being extracted, and obtain system
Evaluation, and with the numeric ratio compared with, relatively required when the statistical value of the knowledge being extracted does not meet the numeric ratio, what this was extracted
Knowledge can be deleted.
4. text analysis system according to claim 3, it is characterised in that the statistical value includes being extracted the source of knowledge
Number, the number for being extracted knowledge are extracted the number of knowledge and account for all percentages for being extracted knowledge number.
5. text analysis system according to claim 3, it is characterised in that the numerical value includes being extracted Knowledge Source number
Threshold value, be extracted the threshold value of knowledge number or be extracted the number of knowledge and account for the threshold of all percentages for being extracted knowledge number
Value, the numeric ratio relatively requires it is to compare the statistical value and the numerical value, and the statistical value is more than, less than or equal to the numerical value.
6. text analysis system according to claim 1, it is characterised in that the system also includes:
Text grammer analysis module, the grammer for analyzing the Chinese text;
There is Chinese word segmentation dictionary, including term list, the term in the term list part-of-speech tagging and the part-of-speech tagging to occur
Frequency, for the Chinese text carry out word cutting and part-of-speech tagging;
LQL rule inputting interfaces, for allowing user to set LQL rules;
Using dictionary inputting interface, for allowing user's setting to apply dictionary.
7. text analysis system according to claim 2, it is characterised in that the system also includes:
Error correction rule inputting interface, for making user's input error recovery regular.
8. text analysis system according to claim 1, it is characterised in that the Chinese text is acquired in internet
's.
9. text analysis system according to claim 8, it is characterised in that use Application Program Interface or Web crawler
To obtain the Chinese text on the internet.
10. text analysis system according to claim 1, it is characterised in that using viterbi algorithm or forwards algorithms with
Part-of-speech tagging is carried out to the word being split out.
11. text analysis system according to claim 1, it is characterised in that the matching condition is also included without specific
Part-of-speech tagging word (WORD NOT POS), its match pattern is part of speech label.
12. text analysis system according to claim 1, it is characterised in that the setting of LQL rules also includes:
Optional condition (OptionalCriteria) is defined, for matching condition.
13. text analysis system according to claim 2, it is characterised in that error correction rule includes setting the quilt
The knowledge of extraction is monodrome or multivalue.
14. text analysis system according to claim 1, it is characterised in that the system is applied to people search, news
Hunting system or brand analysis.
15. the text analyzing method of the system described in a kind of usage right requirement 1, it is characterised in that methods described includes:
S1:Obtain Chinese text;
S2:Using Chinese word segmentation module, the cutting of word is carried out to the Chinese text;
S3:Using part-of-speech tagging module, to the word that this is split out, part-of-speech tagging is carried out;
S4:In LQL analysis modules, using LQL rules, the Chinese text for being split and marking to this carries out LQL analyses, to carry
Knowledge is taken, wherein, LQL analyses comprise the following steps:
Establish coverage defined in LQL rules;
The part of speech label according to defined in the matching condition of LQL rules, is split at this and by the Chinese text of part-of-speech tagging,
Find out the word with the part of speech label;
The keyword according to defined in the matching condition of LQL rules, is split at this and by the Chinese text of part-of-speech tagging, looks for
Go out and the keyword identical word;
When in the coverage, the matching condition can be met, knowledge is extracted according to defined in LQL rules
Position in text, extracts one or more words.
16. text analyzing method according to claim 15, it is characterised in that methods described also includes:
According to error correction rule, to the knowledge progress error correction analysis that this is extracted, the knowledge being extracted with deletion error,
The accuracy for the knowledge being extracted described in increase.
17. text analyzing method according to claim 16, it is characterised in that error correction analysis includes:
The knowledge that this is extracted is counted, to obtain statistical value;
Numerical value defined in the statistical value and error correction rule is compared;
Relatively required when the statistical value of the knowledge being extracted does not meet numeric ratio, the knowledge being extracted can be deleted.
18. text analyzing method according to claim 17, it is characterised in that the statistical value includes being extracted coming for knowledge
Source number, the number for being extracted knowledge are extracted the number of knowledge and account for all percentages for being extracted knowledge number.
19. text analyzing method according to claim 17, it is characterised in that the numerical value includes being extracted Knowledge Source number
Purpose threshold value, the threshold value for being extracted knowledge number are extracted the number of knowledge and account for all percentages for being extracted knowledge number
Threshold value, the numeric ratio relatively requires it is to compare the statistical value and the numerical value, and the statistical value is more than, less than or equal to the numerical value.
20. text analyzing method according to claim 15, it is characterised in that methods described is applied to people search, new
Hear hunting system or brand analysis.
21. a kind of text analysis system of use language inquiry, it is characterised in that the system is applied to different language, institute
The system of stating includes:
Content of text input module, for inputting the text of the language in described text analysis system;
Language word-dividing mode, the cutting for carrying out word to the text;
Part-of-speech tagging module, for the word being split out to this, part of speech label on mark;
Using dictionary database, including one or more application dictionaries;
Language inquiry language LQL rule databases, for storing one or more LQL rules,
Wherein, the setting of LQL rules includes:
Define position (Extraction Position) of the knowledge being extracted in the text;
Coverage (Coverage) is defined, the coverage is a sentence, a paragraph or a document;
Define one or more matching conditions (MatchCriteria), the matching condition be list of phrases (Phrase List) or
Word (WORD POS) with specific part-of-speech tagging;
Match pattern (MatchPattern) is defined, the match pattern is to be used to define matching condition, when the matching condition is short
During language list, its match pattern is a file name, and the file name is pointed at this using one or more passes in dictionary
Keyword, when the matching condition is that this has the word of specific part-of-speech tagging, its match pattern is part of speech label;
LQL analysis modules, according to LQL rules, for being split to this and by the text of part-of-speech tagging, carry out LQL analyses, and
Knowledge needed for extracting, it is characterised in that LQL analyses include:
Establish coverage defined in LQL rules;
The part of speech label according to defined in the matching condition of LQL rules, is split at this and by the text of part-of-speech tagging, finds out
Word with the part of speech label;
According to the LQL rule matching condition defined in keyword, be split at this and by the text of part-of-speech tagging, find out with
The keyword identical word;
When in the coverage, the matching condition can be met, knowledge is extracted in text according to defined in LQL rules
Position in this, extracts one or more words;
Knowledge data base is extracted, for storing the knowledge being extracted.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310330423.5A CN104346382B (en) | 2013-07-31 | 2013-07-31 | Use the text analysis system and method for language inquiry |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310330423.5A CN104346382B (en) | 2013-07-31 | 2013-07-31 | Use the text analysis system and method for language inquiry |
Publications (2)
Publication Number | Publication Date |
---|---|
CN104346382A CN104346382A (en) | 2015-02-11 |
CN104346382B true CN104346382B (en) | 2017-08-29 |
Family
ID=52501997
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201310330423.5A Active CN104346382B (en) | 2013-07-31 | 2013-07-31 | Use the text analysis system and method for language inquiry |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN104346382B (en) |
Families Citing this family (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104778262B (en) * | 2015-04-21 | 2018-07-24 | 无锡天脉聚源传媒科技有限公司 | A kind of searching method and device |
CN105243130A (en) * | 2015-09-29 | 2016-01-13 | 中国电子科技集团公司第三十二研究所 | Text processing system and method for data mining |
CN107870966A (en) * | 2017-08-11 | 2018-04-03 | 成都萌想科技有限责任公司 | A kind of recruitment general regulations data pick-up method based on semantic model |
CN109214005A (en) * | 2018-09-14 | 2019-01-15 | 南威软件股份有限公司 | A kind of clue extracting method and system based on Chinese word segmentation |
CN109558589A (en) * | 2018-11-12 | 2019-04-02 | 速度时空信息科技股份有限公司 | A kind of method and system of the free thought document based on Chinese words segmentation |
CN113239206B (en) * | 2021-06-18 | 2023-05-12 | 广东博维创远科技有限公司 | Judgment document accurate data classification analysis method and computer readable storage device |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102207947A (en) * | 2010-06-29 | 2011-10-05 | 天津海量信息技术有限公司 | Direct speech material library generation method |
CN102253930A (en) * | 2010-05-18 | 2011-11-23 | 腾讯科技(深圳)有限公司 | Method and device for translating text |
CN102654873A (en) * | 2011-03-03 | 2012-09-05 | 苏州同程旅游网络科技有限公司 | Tourism information extraction and aggregation method based on Chinese word segmentation |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR101025814B1 (en) * | 2008-12-16 | 2011-04-04 | 한국전자통신연구원 | Method for tagging morphology by using prosody modeling and its apparatus |
-
2013
- 2013-07-31 CN CN201310330423.5A patent/CN104346382B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102253930A (en) * | 2010-05-18 | 2011-11-23 | 腾讯科技(深圳)有限公司 | Method and device for translating text |
CN102207947A (en) * | 2010-06-29 | 2011-10-05 | 天津海量信息技术有限公司 | Direct speech material library generation method |
CN102654873A (en) * | 2011-03-03 | 2012-09-05 | 苏州同程旅游网络科技有限公司 | Tourism information extraction and aggregation method based on Chinese word segmentation |
Also Published As
Publication number | Publication date |
---|---|
CN104346382A (en) | 2015-02-11 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106874378B (en) | Method for constructing knowledge graph based on entity extraction and relation mining of rule model | |
CN106649260B (en) | Product characteristic structure tree construction method based on comment text mining | |
CN103544255B (en) | Text semantic relativity based network public opinion information analysis method | |
Farouk | Measuring text similarity based on structure and word embedding | |
US20160335234A1 (en) | Systems and Methods for Generating Summaries of Documents | |
CN111143479A (en) | Knowledge graph relation extraction and REST service visualization fusion method based on DBSCAN clustering algorithm | |
WO2015093541A1 (en) | Scenario generation device and computer program therefor | |
CN104346382B (en) | Use the text analysis system and method for language inquiry | |
JP5907393B2 (en) | Complex predicate template collection device and computer program therefor | |
CN111104801B (en) | Text word segmentation method, system, equipment and medium based on website domain name | |
Trabelsi et al. | Bridging folksonomies and domain ontologies: Getting out non-taxonomic relations | |
TW201826145A (en) | Method and system for knowledge extraction from Chinese corpus useful for extracting knowledge from source corpuses mainly written in Chinese | |
CN104281565A (en) | Semantic dictionary constructing method and device | |
Nakashole et al. | Real-time population of knowledge bases: opportunities and challenges | |
Sagcan et al. | Toponym recognition in social media for estimating the location of events | |
CN110362673A (en) | Computer vision class papers contents method of discrimination and system based on abstract semantic analysis | |
CN105677684A (en) | Method for making semantic annotations on content generated by users based on external data sources | |
Bahloul et al. | ArA* summarizer: An Arabic text summarization system based on subtopic segmentation and using an A* algorithm for reduction | |
Samei et al. | Multi-document summarization using graph-based iterative ranking algorithms and information theoretical distortion measures | |
CN113934910A (en) | Automatic optimization and updating theme library construction method and hot event real-time updating method | |
Kannan et al. | Text document clustering using statistical integrated graph based sentence sensitivity ranking algorithm | |
Lim et al. | ClaimFinder: A Framework for Identifying Claims in Microblogs. | |
Aqle et al. | Analyze Unstructured Data Patterns for Conceptual Representation | |
Chen et al. | Chinese named entity abbreviation generation using first-order logic | |
Alruily et al. | Extracting information of future events from Arabic newspapers: an overview |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |