CN104346382B - Use the text analysis system and method for language inquiry - Google Patents

Use the text analysis system and method for language inquiry Download PDF

Info

Publication number
CN104346382B
CN104346382B CN201310330423.5A CN201310330423A CN104346382B CN 104346382 B CN104346382 B CN 104346382B CN 201310330423 A CN201310330423 A CN 201310330423A CN 104346382 B CN104346382 B CN 104346382B
Authority
CN
China
Prior art keywords
text
knowledge
lql
extracted
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201310330423.5A
Other languages
Chinese (zh)
Other versions
CN104346382A (en
Inventor
倪伟定
蔡日星
蔡帆
蔡一帆
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hong Kong Polytechnic University HKPU
Original Assignee
Hong Kong Polytechnic University HKPU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hong Kong Polytechnic University HKPU filed Critical Hong Kong Polytechnic University HKPU
Priority to CN201310330423.5A priority Critical patent/CN104346382B/en
Publication of CN104346382A publication Critical patent/CN104346382A/en
Application granted granted Critical
Publication of CN104346382B publication Critical patent/CN104346382B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3334Selection or weighting of terms from queries, including natural language queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars

Abstract

The invention discloses a kind of text analysis system and method for use language inquiry, Chinese text information can be obtained from network, and makes analysis, so as to extract required knowledge.The present invention uses Chinese word segmentation and language inquiry language(LQL)Technology.Through Chinese words segmentation, the cutting of word can be carried out to Chinese text, and part-of-speech tagging is carried out to the word being syncopated as.LQL technologies carry out LQL analyses, and extract knowledge to the Chinese text for being split and being marked.Present system also provides a kind of error correction analysis, the knowledge being extracted for deletion error.It is an advantage of the current invention that non-computer formula person can also simply set up LQL rules.Meanwhile, the present invention is independently of the network format and structure of content of text, greatly strengthen the scope for collecting information.The present invention is applied to the application fields such as network information extraction, business intelligence excavation, information fusion, networked knowledge base foundation.

Description

Use the text analysis system and method for language inquiry
Technical field
The invention belongs to the network branches in computer science, and in particular to a kind of text of use language inquiry point Analysis system and method, it is adaptable to the application neck such as network information extraction, business intelligence excavation, information fusion, networked knowledge base foundation Domain.
Background technology
With the high speed development of internet, the information on network is in explosive growth, and people are increasingly accustomed on network Obtain information.However, because the information on network is too many, even with cyber stalker, people are also difficult to required for finding Information.In addition, also often occur many incoherent noise informations on network, although many information are to be retrieved, Its content is probably irrelevant or inaccurate.
Accordingly it is desirable to a kind of intelligence tool occur, according to the wish of user, people are helped to get rid of noise, a large amount of Information in, filter out the information really needed.
Traditional natural language processing(NLP)System, can utilize natural language processing technique, such as point part-of-speech tagging, classification Tree, synonym, index allusion quotation etc., from the content of text, extract central meaning.Therefore substantial amounts of computer program is also developed Come, with the content of text after being processed from these through NLP, extract knowledge.But, the exploitation of computer program typically consumes very much When.In addition, elapsing over time, just need more computer programs to extract new knowledge, this makes whole analysis system Maintenance cost becomes expensive.Many times, because the knowledge being extracted is ambiguous, in addition it is also necessary to artificial to examine and correct.
Chinese invention patent application Application No. 200810142630.7 and 200910104805.X propose to utilize classification tree The text analysis system analyzed text.However, the system altitude is dependent on blog or the structure of webpage, to be used as system Input.For many text analysis systems, due to the source of content(Such as from the news article of different news websites, microblogging Content)May be without good or identical structure, it means that each website or each webpage just need corresponding Rule.In addition, the source structure of the content may be changed with the time, so when changing the structure, classification Tree must also rebuild, and this is all without cost-benefit.
U.S. Patent Application Publication No. 2011/019671 and PCT international publication numbers WO2012/099970A1 propose that brand is estimated Valve system.The systematic collection brand website is sold and transmission data, with the value of brand evaluation.It also attempts to the different product of comparison Board, to create the brand index in some industry.But the problem of system is, sale and the flow of rival website are collected Data are extremely difficult.Theoretically, if a tissue can be collected from different company obtain data, the index is can be with It is established.But actually, because sales data is typically highly confidential, this is infeasible.
The content of the invention
According to problem above, the invention discloses a kind of text analysis system and method for use language inquiry.The present invention Use Chinese word segmentation(Chinese Segmentation)With language inquiry language(Linguistics Query Language, LQL)Technology.Through Chinese word segmentation, the cutting of word can be carried out to Chinese text, and part-of-speech tagging is carried out to the word being syncopated as (Part-of-Speech, POS Tagging).LQL technologies can be split to this and by the Chinese text of part-of-speech tagging, be made into one Step analysis, to extract required knowledge.
According to an aspect of the invention, there is provided a kind of text analysis system of use language inquiry, the system bag Include:
Content of text input module, for inputting Chinese text in described text analysis system;
Chinese word segmentation module, the cutting for carrying out word to the Chinese text;
Part-of-speech tagging module, for the word being split out to this, part of speech label on mark;
Using dictionary database, including one or more application dictionaries, this includes one or more keywords using dictionary;
Language inquiry language(LQL)Rule database, for storing one or more LQL rules, wherein, LQL rules Setting include:
Define position of the knowledge being extracted in the Chinese text(Extraction Position);
Define coverage(Coverage), the coverage is a sentence, a paragraph or a document;
Define one or more matching conditions(MatchCriteria), the matching condition is list of phrases(Phrase List)Or the word with specific part-of-speech tagging(WORD POS);
Define match pattern(MatchPattern), the match pattern is to be used to define matching condition, when the matching condition When being list of phrases, its match pattern is a file name, and the file name is pointed at this using one or many in dictionary Individual keyword, when the matching condition is that this has the word of specific part-of-speech tagging, its match pattern is part of speech label;
LQL analysis modules, according to LQL rules, for being split to this and by the Chinese text of part-of-speech tagging, are carried out LQL is analyzed, and extracts required knowledge, wherein, LQL analyses include:
Establish coverage defined in LQL rules;
The part of speech label according to defined in the matching condition of LQL rules, is split at this and by the Chinese of part-of-speech tagging Text, finds out the word with the part of speech label;
The keyword according to defined in the matching condition of LQL rules, is split at this and by the Chinese text of part-of-speech tagging This, finds out and the keyword identical word;
When in the coverage, the matching condition can be met, knowledge is extracted according to defined in LQL rules Position in Chinese text, extracts one or more words;
Knowledge data base is extracted, for storing the knowledge being extracted.
There is provided a kind of text analyzing method of use said system, methods described according to another aspect of the present invention Including:
S1:Obtain Chinese text;
S2:Using Chinese word segmentation module, the cutting of word is carried out to the Chinese text;
S3:Using part-of-speech tagging module, to the word that this is split out, part-of-speech tagging is carried out;
S4:In LQL analysis modules, using LQL rules, the Chinese text for being split and marking to this carries out LQL analyses, To extract knowledge, wherein, LQL analyses comprise the following steps:
Establish coverage defined in LQL rules;
The part of speech label according to defined in the matching condition of LQL rules, is split at this and by the Chinese of part-of-speech tagging Text, finds out the word with the part of speech label;
The keyword according to defined in the matching condition of LQL rules, is split at this and by the Chinese text of part-of-speech tagging This, finds out and the keyword identical word;
When in the coverage, the matching condition can be met, knowledge is extracted according to defined in LQL rules Position in Chinese text, extracts one or more words.
According to an aspect of the invention, there is provided a kind of text analysis system of use language inquiry, the system bag Include:
Content of text input module, for inputting the text of the language in described text analysis system;
Language word-dividing mode, the cutting for carrying out word to the text;
Part-of-speech tagging module, for the word being split out to this, part of speech label on mark;
Using dictionary database, including one or more application dictionaries;
Language inquiry language(LQL)Rule database, for storing one or more LQL rules, wherein, LQL rules Setting include:
Define position of the knowledge being extracted in the text(Extraction Position);
Define coverage(Coverage), the coverage is a sentence, a paragraph or a document;
Define one or more matching conditions(MatchCriteria), the matching condition is list of phrases(Phrase List)Or the word with specific part-of-speech tagging(WORD POS);
Define match pattern(MatchPattern), the match pattern is to be used to define matching condition, when the matching condition When being list of phrases, its match pattern is a file name, and the file name is pointed at this using one or many in dictionary Individual keyword, when the matching condition is that this has the word of specific part-of-speech tagging, its match pattern is part of speech label;
LQL analysis modules, according to LQL rules, for being split to this and by the text of part-of-speech tagging, carry out LQL points Analysis, and extract required knowledge, it is characterised in that LQL analyses include:
Establish coverage defined in LQL rules;
The part of speech label according to defined in the matching condition of LQL rules, is split at this and by the text of part-of-speech tagging, Find out the word with the part of speech label;
The keyword according to defined in the matching condition of LQL rules, is split at this and by the text of part-of-speech tagging, looks for Go out and the keyword identical word;
When in the coverage, the matching condition can be met, knowledge is extracted according to defined in LQL rules Position in the text, extracts one or more words;
Knowledge data base is extracted, for storing the knowledge being extracted.
According to the present invention, content of text input module, text grammer are included with the text analysis system of language inquiry and analyzed Module, text word-dividing mode, part-of-speech tagging module, LQL analysis modules, it is extracted knowledge data base, Chinese word segmentation dictionary, LQL Rule database, using dictionary database, error correction rule database, error correction module, the regular inputting interfaces of LQL, should With dictionary inputting interface and the regular inputting interface of error correction.
Participle is exactly the process that continuous word sequence is reassembled into word sequence according to certain specification.Chinese word segmentation refers to Be that Chinese character sequence is cut into single word one by one.The Chinese word segmentation module is for carrying out word to Chinese text Cutting, i.e., as English so that leave space between each word in Chinese sentence.The part-of-speech tagging module is pair The word being split out carries out part-of-speech tagging(POS Tagging).
The Chinese word segmentation dictionary includes term list, and central term has what part-of-speech tagging and the part-of-speech tagging occurred Frequency.Text word-dividing mode and the part-of-speech tagging module are to carry out word based on the Chinese word segmentation dictionary, to Chinese text Cutting and part-of-speech tagging.
The application dictionary database includes one or more application dictionaries.Each describes a series of using dictionary According to the keyword of application-specific.In the setting that LQL rules can be applied to using dictionary.
The LQL analysis modules are using LQL rules, to being split and being analyzed by the Chinese text of part-of-speech tagging, and Therefrom extract required knowledge.User can use the regular inputting interfaces of LQL, according to difference the need for, and LQL rule needed for setting Then, and LQL rules it is stored among LQL rule databases.The knowledge being extracted, which can be stored in, is extracted knowledge data Among storehouse.
The error correction module can use error correction rule, analysis be made to the knowledge being extracted, and delete those quilts The knowledge of error extraction, so as to improve the accuracy of knowledge extraction.User can use the regular inputting interface of error correction, according to not With the need for, setting error correction rule.The error correction rule being set can be stored in error correction rule database and work as In.
According to an aspect of the present invention, LQL rule settings include:
Definition is extracted the position of knowledge in the text(Extraction Position);
Define coverage(Coverage), the coverage can be a sentence, a paragraph or a document;
Define matching condition(MatchCriteria), the matching condition can be list of phrases(Phrase List), tool There is the word of specific part of speech label(WORD POS)Or the word without specific part of speech label(WORD NOT POS);
Define match pattern(MatchPattern), the match pattern is to be used to define matching condition, for Phrase List, its match pattern can be a file name, and the file name points to a series of keywords in application dictionary, right In WORD POS or WORD NOT POS, its match pattern is part of speech label;
Define optional condition(OptionalCriteria), for matching condition, and can be by general regular expression Defined.
According to an aspect of the present invention, the LQL analysis modules are using LQL rules, to being split and by part-of-speech tagging Text analyzed, the LQL analysis include:
Establish coverage defined in LQL rules;
The part of speech label according to defined in the matching condition of LQL rules, is split at this and by the text of part-of-speech tagging, looks for Go out there is the word of the part of speech label;
The keyword according to defined in the matching condition of LQL rules, is split at this and by the text of part-of-speech tagging, finds out With the keyword identical word;
When in the coverage, can find out word with the part of speech label and with the keyword identical word, i.e., this Can be met with condition, position of the knowledge in Chinese text is extracted according to defined in LQL rules, extraction one or Multiple words.
According to an aspect of the present invention, the error correction rule settings include defining one or more numerical value sums Value compares requirement.Defining the numerical value can be:
It is monodrome or multivalue to define the knowledge being extracted;
Definition is extracted the threshold value of Knowledge Source number;
Definition is extracted the threshold value of knowledge number;Or
The number that definition is extracted knowledge accounts for the threshold values of all percentages for being extracted knowledge number.
The numeric ratio relatively requires it is to compare statistical value and the numerical value, may be greater than, is less than or equal to.
When the knowledge being extracted does not meet the one or more error correction rules of the above, the knowledge meeting that these mistakes are extracted It is deleted.
According to an aspect of the present invention, the error correction module is using error correction rule, to being split and by word Property mark text analyzed, the error correction analyze include:
All knowledge being extracted are counted, to obtain statistical value;
The statistical value and the error correction rule defined in numeric ratio compared with;
Deletion does not meet numeric ratio compared with the desired knowledge being extracted.
There is provided a kind of text analyzing method of use language inquiry, methods described according to another aspect of the present invention It is poly- including following step:
S1:Using the regular inputting interfaces of LQL, LQL rules are defined;
S2:Using dictionary inputting interface is applied, dictionary is applied in definition;
S3:Using the regular inputting interface of error correction, error correction rule is defined;
S4:Using content of text input module, text is obtained;
S5:Using text grammer analysis module, syntactic analysis is carried out to the text.
S6:Using text word-dividing mode, the cutting of word is carried out to the text;
S7:Using part-of-speech tagging module, part-of-speech tagging is carried out to the word being split out;
S8:In LQL analysis modules, using LQL rules, to the Chinese text for being split and having marked, LQL analyses are carried out, To extract knowledge;
S9:The knowledge being extracted, it is stored in and is extracted in knowledge data base;
S10:Using error correction module, and according to error correction rule, the knowledge that deletion error is extracted, to increase State the accuracy of the knowledge put forward.
It is an advantage of the current invention that because the setting of language inquiry language is very close to natural language, rather than general calculating Machine language, so non-computer formula person can also simply set up language rule language, to extract knowledge, so as to lower computer The difficulty of program development, is effectively reduced system development and maintenance cost.Meanwhile, the language inquiry language being set can be tired out Accumulate and be stored in language inquiry language database, using the reference as new opplication.In addition, the present invention is can be independently of in text The webpage format and structure of appearance, greatly strengthen the scope for collecting information.
According to many aspects of the present invention, language inquiry language and renewal application dictionary need to be simply only changed, just can be made User sets up different types of application because of needs.For example, people search, to extract the relation of people and mechanism;News search system System, it can contact a news article in a place;Brand valuation, with monitor brand in different social media platforms recognize by Degree.
Brief description of the drawings
By the way that following accompanying drawing those skilled in the art will present invention may be better understood, and more can clearly it embody Go out advantages of the present invention.Accompanying drawing described herein only for selected embodiment illustration purpose, rather than all it is possible implement Mode and it is intended to not limit the scope of the present invention.
Fig. 1 is the text analysis system block diagram of the use language inquiry according to the present invention;
Fig. 2 is a kind of method of part-of-speech tagging according to the present invention;
Fig. 3 is the text analyzing method flow diagram of the use language inquiry according to the present invention;
Fig. 4 is the LQL analysis method flow charts according to the present invention;
Fig. 5 is the error correction analysis process figure according to the present invention.
Embodiment
Fig. 1 shows text analysis system according to an embodiment of the invention, including content of text input module 101st, text grammer analysis module 102, text word-dividing mode 103, part-of-speech tagging module 104, LQL analysis modules 105, be extracted Knowledge data base 106, Chinese word segmentation dictionary 107, LQL rule databases 108, using dictionary database 109, error correction rule The regular inputting interface 112 of database 110, error correction module 111, LQL, advise using dictionary inputting interface 113 and error correction Then inputting interface 114.
Text content input module 101 is used to input content of text into LQL text analysis systems.Text content can To obtain on the internet or on non-internet.When content of text is on the internet, text content input module 101 can use the Application Program Interface provided on website(Application Program Interface, API)To obtain Text in the webpage activated by API.Or, using Web crawler to capture(crawl)There is hypertext format Website, and extract the text for having hypertext format.
Text syntax Analysis Module 102 is used for the grammer for analyzing text content.
Text word-dividing mode 103 is used for the cutting that Chinese word segmentation is carried out to text content.For example, a Chinese sentence " winter storm attack probably luxuriant and rich with fragrance take hundred lives by force ", can be split for winter, storm, attack, it is luxuriant and rich with fragrance, fear, take by force, hundred, order.
The part-of-speech tagging module 104 can carry out part-of-speech tagging to the word that is split, i.e., the word being each split out, further according to Its part of speech, is marked with corresponding English alphabet, i.e. part of speech label.For example, winter/t, storm/n, attack/v, phenanthrene/j, probably/d, Take by force/v, hundred/m, life/n.T represent time word, n representation nouns, v represent verb, j represent abbreviation abbreviation, d represent adverbial word, m represent Number.
Figure below is a part of speech label list according to the present invention.Central a represents adjective, Ag and represents shape morpheme, ad Secondary shape word, an is represented to represent adnoun, b and represent distinction word etc..
Preferably, part-of-speech tagging module 104 uses viterbi algorithm(viterbi algorithm)In part-of-speech tagging. Viterbi algorithm is a kind of dynamic programming algorithm, and for finding most probable hidden state sequence, the sequence is referred to as Viterbi road Footpath, especially in geneva information source, or hidden Markov model, can sum up the sequence of events being observed.Another method is to use Forwards algorithms(forward algorithm), the algorithm is to calculate the probability for observing sequence of events, also belongs to probability theory model Enclose.Fig. 3 is according to the present invention, using viterbi algorithm in an example of part-of-speech tagging.For sentence, " winter storm is attacked phenanthrene and feared Take hundred lives by force ", the part-of-speech tagging of central each word is winter/t, storm/n, attack/v, phenanthrene/j, probably/d, take/v, hundred/m and life/n by force.
Chinese word segmentation dictionary 107 includes the term list and corresponding part-of-speech tagging, for carrying out participle for text And part-of-speech tagging.Chinese word segmentation dictionary 107 can be defined by a user or change.
The application dictionary database 109 applies dictionary including at least one.This is set according to application using dictionary , a series of keywords for recording application-specific.Application dictionary inputting interface 113, to create, Bian Jihuo can be used in user Deletion application dictionary.According to one embodiment of present invention, in the application of a brand analysis, the pass just analyzed including brand Keyword, for example, fashion brand(LV, Gucci etc.)Or industry particular term(Name of product, model etc.).These keywords can quilt For in LQL rule settings.Figure below is according to one of the present invention application dictionary for being used to find out news and areas relationship.
The LQL processing modules 105 can according to LQL rule, to be split and part-of-speech tagging text, extract needed for knowing Know, and by Knowledge Storage in being extracted among knowledge data base 106.LQL is a kind of script, similar to structuralized query language Speech(SQL), but LQL is can to extract required data from unstructured text information.In addition, LQL is can be based on application The need for user, and it is defined gained.LQL rule inputting interfaces 112 are used to allow user to input LQL rules, LQL rules The LQL rule databases 108 can be stored in.
According to one embodiment of present invention, LQL rule settings include:
Select is the meaning of selection.Extraction Position are to be extracted the position of knowledge in the text, with number Value is represented.Therefore, Select<Extraction Position>Represent selection and be extracted the position of knowledge in the text.
Coverage is the coverage of LQL analyses, and the coverage can be a sentence(Sentence), a section Fall(Paragraph)Or a document(Document).
MatchCriteria is matching condition, and the matching condition can be list of phrases(Phrase List), with spy Determine the word of part of speech label(WORD POS)Or the word without specific part of speech label(WORD NOT POS).
MatchPattern is match pattern, and the match pattern is to be used to define matching condition.For Phrase List, Match pattern can be a file name, and the file name points to a series of keywords in an application dictionary.For WORD POS or WORD NOT POS, its match pattern is part of speech label, such as n, v, t.
OptionalCriteria is optional condition, applied to matching condition, while it can be by general regular expression Formula is defined.
Following, which is one, is used to find out someone and said what example.
In LQL rules, Select<1,3>It is to represent selection to be extracted the position of knowledge in the text.1 and 3 represent First and the 3rd matching condition(In Word NOT pos are not included in).It is sentence that Sentence, which represents coverage,.[Word Pos=" nr "] it is to find out the word with name, " nr " represents name.[Word NOT pos=" nr "] * { 0-5 } is looked for just In five words after the name gone out, without the word of name, to prevent the situation of more than two people from occurring.For [Phrase List=" speech_word.txt "], " speech_word.txt " is a file name, and its sensing is one and applies dictionary, It is central including a series of keyword, such as propose, say, emphasizing, pointing out, representing, indicating, claiming, being expected, thinking, reaffirming, estimating, Estimate, predict, being expected, being all the synonym of " saying ", for representing what someone has said.When appearance has people in a sentence The word of part of speech label and the matching condition of defined keyword, the i.e. above of name can be met, the name(First matching Condition)With one or more words after these keywords(3rd matching condition, but be not revealed)It will be extracted Out.For example, Chen great Wen estimation stocks can rise.According to LQL rules, " Chen great Wen, stock, meeting, rise " this four words are just from this It is extracted in sentence.
Following is an example for being used to analyze someone nationality.
Select<1,3>The word being selected is represented on [Word pos=" nr "] and [Word pos=" ns "] position. It is sentence that Sentence, which represents coverage,.[Word pos=" nr "] is to find out the word with name.[Word NOT pos=″ Nr "] * { 0-5 } is in five words after the name being just found, without the word of name, to prevent the feelings of more than two people Condition occurs.For [Phrase list=" nationality_word.txt "], " nationality_word.txt " is one File name, its sensing is one and applies dictionary, central including a series of keyword, such as ancestral home, nationality Consistent etc..[Word pos =" ns "] it is the word for finding out local title.Four matching conditions are all met in a sentence more than, with name Just it is extracted with the word in place.For example, Wang great Wen ancestral home is the Taishan." Wang great Wen " and " Taishan " is just extracted.
Following, which is one, is used to find the example that unexpected place occurs in news content.
Select<1,3>Represent the word that is selected at [Phrase list=" accidentType_word.txt "] and On the position of [Word pos=" ns "].It is sentence that Sentence, which represents coverage,.[Phrase list=″ AccidentType_word.txt "] it is to find out the keyword such as Wind Disaster with the unexpected meaning, earthquake , Hai Xiao , Shui Difficult etc.. [Phrase list=" accident_word.txt "] is to find out keyword such as Hair to give birth to what, and position is being waited.[Word pos=″ Ns "] it is to find out the word that part of speech label is place name(ns).When three above matching condition is all met in a sentence, Keyword and the place name with the unexpected meaning are just extracted.Example such as, Wind Disaster Hair Sheng Yu Philippine." Wind Disaster " and " Philippine " are just It is extracted.
Following is that one of them is used for the example that brand is analyzed.
The LQL rules are:[brand name]+[new range/new product]+[new product name].[brand name] is one Using dictionary, it includes a series of title of brands.[new range/new product] is one and applies dictionary, it include it is a series of The keyword of brand name prefix, such as new range.[new product name] is the name of product for needing to be found.
The LQL rules are:
Select<3>Represent the word being chosen at after the keyword in product_prefix.txt.Sentence generations Table coverage is sentence.[Phrase list=" brand_name.txt "] be find out it is relevant pointed by brand_name.txt The keyword of brand name.[Phrase list=" product_prefix.txt "] is to find out product_prefix.txt institutes Point to the keyword about brand name prefix.When two above matching condition is all met in a sentence, new product Name, which is found a great convenience, to be extracted.Example sentence, the trendy Zhi You it Li of GUCCI " new range bamboo Festival bags " 2011 have." bamboo Festival bags " can be extracted as newly Name of product.
Many times, multiple answers are extracted, but central only one of which or it is several be correct.Error correction module 111 can delete some by the knowledge of error extraction according to error correction rule.Error correction rule inputting interface 114 is used to allow User sets and input error recovery rule.Error correction rule can be stored in error correction rule database 110.This Outside, the error correction module 111 can be counted to the knowledge being extracted, to obtain statistical value.
Under illustrate one be used for find a people date of birth example.
The error correction rule is:
Answer only one of which, as monodrome(Because the date of birth only one of which of a people);
Being extracted the number of sources of knowledge needs to be more than 3(For example, in website different more than three, obtaining the quilt The knowledge of extraction);
Be extracted knowledge number account for all percentages for being extracted knowledge number need be more than 70%.
Here, 3 and 70% are the numerical value defined in error correction rule." being more than " be the error correction rule in The numeric ratio of definition is relatively required.Therefore, 3 and 70% are alternatively referred to as threshold values.Number in figure is the system of these knowledge being extracted Evaluation.Only 06/07/1951 numeric ratio for meeting the above is relatively required, because its number of sources for being extracted knowledge(The statistics It is worth for 6)More than 3 all percentages for being extracted knowledge number are accounted for its number for being extracted knowledge(The statistical value is 88%)Than Also greater than 70%, therefore it is chosen as correct answer.Other two selections, 07/06/1951 and 06/07/1952 is deleted.
Under illustrate one be used for find occur the unexpected local example of earthquake.
The error correction rule is:
Answer can have multiple, as multivalue(Because can occur multiple earthquakes in the same period);
Being extracted the number of sources of knowledge needs to be more than 3;
Be extracted knowledge number account for all percentages for being extracted knowledge number need be more than 20%.
Here, 3 and 20% are the numerical value defined in error correction rule." being more than " be the error correction rule in The numeric ratio of definition is relatively required.Therefore, 3 and 20% are alternatively referred to as threshold values.Only Wenchuan County in Sichuan and Qinghai jade Trees meets above number Value compares requirement, therefore is chosen as correct answer.Sichuan Cloud rivers only one of which text source is accounted for the number for being extracted knowledge All percentages for being extracted knowledge number only have 2%, therefore are deleted.
Under illustrate one and be used to find the example of new product name.
The error correction rule is:
Answer can have multiple, as multivalue(Because can have multiple new products simultaneously);
Being extracted the number of sources of knowledge needs to be more than 3;
Be extracted knowledge number account for all percentages for being extracted knowledge number need be more than 20%.
Here, 3 and 20% are the threshold values in error correction rule.Bamboo Festival bags and crime love undercurrent meet the numerical value of the above Compare requirement, therefore be chosen as correct answer.But requirement of the favorite undercurrent because failing to meet the above, therefore be deleted.
According to another aspect of the present invention there is provided a kind of text analyzing method of use language inquiry, such as Fig. 3 institutes Show, methods described is poly- including following step:
S301:Using the regular inputting interfaces of LQL, LQL rules are defined;
S302:Using dictionary inputting interface is applied, dictionary is applied in definition;
S303:Using the regular inputting interface of error correction, error correction rule is defined;
S304:Using content of text input module, content of text is obtained;
S305:Using text grammer analysis module, syntactic analysis is carried out to the text.
S306:Using text word-dividing mode, the cutting of word is carried out to the text;
S307:Using part-of-speech tagging module, to the word being split out, part-of-speech tagging is carried out;
S308:In LQL analysis modules, using LQL rules, to the text for being split and having marked, LQL analyses are carried out, with Extract knowledge;
S309:The knowledge being extracted, it is stored in and is extracted in knowledge data base;
S310:Using error correction module, and according to error correction rule, the knowledge that deletion error is extracted, to increase The accuracy for being extracted knowledge.
In the poly- S308 of step, gather as shown in figure 4, LQL analyses include following step:
S401:Establish coverage defined in LQL rules;
S402:The part of speech label according to defined in the matching condition of LQL rules, is split at this and by the text of part-of-speech tagging This, finds out the word with the part of speech label;
S403:Keyword is defined according to the matching condition of LQL rules, is split at this and by the text of part-of-speech tagging, Find out and the keyword identical word;
S404:When in the coverage, the matching condition can be met, it is extracted according to defined in LQL rules Knowledge position in the text, is split and by the text of part-of-speech tagging at this, extracts one or more words.
In the poly- S310 of step, gather as shown in figure 5, error correction analysis includes following step:
S501:The knowledge being extracted is counted, to obtain statistical value;
S502:The statistical value and the error correction rule defined in numeric ratio compared with;
S503:Deletion does not meet numeric ratio compared with the desired knowledge being extracted.
According to the text analyzing method and system with language inquiry of the present invention, in addition to Chinese, it is equally applicable to His language, such as English, German, Japanese, Korean, it is only necessary to just may be used using suitable word-dividing mode and part-of-speech tagging module.
Through the above description of the embodiments, those skilled in the art can be understood that the present invention can be by Software adds the mode of required hardware platform to realize, naturally it is also possible to all implemented by hardware, but in many cases before Person is more preferably embodiment.Understood based on such, whole that technical scheme contributes to background technology or Person part can be embodied in the form of software product, and the computer software product can be stored in storage medium, such as ROM/RAM, magnetic disc, CD etc., including some instructions are to cause a computer equipment (can be personal computer, service Device, or the network equipment etc.) perform method described in some parts of each embodiment of the invention or embodiment.
While there has been shown and described that it is of the invention, it will be appreciated by those skilled in the art that, without departing from this hair On the premise of bright principle and spirit, can be changed in the present embodiment, the scope of the present invention by appended claims and Its equivalent is limited.

Claims (21)

1. a kind of text analysis system of use language inquiry, it is characterised in that the system includes:
Content of text input module, for inputting Chinese text in described text analysis system;
Chinese word segmentation module, the cutting for carrying out word to the Chinese text;
Part-of-speech tagging module, for the word being split out to this, part of speech label on mark;
Using dictionary database, including one or more application dictionaries, this includes one or more keywords using dictionary;
Language inquiry language LQL rule databases, for storing one or more LQL rules, wherein, the setting of LQL rules Including:
Define position (Extraction Position) of the knowledge being extracted in the Chinese text;
Coverage (Coverage) is defined, the coverage is a sentence, a paragraph or a document;
Define one or more matching conditions (MatchCriteria), the matching condition be list of phrases (Phrase List) or Word (WORD POS) with specific part-of-speech tagging;
Match pattern (MatchPattern) is defined, the match pattern is to be used to define matching condition, when the matching condition is short During language list, its match pattern is a file name, and the file name is pointed at this using one or more passes in dictionary Keyword, when the matching condition is that this has the word of specific part-of-speech tagging, its match pattern is part of speech label;
LQL analysis modules, according to LQL rules, for being split to this and by the Chinese text of part-of-speech tagging, carry out LQL points Analysis, and required knowledge is extracted, wherein, LQL analyses include:
Establish coverage defined in LQL rules;
The part of speech label according to defined in the matching condition of LQL rules, is split at this and by the Chinese text of part-of-speech tagging, Find out the word with the part of speech label;
The keyword according to defined in the matching condition of LQL rules, is split at this and by the Chinese text of part-of-speech tagging, looks for Go out and the keyword identical word;
When in the coverage, the matching condition can be met, knowledge is extracted according to defined in LQL rules Position in text, extracts one or more words;
Knowledge data base is extracted, for storing the knowledge being extracted.
2. text analysis system according to claim 1, it is characterised in that the system also includes:
Error correction rule database, for storing one or more error correction rules;
Error correction module, can use error correction rule, to the knowledge being extracted, carry out error correction analysis, to delete The knowledge that mistake is extracted, the accuracy of knowledge being extracted described in increase.
3. text analysis system according to claim 2, it is characterised in that error correction rule include setting one or Multiple numerical value and one or more numeric ratios relatively require that the error correction module is counted to the knowledge being extracted, and obtain system Evaluation, and with the numeric ratio compared with, relatively required when the statistical value of the knowledge being extracted does not meet the numeric ratio, what this was extracted Knowledge can be deleted.
4. text analysis system according to claim 3, it is characterised in that the statistical value includes being extracted the source of knowledge Number, the number for being extracted knowledge are extracted the number of knowledge and account for all percentages for being extracted knowledge number.
5. text analysis system according to claim 3, it is characterised in that the numerical value includes being extracted Knowledge Source number Threshold value, be extracted the threshold value of knowledge number or be extracted the number of knowledge and account for the threshold of all percentages for being extracted knowledge number Value, the numeric ratio relatively requires it is to compare the statistical value and the numerical value, and the statistical value is more than, less than or equal to the numerical value.
6. text analysis system according to claim 1, it is characterised in that the system also includes:
Text grammer analysis module, the grammer for analyzing the Chinese text;
There is Chinese word segmentation dictionary, including term list, the term in the term list part-of-speech tagging and the part-of-speech tagging to occur Frequency, for the Chinese text carry out word cutting and part-of-speech tagging;
LQL rule inputting interfaces, for allowing user to set LQL rules;
Using dictionary inputting interface, for allowing user's setting to apply dictionary.
7. text analysis system according to claim 2, it is characterised in that the system also includes:
Error correction rule inputting interface, for making user's input error recovery regular.
8. text analysis system according to claim 1, it is characterised in that the Chinese text is acquired in internet 's.
9. text analysis system according to claim 8, it is characterised in that use Application Program Interface or Web crawler To obtain the Chinese text on the internet.
10. text analysis system according to claim 1, it is characterised in that using viterbi algorithm or forwards algorithms with Part-of-speech tagging is carried out to the word being split out.
11. text analysis system according to claim 1, it is characterised in that the matching condition is also included without specific Part-of-speech tagging word (WORD NOT POS), its match pattern is part of speech label.
12. text analysis system according to claim 1, it is characterised in that the setting of LQL rules also includes:
Optional condition (OptionalCriteria) is defined, for matching condition.
13. text analysis system according to claim 2, it is characterised in that error correction rule includes setting the quilt The knowledge of extraction is monodrome or multivalue.
14. text analysis system according to claim 1, it is characterised in that the system is applied to people search, news Hunting system or brand analysis.
15. the text analyzing method of the system described in a kind of usage right requirement 1, it is characterised in that methods described includes:
S1:Obtain Chinese text;
S2:Using Chinese word segmentation module, the cutting of word is carried out to the Chinese text;
S3:Using part-of-speech tagging module, to the word that this is split out, part-of-speech tagging is carried out;
S4:In LQL analysis modules, using LQL rules, the Chinese text for being split and marking to this carries out LQL analyses, to carry Knowledge is taken, wherein, LQL analyses comprise the following steps:
Establish coverage defined in LQL rules;
The part of speech label according to defined in the matching condition of LQL rules, is split at this and by the Chinese text of part-of-speech tagging, Find out the word with the part of speech label;
The keyword according to defined in the matching condition of LQL rules, is split at this and by the Chinese text of part-of-speech tagging, looks for Go out and the keyword identical word;
When in the coverage, the matching condition can be met, knowledge is extracted according to defined in LQL rules Position in text, extracts one or more words.
16. text analyzing method according to claim 15, it is characterised in that methods described also includes:
According to error correction rule, to the knowledge progress error correction analysis that this is extracted, the knowledge being extracted with deletion error, The accuracy for the knowledge being extracted described in increase.
17. text analyzing method according to claim 16, it is characterised in that error correction analysis includes:
The knowledge that this is extracted is counted, to obtain statistical value;
Numerical value defined in the statistical value and error correction rule is compared;
Relatively required when the statistical value of the knowledge being extracted does not meet numeric ratio, the knowledge being extracted can be deleted.
18. text analyzing method according to claim 17, it is characterised in that the statistical value includes being extracted coming for knowledge Source number, the number for being extracted knowledge are extracted the number of knowledge and account for all percentages for being extracted knowledge number.
19. text analyzing method according to claim 17, it is characterised in that the numerical value includes being extracted Knowledge Source number Purpose threshold value, the threshold value for being extracted knowledge number are extracted the number of knowledge and account for all percentages for being extracted knowledge number Threshold value, the numeric ratio relatively requires it is to compare the statistical value and the numerical value, and the statistical value is more than, less than or equal to the numerical value.
20. text analyzing method according to claim 15, it is characterised in that methods described is applied to people search, new Hear hunting system or brand analysis.
21. a kind of text analysis system of use language inquiry, it is characterised in that the system is applied to different language, institute The system of stating includes:
Content of text input module, for inputting the text of the language in described text analysis system;
Language word-dividing mode, the cutting for carrying out word to the text;
Part-of-speech tagging module, for the word being split out to this, part of speech label on mark;
Using dictionary database, including one or more application dictionaries;
Language inquiry language LQL rule databases, for storing one or more LQL rules,
Wherein, the setting of LQL rules includes:
Define position (Extraction Position) of the knowledge being extracted in the text;
Coverage (Coverage) is defined, the coverage is a sentence, a paragraph or a document;
Define one or more matching conditions (MatchCriteria), the matching condition be list of phrases (Phrase List) or Word (WORD POS) with specific part-of-speech tagging;
Match pattern (MatchPattern) is defined, the match pattern is to be used to define matching condition, when the matching condition is short During language list, its match pattern is a file name, and the file name is pointed at this using one or more passes in dictionary Keyword, when the matching condition is that this has the word of specific part-of-speech tagging, its match pattern is part of speech label;
LQL analysis modules, according to LQL rules, for being split to this and by the text of part-of-speech tagging, carry out LQL analyses, and Knowledge needed for extracting, it is characterised in that LQL analyses include:
Establish coverage defined in LQL rules;
The part of speech label according to defined in the matching condition of LQL rules, is split at this and by the text of part-of-speech tagging, finds out Word with the part of speech label;
According to the LQL rule matching condition defined in keyword, be split at this and by the text of part-of-speech tagging, find out with The keyword identical word;
When in the coverage, the matching condition can be met, knowledge is extracted in text according to defined in LQL rules Position in this, extracts one or more words;
Knowledge data base is extracted, for storing the knowledge being extracted.
CN201310330423.5A 2013-07-31 2013-07-31 Use the text analysis system and method for language inquiry Active CN104346382B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310330423.5A CN104346382B (en) 2013-07-31 2013-07-31 Use the text analysis system and method for language inquiry

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310330423.5A CN104346382B (en) 2013-07-31 2013-07-31 Use the text analysis system and method for language inquiry

Publications (2)

Publication Number Publication Date
CN104346382A CN104346382A (en) 2015-02-11
CN104346382B true CN104346382B (en) 2017-08-29

Family

ID=52501997

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310330423.5A Active CN104346382B (en) 2013-07-31 2013-07-31 Use the text analysis system and method for language inquiry

Country Status (1)

Country Link
CN (1) CN104346382B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104778262B (en) * 2015-04-21 2018-07-24 无锡天脉聚源传媒科技有限公司 A kind of searching method and device
CN105243130A (en) * 2015-09-29 2016-01-13 中国电子科技集团公司第三十二研究所 Text processing system and method for data mining
CN107870966A (en) * 2017-08-11 2018-04-03 成都萌想科技有限责任公司 A kind of recruitment general regulations data pick-up method based on semantic model
CN109214005A (en) * 2018-09-14 2019-01-15 南威软件股份有限公司 A kind of clue extracting method and system based on Chinese word segmentation
CN109558589A (en) * 2018-11-12 2019-04-02 速度时空信息科技股份有限公司 A kind of method and system of the free thought document based on Chinese words segmentation
CN113239206B (en) * 2021-06-18 2023-05-12 广东博维创远科技有限公司 Judgment document accurate data classification analysis method and computer readable storage device

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102207947A (en) * 2010-06-29 2011-10-05 天津海量信息技术有限公司 Direct speech material library generation method
CN102253930A (en) * 2010-05-18 2011-11-23 腾讯科技(深圳)有限公司 Method and device for translating text
CN102654873A (en) * 2011-03-03 2012-09-05 苏州同程旅游网络科技有限公司 Tourism information extraction and aggregation method based on Chinese word segmentation

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR101025814B1 (en) * 2008-12-16 2011-04-04 한국전자통신연구원 Method for tagging morphology by using prosody modeling and its apparatus

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102253930A (en) * 2010-05-18 2011-11-23 腾讯科技(深圳)有限公司 Method and device for translating text
CN102207947A (en) * 2010-06-29 2011-10-05 天津海量信息技术有限公司 Direct speech material library generation method
CN102654873A (en) * 2011-03-03 2012-09-05 苏州同程旅游网络科技有限公司 Tourism information extraction and aggregation method based on Chinese word segmentation

Also Published As

Publication number Publication date
CN104346382A (en) 2015-02-11

Similar Documents

Publication Publication Date Title
CN108763333B (en) Social media-based event map construction method
CN106874378B (en) Method for constructing knowledge graph based on entity extraction and relation mining of rule model
CN106844658B (en) Automatic construction method and system of Chinese text knowledge graph
CN106649260B (en) Product characteristic structure tree construction method based on comment text mining
JP5904559B2 (en) Scenario generation device and computer program therefor
US20160335234A1 (en) Systems and Methods for Generating Summaries of Documents
CN104346382B (en) Use the text analysis system and method for language inquiry
Farouk Measuring text similarity based on structure and word embedding
CN111143479A (en) Knowledge graph relation extraction and REST service visualization fusion method based on DBSCAN clustering algorithm
JP5907393B2 (en) Complex predicate template collection device and computer program therefor
CN103678412A (en) Document retrieval method and device
CN111104801B (en) Text word segmentation method, system, equipment and medium based on website domain name
Trabelsi et al. Bridging folksonomies and domain ontologies: Getting out non-taxonomic relations
CN104281565A (en) Semantic dictionary constructing method and device
TW201826145A (en) Method and system for knowledge extraction from Chinese corpus useful for extracting knowledge from source corpuses mainly written in Chinese
Nakashole et al. Real-time population of knowledge bases: opportunities and challenges
Sagcan et al. Toponym recognition in social media for estimating the location of events
Bahloul et al. ArA* summarizer: An Arabic text summarization system based on subtopic segmentation and using an A* algorithm for reduction
Lim et al. ClaimFinder: A Framework for Identifying Claims in Microblogs.
Samei et al. Multi-document summarization using graph-based iterative ranking algorithms and information theoretical distortion measures
Kannan et al. Text document clustering using statistical integrated graph based sentence sensitivity ranking algorithm
Aqle et al. Analyze Unstructured Data Patterns for Conceptual Representation
CN110362673A (en) Computer vision class papers contents method of discrimination and system based on abstract semantic analysis
Chen et al. Chinese named entity abbreviation generation using first-order logic
Yuan et al. Self-adaptive extracting academic entities from World Wide Web

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant