CN101833582A - Mining method and system for correlation of vocabulary entities based on template - Google Patents

Mining method and system for correlation of vocabulary entities based on template Download PDF

Info

Publication number
CN101833582A
CN101833582A CN 201010174505 CN201010174505A CN101833582A CN 101833582 A CN101833582 A CN 101833582A CN 201010174505 CN201010174505 CN 201010174505 CN 201010174505 A CN201010174505 A CN 201010174505A CN 101833582 A CN101833582 A CN 101833582A
Authority
CN
China
Prior art keywords
vocabulary
correlation
speech
named entity
template
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN 201010174505
Other languages
Chinese (zh)
Inventor
吴毓杰
卢阳正
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to CN 201010174505 priority Critical patent/CN101833582A/en
Publication of CN101833582A publication Critical patent/CN101833582A/en
Pending legal-status Critical Current

Links

Images

Abstract

The invention provides a mining method and a system for correlation of vocabulary entities based on a template. The invention is characterized by defining according to part-of-speech styles predefined by a user, disused words and named entities, and mining and presenting the correlation styles meeting statistical independence or correlation by a sequential pattern mining method according to the correlation of various styles. Process detail regulation of the invention is parameterized, and can be defined and added to other file information such as time, date, source and the like according to the favor of a user; and the user can obtain highly relevant named entities or vocabulary relation styles in a designated fileset within a limit time.

Description

Correlation of vocabulary entities method for digging and system based on template
Technical field
The present invention relates to the text mining method and system of a kind of information processing and information retrieval field, particularly a kind of correlation of vocabulary entities method for digging and system based on template.
Background technology
The present invention is positioned one can be among heap file, utilize natural language part of speech acquisition annotation results, according to the named entity rule of predefined, and then prospects rule with mass data and carries out the connection formula and prospect.This invention involves several fields knowledge as (1) natural language processing: vocabulary acquisition (Natural language processing), the automatic mark of part of speech (Part-of-speech tagging), vocabulary aftertreatment (Post term processing), named entity rule research (Named entity recognition); (2) Date Mining: sequence data is prospected (Sequential patternmining), association is prospected (Association mining); (3) domain knowledge such as related coefficient calibrating research.
With regard to overall architecture design spirit, the present invention is an innovation and a cross-cutting combination.Though the technical capability that domestic and international technology of past can be accomplished is all had a certain level, really technology such as do not prospect as yet and be integrated into a systemic framework in conjunction with above-mentioned natural language, sequence.Past document or invention research emphatically be prospecting of general pattern (pattern) but not prospecting that correlative converges on file.Simultaneously these technology are conceived to calculate the similarity between vocabulary and the vocabulary, but not a model formula of setting out based on part of speech is prospected.In the natural language processing, though existing similar vocabulary acquisition technology, but, be defined as method main and can be multi-lingual for tool association, part of speech model, named entity, but have never seen.
File is prospected on the subject under discussion, and wherein main application still is classification and hives off last the method breakthrough, and the association of further not expanding to vocabulary-phrase-named entity is prospected.First problem is how elder generation is with among a large amount of file datas and file is often prospected, Yang and Liu has made a summary at its 1998SIGIR article, put into different categories and correspond to (Yang among the good item name of user's predefined to it, Y.and Liu, X., " ARe-Examination of Text Categorization methods; " Proceedings of SIGIR ' 99:22nd Annual International ACM SIGIR Conference on Research and Development inInformation Retrieval, 1999, pp.42-49).For example under the input of newspaper news, electronic medium file, judge automatically, give its suitable class,, and these files are finished with each self-organization of these classifications as bookmark names such as politics, society, films and dramas by machine.The application in a step then is to utilize file to hive off technology (as hierarchy type grouping method or the K-means that is widely known by the people) seeming irrelevant file each other again, see through in conjunction with lexical analysis, file similarity calculates, estimate particular demographic, yet group must not specify in advance, but is got by the automatic computing of machine.Though these two kinds of methods are all current file and prospect the main flow subject under discussion, in fact are still in the article stage, thinless one goes on foot to levels such as sentence or phrases.And level such as vocabulary-phrase-named entity and one piece of article exist sizable difference, and the technology of therefore prospecting file is obviously different with this patent method basically.
Based on the advantage that the past develops, the development of this patent can successfully be based upon powerful Language Processing and file is prospected on the basis.Past part that these methods are had in mind is obviously different with this patent.Though the vocabulary similarity is calculated and is seemed relevant with this patent, is actually two diverse aspects.This patent does not limit the family of languages, and is not limited to vocabulary, but can allow the user reserve the part of speech framework of wanting earlier, the pattern of tool statistic height correlation is excavated, and satisfied minimum door conditions such as confidence level.As for methods such as vocabulary similarity calculating, mainly be to be used for differentiating that similarity is high or low between speech and the speech, whether identical or relevant with the association of differentiating between vocabulary.Moreover it is only to be suitable for Chinese in essence, and only based on the Chinese word separating result; Yet this patent not only is suitable for the multinational family of languages, and is to be main the analysis according to part of speech, utilizes definition (having comprised vocabulary) on the named entity, and the pattern of height correlation is excavated out, not only has breakthrough technically, also contains not cognation simultaneously.
Summary of the invention
The invention provides a kind of framework that is applicable to that multi-lingual named entity relevance is excavated.
The present invention is according to the defined in advance part of speech pattern of user, stop words and named entity definition, and according to the relevance of each pattern, and with the method for sequential mode mining, the related pattern that will meet statistical independence or correlativity excavates and presents.
The thin portion of process of the present invention adjusts all parametrization, and can be defined according to user preferences, adds other fileinfo, as time, date, source etc.The user can obtain in finite time in specified file set, the named entity of height correlation or lexical relation pattern.
The present invention discloses a kind of correlation of vocabulary entities digging system based on template, comprising: the natural language processing module forms mark file with part-of-speech tagging in order to source document that will input; Named entity rule module is in order to defining named entity rule and excavation template; The stop words module is in order to the storage stop words; Pattern data for projection storehouse is used to store and removes after the stop words and meet the vocabulary that excavates template; Relevance is excavated module, and the named entity that will meet definite condition and satisfy minimum occurrence number is excavated in proper order according to appearance in its file; The relevance computing module calculates the degree of correlation between the described vocabulary that meets definite condition.
Wherein, the natural language processing module comprises vocabulary cutting module and vocabulary labeling module, described vocabulary cutting module in order to vocabulary in the source document with blank or sign field every, described vocabulary labeling module gives mark with the part of speech of each vocabulary.Excavate template in order to will meet vocabulary merging a becoming named entity of part of speech combination definition or vocabulary definition.Relevance is excavated module on the basis in pattern data for projection storehouse, adopts the sequence pattern to excavate, and the named entity that satisfies minimum occurrence number is excavated in proper order according to appearance in its file.The relevance computing module adopts the test of independence method, calculates the degree of correlation between the vocabulary.Definite condition comprises: vocabulary is apart from length, maximum pattern tap length and threshold value.The test of independence method comprises: chi-square test, related coefficient, information gain amount, expectation interactive information, confidence level, the relevant probability that occurs.
The present invention discloses a kind of correlation of vocabulary entities method for digging based on template, the steps include:
1), the source document of input is formed mark file with part-of-speech tagging;
2), will mark file through named entity rule and stop speech;
3), set up pattern data for projection storehouse to meeting the vocabulary that excavates template;
4), the sequence pattern excavates, the named entity that will meet definite condition and satisfy minimum occurrence number is excavated in proper order according to appearance in its file;
5), calculate the degree of correlation between the vocabulary excavated.
Under framework of the present invention, emphasize multinational family of languages applicability, vocabulary that therefore must the collocation language-specific extracts and the part-of-speech tagging method, its vocabulary and part of speech are indicated finish, and then the pattern that will meet the named entity rule gets by excavating in the heap file.So in an embodiment, this part assembly can be obtained file vocabulary and part of speech with not cognation or different labeled vocabulary method, can carry out named entity vocabulary and excavate.
In an embodiment, the named entity rule definition is to be made up according to its part of speech by the user, will define legal part of speech combination (i.e. a continuous vocabulary) through its observation and synthesize a named entity, forms the excavation elementary cell.Its named entity rule can elasticity add, revise deletion.
In an embodiment, it is with all named entity elementary cells that pattern excavates, and according to its ordinal relation in article, uses the sequence mining method, and the pattern that satisfies user's definition is excavated.
In an embodiment, it is with all vocabulary in the pattern that relevance is calculated, and utilizes the defined calculating degree of correlation of user method, lists according to its relevancy ranking.
Description of drawings
Accompanying drawing 1 is a system architecture synoptic diagram of the present invention;
Accompanying drawing 2 is operational flowcharts of the present invention.
Embodiment
Describe specific embodiments of the invention in detail below in conjunction with accompanying drawing.
The present invention is according to the defined in advance part of speech pattern of user, stop words and named entity definition, and according to the relevance of each pattern, and with the method for sequential mode mining, the related pattern that will meet statistical independence or correlativity excavates and presents.
Under framework of the present invention, emphasize multinational family of languages applicability, vocabulary that therefore must the collocation language-specific extracts and the part-of-speech tagging method, its vocabulary and part of speech are indicated finish, and then the pattern that will meet the named entity rule gets by excavating in the heap file.So in an embodiment, this part assembly can be obtained file vocabulary and part of speech with not cognation or different labeled vocabulary method, can carry out named entity vocabulary and excavate.Following spy illustrates with the excavation of the named entity vocabulary of Chinese, and other family of languageies are suitable for the present invention in a similar fashion.
Accompanying drawing 1 is a system architecture synoptic diagram of the present invention, wherein imports link or article 100, and it is modules relevant with the user that template definition 130 and stop words (speech) remove 150, and input link or article 100 are a series of files that the user imported; Template definition 130 is the specified module that is used for defining named entity rule of user; Stop words (speech) is removed 150 by being defined in advance, needs stop words or the table removed.
Webpage article extraction 110 is being represented and is being used a network data gatherer (web-crawler), in order to the article of user's input is obtained data via the network linking transmission or with the automatic visit of network data gatherer.The user also can be without thus, among the direct input system of article content.Comprised vocabulary cutting 121 (or claiming vocabulary to extract) and part-of-speech tagging 122 modules among natural language processing 120 modules, wherein vocabulary cutting 121 is used for the vocabulary among the file is separated out with blank or other define symbols.122 of part-of-speech taggings are that the part of speech with vocabulary gives mark.
130 of template definitions are to remove 150 for named entity definition rule 140 and stop words (speech) effective definition is provided.When the article of input or after file passed through part-of-speech tagging 122, the named entity definition then was that the named entity that utilizes named entity definition rule 140 modules will meet definition extracts, and a plurality of vocabulary that will meet the named entity definition form a unit entity.Stop words (speech) is removed named entity and the vocabulary removal that 150 modules then will meet the vocabulary of stopping using.Via setting up named entity projection table 160 modules, set up the named entity projection table.It is to utilize the named entity projection table of being set up that relevance is excavated 170, adopts the sequence pattern to excavate.Relevance is excavated 170 modules and is comprised that pattern extraction 171 and pattern excavate 172 modules, and wherein pattern extraction module 171 is used to extract the pattern that meets named entity, and the pattern that pattern excavation module 172 is used for meeting its relevance excavates.The named entity that satisfies minimum occurrence number can be excavated in proper order according to appearance in its file.
At last, it then is the selected test of independence method of utilizing that will excavate that relevance is calculated 180 modules, calculates the relevance between its vocabulary and the vocabulary.Named entity incidence set 190 is according to degree of association ordering output with named entity vocabulary.
Accompanying drawing 2 is operational flowcharts of the present invention.As shown in the figure, bring into operation S1 and enter original state S2 of system judges that whether the input file collection is nonvoid set, enters next procedure S3 if file set is a non-NULL, otherwise returns original state S2.The input file collection is that nonvoid set then enters S4 and judges whether parameter setting is legal, illegally returns original state S1, the legal vocabulary cutting S5 that enters.Finish and carry out vocabulary behind the vocabulary cutting S5 and mark S6 automatically, enter named entity definition rule S7 then.Next, judge whether filter rule keeps important vocabulary S8, if whether, flow process finishes S12; If be, then set up pattern data for projection storehouse S9, next carry out correlation calculations S10, finish the laggard line correlation ordering output of S10 S11, flow process finishes S12.
Below introduce detailed calculating process proposed by the invention, and contain the operating process of above-mentioned introduction.Wherein, the input content is an article to be analyzed, and the user sets rule and template in advance; Output content is the vocabulary relevance tabulation of excavating.
Input (in advance prepare) comprises and is not limited to:
Zero webpage, news or the generic-document of collecting is designated as Dtext;
The zero definition part of speech and the then Stag of vocabulary filtration method that stops using;
Zero defining named entity rule NRule;
Zero definition will be extracted pattern Template;
* for example: noun-noun, noun-noun-verb, verb-noun-patterns such as adjective;
Degree of correlation method and threshold value θ are calculated in zero definition.
Output (output result)
Zero produces significant style set
* for example:
● Ma Ying-jeou's-leader-Kuomintang (noun-verb-noun) triple combination → degree of correlation=90%
● the binary combination → degree of correlation of young-Zhao Youting (noun-noun)=80%
● binary combination → degree of correlation=78% of Youda-panel (noun-noun)
Compute mode
STEP1: all DText are carried out the vocabulary cutting;
STEP2:, carry out part of speech and mark automatically-TagText to the vocabulary that DText cut out;
STEP3: use Nrule that the named entity in the TagText is extracted again;
STEP4: use the STag definition to remove to the vocabulary in the TagText, become STagText;
STEP5: with all the pattern-PaText that will excavate in the Template reservation STagText;
STEP6: among PaText, meet the information of the style set contrast DText of Template to each all, set up data for projection storehouse (ProjDB);
STEP7: take sequence pattern mining method, each sequence pattern (Frequent Pattern) is excavated and work out the vocabulary distance match its statistic of enumerating and add up at K with interior all patterns (Patterns);
STEP8: utilize following several test of independence modes to calculate the vocabulary degree of correlation;
Chi-square test (Chi-square statistics)
Related coefficient (Correlation Coefficient)
Information gain amount (Information Gain)
Expectation interactive information (Expectation Mutual Information)
Confidence level (Confidence)
The relevant probability (Related Prob.) that occurs
STEP9: each Patterns that excavates is calculated the above-mentioned degree of correlation, and its degree of correlation must be higher than threshold value (θ)-become set Pat;
STEP10: all Pat are sorted.
With next input Chinese file is example, and the user defines following parameter:
Chinese file (handling) via Chinese word separating and Chinese part-of-speech tagging
C=5 (=5 speech of vocabulary distance)
M=10% (extracting preceding 10% the most remarkable vocabulary)
θ=70% (the entity correlativity wants>70% at least)
MinS=1 (entity is occurrence number at least)
The maximum pattern tap length of MaxLen=2
Chi-square test (Chi-square test) is the test of independence method
Source document reference table 1 behind Chinese word separating and part-of-speech tagging.Taken passages the multistage newsletter archive in the table 1.Wherein " _ " is compartmented, and the English symbol behind each speech is the part of speech of this speech, as the Na noun, and Nb proper noun etc.The detailed part of speech table of comparisons is as shown in table 7.
Figure GSA00000108764000071
Table 1
After the definition of named entity rule, named entity vocabulary changes to form as shown in table 2 again.Its substantial definition rule defines as shown in table 3 at this embodiment.
Figure GSA00000108764000072
Figure GSA00000108764000081
Table 2
Figure GSA00000108764000082
Figure GSA00000108764000091
Table 3
Then with the words of stopping using, more above-mentioned file is comprised the stop words speech and remove, table 4 is depicted as the file of removing behind the stop words, and the words of stopping using is shown in this embodiment is defined as follows: Be Also Have After big Think Investigation Point out According to No matter The aspect All The spaceThe content that strikethrough indicated in the table 4 is through stopping the part in place to go behind the speech.
Figure GSA00000108764000092
Table 4
According to the part of speech that reservation that the user defines needs, with the part of speech vocabulary of non-reservation among the file, to be removed, following table 5 is for removing file after the non-reservation part of speech, and keep the part of speech table shown in this embodiment is defined as follows: the part of speech group that keep is A Dfa I Na Nb Nc Ncd Nd Neu Nf Ng VA VB VC VE VG VH VJ
Figure GSA00000108764000101
Table 5
To residue vocabulary (being named entity) implementation sequence formula method for digging, according to parameter setting, can excavate following connection entity list, and comprise chi-square value at last, according to incremental manner ordering output, detailed content is as shown in table 6.
?? The connection list of entities of being found outThe Taiwan Semiconductor Manufacturing Co. cicada connects the χ=tap χ of 2.721 Taiwan Semiconductor Manufacturing Co.s=2.721 χ of group of Taiwan Semiconductor Manufacturing Co.=2.7199 χ of enterprise of Taiwan Semiconductor Manufacturing Co.=2.7199 profit king χ of the Taiwan Semiconductor Manufacturing Co.=2.4551 profit king χ of group=6.8935 χ of winning post enterprise=8.5239 ? Excavate parameter?MaxLen=2?C=5?M=10%?θ=70%?MinS=1?χ 2-statistics
Cicada connects leading χ=8.5253
Table 6
Mark Part of speech Mark Part of speech
??A Non-meaning adjective ??Nh Synonym
??Caa The equity conjunction ??I Interjection
??Cab Conjunction, as: or the like ??P Preposition
??Cba Conjunction, as: words ??T Auxiliary word that indicates mood
??Cbb Related conjunction ??VA The action intransitive verb
??Da The quantity adverbial word ??VAC Action makes verb
??Dfa Degree adverb before the verb ??VB Action class transitive verb
??Dfb Degree adverb behind the verb ??VC The action transitive verb
??Di The tense mark ??VCL Action ground connection side object verb
??Dk Sentence adverb ??VD Ditransitive verb
??D Adverbial word ??VE Action sentence guest verb
??Na Common noun ??VF Action meaning guest verb
??Nb Proper noun ??VG The classification verb
??Nc Local speech ??VH The state intransitive verb
??Ncd The position speech ??VHC State makes verb
??Nd Time word ??VI The state class transitive verb
??Neu Speech decided in number ??VJ The state transitive verb
??Nes Refer in particular to and decide speech ??VK State sentence guest verb
??Nep Refer to and decide speech ??VL State meaning guest verb
??Neqa Quantity is decided speech ??V_2 Have
??Neqb Rearmounted quantity is decided speech ??DE , it,, ground
??Nf Measure word ??SHI Be
??Ng Postposition ??FW The foreign language mark
Mark Part of speech Mark Part of speech
DASHCATEGORY - EXCLANATIONCATEGORY
ETCCATEGORY PARENTHESISCATEGORY " " () []
COMMACATEGORY , PAUSECATEGORY ,
PERIODCATEGORY SPCHANGECATEGORY //
QUESTIONCATEGORY DM Quantitative compound word
COLONCATEGORY : BM Bound morpheme
SEMICOLONCATEGORY
Table 7
Described in this instructions is preferred embodiment of the present invention, and above embodiment is only in order to illustrate technical scheme of the present invention but not limitation of the present invention.All those skilled in the art all should be within the scope of the present invention under this invention's idea by the available technical scheme of logical analysis, reasoning, or a limited experiment.

Claims (10)

1. the correlation of vocabulary entities digging system based on template is characterized in that, comprising:
The natural language processing module forms mark file with part-of-speech tagging in order to source document that will input;
Named entity rule module is in order to defining named entity rule and excavation template;
The stop words module is in order to the storage stop words;
Pattern data for projection storehouse is used to store and removes after the stop words and meet the vocabulary that excavates template;
Relevance is excavated module, and the named entity that will meet definite condition and satisfy minimum occurrence number is excavated in proper order according to appearance in its file; And
The relevance computing module calculates the degree of correlation between the described vocabulary that meets definite condition.
2. correlation of vocabulary entities digging system as claimed in claim 1, it is characterized in that, described natural language processing module comprises vocabulary cutting module and vocabulary labeling module, described vocabulary cutting module in order to vocabulary in the source document with blank or sign field every, described vocabulary labeling module gives mark with the part of speech of each vocabulary.
3. correlation of vocabulary entities digging system as claimed in claim 1 is characterized in that, described excavation template will be in order to will meet vocabulary merging a becoming named entity of part of speech combination definition or vocabulary definition.
4. correlation of vocabulary entities digging system as claimed in claim 1, it is characterized in that, described relevance is excavated module on the basis in pattern data for projection storehouse, adopts the sequence pattern to excavate, and the named entity that satisfies minimum occurrence number is excavated in proper order according to appearance in its file.
5. correlation of vocabulary entities digging system as claimed in claim 1 is characterized in that, described relevance computing module adopts the test of independence method, calculates the degree of correlation between the vocabulary.
6. correlation of vocabulary entities digging system as claimed in claim 1 is characterized in that, described definite condition comprises: vocabulary is apart from length, the minimum occurrence number of vocabulary, maximum pattern tap length and threshold value.
7. correlation of vocabulary entities digging system as claimed in claim 5 is characterized in that, described test of independence method comprises: chi-square test, related coefficient, information gain amount, expectation interactive information, confidence level, the relevant probability that occurs.
8. the correlation of vocabulary entities method for digging based on template the steps include:
1), the source document of input is formed mark file with part-of-speech tagging;
2), will mark file through named entity rule and stop speech;
3), set up pattern data for projection storehouse to meeting the vocabulary that excavates template;
4), the sequence pattern excavates, the named entity that will meet definite condition and satisfy minimum occurrence number is excavated in proper order according to appearance in its file;
5), calculate the degree of correlation between the vocabulary excavated.
9. correlation of vocabulary entities method for digging as claimed in claim 8 is characterized in that, described excavation template will be in order to will meet vocabulary merging a becoming named entity of part of speech combination definition or vocabulary definition.
10. correlation of vocabulary entities method for digging as claimed in claim 8, it is characterized in that, described step 5) adopts the test of independence method, comprise: chi-square test, related coefficient, information gain amount, expectation interactive information, confidence level, the methods such as probability occur of being correlated with, calculate the degree of correlation between the vocabulary.
CN 201010174505 2010-05-04 2010-05-04 Mining method and system for correlation of vocabulary entities based on template Pending CN101833582A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN 201010174505 CN101833582A (en) 2010-05-04 2010-05-04 Mining method and system for correlation of vocabulary entities based on template

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN 201010174505 CN101833582A (en) 2010-05-04 2010-05-04 Mining method and system for correlation of vocabulary entities based on template

Publications (1)

Publication Number Publication Date
CN101833582A true CN101833582A (en) 2010-09-15

Family

ID=42717651

Family Applications (1)

Application Number Title Priority Date Filing Date
CN 201010174505 Pending CN101833582A (en) 2010-05-04 2010-05-04 Mining method and system for correlation of vocabulary entities based on template

Country Status (1)

Country Link
CN (1) CN101833582A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104123293A (en) * 2013-04-24 2014-10-29 财团法人工业技术研究院 Alias query system and method thereof
CN105243052A (en) * 2015-09-15 2016-01-13 浪潮软件集团有限公司 Corpus labeling method, device and system
CN112347767A (en) * 2021-01-07 2021-02-09 腾讯科技(深圳)有限公司 Text processing method, device and equipment

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101079025A (en) * 2006-06-19 2007-11-28 腾讯科技(深圳)有限公司 File correlation computing system and method
CN101174273A (en) * 2007-12-04 2008-05-07 清华大学 News event detecting method based on metadata analysis

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101079025A (en) * 2006-06-19 2007-11-28 腾讯科技(深圳)有限公司 File correlation computing system and method
CN101174273A (en) * 2007-12-04 2008-05-07 清华大学 News event detecting method based on metadata analysis

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
《现代图书情报技术》 20071024 王昊等 HMM和CRFs在信息抽取应用中的比较研究 第57-63页,图4 1-10 , 第12期 2 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104123293A (en) * 2013-04-24 2014-10-29 财团法人工业技术研究院 Alias query system and method thereof
CN104123293B (en) * 2013-04-24 2018-10-26 财团法人工业技术研究院 alias query system and method thereof
CN105243052A (en) * 2015-09-15 2016-01-13 浪潮软件集团有限公司 Corpus labeling method, device and system
CN112347767A (en) * 2021-01-07 2021-02-09 腾讯科技(深圳)有限公司 Text processing method, device and equipment
CN112347767B (en) * 2021-01-07 2021-04-06 腾讯科技(深圳)有限公司 Text processing method, device and equipment

Similar Documents

Publication Publication Date Title
CN101593200B (en) Method for classifying Chinese webpages based on keyword frequency analysis
CN106649260B (en) Product characteristic structure tree construction method based on comment text mining
Litvak et al. Graph-based keyword extraction for single-document summarization
EP3534272A1 (en) Natural language question answering systems
Hasan et al. Conundrums in unsupervised keyphrase extraction: making sense of the state-of-the-art
CN102760134B (en) Method and device for mining synonyms
CN107832229A (en) A kind of system testing case automatic generating method based on NLP
US20090182723A1 (en) Ranking search results using author extraction
US20090228777A1 (en) System and Method for Search
CN108829658A (en) The method and device of new word discovery
WO2009026193A2 (en) System and method for search
CN103678576A (en) Full-text retrieval system based on dynamic semantic analysis
CN102214189B (en) Data mining-based word usage knowledge acquisition system and method
CN102253930A (en) Method and device for translating text
CN102662936A (en) Chinese-English unknown words translating method blending Web excavation, multi-feature and supervised learning
CN102750282B (en) Synonym template mining method and device as well as synonym mining method and device
CN102567409A (en) Method and device for providing retrieval associated word
CN110889310B (en) Financial document information intelligent extraction system and method
CN103324626A (en) Method for setting multi-granularity dictionary and segmenting words and device thereof
CN110851714A (en) Text recommendation method and system based on heterogeneous topic model and word embedding model
CN105760524A (en) Multi-level and multi-class classification method for science news headlines
Bougouin et al. Keyphrase annotation with graph co-ranking
CN116049359A (en) Duplicate checking algorithm based on document content analysis
Liang et al. Clustering web services for automatic categorization
CN101833582A (en) Mining method and system for correlation of vocabulary entities based on template

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C02 Deemed withdrawal of patent application after publication (patent law 2001)
WD01 Invention patent application deemed withdrawn after publication

Open date: 20100915