CN104572621B - A kind of term decision method based on decision tree - Google Patents
A kind of term decision method based on decision tree Download PDFInfo
- Publication number
- CN104572621B CN104572621B CN201510002515.XA CN201510002515A CN104572621B CN 104572621 B CN104572621 B CN 104572621B CN 201510002515 A CN201510002515 A CN 201510002515A CN 104572621 B CN104572621 B CN 104572621B
- Authority
- CN
- China
- Prior art keywords
- mrow
- term
- candidate terms
- decision tree
- morpheme
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Landscapes
- Machine Translation (AREA)
Abstract
Description
Claims (4)
- A kind of 1. term decision method based on decision tree, it is characterised in that including:The cutting of random length is carried out in units of morpheme to original language material, obtains some candidate terms, wherein, each time Term is selected to be made up of at least two morphemes;It is determined that influenceing multiple features that term judges, the characteristic value of each feature of each candidate terms is calculated;The multiple feature includes:Word frequency of the candidate terms in original language material, the candidate terms are divided into arbitrarily In both two parts of length, the described any minimum value of two-part mutual information, the left entropy of the candidate terms and right entropy compared with The independent probability into word of big value, candidate terms, each morpheme of the candidate terms in history corpus in prefix position, The field probability of the probability of occurrence and candidate terms of position and suffix position in word;The independent probability IPW (x) into word of each morpheme x in candidate terms are calculated according to equation below:<mrow> <mi>I</mi> <mi>P</mi> <mi>W</mi> <mrow> <mo>(</mo> <mi>x</mi> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mrow> <mi>w</mi> <mi>o</mi> <mi>r</mi> <mi>d</mi> <mrow> <mo>(</mo> <mi>x</mi> <mo>)</mo> </mrow> </mrow> <mrow> <mi>t</mi> <mi>i</mi> <mi>m</mi> <mi>e</mi> <mi>s</mi> <mrow> <mo>(</mo> <mi>x</mi> <mo>)</mo> </mrow> </mrow> </mfrac> </mrow>Wherein, word (x) be morpheme x in history corpus independently into the number of word, times (x) represents morpheme x in history language The total degree occurred in material storehouse;The independent probability IPW (C) into word of candidate terms C are calculated according to equation below:IPW (C)=IPW (c1c2…cl)=IPW (c1)·IPW(c2)·…·IPW(cl)Wherein, c1、c2、…、clThe morpheme of relevant position in respectively candidate terms C;Position and the suffix in the prefix position, institute's predicate are in the history corpus according to each morpheme The probability of occurrence of position obtains an inside Word probability table for including all morphemes, is calculated as follows:<mrow> <mi>I</mi> <mi>P</mi> <mi>C</mi> <mrow> <mo>(</mo> <mi>x</mi> <mo>,</mo> <mn>0</mn> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mrow> <mi>t</mi> <mi>i</mi> <mi>m</mi> <mi>e</mi> <mi>s</mi> <mrow> <mo>(</mo> <mi>x</mi> <mo>*</mo> <mo>)</mo> </mrow> </mrow> <mrow> <mi>t</mi> <mi>i</mi> <mi>m</mi> <mi>e</mi> <mi>s</mi> <mrow> <mo>(</mo> <mi>x</mi> <mo>*</mo> <mo>)</mo> </mrow> <mo>+</mo> <mi>t</mi> <mi>i</mi> <mi>m</mi> <mi>e</mi> <mi>s</mi> <mrow> <mo>(</mo> <mo>*</mo> <mi>x</mi> <mo>*</mo> <mo>)</mo> </mrow> <mo>+</mo> <mi>t</mi> <mi>i</mi> <mi>m</mi> <mi>e</mi> <mi>s</mi> <mrow> <mo>(</mo> <mo>*</mo> <mi>x</mi> <mo>)</mo> </mrow> </mrow> </mfrac> </mrow><mrow> <mi>I</mi> <mi>P</mi> <mi>C</mi> <mrow> <mo>(</mo> <mi>x</mi> <mo>,</mo> <mn>1</mn> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mrow> <mi>t</mi> <mi>i</mi> <mi>m</mi> <mi>e</mi> <mi>s</mi> <mrow> <mo>(</mo> <mo>*</mo> <mi>x</mi> <mo>*</mo> <mo>)</mo> </mrow> </mrow> <mrow> <mi>t</mi> <mi>i</mi> <mi>m</mi> <mi>e</mi> <mi>s</mi> <mrow> <mo>(</mo> <mi>x</mi> <mo>*</mo> <mo>)</mo> </mrow> <mo>+</mo> <mi>t</mi> <mi>i</mi> <mi>m</mi> <mi>e</mi> <mi>s</mi> <mrow> <mo>(</mo> <mo>*</mo> <mi>x</mi> <mo>*</mo> <mo>)</mo> </mrow> <mo>+</mo> <mi>t</mi> <mi>i</mi> <mi>m</mi> <mi>e</mi> <mi>s</mi> <mrow> <mo>(</mo> <mo>*</mo> <mi>x</mi> <mo>)</mo> </mrow> </mrow> </mfrac> </mrow><mrow> <mi>I</mi> <mi>P</mi> <mi>C</mi> <mrow> <mo>(</mo> <mi>x</mi> <mo>,</mo> <mn>2</mn> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mrow> <mi>t</mi> <mi>i</mi> <mi>m</mi> <mi>e</mi> <mi>s</mi> <mrow> <mo>(</mo> <mo>*</mo> <mi>x</mi> <mo>*</mo> <mo>)</mo> </mrow> </mrow> <mrow> <mi>t</mi> <mi>i</mi> <mi>m</mi> <mi>e</mi> <mi>s</mi> <mrow> <mo>(</mo> <mi>x</mi> <mo>*</mo> <mo>)</mo> </mrow> <mo>+</mo> <mi>t</mi> <mi>i</mi> <mi>m</mi> <mi>e</mi> <mi>s</mi> <mrow> <mo>(</mo> <mo>*</mo> <mi>x</mi> <mo>*</mo> <mo>)</mo> </mrow> <mo>+</mo> <mi>t</mi> <mi>i</mi> <mi>m</mi> <mi>e</mi> <mi>s</mi> <mrow> <mo>(</mo> <mo>*</mo> <mi>x</mi> <mo>)</mo> </mrow> </mrow> </mfrac> </mrow>Wherein " * " represents the front and rear morpheme combination with morpheme x composition terms, and times (X) represents the term X in term language material Occurrence number in storehouse;IPC (x, pos) represents that the morpheme x appears in position pos probability;Pos values are { 0,1,2 }, 0 Represent position prefix, 1 represent position in word, 2 represent positions in suffix;For l metacharacter strings C=c to be calculated1c2…cl, according to the internal Word probability table, its internal Word probability IPC meter Calculation method is:<mrow> <mi>I</mi> <mi>P</mi> <mi>C</mi> <mo>=</mo> <mroot> <mrow> <mi>I</mi> <mi>P</mi> <mi>C</mi> <mrow> <mo>(</mo> <msub> <mi>c</mi> <mn>1</mn> </msub> <mo>,</mo> <mn>0</mn> <mo>)</mo> </mrow> <mo>&CenterDot;</mo> <mi>I</mi> <mi>P</mi> <mi>C</mi> <mrow> <mo>(</mo> <msub> <mi>c</mi> <mi>l</mi> </msub> <mo>,</mo> <mn>2</mn> <mo>)</mo> </mrow> <mo>&CenterDot;</mo> <mfrac> <mn>1</mn> <mrow> <mi>l</mi> <mo>-</mo> <mn>2</mn> </mrow> </mfrac> <msubsup> <mi>&Sigma;</mi> <mrow> <mi>i</mi> <mo>=</mo> <mn>2</mn> </mrow> <mrow> <mi>l</mi> <mo>-</mo> <mn>1</mn> </mrow> </msubsup> <mi>I</mi> <mi>P</mi> <mi>C</mi> <mrow> <mo>(</mo> <msub> <mi>c</mi> <mi>i</mi> </msub> <mo>,</mo> <mn>1</mn> <mo>)</mo> </mrow> </mrow> <mn>3</mn> </mroot> <mo>;</mo> </mrow>Wherein, the field probability of candidate terms is calculated according to equation below:<mrow> <mi>P</mi> <mi>C</mi> <mo>=</mo> <mroot> <mrow> <msubsup> <mo>&Pi;</mo> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>n</mi> </msubsup> <mi>P</mi> <mrow> <mo>(</mo> <mi>F</mi> <mo>_</mo> <msub> <mi>c</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> </mrow> <mi>n</mi> </mroot> </mrow>Wherein, P (F_ci) it is probability of occurrence of each morpheme or the morpheme combination of candidate terms in history corpus, n is candidate The number of morpheme number or the morpheme combination of term;With multiple characteristic values of each candidate terms, in the decision tree judged for term, according to the decision tree Genesis sequence is judged successively;It will judge that successfully the candidate terms are used as new terminology by the decision tree.
- 2. term decision method according to claim 1, it is characterised in that described with the more of each candidate terms Individual characteristic value, in the decision tree judged for term, before the genesis sequence according to the decision tree judge successively, also Including:Certain amount and the continuous term that several have been assert are randomly selected from terminology bank;According to the term of selection, and the multiple feature, the decision tree is built using ID3 algorithms or C4.5 algorithms.
- 3. term decision method according to claim 2, it is characterised in that the term according to selection, and The multiple feature, during building the decision tree using ID3 algorithms or C4.5 algorithms, including:Using each feature as the judgement node on the decision tree, and according to the information gain of the multiple feature or Information gain than magnitude relationship, determine the genesis sequence of the decision tree;Wherein, the decision threshold on node with corresponding feature, for forming the decision tree limb is each judged Value.
- 4. term decision method according to claim 3, it is characterised in that described with the multiple of each candidate terms Characteristic value, in the decision tree judged for term, the genesis sequence according to the decision tree is judged successively, specific bag Include:By each characteristic value of the candidate terms, according on the genesis sequence of the decision tree, with the judgement node of decision tree Decision threshold be compared;If as judging successfully, the candidate terms to be labeled as into new art on the judgement node of leafy node on the decision tree Language.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510002515.XA CN104572621B (en) | 2015-01-05 | 2015-01-05 | A kind of term decision method based on decision tree |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510002515.XA CN104572621B (en) | 2015-01-05 | 2015-01-05 | A kind of term decision method based on decision tree |
Publications (2)
Publication Number | Publication Date |
---|---|
CN104572621A CN104572621A (en) | 2015-04-29 |
CN104572621B true CN104572621B (en) | 2018-01-26 |
Family
ID=53088725
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510002515.XA Active CN104572621B (en) | 2015-01-05 | 2015-01-05 | A kind of term decision method based on decision tree |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN104572621B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106649277B (en) * | 2016-12-29 | 2020-07-03 | 语联网(武汉)信息技术有限公司 | Dictionary entry method and system |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101122919A (en) * | 2007-09-14 | 2008-02-13 | 中国科学院计算技术研究所 | Professional term extraction method and system |
CN102402501A (en) * | 2010-09-09 | 2012-04-04 | 富士通株式会社 | Term extraction method and device |
CN102708147A (en) * | 2012-03-26 | 2012-10-03 | 北京新发智信科技有限责任公司 | Recognition method for new words of scientific and technical terminology |
CN103049501A (en) * | 2012-12-11 | 2013-04-17 | 上海大学 | Chinese domain term recognition method based on mutual information and conditional random field model |
CN103778243A (en) * | 2014-02-11 | 2014-05-07 | 北京信息科技大学 | Domain term extraction method |
-
2015
- 2015-01-05 CN CN201510002515.XA patent/CN104572621B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101122919A (en) * | 2007-09-14 | 2008-02-13 | 中国科学院计算技术研究所 | Professional term extraction method and system |
CN102402501A (en) * | 2010-09-09 | 2012-04-04 | 富士通株式会社 | Term extraction method and device |
CN102708147A (en) * | 2012-03-26 | 2012-10-03 | 北京新发智信科技有限责任公司 | Recognition method for new words of scientific and technical terminology |
CN103049501A (en) * | 2012-12-11 | 2013-04-17 | 上海大学 | Chinese domain term recognition method based on mutual information and conditional random field model |
CN103778243A (en) * | 2014-02-11 | 2014-05-07 | 北京信息科技大学 | Domain term extraction method |
Non-Patent Citations (5)
Title |
---|
A Statistical Corpus-Based Term Extractor;Patrick Pantel 等;《Springer Berlin Heidelberg》;20001231;全文 * |
一种基于统计技术的中文术语抽取方法;刘剑 等;《中国科技术语》;20141231(第5期);全文 * |
专业领域未登录词识别研究;鞠菲;《中国优秀硕士学位论文全文数据库 信息科技辑(月刊)》;20131215;第2013年卷(第S2期);摘要,正文第45页第7.1节、第46-48页第7.3节 * |
基于种子扩充的专业术语识别方法研究;王卫民 等;《计算机应用研究》;20121130;第29卷(第11期);全文 * |
基于统计的中文文本关键短语自动抽取方法研究;韩艳;《中国优秀硕士学位论文全文数据库 信息科技辑(月刊)》;20091015;第2009年卷(第10期);正文第30页第3.1.3节、第32、33页 * |
Also Published As
Publication number | Publication date |
---|---|
CN104572621A (en) | 2015-04-29 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN104572622B (en) | A kind of screening technique of term | |
JP6721179B2 (en) | Causal relationship recognition device and computer program therefor | |
CN107122340B (en) | A kind of similarity detection method of the science and technology item return based on synonym analysis | |
Nguyen et al. | AIDA-light: High-Throughput Named-Entity Disambiguation. | |
CN104778256B (en) | A kind of the quick of field question answering system consulting can increment clustering method | |
CN105550170B (en) | A kind of Chinese word cutting method and device | |
CN111190900B (en) | JSON data visualization optimization method in cloud computing mode | |
CN108845982A (en) | A kind of Chinese word cutting method of word-based linked character | |
CN106708798B (en) | Character string segmentation method and device | |
CN107688630B (en) | Semantic-based weakly supervised microbo multi-emotion dictionary expansion method | |
CN105787121B (en) | A kind of microblogging event summary extracting method based on more story lines | |
CN104298732B (en) | The personalized text sequence of network-oriented user a kind of and recommendation method | |
CN104598530B (en) | A kind of method that field term extracts | |
CN105956158B (en) | The method that network neologisms based on massive micro-blog text and user information automatically extract | |
CN104346382B (en) | Use the text analysis system and method for language inquiry | |
CN111460170A (en) | Word recognition method and device, terminal equipment and storage medium | |
CN107844608A (en) | A kind of sentence similarity comparative approach based on term vector | |
CN108038166A (en) | A kind of Chinese microblog emotional analysis method based on the subjective and objective skewed popularity of lexical item | |
CN104572633A (en) | Method for determining meanings of polysemous word | |
Wang et al. | Automatic scoring of Chinese fill-in-the-blank questions based on improved P-means | |
CN106126497A (en) | A kind of automatic mining correspondence executes leader section and the method for cited literature textual content fragment | |
JP2009015796A (en) | Apparatus and method for extracting multiplex topics in text, program, and recording medium | |
CN104572621B (en) | A kind of term decision method based on decision tree | |
CN110413985B (en) | Related text segment searching method and device | |
CN110110079A (en) | A kind of social networks junk user detection method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
ASS | Succession or assignment of patent right |
Owner name: WUHAN TRANSN INFORMATION TECHNOLOGY CO., LTD. Free format text: FORMER OWNER: YULIANWANG (WUHAN) INFORMATION TECHNOLOGY CO., LTD. Effective date: 20150805 |
|
C41 | Transfer of patent application or patent right or utility model | ||
TA01 | Transfer of patent application right |
Effective date of registration: 20150805 Address after: 430073 East Lake Hubei Development Zone, Optics Valley Software Park, a phase of the west, South Lake Road South, Optics Valley Software Park, No. 2, No. 5, layer 205, six Applicant after: Wuhan Transn Information Technology Co., Ltd. Address before: 430073 East Lake Hubei Development Zone, Optics Valley Software Park, a phase of the west, South Lake Road South, Optics Valley Software Park, No. 2, No. 6, layer 206, six Applicant before: Language network (Wuhan) Information Technology Co., Ltd. |
|
CB02 | Change of applicant information | ||
CB02 | Change of applicant information |
Address after: 430070 East Lake Hubei Development Zone, Optics Valley Software Park, a phase of the west, South Lake Road South, Optics Valley Software Park, No. 2, No. 5, layer 205, six Applicant after: Language network (Wuhan) Information Technology Co., Ltd. Address before: 430073 East Lake Hubei Development Zone, Optics Valley Software Park, a phase of the west, South Lake Road South, Optics Valley Software Park, No. 2, No. 5, layer 205, six Applicant before: Wuhan Transn Information Technology Co., Ltd. |
|
GR01 | Patent grant | ||
GR01 | Patent grant |