CN104572621B

CN104572621B - A kind of term decision method based on decision tree

Info

Publication number: CN104572621B
Application number: CN201510002515.XA
Authority: CN
Inventors: 江潮; 张芃
Original assignee: Language Network (wuhan) Information Technology Co Ltd
Current assignee: Wuhan Transn Information Technology Co., Ltd.
Priority date: 2015-01-05
Filing date: 2015-01-05
Publication date: 2018-01-26
Anticipated expiration: 2035-01-05
Also published as: CN104572621A

Abstract

A kind of term decision method based on decision tree, including：The cutting of random length is carried out in units of morpheme to original language material, obtains some candidate terms, wherein, each candidate terms are made up of at least two morphemes；It is determined that influenceing multiple features that term judges, the characteristic value of each feature of each candidate terms is calculated；With multiple characteristic values of each candidate terms, in the decision tree judged for term, the genesis sequence according to the decision tree is judged successively；It will judge that successfully the candidate terms are used as new terminology by the decision tree.The present invention reduces the workload of artificial treatment, it is ensured that obtained term reliability and accuracy is higher.

Description

A kind of term decision method based on decision tree

Technical field

The invention belongs to data mining technology field, especially a kind of term decision method based on decision tree.

Background technology

Field term or technical term are to express or limit the about qualitative symbol of professional conceptual using voice or word as carrier Number.With the booming of science and technology, continuing to bring out for new technology and making rapid progress for Internet technology, some specific necks The technical term in domain constantly expands and renewal, therefore can not meet reality in the way of traditional artificial collection field term Demand, extracting field term (ATE, Automatic Term Extraction) automatically has become certainty.In practical application, Field term is extracted in structure domain body, Chinese word segmentation, information extraction, lexicography, information retrieval, machine translation, text Classification, automatic abstract etc. are respectively provided with significance.

At present, in the industry used field term abstracting method be only based on single aspect vocabulary is analyzed and Judge, field term extraction effect is poor.

The content of the invention

An object of the present invention is to provide a kind of term decision method based on decision tree, right in the prior art to solve In field term extraction effect is poor the problem of.

In some illustrative embodiments, the term decision method based on decision tree, including：To original language material with language Element carries out the cutting of random length for unit, obtains some candidate terms, wherein, each candidate terms are by least two languages Element composition；It is determined that influenceing multiple features that term judges, the characteristic value of each feature of each candidate terms is calculated；With Multiple characteristic values of each candidate terms, in the decision tree judged for term, the generation according to the decision tree is suitable Sequence is judged successively；It will judge that successfully the candidate terms are used as new terminology by the decision tree.

Compared with prior art, illustrative embodiment of the invention includes advantages below：

Reduce the workload of artificial treatment, it is ensured that obtained term reliability and accuracy is higher.

Brief description of the drawings

Accompanying drawing described herein is used for providing a further understanding of the present invention, forms the part of the application, this hair Bright schematic description and description is used to explain the present invention, does not form inappropriate limitation of the present invention.In the accompanying drawings：

Fig. 1 is the flow chart according to the illustrative embodiment of the present invention.

Embodiment

In the following detailed description, a large amount of specific details are proposed, in order to provide thorough understanding of the present invention.But It will be understood by those within the art that it can also implement the present invention even if without these specific details.In other cases, without detailed Well-known method, process, component and circuit are carefully described, in order to avoid influence the understanding of the present invention.

As shown in figure 1, a kind of term decision method based on decision tree is disclosed, including：

S11, the cutting for carrying out random length in units of morpheme to original language material, obtain some candidate terms, wherein, often The individual candidate terms are made up of at least two morphemes；

S12, determine to influence multiple features that term judges, calculate the feature of each feature of each candidate terms Value；

S13, multiple characteristic values with each candidate terms, in the decision tree judged for term, according to described The genesis sequence of decision tree is judged successively；

S14, it will judge that successfully the candidate terms are used as new terminology by the decision tree.

The above method is described in detail below：

Such as：Original language material " People's Republic of China (PRC) " carries out cutting, carries out cuttings first with two morpheme units, can be with " China ", " Chinese ", " people ", " people are common ", " republicanism ", " and state " six candidate terms are obtained, then are entered with three morpheme units Row cutting, " middle Chinese ", " the Chinese people ", " people are common ", " people's republicanism ", " republic " five candidate terms can be obtained, then with 4 Individual morpheme unit carries out cutting, can obtain " the Chinese people ", " the Chinese people are total to ", " people's republicanism ", " people republic " four times Term is selected, then cutting is carried out with 5 morpheme units, " the Chinese people are total to ", " Chinese people's republicanism ", " people's republic " can be obtained Three candidate terms, then cutting is carried out with 6 morpheme units, " Chinese people's republicanism ", " magnificent people's republic " two can be obtained Individual candidate terms, cutting is carried out with 7 morpheme units, that is, obtains candidate terms " People's Republic of China (PRC) ".Obtain 21 altogether above Individual candidate terms.

Dicing process above is for the ease of understanding the example of the illustrative embodiment of the present invention, original language material faster Can be a text or a text collection, wherein being made up of substantial amounts of morpheme, dicing process is more complicated, in addition, if Term is long, and the term is it can be understood that be a sentence, so needing to be defined to the length of term, restriction maximum is cut Divide unit, such as maximum segmentation unit is 10 morphemes.

It is described to determine to influence multiple features that term judges in some illustrative embodiments, including：

Word frequency of the candidate terms in original language material, candidate terms are divided into two parts of random length, described any Higher value, candidate terms are independent into the general of word in both the minimum value of two-part mutual information, the left entropy of candidate terms and right entropy The appearance of rate, each morpheme of candidate terms position and suffix position in prefix position, word in the history corpus The field probability of probability and candidate terms.

Wherein, to the acquisition process of features described above, it is described in detail：

1), the word frequency of candidate terms is analyzed, that is, obtains occurrence number of the candidate terms in the original language material；

2), candidate terms are carried out with the analysis of mutual information, obtains two parts that candidate terms are divided into random length, institute State the minimum value of any two-part mutual information.

Such as：The candidate terms C of analysis length is l morpheme unit, with k-th of morpheme position, is split, obtained Front portion be c₁~c_k, rear portion c_k+1~c_l。

The calculating of mutual information is carried out according to equation below：

Wherein, c₁c₂…c_lThe morpheme of relevant position in candidate terms C, P (c are corresponded to respectively₁c₂…c_l) it is candidate's art Probabilities of occurrence of the language C in original language material, P (c₁c₂…c_k) for the candidate terms C anterior probability of occurrence in original language material, P (c_k+1c_k+2…c_l) it is probability of occurrence of the candidate terms C rear portion in original language material.

The acquisition of the minimum value of mutual information, such as：Candidate terms ABC, A, and BC are split as the first time, is split as the second time AB and C, the twice calculating of mutual information are followed successively by 0.5 and 0.6, take 0.5 mutual information as candidate terms ABC.

3), candidate terms are carried out with the analysis of left and right entropy, determines the left and right entropy of candidate terms

Calculated according to equation below：

Wherein, LH (C) is candidate terms C left entropy, and L represents to appear in lexical set on the left of candidate terms C, P (lC | C the conditional probability on the left of candidate terms C) is appeared in for vocabulary l；

Wherein, RH (C) is candidate terms C right entropy, and R represents to appear in lexical set on the right side of candidate terms C, P (Cr | C the conditional probability on the right side of candidate terms C) is appeared in for vocabulary r；

LRH=max (LH (C), RH (C))

Wherein, LRH is candidate terms C left and right entropy, by taking maximum to obtain its left entropy and right entropy.

4), the independence of candidate terms is analyzed, i.e., each morpheme that candidate terms have is in history corpus Probability of the middle independence into word

According to equation below calculate the independent probability IPW (x) into word of each morpheme x in candidate terms：

Wherein, word (x) is that morpheme x is independently being gone through in history corpus into the number of word, times (x) expressions morpheme x The total degree occurred in history corpus；

The independent probability IPW (C) into word of candidate terms C are calculated according to equation below：

IPW (C)=IPW (c₁c₂…c_l)=IPW (c₁)·IPW(c₂)·…·IPW(c_l)

5), the position occurred to each morpheme in candidate terms is analyzed, and candidate is obtained according to internal Word probability table The probability of occurrence of the morpheme of term diverse location；The acquisition methods of wherein internal Word probability table are, in existing term corpus In, calculate wherein each morpheme x appear in term head, in, the probability of tail, so as to obtain the inside for including all morphemes Word probability table.Formula is as follows：

Wherein " * " represents the front and rear morpheme combination with morpheme x composition terms, and times (X) represents term X in term language material Occurrence number in storehouse.IPC (x, pos) represents that morpheme x appears in position pos probability.Pos values are { 0,1,2 }, and 0 represents Position prefix, 1 represent position in word, 2 represent positions in suffix.

For l metacharacter strings C=c to be calculated₁c₂…c_l, it is general according to the inside Word probability table of above-mentioned gained, its internal word Rate IPC computational methods are：

6), the probability of occurrence in history corpus is combined to each morpheme or morpheme in candidate terms to analyze, Determine the field probability of candidate terms

The each morpheme or morpheme for counting and calculating candidate terms combine the probability of occurrence P (F_ in history corpus c_i)；

The field probability P C of candidate terms is calculated according still further to equation below：

By being analyzed on word frequency, mutual information, left and right entropy, independence, structure and field candidate terms, as candidate The feature of term, fully further such that the term reliability and accuracy that arrive are higher.

In some illustrative embodiments, in multiple characteristic values with each candidate terms, for term In the decision tree of judgement, before the genesis sequence according to the decision tree judge successively, in addition to：

Certain amount and the continuous term that several have been assert are randomly selected from terminology bank；

According to the term of selection, and the multiple feature, the decision-making is built using ID3 algorithms or C4.5 algorithms Tree.

In some illustrative embodiments, the term according to selection, and the multiple feature, utilize ID3 During algorithm or C4.5 algorithms build the decision tree, including：

Using each feature as the judgement node on the decision tree, and increased according to the information of the multiple feature Benefit or information gain than magnitude relationship, determine the genesis sequence of the decision tree；

Wherein, each judge that there is sentencing for corresponding feature, for forming the decision tree limb on node Determine threshold value.

In some illustrative embodiments, multiple characteristic values with each candidate terms, sentence for term In fixed decision tree, the genesis sequence according to the decision tree is judged successively, is specifically included：

By each characteristic value of the candidate terms, according to the genesis sequence of the decision tree, the judgement knot with decision tree Decision threshold on point is compared；

If as judging successfully, the candidate terms are labeled as on the judgement node of leafy node on the decision tree New terminology.

Preferably, illustrative embodiment described above is described in detail below：

First, Feature selection and its value

1), word frequency WT

Cutting is carried out to new corpus, obtains the character string that new language material concentrates random length.Will be acquired arbitrarily long The character string of degree is as character trail 1.The word frequency of each character string in character trail 1 is counted, that is, counts every in character trail 1 The occurrence number that individual character string is concentrated in new language material.

It is { 0,1 } by this feature item value according to given threshold value, whether the word frequency for representing the candidate terms respectively is more than Given threshold value.

2), mutual information MI

Mutual information is the concept in information theory, for the correlation degree of two units in metric, the mutual trust of character string Breath shows that more greatly the possibility of the composition term of the character string is bigger.

It is the character string that length is n for n metacharacter strings, the computational methods of its mutual information are --- calculate the n metacharacters The mutual information for two substrings that string is arbitrarily segmented to, the mutual information using its minimum value as the n metacharacter strings.Formula is expressed such as Under：

Order：N metacharacter strings C=c₁c₂…c_n, its mutual information calculation formula is as follows：

Wherein, k ∈ { 1,2 ..., n }.

It is { 0,1 } by this feature item value according to given threshold value, whether the mutual information for representing the candidate terms respectively is big In given threshold value.

3), left and right entropy LRH

In natural language processing, the left and right entropy of character string is an important statistical nature, embodies the upper of character string Hereafter active degree, terminology extraction, neologisms detection etc. has a very wide range of applications in field.If some character string have compared with Big left and right entropy, illustrate its context Collocation Huifeng richness, there is larger flexibility and independence using upper, also indicate that simultaneously The character string is a kind of unstable composition, i.e., the character string is relatively low for the probability of term.

Character string or so entropy calculation formula is as follows:

LRH (C)=max (LH (C), RH (C))

Wherein, L represents to appear in the lexical set on the left of character string C；R represents to appear in the vocabulary on the right side of character string C Set；P (lC | C) appears in the conditional probability on the left of character string C for character l；P (Cr | C) represent that character r appears in character string C The conditional probability on right side.

It is { 0,1 } by this feature item value according to given threshold value, whether the left and right entropy for representing the candidate terms respectively is big In given threshold value.

4), independent Word probability IWP

For character string C, if its independent Word probability IPW (C) is bigger, expression C is the possibility of term with regard to smaller.

Autonomous word method for calculating probability is as follows：

Appoint to a character x, its independent computational methods into the possibility IPW (x) of word in sentence is

Wherein, word (x) represents the independent numbers into word of character x, and times (x) represents that x concentrates time occurred in new language material Number；

Then the computational methods of candidate terms C independences Word probability are：

IPW (C)=IPW (c₁c₂…c_n)=IPW (c₁)·IPW(c₂)·…·IPW(c_n)

It is { 0,1 } by this feature item value according to given threshold value, representing the independent Word probability of the candidate terms respectively is It is no to be more than given threshold value.

5), internal Word probability IPC

Internal Word probability represents the probability that a character appears in certain position in term, and IPC (x, pos) represents character x Appear in position pos probability.Pos values are { 0,1,2 }, and 0 represents that position represents position in word, 2 expression positions in prefix, 1 Put in suffix.Internal Word probability express a character string head, in, the degree of conformity of the characters of three positions of tail, its value is bigger, The character string is that the possibility of term is bigger.

Computational methods are, in existing term corpus, calculate wherein each character x appear in term head, in, tail Probability, so as to obtain an inside Word probability table for including all characters.Calculation formula is as follows：

Wherein " * " represents the front and rear character string with character x composition terms, and times (X) represents term X in term corpus In occurrence number.

For n metacharacter strings C=c to be calculated₁c₂…c_n, it is general according to the inside Word probability table of above-mentioned gained, its internal word The computational methods of rate are：

It is { 0,1 } by this feature item value according to given threshold value, representing the inside Word probability of the candidate terms respectively is It is no to be more than given threshold value.

6th, field probability P C

Field probability shows that the character string belongs to the probability of the field term.

The field probability of each character string in calculating character trail 6, remove the character that field probability is less than given threshold value String, obtains final candidate terms collection.

For each character string C in character trail 6, appearance of each of which character in existing term corpus is calculated Probability P (F_c_i)：

Be { 0,1 } by this feature item value according to given threshold value, represent respectively the candidate terms field probability whether More than given threshold value.

2nd, candidate terms are established and judges decision tree

The corpus put in order by term establishes decision tree

Input：

Training set D：Term has arranged the corpus determined

Decision Classfication：C={ C₁, C₂, wherein C₁=it is term, C₂=it is not term

Feature item collection：A={ A₁, A₂, A₃, A₄, A₅, A₆, wherein A₁=WT, A₂=MI, A₃=LRH, A₄=IPW, A₅= IPC, A₆=PC

Threshold values：th

Output：Term judges decision tree T

Algorithm flow：

If all character strings belong to same class C in D_i, decision tree T is set to single node tree, with C_iAs the node Classification, return to decision tree T；

IfDecision tree T is set to single node tree, and by the maximum class C of character string number in D_iAs the node Classification, return to decision tree T；

Otherwise A is calculated₁~A₆Each feature selects information gain than the feature A of maximum to D information gain ratio_j；

If A_jInformation gain ratio be less than threshold values th, T is set to unijunction points, and by the maximum class C of character string number in D_i The classification of the node the most, return to decision tree T；

Otherwise, to feature A_jEach possible value, D is divided into multiple nonvoid subset D_k, by D_kMiddle character string number is most Big class builds child node, forms decision tree T by node and its child node, return to decision tree T as mark；

To node k, with D_kFor training set, with A- { A_jIt is characterized collection, recursive call step 1)~5), obtain subtree T_i, return Return T_i。

3rd, the judgement of candidate terms

Cutting is carried out to new corpus, obtains the character string that new language material concentrates random length.Using these character strings as Candidate terms.

Calculate word frequency WT, mutual information MI, left and right entropy LRH, independent Word probability IPW, the internal Word probability of these candidate terms IPC and field probability P C, more these value values and its respective threshold, obtain the value of each characteristic item of candidate terms.

According to the value of each characteristic item of candidate terms, judge in term on decision tree T by the genesis sequence of decision tree Judged.

The explanation of above example is only intended to help the method and its core concept for understanding the present invention；Meanwhile for this The those skilled in the art in field, according to the thought of the present invention, there will be changes in specific embodiments and applications, In summary, this specification content should not be construed as limiting the invention.

Claims

A kind of 1. term decision method based on decision tree, it is characterised in that including：

The cutting of random length is carried out in units of morpheme to original language material, obtains some candidate terms, wherein, each time Term is selected to be made up of at least two morphemes；

It is determined that influenceing multiple features that term judges, the characteristic value of each feature of each candidate terms is calculated；

The multiple feature includes：Word frequency of the candidate terms in original language material, the candidate terms are divided into arbitrarily In both two parts of length, the described any minimum value of two-part mutual information, the left entropy of the candidate terms and right entropy compared with The independent probability into word of big value, candidate terms, each morpheme of the candidate terms in history corpus in prefix position, The field probability of the probability of occurrence and candidate terms of position and suffix position in word；

The independent probability IPW (x) into word of each morpheme x in candidate terms are calculated according to equation below：

<mrow> <mi>I</mi> <mi>P</mi> <mi>W</mi> <mrow> <mo>(</mo> <mi>x</mi> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mrow> <mi>w</mi> <mi>o</mi> <mi>r</mi> <mi>d</mi> <mrow> <mo>(</mo> <mi>x</mi> <mo>)</mo> </mrow> </mrow> <mrow> <mi>t</mi> <mi>i</mi> <mi>m</mi> <mi>e</mi> <mi>s</mi> <mrow> <mo>(</mo> <mi>x</mi> <mo>)</mo> </mrow> </mrow> </mfrac> </mrow>

Wherein, word (x) be morpheme x in history corpus independently into the number of word, times (x) represents morpheme x in history language The total degree occurred in material storehouse；

The independent probability IPW (C) into word of candidate terms C are calculated according to equation below：

IPW (C)=IPW (c₁c₂…c_l)=IPW (c₁)·IPW(c₂)·…·IPW(c_l)

Wherein, c₁、c₂、…、c_lThe morpheme of relevant position in respectively candidate terms C；

Position and the suffix in the prefix position, institute's predicate are in the history corpus according to each morpheme The probability of occurrence of position obtains an inside Word probability table for including all morphemes, is calculated as follows：

<mrow> <mi>I</mi> <mi>P</mi> <mi>C</mi> <mrow> <mo>(</mo> <mi>x</mi> <mo>,</mo> <mn>0</mn> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mrow> <mi>t</mi> <mi>i</mi> <mi>m</mi> <mi>e</mi> <mi>s</mi> <mrow> <mo>(</mo> <mi>x</mi> <mo>*</mo> <mo>)</mo> </mrow> </mrow> <mrow> <mi>t</mi> <mi>i</mi> <mi>m</mi> <mi>e</mi> <mi>s</mi> <mrow> <mo>(</mo> <mi>x</mi> <mo>*</mo> <mo>)</mo> </mrow> <mo>+</mo> <mi>t</mi> <mi>i</mi> <mi>m</mi> <mi>e</mi> <mi>s</mi> <mrow> <mo>(</mo> <mo>*</mo> <mi>x</mi> <mo>*</mo> <mo>)</mo> </mrow> <mo>+</mo> <mi>t</mi> <mi>i</mi> <mi>m</mi> <mi>e</mi> <mi>s</mi> <mrow> <mo>(</mo> <mo>*</mo> <mi>x</mi> <mo>)</mo> </mrow> </mrow> </mfrac> </mrow>

<mrow> <mi>I</mi> <mi>P</mi> <mi>C</mi> <mrow> <mo>(</mo> <mi>x</mi> <mo>,</mo> <mn>1</mn> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mrow> <mi>t</mi> <mi>i</mi> <mi>m</mi> <mi>e</mi> <mi>s</mi> <mrow> <mo>(</mo> <mo>*</mo> <mi>x</mi> <mo>*</mo> <mo>)</mo> </mrow> </mrow> <mrow> <mi>t</mi> <mi>i</mi> <mi>m</mi> <mi>e</mi> <mi>s</mi> <mrow> <mo>(</mo> <mi>x</mi> <mo>*</mo> <mo>)</mo> </mrow> <mo>+</mo> <mi>t</mi> <mi>i</mi> <mi>m</mi> <mi>e</mi> <mi>s</mi> <mrow> <mo>(</mo> <mo>*</mo> <mi>x</mi> <mo>*</mo> <mo>)</mo> </mrow> <mo>+</mo> <mi>t</mi> <mi>i</mi> <mi>m</mi> <mi>e</mi> <mi>s</mi> <mrow> <mo>(</mo> <mo>*</mo> <mi>x</mi> <mo>)</mo> </mrow> </mrow> </mfrac> </mrow>

<mrow> <mi>I</mi> <mi>P</mi> <mi>C</mi> <mrow> <mo>(</mo> <mi>x</mi> <mo>,</mo> <mn>2</mn> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mrow> <mi>t</mi> <mi>i</mi> <mi>m</mi> <mi>e</mi> <mi>s</mi> <mrow> <mo>(</mo> <mo>*</mo> <mi>x</mi> <mo>*</mo> <mo>)</mo> </mrow> </mrow> <mrow> <mi>t</mi> <mi>i</mi> <mi>m</mi> <mi>e</mi> <mi>s</mi> <mrow> <mo>(</mo> <mi>x</mi> <mo>*</mo> <mo>)</mo> </mrow> <mo>+</mo> <mi>t</mi> <mi>i</mi> <mi>m</mi> <mi>e</mi> <mi>s</mi> <mrow> <mo>(</mo> <mo>*</mo> <mi>x</mi> <mo>*</mo> <mo>)</mo> </mrow> <mo>+</mo> <mi>t</mi> <mi>i</mi> <mi>m</mi> <mi>e</mi> <mi>s</mi> <mrow> <mo>(</mo> <mo>*</mo> <mi>x</mi> <mo>)</mo> </mrow> </mrow> </mfrac> </mrow>

Wherein " * " represents the front and rear morpheme combination with morpheme x composition terms, and times (X) represents the term X in term language material Occurrence number in storehouse；IPC (x, pos) represents that the morpheme x appears in position pos probability；Pos values are { 0,1,2 }, 0 Represent position prefix, 1 represent position in word, 2 represent positions in suffix；

For l metacharacter strings C=c to be calculated₁c₂…c_l, according to the internal Word probability table, its internal Word probability IPC meter Calculation method is：

<mrow> <mi>I</mi> <mi>P</mi> <mi>C</mi> <mo>=</mo> <mroot> <mrow> <mi>I</mi> <mi>P</mi> <mi>C</mi> <mrow> <mo>(</mo> <msub> <mi>c</mi> <mn>1</mn> </msub> <mo>,</mo> <mn>0</mn> <mo>)</mo> </mrow> <mo>&CenterDot;</mo> <mi>I</mi> <mi>P</mi> <mi>C</mi> <mrow> <mo>(</mo> <msub> <mi>c</mi> <mi>l</mi> </msub> <mo>,</mo> <mn>2</mn> <mo>)</mo> </mrow> <mo>&CenterDot;</mo> <mfrac> <mn>1</mn> <mrow> <mi>l</mi> <mo>-</mo> <mn>2</mn> </mrow> </mfrac> <msubsup> <mi>&Sigma;</mi> <mrow> <mi>i</mi> <mo>=</mo> <mn>2</mn> </mrow> <mrow> <mi>l</mi> <mo>-</mo> <mn>1</mn> </mrow> </msubsup> <mi>I</mi> <mi>P</mi> <mi>C</mi> <mrow> <mo>(</mo> <msub> <mi>c</mi> <mi>i</mi> </msub> <mo>,</mo> <mn>1</mn> <mo>)</mo> </mrow> </mrow> <mn>3</mn> </mroot> <mo>;</mo> </mrow>

Wherein, the field probability of candidate terms is calculated according to equation below：

<mrow> <mi>P</mi> <mi>C</mi> <mo>=</mo> <mroot> <mrow> <msubsup> <mo>&Pi;</mo> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>n</mi> </msubsup> <mi>P</mi> <mrow> <mo>(</mo> <mi>F</mi> <mo>_</mo> <msub> <mi>c</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> </mrow> <mi>n</mi> </mroot> </mrow>

Wherein, P (F_c_i) it is probability of occurrence of each morpheme or the morpheme combination of candidate terms in history corpus, n is candidate The number of morpheme number or the morpheme combination of term；

With multiple characteristic values of each candidate terms, in the decision tree judged for term, according to the decision tree Genesis sequence is judged successively；

It will judge that successfully the candidate terms are used as new terminology by the decision tree.
2. term decision method according to claim 1, it is characterised in that described with the more of each candidate terms Individual characteristic value, in the decision tree judged for term, before the genesis sequence according to the decision tree judge successively, also Including：

Certain amount and the continuous term that several have been assert are randomly selected from terminology bank；

According to the term of selection, and the multiple feature, the decision tree is built using ID3 algorithms or C4.5 algorithms.
3. term decision method according to claim 2, it is characterised in that the term according to selection, and The multiple feature, during building the decision tree using ID3 algorithms or C4.5 algorithms, including：

Using each feature as the judgement node on the decision tree, and according to the information gain of the multiple feature or Information gain than magnitude relationship, determine the genesis sequence of the decision tree；

Wherein, the decision threshold on node with corresponding feature, for forming the decision tree limb is each judged Value.
4. term decision method according to claim 3, it is characterised in that described with the multiple of each candidate terms Characteristic value, in the decision tree judged for term, the genesis sequence according to the decision tree is judged successively, specific bag Include：

By each characteristic value of the candidate terms, according on the genesis sequence of the decision tree, with the judgement node of decision tree Decision threshold be compared；

If as judging successfully, the candidate terms to be labeled as into new art on the judgement node of leafy node on the decision tree Language.