CN104572621B - A kind of term decision method based on decision tree - Google Patents

A kind of term decision method based on decision tree Download PDF

Info

Publication number
CN104572621B
CN104572621B CN201510002515.XA CN201510002515A CN104572621B CN 104572621 B CN104572621 B CN 104572621B CN 201510002515 A CN201510002515 A CN 201510002515A CN 104572621 B CN104572621 B CN 104572621B
Authority
CN
China
Prior art keywords
mrow
term
candidate terms
decision tree
morpheme
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201510002515.XA
Other languages
Chinese (zh)
Other versions
CN104572621A (en
Inventor
江潮
张芃
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan Transn Information Technology Co., Ltd.
Original Assignee
Language Network (wuhan) Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Language Network (wuhan) Information Technology Co Ltd filed Critical Language Network (wuhan) Information Technology Co Ltd
Priority to CN201510002515.XA priority Critical patent/CN104572621B/en
Publication of CN104572621A publication Critical patent/CN104572621A/en
Application granted granted Critical
Publication of CN104572621B publication Critical patent/CN104572621B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Machine Translation (AREA)

Abstract

A kind of term decision method based on decision tree, including:The cutting of random length is carried out in units of morpheme to original language material, obtains some candidate terms, wherein, each candidate terms are made up of at least two morphemes;It is determined that influenceing multiple features that term judges, the characteristic value of each feature of each candidate terms is calculated;With multiple characteristic values of each candidate terms, in the decision tree judged for term, the genesis sequence according to the decision tree is judged successively;It will judge that successfully the candidate terms are used as new terminology by the decision tree.The present invention reduces the workload of artificial treatment, it is ensured that obtained term reliability and accuracy is higher.

Description

A kind of term decision method based on decision tree
Technical field
The invention belongs to data mining technology field, especially a kind of term decision method based on decision tree.
Background technology
Field term or technical term are to express or limit the about qualitative symbol of professional conceptual using voice or word as carrier Number.With the booming of science and technology, continuing to bring out for new technology and making rapid progress for Internet technology, some specific necks The technical term in domain constantly expands and renewal, therefore can not meet reality in the way of traditional artificial collection field term Demand, extracting field term (ATE, Automatic Term Extraction) automatically has become certainty.In practical application, Field term is extracted in structure domain body, Chinese word segmentation, information extraction, lexicography, information retrieval, machine translation, text Classification, automatic abstract etc. are respectively provided with significance.
At present, in the industry used field term abstracting method be only based on single aspect vocabulary is analyzed and Judge, field term extraction effect is poor.
The content of the invention
An object of the present invention is to provide a kind of term decision method based on decision tree, right in the prior art to solve In field term extraction effect is poor the problem of.
In some illustrative embodiments, the term decision method based on decision tree, including:To original language material with language Element carries out the cutting of random length for unit, obtains some candidate terms, wherein, each candidate terms are by least two languages Element composition;It is determined that influenceing multiple features that term judges, the characteristic value of each feature of each candidate terms is calculated;With Multiple characteristic values of each candidate terms, in the decision tree judged for term, the generation according to the decision tree is suitable Sequence is judged successively;It will judge that successfully the candidate terms are used as new terminology by the decision tree.
Compared with prior art, illustrative embodiment of the invention includes advantages below:
Reduce the workload of artificial treatment, it is ensured that obtained term reliability and accuracy is higher.
Brief description of the drawings
Accompanying drawing described herein is used for providing a further understanding of the present invention, forms the part of the application, this hair Bright schematic description and description is used to explain the present invention, does not form inappropriate limitation of the present invention.In the accompanying drawings:
Fig. 1 is the flow chart according to the illustrative embodiment of the present invention.
Embodiment
In the following detailed description, a large amount of specific details are proposed, in order to provide thorough understanding of the present invention.But It will be understood by those within the art that it can also implement the present invention even if without these specific details.In other cases, without detailed Well-known method, process, component and circuit are carefully described, in order to avoid influence the understanding of the present invention.
As shown in figure 1, a kind of term decision method based on decision tree is disclosed, including:
S11, the cutting for carrying out random length in units of morpheme to original language material, obtain some candidate terms, wherein, often The individual candidate terms are made up of at least two morphemes;
S12, determine to influence multiple features that term judges, calculate the feature of each feature of each candidate terms Value;
S13, multiple characteristic values with each candidate terms, in the decision tree judged for term, according to described The genesis sequence of decision tree is judged successively;
S14, it will judge that successfully the candidate terms are used as new terminology by the decision tree.
Reduce the workload of artificial treatment, it is ensured that obtained term reliability and accuracy is higher.
The above method is described in detail below:
Such as:Original language material " People's Republic of China (PRC) " carries out cutting, carries out cuttings first with two morpheme units, can be with " China ", " Chinese ", " people ", " people are common ", " republicanism ", " and state " six candidate terms are obtained, then are entered with three morpheme units Row cutting, " middle Chinese ", " the Chinese people ", " people are common ", " people's republicanism ", " republic " five candidate terms can be obtained, then with 4 Individual morpheme unit carries out cutting, can obtain " the Chinese people ", " the Chinese people are total to ", " people's republicanism ", " people republic " four times Term is selected, then cutting is carried out with 5 morpheme units, " the Chinese people are total to ", " Chinese people's republicanism ", " people's republic " can be obtained Three candidate terms, then cutting is carried out with 6 morpheme units, " Chinese people's republicanism ", " magnificent people's republic " two can be obtained Individual candidate terms, cutting is carried out with 7 morpheme units, that is, obtains candidate terms " People's Republic of China (PRC) ".Obtain 21 altogether above Individual candidate terms.
Dicing process above is for the ease of understanding the example of the illustrative embodiment of the present invention, original language material faster Can be a text or a text collection, wherein being made up of substantial amounts of morpheme, dicing process is more complicated, in addition, if Term is long, and the term is it can be understood that be a sentence, so needing to be defined to the length of term, restriction maximum is cut Divide unit, such as maximum segmentation unit is 10 morphemes.
It is described to determine to influence multiple features that term judges in some illustrative embodiments, including:
Word frequency of the candidate terms in original language material, candidate terms are divided into two parts of random length, described any Higher value, candidate terms are independent into the general of word in both the minimum value of two-part mutual information, the left entropy of candidate terms and right entropy The appearance of rate, each morpheme of candidate terms position and suffix position in prefix position, word in the history corpus The field probability of probability and candidate terms.
Wherein, to the acquisition process of features described above, it is described in detail:
1), the word frequency of candidate terms is analyzed, that is, obtains occurrence number of the candidate terms in the original language material;
2), candidate terms are carried out with the analysis of mutual information, obtains two parts that candidate terms are divided into random length, institute State the minimum value of any two-part mutual information.
Such as:The candidate terms C of analysis length is l morpheme unit, with k-th of morpheme position, is split, obtained Front portion be c1~ck, rear portion ck+1~cl
The calculating of mutual information is carried out according to equation below:
Wherein, c1c2…clThe morpheme of relevant position in candidate terms C, P (c are corresponded to respectively1c2…cl) it is candidate's art Probabilities of occurrence of the language C in original language material, P (c1c2…ck) for the candidate terms C anterior probability of occurrence in original language material, P (ck+1ck+2…cl) it is probability of occurrence of the candidate terms C rear portion in original language material.
The acquisition of the minimum value of mutual information, such as:Candidate terms ABC, A, and BC are split as the first time, is split as the second time AB and C, the twice calculating of mutual information are followed successively by 0.5 and 0.6, take 0.5 mutual information as candidate terms ABC.
3), candidate terms are carried out with the analysis of left and right entropy, determines the left and right entropy of candidate terms
Calculated according to equation below:
Wherein, LH (C) is candidate terms C left entropy, and L represents to appear in lexical set on the left of candidate terms C, P (lC | C the conditional probability on the left of candidate terms C) is appeared in for vocabulary l;
Wherein, RH (C) is candidate terms C right entropy, and R represents to appear in lexical set on the right side of candidate terms C, P (Cr | C the conditional probability on the right side of candidate terms C) is appeared in for vocabulary r;
LRH=max (LH (C), RH (C))
Wherein, LRH is candidate terms C left and right entropy, by taking maximum to obtain its left entropy and right entropy.
4), the independence of candidate terms is analyzed, i.e., each morpheme that candidate terms have is in history corpus Probability of the middle independence into word
According to equation below calculate the independent probability IPW (x) into word of each morpheme x in candidate terms:
Wherein, word (x) is that morpheme x is independently being gone through in history corpus into the number of word, times (x) expressions morpheme x The total degree occurred in history corpus;
The independent probability IPW (C) into word of candidate terms C are calculated according to equation below:
IPW (C)=IPW (c1c2…cl)=IPW (c1)·IPW(c2)·…·IPW(cl)
5), the position occurred to each morpheme in candidate terms is analyzed, and candidate is obtained according to internal Word probability table The probability of occurrence of the morpheme of term diverse location;The acquisition methods of wherein internal Word probability table are, in existing term corpus In, calculate wherein each morpheme x appear in term head, in, the probability of tail, so as to obtain the inside for including all morphemes Word probability table.Formula is as follows:
Wherein " * " represents the front and rear morpheme combination with morpheme x composition terms, and times (X) represents term X in term language material Occurrence number in storehouse.IPC (x, pos) represents that morpheme x appears in position pos probability.Pos values are { 0,1,2 }, and 0 represents Position prefix, 1 represent position in word, 2 represent positions in suffix.
For l metacharacter strings C=c to be calculated1c2…cl, it is general according to the inside Word probability table of above-mentioned gained, its internal word Rate IPC computational methods are:
6), the probability of occurrence in history corpus is combined to each morpheme or morpheme in candidate terms to analyze, Determine the field probability of candidate terms
The each morpheme or morpheme for counting and calculating candidate terms combine the probability of occurrence P (F_ in history corpus ci);
The field probability P C of candidate terms is calculated according still further to equation below:
By being analyzed on word frequency, mutual information, left and right entropy, independence, structure and field candidate terms, as candidate The feature of term, fully further such that the term reliability and accuracy that arrive are higher.
In some illustrative embodiments, in multiple characteristic values with each candidate terms, for term In the decision tree of judgement, before the genesis sequence according to the decision tree judge successively, in addition to:
Certain amount and the continuous term that several have been assert are randomly selected from terminology bank;
According to the term of selection, and the multiple feature, the decision-making is built using ID3 algorithms or C4.5 algorithms Tree.
In some illustrative embodiments, the term according to selection, and the multiple feature, utilize ID3 During algorithm or C4.5 algorithms build the decision tree, including:
Using each feature as the judgement node on the decision tree, and increased according to the information of the multiple feature Benefit or information gain than magnitude relationship, determine the genesis sequence of the decision tree;
Wherein, each judge that there is sentencing for corresponding feature, for forming the decision tree limb on node Determine threshold value.
In some illustrative embodiments, multiple characteristic values with each candidate terms, sentence for term In fixed decision tree, the genesis sequence according to the decision tree is judged successively, is specifically included:
By each characteristic value of the candidate terms, according to the genesis sequence of the decision tree, the judgement knot with decision tree Decision threshold on point is compared;
If as judging successfully, the candidate terms are labeled as on the judgement node of leafy node on the decision tree New terminology.
Preferably, illustrative embodiment described above is described in detail below:
First, Feature selection and its value
1), word frequency WT
Cutting is carried out to new corpus, obtains the character string that new language material concentrates random length.Will be acquired arbitrarily long The character string of degree is as character trail 1.The word frequency of each character string in character trail 1 is counted, that is, counts every in character trail 1 The occurrence number that individual character string is concentrated in new language material.
It is { 0,1 } by this feature item value according to given threshold value, whether the word frequency for representing the candidate terms respectively is more than Given threshold value.
2), mutual information MI
Mutual information is the concept in information theory, for the correlation degree of two units in metric, the mutual trust of character string Breath shows that more greatly the possibility of the composition term of the character string is bigger.
It is the character string that length is n for n metacharacter strings, the computational methods of its mutual information are --- calculate the n metacharacters The mutual information for two substrings that string is arbitrarily segmented to, the mutual information using its minimum value as the n metacharacter strings.Formula is expressed such as Under:
Order:N metacharacter strings C=c1c2…cn, its mutual information calculation formula is as follows:
Wherein, k ∈ { 1,2 ..., n }.
It is { 0,1 } by this feature item value according to given threshold value, whether the mutual information for representing the candidate terms respectively is big In given threshold value.
3), left and right entropy LRH
In natural language processing, the left and right entropy of character string is an important statistical nature, embodies the upper of character string Hereafter active degree, terminology extraction, neologisms detection etc. has a very wide range of applications in field.If some character string have compared with Big left and right entropy, illustrate its context Collocation Huifeng richness, there is larger flexibility and independence using upper, also indicate that simultaneously The character string is a kind of unstable composition, i.e., the character string is relatively low for the probability of term.
Character string or so entropy calculation formula is as follows:
LRH (C)=max (LH (C), RH (C))
Wherein, L represents to appear in the lexical set on the left of character string C;R represents to appear in the vocabulary on the right side of character string C Set;P (lC | C) appears in the conditional probability on the left of character string C for character l;P (Cr | C) represent that character r appears in character string C The conditional probability on right side.
It is { 0,1 } by this feature item value according to given threshold value, whether the left and right entropy for representing the candidate terms respectively is big In given threshold value.
4), independent Word probability IWP
For character string C, if its independent Word probability IPW (C) is bigger, expression C is the possibility of term with regard to smaller.
Autonomous word method for calculating probability is as follows:
Appoint to a character x, its independent computational methods into the possibility IPW (x) of word in sentence is
Wherein, word (x) represents the independent numbers into word of character x, and times (x) represents that x concentrates time occurred in new language material Number;
Then the computational methods of candidate terms C independences Word probability are:
IPW (C)=IPW (c1c2…cn)=IPW (c1)·IPW(c2)·…·IPW(cn)
It is { 0,1 } by this feature item value according to given threshold value, representing the independent Word probability of the candidate terms respectively is It is no to be more than given threshold value.
5), internal Word probability IPC
Internal Word probability represents the probability that a character appears in certain position in term, and IPC (x, pos) represents character x Appear in position pos probability.Pos values are { 0,1,2 }, and 0 represents that position represents position in word, 2 expression positions in prefix, 1 Put in suffix.Internal Word probability express a character string head, in, the degree of conformity of the characters of three positions of tail, its value is bigger, The character string is that the possibility of term is bigger.
Computational methods are, in existing term corpus, calculate wherein each character x appear in term head, in, tail Probability, so as to obtain an inside Word probability table for including all characters.Calculation formula is as follows:
Wherein " * " represents the front and rear character string with character x composition terms, and times (X) represents term X in term corpus In occurrence number.
For n metacharacter strings C=c to be calculated1c2…cn, it is general according to the inside Word probability table of above-mentioned gained, its internal word The computational methods of rate are:
It is { 0,1 } by this feature item value according to given threshold value, representing the inside Word probability of the candidate terms respectively is It is no to be more than given threshold value.
6th, field probability P C
Field probability shows that the character string belongs to the probability of the field term.
The field probability of each character string in calculating character trail 6, remove the character that field probability is less than given threshold value String, obtains final candidate terms collection.
For each character string C in character trail 6, appearance of each of which character in existing term corpus is calculated Probability P (F_ci):
Be { 0,1 } by this feature item value according to given threshold value, represent respectively the candidate terms field probability whether More than given threshold value.
2nd, candidate terms are established and judges decision tree
The corpus put in order by term establishes decision tree
Input:
Training set D:Term has arranged the corpus determined
Decision Classfication:C={ C1, C2, wherein C1=it is term, C2=it is not term
Feature item collection:A={ A1, A2, A3, A4, A5, A6, wherein A1=WT, A2=MI, A3=LRH, A4=IPW, A5= IPC, A6=PC
Threshold values:th
Output:Term judges decision tree T
Algorithm flow:
If all character strings belong to same class C in Di, decision tree T is set to single node tree, with CiAs the node Classification, return to decision tree T;
IfDecision tree T is set to single node tree, and by the maximum class C of character string number in DiAs the node Classification, return to decision tree T;
Otherwise A is calculated1~A6Each feature selects information gain than the feature A of maximum to D information gain ratioj
If AjInformation gain ratio be less than threshold values th, T is set to unijunction points, and by the maximum class C of character string number in Di The classification of the node the most, return to decision tree T;
Otherwise, to feature AjEach possible value, D is divided into multiple nonvoid subset Dk, by DkMiddle character string number is most Big class builds child node, forms decision tree T by node and its child node, return to decision tree T as mark;
To node k, with DkFor training set, with A- { AjIt is characterized collection, recursive call step 1)~5), obtain subtree Ti, return Return Ti
3rd, the judgement of candidate terms
Cutting is carried out to new corpus, obtains the character string that new language material concentrates random length.Using these character strings as Candidate terms.
Calculate word frequency WT, mutual information MI, left and right entropy LRH, independent Word probability IPW, the internal Word probability of these candidate terms IPC and field probability P C, more these value values and its respective threshold, obtain the value of each characteristic item of candidate terms.
According to the value of each characteristic item of candidate terms, judge in term on decision tree T by the genesis sequence of decision tree Judged.
The explanation of above example is only intended to help the method and its core concept for understanding the present invention;Meanwhile for this The those skilled in the art in field, according to the thought of the present invention, there will be changes in specific embodiments and applications, In summary, this specification content should not be construed as limiting the invention.

Claims (4)

  1. A kind of 1. term decision method based on decision tree, it is characterised in that including:
    The cutting of random length is carried out in units of morpheme to original language material, obtains some candidate terms, wherein, each time Term is selected to be made up of at least two morphemes;
    It is determined that influenceing multiple features that term judges, the characteristic value of each feature of each candidate terms is calculated;
    The multiple feature includes:Word frequency of the candidate terms in original language material, the candidate terms are divided into arbitrarily In both two parts of length, the described any minimum value of two-part mutual information, the left entropy of the candidate terms and right entropy compared with The independent probability into word of big value, candidate terms, each morpheme of the candidate terms in history corpus in prefix position, The field probability of the probability of occurrence and candidate terms of position and suffix position in word;
    The independent probability IPW (x) into word of each morpheme x in candidate terms are calculated according to equation below:
    <mrow> <mi>I</mi> <mi>P</mi> <mi>W</mi> <mrow> <mo>(</mo> <mi>x</mi> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mrow> <mi>w</mi> <mi>o</mi> <mi>r</mi> <mi>d</mi> <mrow> <mo>(</mo> <mi>x</mi> <mo>)</mo> </mrow> </mrow> <mrow> <mi>t</mi> <mi>i</mi> <mi>m</mi> <mi>e</mi> <mi>s</mi> <mrow> <mo>(</mo> <mi>x</mi> <mo>)</mo> </mrow> </mrow> </mfrac> </mrow>
    Wherein, word (x) be morpheme x in history corpus independently into the number of word, times (x) represents morpheme x in history language The total degree occurred in material storehouse;
    The independent probability IPW (C) into word of candidate terms C are calculated according to equation below:
    IPW (C)=IPW (c1c2…cl)=IPW (c1)·IPW(c2)·…·IPW(cl)
    Wherein, c1、c2、…、clThe morpheme of relevant position in respectively candidate terms C;
    Position and the suffix in the prefix position, institute's predicate are in the history corpus according to each morpheme The probability of occurrence of position obtains an inside Word probability table for including all morphemes, is calculated as follows:
    <mrow> <mi>I</mi> <mi>P</mi> <mi>C</mi> <mrow> <mo>(</mo> <mi>x</mi> <mo>,</mo> <mn>0</mn> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mrow> <mi>t</mi> <mi>i</mi> <mi>m</mi> <mi>e</mi> <mi>s</mi> <mrow> <mo>(</mo> <mi>x</mi> <mo>*</mo> <mo>)</mo> </mrow> </mrow> <mrow> <mi>t</mi> <mi>i</mi> <mi>m</mi> <mi>e</mi> <mi>s</mi> <mrow> <mo>(</mo> <mi>x</mi> <mo>*</mo> <mo>)</mo> </mrow> <mo>+</mo> <mi>t</mi> <mi>i</mi> <mi>m</mi> <mi>e</mi> <mi>s</mi> <mrow> <mo>(</mo> <mo>*</mo> <mi>x</mi> <mo>*</mo> <mo>)</mo> </mrow> <mo>+</mo> <mi>t</mi> <mi>i</mi> <mi>m</mi> <mi>e</mi> <mi>s</mi> <mrow> <mo>(</mo> <mo>*</mo> <mi>x</mi> <mo>)</mo> </mrow> </mrow> </mfrac> </mrow>
    <mrow> <mi>I</mi> <mi>P</mi> <mi>C</mi> <mrow> <mo>(</mo> <mi>x</mi> <mo>,</mo> <mn>1</mn> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mrow> <mi>t</mi> <mi>i</mi> <mi>m</mi> <mi>e</mi> <mi>s</mi> <mrow> <mo>(</mo> <mo>*</mo> <mi>x</mi> <mo>*</mo> <mo>)</mo> </mrow> </mrow> <mrow> <mi>t</mi> <mi>i</mi> <mi>m</mi> <mi>e</mi> <mi>s</mi> <mrow> <mo>(</mo> <mi>x</mi> <mo>*</mo> <mo>)</mo> </mrow> <mo>+</mo> <mi>t</mi> <mi>i</mi> <mi>m</mi> <mi>e</mi> <mi>s</mi> <mrow> <mo>(</mo> <mo>*</mo> <mi>x</mi> <mo>*</mo> <mo>)</mo> </mrow> <mo>+</mo> <mi>t</mi> <mi>i</mi> <mi>m</mi> <mi>e</mi> <mi>s</mi> <mrow> <mo>(</mo> <mo>*</mo> <mi>x</mi> <mo>)</mo> </mrow> </mrow> </mfrac> </mrow>
    <mrow> <mi>I</mi> <mi>P</mi> <mi>C</mi> <mrow> <mo>(</mo> <mi>x</mi> <mo>,</mo> <mn>2</mn> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mrow> <mi>t</mi> <mi>i</mi> <mi>m</mi> <mi>e</mi> <mi>s</mi> <mrow> <mo>(</mo> <mo>*</mo> <mi>x</mi> <mo>*</mo> <mo>)</mo> </mrow> </mrow> <mrow> <mi>t</mi> <mi>i</mi> <mi>m</mi> <mi>e</mi> <mi>s</mi> <mrow> <mo>(</mo> <mi>x</mi> <mo>*</mo> <mo>)</mo> </mrow> <mo>+</mo> <mi>t</mi> <mi>i</mi> <mi>m</mi> <mi>e</mi> <mi>s</mi> <mrow> <mo>(</mo> <mo>*</mo> <mi>x</mi> <mo>*</mo> <mo>)</mo> </mrow> <mo>+</mo> <mi>t</mi> <mi>i</mi> <mi>m</mi> <mi>e</mi> <mi>s</mi> <mrow> <mo>(</mo> <mo>*</mo> <mi>x</mi> <mo>)</mo> </mrow> </mrow> </mfrac> </mrow>
    Wherein " * " represents the front and rear morpheme combination with morpheme x composition terms, and times (X) represents the term X in term language material Occurrence number in storehouse;IPC (x, pos) represents that the morpheme x appears in position pos probability;Pos values are { 0,1,2 }, 0 Represent position prefix, 1 represent position in word, 2 represent positions in suffix;
    For l metacharacter strings C=c to be calculated1c2…cl, according to the internal Word probability table, its internal Word probability IPC meter Calculation method is:
    <mrow> <mi>I</mi> <mi>P</mi> <mi>C</mi> <mo>=</mo> <mroot> <mrow> <mi>I</mi> <mi>P</mi> <mi>C</mi> <mrow> <mo>(</mo> <msub> <mi>c</mi> <mn>1</mn> </msub> <mo>,</mo> <mn>0</mn> <mo>)</mo> </mrow> <mo>&amp;CenterDot;</mo> <mi>I</mi> <mi>P</mi> <mi>C</mi> <mrow> <mo>(</mo> <msub> <mi>c</mi> <mi>l</mi> </msub> <mo>,</mo> <mn>2</mn> <mo>)</mo> </mrow> <mo>&amp;CenterDot;</mo> <mfrac> <mn>1</mn> <mrow> <mi>l</mi> <mo>-</mo> <mn>2</mn> </mrow> </mfrac> <msubsup> <mi>&amp;Sigma;</mi> <mrow> <mi>i</mi> <mo>=</mo> <mn>2</mn> </mrow> <mrow> <mi>l</mi> <mo>-</mo> <mn>1</mn> </mrow> </msubsup> <mi>I</mi> <mi>P</mi> <mi>C</mi> <mrow> <mo>(</mo> <msub> <mi>c</mi> <mi>i</mi> </msub> <mo>,</mo> <mn>1</mn> <mo>)</mo> </mrow> </mrow> <mn>3</mn> </mroot> <mo>;</mo> </mrow>
    Wherein, the field probability of candidate terms is calculated according to equation below:
    <mrow> <mi>P</mi> <mi>C</mi> <mo>=</mo> <mroot> <mrow> <msubsup> <mo>&amp;Pi;</mo> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>n</mi> </msubsup> <mi>P</mi> <mrow> <mo>(</mo> <mi>F</mi> <mo>_</mo> <msub> <mi>c</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> </mrow> <mi>n</mi> </mroot> </mrow>
    Wherein, P (F_ci) it is probability of occurrence of each morpheme or the morpheme combination of candidate terms in history corpus, n is candidate The number of morpheme number or the morpheme combination of term;
    With multiple characteristic values of each candidate terms, in the decision tree judged for term, according to the decision tree Genesis sequence is judged successively;
    It will judge that successfully the candidate terms are used as new terminology by the decision tree.
  2. 2. term decision method according to claim 1, it is characterised in that described with the more of each candidate terms Individual characteristic value, in the decision tree judged for term, before the genesis sequence according to the decision tree judge successively, also Including:
    Certain amount and the continuous term that several have been assert are randomly selected from terminology bank;
    According to the term of selection, and the multiple feature, the decision tree is built using ID3 algorithms or C4.5 algorithms.
  3. 3. term decision method according to claim 2, it is characterised in that the term according to selection, and The multiple feature, during building the decision tree using ID3 algorithms or C4.5 algorithms, including:
    Using each feature as the judgement node on the decision tree, and according to the information gain of the multiple feature or Information gain than magnitude relationship, determine the genesis sequence of the decision tree;
    Wherein, the decision threshold on node with corresponding feature, for forming the decision tree limb is each judged Value.
  4. 4. term decision method according to claim 3, it is characterised in that described with the multiple of each candidate terms Characteristic value, in the decision tree judged for term, the genesis sequence according to the decision tree is judged successively, specific bag Include:
    By each characteristic value of the candidate terms, according on the genesis sequence of the decision tree, with the judgement node of decision tree Decision threshold be compared;
    If as judging successfully, the candidate terms to be labeled as into new art on the judgement node of leafy node on the decision tree Language.
CN201510002515.XA 2015-01-05 2015-01-05 A kind of term decision method based on decision tree Active CN104572621B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510002515.XA CN104572621B (en) 2015-01-05 2015-01-05 A kind of term decision method based on decision tree

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510002515.XA CN104572621B (en) 2015-01-05 2015-01-05 A kind of term decision method based on decision tree

Publications (2)

Publication Number Publication Date
CN104572621A CN104572621A (en) 2015-04-29
CN104572621B true CN104572621B (en) 2018-01-26

Family

ID=53088725

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510002515.XA Active CN104572621B (en) 2015-01-05 2015-01-05 A kind of term decision method based on decision tree

Country Status (1)

Country Link
CN (1) CN104572621B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106649277B (en) * 2016-12-29 2020-07-03 语联网(武汉)信息技术有限公司 Dictionary entry method and system

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101122919A (en) * 2007-09-14 2008-02-13 中国科学院计算技术研究所 Professional term extraction method and system
CN102402501A (en) * 2010-09-09 2012-04-04 富士通株式会社 Term extraction method and device
CN102708147A (en) * 2012-03-26 2012-10-03 北京新发智信科技有限责任公司 Recognition method for new words of scientific and technical terminology
CN103049501A (en) * 2012-12-11 2013-04-17 上海大学 Chinese domain term recognition method based on mutual information and conditional random field model
CN103778243A (en) * 2014-02-11 2014-05-07 北京信息科技大学 Domain term extraction method

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101122919A (en) * 2007-09-14 2008-02-13 中国科学院计算技术研究所 Professional term extraction method and system
CN102402501A (en) * 2010-09-09 2012-04-04 富士通株式会社 Term extraction method and device
CN102708147A (en) * 2012-03-26 2012-10-03 北京新发智信科技有限责任公司 Recognition method for new words of scientific and technical terminology
CN103049501A (en) * 2012-12-11 2013-04-17 上海大学 Chinese domain term recognition method based on mutual information and conditional random field model
CN103778243A (en) * 2014-02-11 2014-05-07 北京信息科技大学 Domain term extraction method

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
A Statistical Corpus-Based Term Extractor;Patrick Pantel 等;《Springer Berlin Heidelberg》;20001231;全文 *
一种基于统计技术的中文术语抽取方法;刘剑 等;《中国科技术语》;20141231(第5期);全文 *
专业领域未登录词识别研究;鞠菲;《中国优秀硕士学位论文全文数据库 信息科技辑(月刊)》;20131215;第2013年卷(第S2期);摘要,正文第45页第7.1节、第46-48页第7.3节 *
基于种子扩充的专业术语识别方法研究;王卫民 等;《计算机应用研究》;20121130;第29卷(第11期);全文 *
基于统计的中文文本关键短语自动抽取方法研究;韩艳;《中国优秀硕士学位论文全文数据库 信息科技辑(月刊)》;20091015;第2009年卷(第10期);正文第30页第3.1.3节、第32、33页 *

Also Published As

Publication number Publication date
CN104572621A (en) 2015-04-29

Similar Documents

Publication Publication Date Title
CN104572622B (en) A kind of screening technique of term
JP6721179B2 (en) Causal relationship recognition device and computer program therefor
CN107122340B (en) A kind of similarity detection method of the science and technology item return based on synonym analysis
Nguyen et al. AIDA-light: High-Throughput Named-Entity Disambiguation.
CN104778256B (en) A kind of the quick of field question answering system consulting can increment clustering method
CN105550170B (en) A kind of Chinese word cutting method and device
CN111190900B (en) JSON data visualization optimization method in cloud computing mode
CN108845982A (en) A kind of Chinese word cutting method of word-based linked character
CN106708798B (en) Character string segmentation method and device
CN107688630B (en) Semantic-based weakly supervised microbo multi-emotion dictionary expansion method
CN105787121B (en) A kind of microblogging event summary extracting method based on more story lines
CN104298732B (en) The personalized text sequence of network-oriented user a kind of and recommendation method
CN104598530B (en) A kind of method that field term extracts
CN105956158B (en) The method that network neologisms based on massive micro-blog text and user information automatically extract
CN104346382B (en) Use the text analysis system and method for language inquiry
CN111460170A (en) Word recognition method and device, terminal equipment and storage medium
CN107844608A (en) A kind of sentence similarity comparative approach based on term vector
CN108038166A (en) A kind of Chinese microblog emotional analysis method based on the subjective and objective skewed popularity of lexical item
CN104572633A (en) Method for determining meanings of polysemous word
Wang et al. Automatic scoring of Chinese fill-in-the-blank questions based on improved P-means
CN106126497A (en) A kind of automatic mining correspondence executes leader section and the method for cited literature textual content fragment
JP2009015796A (en) Apparatus and method for extracting multiplex topics in text, program, and recording medium
CN104572621B (en) A kind of term decision method based on decision tree
CN110413985B (en) Related text segment searching method and device
CN110110079A (en) A kind of social networks junk user detection method

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
ASS Succession or assignment of patent right

Owner name: WUHAN TRANSN INFORMATION TECHNOLOGY CO., LTD.

Free format text: FORMER OWNER: YULIANWANG (WUHAN) INFORMATION TECHNOLOGY CO., LTD.

Effective date: 20150805

C41 Transfer of patent application or patent right or utility model
TA01 Transfer of patent application right

Effective date of registration: 20150805

Address after: 430073 East Lake Hubei Development Zone, Optics Valley Software Park, a phase of the west, South Lake Road South, Optics Valley Software Park, No. 2, No. 5, layer 205, six

Applicant after: Wuhan Transn Information Technology Co., Ltd.

Address before: 430073 East Lake Hubei Development Zone, Optics Valley Software Park, a phase of the west, South Lake Road South, Optics Valley Software Park, No. 2, No. 6, layer 206, six

Applicant before: Language network (Wuhan) Information Technology Co., Ltd.

CB02 Change of applicant information
CB02 Change of applicant information

Address after: 430070 East Lake Hubei Development Zone, Optics Valley Software Park, a phase of the west, South Lake Road South, Optics Valley Software Park, No. 2, No. 5, layer 205, six

Applicant after: Language network (Wuhan) Information Technology Co., Ltd.

Address before: 430073 East Lake Hubei Development Zone, Optics Valley Software Park, a phase of the west, South Lake Road South, Optics Valley Software Park, No. 2, No. 5, layer 205, six

Applicant before: Wuhan Transn Information Technology Co., Ltd.

GR01 Patent grant
GR01 Patent grant