CN105488098A

CN105488098A - Field difference based new word extraction method

Info

Publication number: CN105488098A
Application number: CN201510711219.7A
Authority: CN
Inventors: 史树敏; 周新宇; 黄河燕; 史胜清
Original assignee: Beijing Institute of Technology BIT
Current assignee: Beijing Institute of Technology BIT
Priority date: 2015-10-28
Filing date: 2015-10-28
Publication date: 2016-04-13
Anticipated expiration: 2035-10-28
Also published as: CN105488098B

Abstract

The invention relates to a field difference based new word extraction method and belongs to the technical field of natural language processing applications. The method comprises: firstly, comparing the difference of word distribution in different fields to obtain difference word seeds; secondly, expanding the difference word seeds in an n-gram manner to construct a candidate word set, and removing repeated words in the candidate word set according to a field difference value; and finally, for each word in the candidate word set, by taking the field difference value, an aggregation degree and a word forming rate as measurement standards, removing candidate words with relatively low field difference to obtain new words. Compared with the prior art, the method has the characteristics that seed words are selected by utilizing difference information of different corpus fields, and the seed words are expanded through n-gram to obtain the candidate word set; and then the new words in the candidate words are automatically selected by utilizing the words again and the field difference information, so that the new word discovery quantity is remarkably increased and the new word discovery accuracy is remarkably improved.

Description

A kind of new term fetch method based on field otherness

Technical field

The present invention relates to a kind of method of new words extraction, particularly a kind of method of the new words extraction based on field otherness, belongs to natural language processing applied technical field.

Background technology

Network neologisms refer to along with internet appearance and some special languages of popular use or word.Usually the popular term of video display network is derived from, or some words acceptable to all produced because of a certain social phenomenon.Network neologisms at network field text, as: frequently occur in mhkc, microblogging.Statistics finds, China appears in daily life per year over 1000 neologisms.According to correlative study achievement, the participle mistake more than 60% carrys out automatic network neologisms, and the order of accuarcy of new word identification directly affects the performance of intelligent information handling system.Such as: in the text emotion analysis task of Intelligent Information Processing, fixed phrases collocation can embody feeling polarities, for neologisms phrase, if correctly cannot identify it, can cause judged feeling polarities distortion.As: " express very tall and big on " (this is the net exploxer comment of a product), here " on tall and big " is actual should as network neologisms, the positive emotion that entirety represents " high-end and atmospheric improve grade ", but in application system nearly all at present, the annotated sequence formed after word segmentation processing for " expressions/v ten points/adv is high/adj greatly/adj is upper/adv ", that is: these network neologisms are cut into individual character, the word segmentation processing of mistake makes this lost the implication of positive emotion tendency, creates have a strong impact on the intellectual analysis of follow-up.Therefore very important meaning is had to being effectively identified in natural language processing field of neologisms.

At present, new words extraction is mainly divided into rule-based method and Statistics-Based Method two class.The main thought of rule-based approach is: the word-building principle being conceived to neologisms, it can be used as theoretical foundation and sets up the conventional corpus that contributes to identifying neologisms; Then study self characteristic of speech sounds of word, build a special word-building rule storehouse based on the natural quality of word.The recognition accuracy of rule-based method to neologisms is higher, but needs extremely strong language attainment and pertinent arts background.Statistics-Based Method realizes new word identification and mainly contains two kinds of means, one be using new words extraction as the requisite part of participle, finally infer most possible separation by certain statistical model and then obtain neologisms.Classical statistical model is had ready conditions the Gradient Descent training pattern etc. of random field (ConditionalRandomFields, CRF), feature based frequency information.Another kind of means be using new words extraction as an independent task, usually need the pre-service doing part-of-speech tagging (Part-Of-Speech, POS).Because network neologisms have real-time, the features such as circulation is strong, dynamic change, therefore pure rule-based method often poor effect; And training data is sparse, validity feature extracts the deficiencies such as difficulty to adopt statistical means acquisition network neologisms also to exist completely.The method that current most of researcher's service regeulations and statistics combine, to playing respective advantage, but these methods all have ignored the information characteristics advantage of corpus itself, that is: the information of same words between different field theme (intension) difference, the word distribution performance being embodied as same words under different field theme corresponding is different.

Summary of the invention

The present invention is directed in network the neologisms constantly producing and use, a kind of new term fetch method based on field otherness is proposed, this method makes full use of the characteristic of different field language material self, under existing general appraisement system, effectively improves the accuracy rate of new word identification.

Thought of the present invention is the otherness by comparing word distribution between different field, obtain difference word seed, difference word is expanded by n-gram mode, build candidate word set, then to each word in candidate word set, respectively with field difference value, solidifying right, and become word rate as criterion, extract further and obtain neologisms.

The related definition related in the present invention is as follows:

Definition 1: field difference word, refer to the individual character that can embody field otherness, this individual character can reflect domain features, and its frequency of occurrences in different field language material has very large difference.As, if individual character c is frequency of occurrences f in network language material _internet(c) and frequency of occurrences f in News Field _newsc the ratio of () exceedes threshold value λ, then claim c to be field difference word.It individual character is become to the language phenomenon of word, if can symbolize otherness.The present invention also assert that it has the difference performance of word distribution.

Definition 2: repetitor, as word W _awith word W _bsatisfy condition claim W _band W _arepetitor each other.As: " liking general greatly running quickly " (W _a) with " general greatly run quickly " (W _b).

Definition 3: field difference value DV (DifferenceValue), the tolerance of field otherness, utilizes word W at network language material frequency of occurrences f _internet(W) with news corpus frequency of occurrences f _news(W) calculate; Wherein f _internet(W) word W frequency of occurrences in network language material is represented, f _news(W) word W frequency of occurrences in news corpus is represented.

Definition 4: solidifying right CV (ConcreteValue), weighs word by the quantizating index of correct cutting.As " cinema " has " film "+" institute " and the solidifying conjunction mode of two kinds, " electricity "+" movie theatre ".To any word W=c ₁c ₂(wherein, c ₁or c ₂represent the word or word that form this word), by enumerating its all possible solidifying conjunction mode, calculating corresponding weights, getting wherein minimum value, coagulating right as this word.

Definition 5: become word rate NWP (NewWordProbability), judge whether certain individual character sequence forms the index of word.As: " liking to say ", " love is eaten " are by individual character composition, but NWP is very low, all do not form word both namely representing.

Object of the present invention is realized by following steps:

Based on a new term fetch method for field otherness, comprise the following steps:

Step one, by certain field of neologisms to be obtained input language material S ₁with other field language material S ₂carry out contrast acquisition field difference word seed;

As preferably, obtain field difference word seed by following steps:

(1) S is added up respectively ₁and S ₂in the frequency f that occurs of each word " c " _s1(c) and f _s2(c);

(2) by each word of following formulae discovery at S ₁and S ₂in difference value:

D _{word_seg}(c)＝f _s1(c)/1+f _s2(c)

(3) threshold value λ is set, if the difference value D of word " c " _{word_seg}c () exceedes threshold value λ, using word " c " as difference word seed.

Step 2, expands field difference word seed, builds candidate word set Set _candidate;

As preferably, adopt n-gram mode to expand by following steps, detailed process is as follows:

(1) at language material S ₁in, get n=2 respectively, 3,4,5, obtain all n-gram words of its correspondence, to these n-gram words, if include any difference word, then retain, and add up these n-gram word frequencies of occurrences, add candidate word set Set _candidate;

(2) to candidate word set Set _candidatein all candidate word W, with predetermined threshold value relatively, if its word frequency at candidate word set Set _candidatein leave out W;

Step 3: remove candidate word set Set according to the field difference size of candidate word _candidatein repetitor;

As preferably, the field difference of candidate word W can pass through following formulae discovery:

DV(W)＝log(1+f _s1(W)/(1+f _s2(W)))

Wherein f _s1(W) represent that W is at language material S ₁the frequency of middle appearance, f _s2(W) represent that word W is at language material S ₂the frequency of middle appearance.

Further, in order to obtain better duplicate removal effect, the field difference of repetitor can consider solidifying rightly to obtain with field difference value, namely according to definition 2, finds out candidate word set Set _candidatein all repetitor, compare repetitor, select the reservation that weight in repetitor is larger, less gives up; Repeat this process until candidate word set Set _candidatein no longer containing repetitor, detailed process is as follows:

(1) according to definition 2, n=2 is got, 3,4,5, to Set _candidatein all words compare, find out all repetitors, n represents Set _candidatethe individual character number comprised in the word of set;

(2) according to solidifying right CV (W) and the field difference value DV (W) of definition 3, each repetitor of definition 4 calculating, its computing formula is as follows respectively:

Solidifying right:

C V (W) = \min (\frac{f (W)}{Π_{1}^{2} f (c_{i})})

Field difference value:

DV(W)＝log(1+f _s1(W)/(1+f _s2(W)))

Further, weights V size after weighting shown in formula is compared as follows between two to repetitor, leaves the word that weights are larger:

V(W)＝α ⁿ*DV(W)+CV(W)

{Set}_{C a n d i d a t e} = {Set}_{C a n d i d a t e} - \underset{w &Element; {w 1, w 2}}{\arg \min} V (W)

Wherein, a is parameter, represents the tolerance of the difference allowed between different n-gram, and n represents individual character number in word W, c _irepresent i-th word or word in word W, w ₁and w ₂for two words repeated each other.

(3) step (1), (2) are repeated, until no longer containing repetitor in candidate word set.

Step 4, removal Set _candidatethe candidate word that middle field difference is lower, adds new set of words Y by the candidate word higher than predetermined threshold value γ and output obtains all neologisms.

DV(W)＝log(1+f _s1(W)/(1+f _s2(W)))

Further, described field difference can be passed through candidate word set Set _candidatein each candidate word, respectively according to definition 3,4,5, calculate its field difference value (DV), become word rate (NWP) and solidifying right (CV), and it is comprehensively characterized according to a certain percentage, specific as follows:

(1) according to following formula calculated candidate word W difference value DV (W):

DV(W)＝log(1+f _s1(W)/(1+f _s2(W)))

(2) word rate NWP (W) is become according to following formula calculated candidate word W:

N W P (W) = Π_{i}^{n} \frac{P (c_{i})}{1 - P (c_{i})}

p (c_{i}) = \frac{f (c_{i}) - S i n g l e (c_{i})}{f (c_{i})}

Wherein, f (c _i) represent individual character c in W _ithe frequency of occurrences; Single (c _i) represent and use participle instrument after, c _ithe frequency of occurrences;

(3) right CV (W) is coagulated according to following formula calculated candidate word W:

C V (W) = \min (\frac{f (W)}{Π_{1}^{2} f (c_{i})})

(4) by difference value (DV), one-tenth word rate (NWP), and solidifying right (CV) is normalized respectively, and normalization formula is as follows:

s_{j} = \frac{X_{j} - X_{m i n}}{X_{m a x} - X_{m i n}}

Wherein, X _ja corresponding jth word currency (difference value becomes word rate or coagulates right), X _minrepresent the minimum of this value in all words, X _maxrepresent the mxm. of this value in all words;

(5) according to following formula calculated candidate word W weight V:

V(W)＝a*DV(W)+b*CV(W)+c*NWP(W)

Wherein, a, b and c represent respectively difference value, solidifying right, become word rate to account for the ratio of weight V.

Beneficial effect

The present invention contrasts prior art, by utilizing different information between different language material field, selected seed word, and expands the set of acquisition candidate word by n-gram; And then utilize different information between word itself and field, automatically select the neologisms in candidate word, thus significantly improve number and the accuracy of new word discovery.

Accompanying drawing explanation

Fig. 1 is the schematic flow sheet of a kind of new term fetch method based on field otherness of the embodiment of the present invention;

Fig. 2 is the inventive method and the comparing result schematic diagram of existing four kinds of new term fetch methods in new word identification quantity and accuracy rate.

Embodiment

Below in conjunction with accompanying drawing and embodiment, the inventive method is described in further details.

Embodiment

The present embodiment is using network language material as S ₁, news corpus is as S ₂for example is described in detail to the inventive method.

Network language material selects a model in mhkc as shown in table 1:

Table 1:

News corpus selection certain news on April 4 calendar year 2001 as shown in table 2:

Table 2:

Based on a new term fetch method for field otherness, its treatment scheme as shown in Figure 1, comprises the following steps:

Step one, acquisition field difference word seed:

Field difference word be namely in a kind of language material occurrence number obviously more than the word of other language material, the mode of acquisition field difference word is varied, whether the frequency difference that this enforcement simply occurs in two kinds of language materials with word determines whether to it can be used as field difference word seed higher than certain predetermined threshold value, specific as follows:

The frequency that in statistics network language material, each word occurs respectively and its frequency occurred in news corpus; Then the difference value both calculating, finally setting threshold value λ is 2, difference value is more than or equal to the word of λ as difference word; Obtain the set of difference word as shown in table 3:

Table 3:

Step 2, expansion difference word seed, obtain candidate word set

The mode obtaining candidate word is expanded to difference word varied, as expanded by dictionary or employing n-gram mode, n-gram mode is adopted in the present embodiment, specific as follows: in network language material, get n=2,3,4 or 5 respectively, obtain all n-gram portmanteau word strings, to these n-gram words, if include any difference word, then retain, if meaningless word string, then delete.As: " good beautiful mew star people ", can extract following n-gram form respectively:

2-gram{ " good drift ", " beautiful ", " bright ", " mew ", " mew star ", " star people " },

3-gram{ " good beautiful ", " beautiful ", " bright mew ", " mew star ", " mew star people " },

4-gram{ " good beautiful ", " beautiful mew ", " bright mew star ", " mew star people " }, and 5-gram{ " good beautiful mew ", " beautiful mew star ", " bright mew star people " }

Then, add up the word frequency of these n-gram respectively, threshold value is set when word W word frequency f (W) exceedes threshold value and when comprising above-mentioned arbitrary difference word, elect candidate word as, the candidate word set finally obtained is as shown in table 4:

Table 4:

Step 3, removal repetitor.

First according to definition 2, candidate word set Set is found out _candidateall repetitors; Be all repetitors of finding out for " mew star people " below: { mew star, mew star people }, { star people, mew star people }, { mew star people, mew star people }, { mew star people, the mew star people of love };

Secondly retain field according to the field difference size between two between repetitor to differ greatly candidate word; At this, the frequency that field difference can simply occur in two kinds of language materials with candidate word characterizes, and for overcoming the impact different because of language material that simple frequency difference band comes in the present embodiment, both employings ratio is asked logarithm to characterize, shown in following formula:

DV(W)＝log(1+f _s1(W)/(1+f _s2(W)))

Further, the results show, if field difference not only can consider the as above difference value DV of field shown in formula, can also consider that solidifying right CV can obtain better duplicate removal effect, namely field difference by shown in following formula the two comprehensive after weights obtain:

V(W)＝α ⁿ*DV(W)+CV(W)

Therefore, according to definition 3,4, calculate the solidifying right and difference value of above each word.With { mew star people, the mew star people of love } for example removes repetitor, mew star people word frequency is 6, and the mew star people word frequency of love is 3, and in news corpus, word frequency is 0, then:

DV (mew star people)=log ((6+1)/(0+1))=0.845

DV (the mew star people of love)=log ((3+1)/(0+1))=0.602

CV (mew star people) has " mew "+" star people " and the solidifying conjunction mode of " mew star "+" people " two kinds, and its solidifying right value is respectively

CV (" mew "+" star people ")=6/ (8*6)=0.125

CV (" mew star "+" people ")=6/ (6*7)=0.143.

Get its smaller value solidifying right as word " mew star people "

CV (mew star people)=0.125

In like manner CV (the mew star people of love) has " love "+" mew star people ", " love "+" mew star people ", " mew of love "+" star people ", the solidifying conjunction mode of " the mew star of love "+" people " four kinds.

Its solidifying right value is respectively:

CV (" love "+" mew star people ")=3/ (4*4)=0.185

CV (" love "+" mew star people ")=3/ (3*6)=0.167

CV (" mew of love "+" star people ")=3/ (3*6)=0.167

It is solidifying right as word " the mew star people of love " that its smaller value is got in CV (" the mew star of love "+" people ")=3/ (3*7)=0.143

CV (the mew star people of love)=0.143

Getting a parameter is 1.1

V (mew star people)=0.845*1.1 ³+ 0.125=1.249

V (the mew star people of love)=0.602*1.1 ⁵+ 0.143=1.113

So retain " mew star people " in this candidate word duplicate removal, leave out " the mew star people of love ".To Set _candidatein all repetitor, perform step 3, until do not have repetitor to produce.The candidate word finally determined is as shown in table 5:

Table 5:

Step 4, obtain new set of words according to field differential screening candidate word and export.

Same step 3, described field difference is characterized after can being taken the logarithm by candidate word frequency ratio between different language material, but the experiment proved that, if field difference can consider field difference value DV, become word rate NWP and solidifying right CV, according to shown in following formula, three will be obtained better effect according to the words of certain proportion composite:

V(W)＝a*DV(W)+b*CV(W)+c*NWP(W)

To candidate word set Set _candidatein each candidate word, respectively according to definition 3,4,5, calculate its field difference value, become word rate and solidifying right:

Still for " mew star people " word:

Difference value: DV (mew star people)=log ((6+1)/(0+1))=0.845

Solidifying right: CV (mew star people)=6/ (8*6)=0.125 (getting " mew "+" star people " obtains minimum)

Become word rate:

The present embodiment adopts ICTCLAS participle instrument will obtain single (mew)=8, single (star)=6, single (people)=7 after participle above; F (mew)=8, f (star)=6, f (people)=7, f (mew star people)=6 again; Therefore

further, for obtaining better extraction effect, need the weights comprehensively obtaining field difference after above three kinds of values being normalized again;

In shown 7 words of table 5, maximum, the minimum value of three kinds of values are respectively:

DV _max＝0.903；DV _min＝0.176；

CV _max＝0.25；CV _min＝0.071；

NWP _max＝1；NWP _min＝0；

After normalization, " mew star people " three kinds of values are respectively:

Get a=0.6, b=0.4, c=-0.2;

V _{mew star people}=0.6*0.920+0.4*0.302-0.2*0=0.6728

The field difference obtaining words all shown in table 5 is thus as shown in table 6:

Table 6:

Get threshold gamma=0.4, filtering all spectra difference obtains new set of words for { building-owner, mew star people, tinkling of pieces of jade body } lower than the word of threshold gamma.

Experimental result:

In order to verify the validity of the embodiment of the present invention based on the new term fetch method of field otherness, this experiment adopts Sina's three days microblogging 6-8 in June days microblogging, amount to 10, 237, article 813, and Baidu " the large Supreme Being of Li Yi " amounts to 3, 524, 584 models are as network language material, use Xinhua News Agency's news data to all issues in 2004 in 1993, amount to 9, 517, 292 sentences are as news corpus, utilize existing new term fetch method CV respectively, NWP, EMI, DV and the DV+CV+NWP method that PNWD and the present invention propose contrasts in new word identification quantity and accuracy rate, comparing result as shown in Figure 2.

CV and NWP is the new words extraction statistical method that those skilled in the art generally understand, and repeats no more herein.

The EnhancedMutualInformation algorithm that the people such as EMI:Zhang proposed in 2009, its formula:

E M I (W) = l o g \frac{\frac{F}{N}}{Π_{i = 1}^{n} \frac{F_{i} - F}{N}}

Wherein, word W=w ₁w ₂w _n, w _ifor forming each word of word, n is the number of the word forming word.F represents word W occurrence number, F _irepresent word w _ioccurrence number.This algorithm idea is to weigh word to the dependence of each word, and be worth larger, then the possibility becoming word is larger.

The new word identification based on pattern (PattenNewWordDetection) algorithm that the people such as PNWD:Huang proposed in 2014.This algorithm core concept utilizes POS markup information and automatically selects to meet the model of phrase patterns as <ad, *, au> by seed vocabulary, more automatically extract to make new advances by these models and occur the method for vocabulary.

As shown in Figure 2, in figure, x-axis represents a front k word, and y-axis represents Average Accuracy AP (k) of a front k word.Can see by figure, compared with benchmarks EMI, CV, NWP, DV, DV+CV+NWP all obtains better effect, compared with benchmarks PNWD, DV and DV+CV+NWP better effects if, and CV and NWP is when results set is less, accuracy is slightly poorer than PNWD, and along with the expansion of result data, CV and NWP has again obvious lifting.This is because PWND can only find the neologisms describing part of speech, and have ignored the neologisms of other parts of speech, so after identifying the neologisms describing part of speech efficiently, PWND declines for the new word identification rate of other parts of speech.For DV, obtain extraordinary effect, main because the method takes full advantage of otherness between different field, and neologisms are good at embodying this field otherness.For CV and NWP, its recognition accuracy is slightly poor, it is main because CV and NWP is slightly poor for the judgement of 2-gram vocabulary, to 2-gram vocabulary, he can be divided into 2 individual characters, and the probability that individual character occurs is very large, cause these 2 values of 2-gram extremely low, not easily be identified, and in neologisms, 2-gram vocabulary has greatly, so these 2 kinds of method effects are not ideal.DV+CV+NWP combines DV, and the advantage of CV and NWP tri-kinds of methods, obtains best result.Therefore, compared with classic method, the new term fetch method based on field otherness that the present invention proposes can obtain higher accuracy and find more neologisms.

More than show and describe ultimate principle of the present invention and principal character and advantage of the present invention.The technician of the industry should understand; the present invention is not restricted to the described embodiments; what describe in above-described embodiment and instructions just illustrates principle of the present invention; without departing from the spirit and scope of the present invention; the present invention also has various changes and modifications; these changes and improvements are all in the claimed scope of the invention, and application claims protection domain is defined by appending claims and equivalent thereof.

Claims

1. based on a new term fetch method for field otherness, it is characterized in that, comprise the following steps:

Step 1, by certain field of neologisms to be obtained input language material S ₁with other field language material S ₂carry out contrast acquisition field difference word seed;

Step 2, expansion field difference word seed, build candidate word set Set _candidate;

Step 3, remove candidate word set Set according to the field difference size of candidate word _candidatein repetitor;

2. a kind of new term fetch method based on field otherness according to claim 1, is characterized in that, described field difference word seed is by following Procedure Acquisition:

D _{word_seg}(c)＝f _s1(c)/ _fs2(c)

3. a kind of new term fetch method based on field otherness according to claim 1, is characterized in that, λ=2.

4. a kind of new term fetch method based on field otherness according to claim 1, is characterized in that, described expansion field difference word seed, builds candidate word set Set _candidateundertaken by n-gram mode, detailed process is as follows:

(2) to candidate word set Set _candidatein all candidate word W, with predetermined threshold value relatively, if its word frequency at candidate word set Set _candidatein leave out W.

5. a kind of new term fetch method based on field otherness according to claim 1, is characterized in that, the field difference of described candidate word W is by following formulae discovery:

DV(W)＝log(1+f _s1(W)/(1+f _s2(W)))

Wherein f _s1(W) represent that word W is at language material S ₁the frequency of middle appearance, f _s2(W) represent that word W is at language material S ₂the frequency of middle appearance.

6., according to the arbitrary described a kind of new term fetch method based on field otherness of claim 1-4, it is characterized in that, described according to field difference size removal candidate word set Set _candidatein repetitor undertaken by following steps:

(1) n=2,3,4 or 5 is got, to Set _candidatein all words compare, find out all repetitors, n represents Set _candidatethe number of the word comprised in the word of set;

(2) solidifying right CV and field difference value DV is considered for the repetitor found and calculates its weight V by following formula, and retain the larger word of weight, remove the less word of weight thus reach the object of duplicate removal:

V(W)＝α ⁿ*DV(W)+CV(W)；

C V (W) = \min (\frac{f (W)}{Π_{1}^{2} f (c_{i})});

DV(W)＝log(1+f _s1(W)/(1+f _s2(W)))；

Wherein, a is parameter, represents the tolerance of the difference allowed between different n-gram, c _irepresent i-th word or word in word W, and W=c ₁c ₂;

7., according to the arbitrary described a kind of new term fetch method based on field otherness of claim 6, it is characterized in that, a=1.1.

8., according to the arbitrary described a kind of new term fetch method based on field otherness of claim 1-5, it is characterized in that, described removal Set _candidate" field difference " in the candidate word that middle field difference is lower be by field difference value DV, become word rate NWP and solidifying right CV comprehensive according to a certain percentage after value (weight) V, obtain especially by following process:

DV(W)＝log(1+f _s1(W)/(1+f _s2(W)))

N W P (W) = Π_{i}^{n} \frac{P (c_{i})}{1 - P (c_{i})}

p (c_{i}) = \frac{f (c_{i}) - S i g l e e (c_{i})}{f (c_{i})}

Wherein, f (c _i) represent word c _ithe frequency of occurrences; Single (c _i) represent and use participle instrument after, c _ithe frequency of occurrences;

C V (W) = \min (\frac{f (W)}{Π_{1}^{2} f (c_{i})})

s_{j} = \frac{X_{j} - X_{\min}}{X_{\max} - X_{\min}}

(4) according to following formula calculated candidate word W weight V:

V(W)＝a*DV(W)+b*CV(W)+c*NWP(W)

9. a kind of new term fetch method based on field otherness according to claim 8, is characterized in that, a=0.6, b=0.4, c=-0.2.

10., according to claim 1-5,7 or 9 arbitrary described a kind of new term fetch methods based on field otherness, it is characterized in that, γ=0.4.