CN105488098A - Field difference based new word extraction method - Google Patents

Field difference based new word extraction method Download PDF

Info

Publication number
CN105488098A
CN105488098A CN201510711219.7A CN201510711219A CN105488098A CN 105488098 A CN105488098 A CN 105488098A CN 201510711219 A CN201510711219 A CN 201510711219A CN 105488098 A CN105488098 A CN 105488098A
Authority
CN
China
Prior art keywords
word
candidate
field
difference
words
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201510711219.7A
Other languages
Chinese (zh)
Other versions
CN105488098B (en
Inventor
史树敏
周新宇
黄河燕
史胜清
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Institute of Technology BIT
Original Assignee
Beijing Institute of Technology BIT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Institute of Technology BIT filed Critical Beijing Institute of Technology BIT
Priority to CN201510711219.7A priority Critical patent/CN105488098B/en
Publication of CN105488098A publication Critical patent/CN105488098A/en
Application granted granted Critical
Publication of CN105488098B publication Critical patent/CN105488098B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Machine Translation (AREA)

Abstract

The invention relates to a field difference based new word extraction method and belongs to the technical field of natural language processing applications. The method comprises: firstly, comparing the difference of word distribution in different fields to obtain difference word seeds; secondly, expanding the difference word seeds in an n-gram manner to construct a candidate word set, and removing repeated words in the candidate word set according to a field difference value; and finally, for each word in the candidate word set, by taking the field difference value, an aggregation degree and a word forming rate as measurement standards, removing candidate words with relatively low field difference to obtain new words. Compared with the prior art, the method has the characteristics that seed words are selected by utilizing difference information of different corpus fields, and the seed words are expanded through n-gram to obtain the candidate word set; and then the new words in the candidate words are automatically selected by utilizing the words again and the field difference information, so that the new word discovery quantity is remarkably increased and the new word discovery accuracy is remarkably improved.

Description

A kind of new term fetch method based on field otherness
Technical field
The present invention relates to a kind of method of new words extraction, particularly a kind of method of the new words extraction based on field otherness, belongs to natural language processing applied technical field.
Background technology
Network neologisms refer to along with internet appearance and some special languages of popular use or word.Usually the popular term of video display network is derived from, or some words acceptable to all produced because of a certain social phenomenon.Network neologisms at network field text, as: frequently occur in mhkc, microblogging.Statistics finds, China appears in daily life per year over 1000 neologisms.According to correlative study achievement, the participle mistake more than 60% carrys out automatic network neologisms, and the order of accuarcy of new word identification directly affects the performance of intelligent information handling system.Such as: in the text emotion analysis task of Intelligent Information Processing, fixed phrases collocation can embody feeling polarities, for neologisms phrase, if correctly cannot identify it, can cause judged feeling polarities distortion.As: " express very tall and big on " (this is the net exploxer comment of a product), here " on tall and big " is actual should as network neologisms, the positive emotion that entirety represents " high-end and atmospheric improve grade ", but in application system nearly all at present, the annotated sequence formed after word segmentation processing for " expressions/v ten points/adv is high/adj greatly/adj is upper/adv ", that is: these network neologisms are cut into individual character, the word segmentation processing of mistake makes this lost the implication of positive emotion tendency, creates have a strong impact on the intellectual analysis of follow-up.Therefore very important meaning is had to being effectively identified in natural language processing field of neologisms.
At present, new words extraction is mainly divided into rule-based method and Statistics-Based Method two class.The main thought of rule-based approach is: the word-building principle being conceived to neologisms, it can be used as theoretical foundation and sets up the conventional corpus that contributes to identifying neologisms; Then study self characteristic of speech sounds of word, build a special word-building rule storehouse based on the natural quality of word.The recognition accuracy of rule-based method to neologisms is higher, but needs extremely strong language attainment and pertinent arts background.Statistics-Based Method realizes new word identification and mainly contains two kinds of means, one be using new words extraction as the requisite part of participle, finally infer most possible separation by certain statistical model and then obtain neologisms.Classical statistical model is had ready conditions the Gradient Descent training pattern etc. of random field (ConditionalRandomFields, CRF), feature based frequency information.Another kind of means be using new words extraction as an independent task, usually need the pre-service doing part-of-speech tagging (Part-Of-Speech, POS).Because network neologisms have real-time, the features such as circulation is strong, dynamic change, therefore pure rule-based method often poor effect; And training data is sparse, validity feature extracts the deficiencies such as difficulty to adopt statistical means acquisition network neologisms also to exist completely.The method that current most of researcher's service regeulations and statistics combine, to playing respective advantage, but these methods all have ignored the information characteristics advantage of corpus itself, that is: the information of same words between different field theme (intension) difference, the word distribution performance being embodied as same words under different field theme corresponding is different.
Summary of the invention
The present invention is directed in network the neologisms constantly producing and use, a kind of new term fetch method based on field otherness is proposed, this method makes full use of the characteristic of different field language material self, under existing general appraisement system, effectively improves the accuracy rate of new word identification.
Thought of the present invention is the otherness by comparing word distribution between different field, obtain difference word seed, difference word is expanded by n-gram mode, build candidate word set, then to each word in candidate word set, respectively with field difference value, solidifying right, and become word rate as criterion, extract further and obtain neologisms.
The related definition related in the present invention is as follows:
Definition 1: field difference word, refer to the individual character that can embody field otherness, this individual character can reflect domain features, and its frequency of occurrences in different field language material has very large difference.As, if individual character c is frequency of occurrences f in network language material internet(c) and frequency of occurrences f in News Field newsc the ratio of () exceedes threshold value λ, then claim c to be field difference word.It individual character is become to the language phenomenon of word, if can symbolize otherness.The present invention also assert that it has the difference performance of word distribution.
Definition 2: repetitor, as word W awith word W bsatisfy condition claim W band W arepetitor each other.As: " liking general greatly running quickly " (W a) with " general greatly run quickly " (W b).
Definition 3: field difference value DV (DifferenceValue), the tolerance of field otherness, utilizes word W at network language material frequency of occurrences f internet(W) with news corpus frequency of occurrences f news(W) calculate; Wherein f internet(W) word W frequency of occurrences in network language material is represented, f news(W) word W frequency of occurrences in news corpus is represented.
Definition 4: solidifying right CV (ConcreteValue), weighs word by the quantizating index of correct cutting.As " cinema " has " film "+" institute " and the solidifying conjunction mode of two kinds, " electricity "+" movie theatre ".To any word W=c 1c 2(wherein, c 1or c 2represent the word or word that form this word), by enumerating its all possible solidifying conjunction mode, calculating corresponding weights, getting wherein minimum value, coagulating right as this word.
Definition 5: become word rate NWP (NewWordProbability), judge whether certain individual character sequence forms the index of word.As: " liking to say ", " love is eaten " are by individual character composition, but NWP is very low, all do not form word both namely representing.
Object of the present invention is realized by following steps:
Based on a new term fetch method for field otherness, comprise the following steps:
Step one, by certain field of neologisms to be obtained input language material S 1with other field language material S 2carry out contrast acquisition field difference word seed;
As preferably, obtain field difference word seed by following steps:
(1) S is added up respectively 1and S 2in the frequency f that occurs of each word " c " s1(c) and f s2(c);
(2) by each word of following formulae discovery at S 1and S 2in difference value:
D word_seg(c)=f s1(c)/1+f s2(c)
(3) threshold value λ is set, if the difference value D of word " c " word_segc () exceedes threshold value λ, using word " c " as difference word seed.
Step 2, expands field difference word seed, builds candidate word set Set candidate;
As preferably, adopt n-gram mode to expand by following steps, detailed process is as follows:
(1) at language material S 1in, get n=2 respectively, 3,4,5, obtain all n-gram words of its correspondence, to these n-gram words, if include any difference word, then retain, and add up these n-gram word frequencies of occurrences, add candidate word set Set candidate;
(2) to candidate word set Set candidatein all candidate word W, with predetermined threshold value relatively, if its word frequency at candidate word set Set candidatein leave out W;
Step 3: remove candidate word set Set according to the field difference size of candidate word candidatein repetitor;
As preferably, the field difference of candidate word W can pass through following formulae discovery:
DV(W)=log(1+f s1(W)/(1+f s2(W)))
Wherein f s1(W) represent that W is at language material S 1the frequency of middle appearance, f s2(W) represent that word W is at language material S 2the frequency of middle appearance.
Further, in order to obtain better duplicate removal effect, the field difference of repetitor can consider solidifying rightly to obtain with field difference value, namely according to definition 2, finds out candidate word set Set candidatein all repetitor, compare repetitor, select the reservation that weight in repetitor is larger, less gives up; Repeat this process until candidate word set Set candidatein no longer containing repetitor, detailed process is as follows:
(1) according to definition 2, n=2 is got, 3,4,5, to Set candidatein all words compare, find out all repetitors, n represents Set candidatethe individual character number comprised in the word of set;
(2) according to solidifying right CV (W) and the field difference value DV (W) of definition 3, each repetitor of definition 4 calculating, its computing formula is as follows respectively:
Solidifying right:
C V ( W ) = min ( f ( W ) Π 1 2 f ( c i ) )
Field difference value:
DV(W)=log(1+f s1(W)/(1+f s2(W)))
Further, weights V size after weighting shown in formula is compared as follows between two to repetitor, leaves the word that weights are larger:
V(W)=α n*DV(W)+CV(W)
Set C a n d i d a t e = Set C a n d i d a t e - arg min w ∈ { w 1 , w 2 } V ( W )
Wherein, a is parameter, represents the tolerance of the difference allowed between different n-gram, and n represents individual character number in word W, c irepresent i-th word or word in word W, w 1and w 2for two words repeated each other.
(3) step (1), (2) are repeated, until no longer containing repetitor in candidate word set.
Step 4, removal Set candidatethe candidate word that middle field difference is lower, adds new set of words Y by the candidate word higher than predetermined threshold value γ and output obtains all neologisms.
As preferably, the field difference of candidate word W can pass through following formulae discovery:
DV(W)=log(1+f s1(W)/(1+f s2(W)))
Wherein f s1(W) represent that W is at language material S 1the frequency of middle appearance, f s2(W) represent that word W is at language material S 2the frequency of middle appearance.
Further, described field difference can be passed through candidate word set Set candidatein each candidate word, respectively according to definition 3,4,5, calculate its field difference value (DV), become word rate (NWP) and solidifying right (CV), and it is comprehensively characterized according to a certain percentage, specific as follows:
(1) according to following formula calculated candidate word W difference value DV (W):
DV(W)=log(1+f s1(W)/(1+f s2(W)))
(2) word rate NWP (W) is become according to following formula calculated candidate word W:
N W P ( W ) = Π i n P ( c i ) 1 - P ( c i )
p ( c i ) = f ( c i ) - S i n g l e ( c i ) f ( c i )
Wherein, f (c i) represent individual character c in W ithe frequency of occurrences; Single (c i) represent and use participle instrument after, c ithe frequency of occurrences;
(3) right CV (W) is coagulated according to following formula calculated candidate word W:
C V ( W ) = min ( f ( W ) Π 1 2 f ( c i ) )
(4) by difference value (DV), one-tenth word rate (NWP), and solidifying right (CV) is normalized respectively, and normalization formula is as follows:
s j = X j - X m i n X m a x - X m i n
Wherein, X ja corresponding jth word currency (difference value becomes word rate or coagulates right), X minrepresent the minimum of this value in all words, X maxrepresent the mxm. of this value in all words;
(5) according to following formula calculated candidate word W weight V:
V(W)=a*DV(W)+b*CV(W)+c*NWP(W)
Wherein, a, b and c represent respectively difference value, solidifying right, become word rate to account for the ratio of weight V.
Beneficial effect
The present invention contrasts prior art, by utilizing different information between different language material field, selected seed word, and expands the set of acquisition candidate word by n-gram; And then utilize different information between word itself and field, automatically select the neologisms in candidate word, thus significantly improve number and the accuracy of new word discovery.
Accompanying drawing explanation
Fig. 1 is the schematic flow sheet of a kind of new term fetch method based on field otherness of the embodiment of the present invention;
Fig. 2 is the inventive method and the comparing result schematic diagram of existing four kinds of new term fetch methods in new word identification quantity and accuracy rate.
Embodiment
Below in conjunction with accompanying drawing and embodiment, the inventive method is described in further details.
Embodiment
The present embodiment is using network language material as S 1, news corpus is as S 2for example is described in detail to the inventive method.
Network language material selects a model in mhkc as shown in table 1:
Table 1:
News corpus selection certain news on April 4 calendar year 2001 as shown in table 2:
Table 2:
Based on a new term fetch method for field otherness, its treatment scheme as shown in Figure 1, comprises the following steps:
Step one, acquisition field difference word seed:
Field difference word be namely in a kind of language material occurrence number obviously more than the word of other language material, the mode of acquisition field difference word is varied, whether the frequency difference that this enforcement simply occurs in two kinds of language materials with word determines whether to it can be used as field difference word seed higher than certain predetermined threshold value, specific as follows:
The frequency that in statistics network language material, each word occurs respectively and its frequency occurred in news corpus; Then the difference value both calculating, finally setting threshold value λ is 2, difference value is more than or equal to the word of λ as difference word; Obtain the set of difference word as shown in table 3:
Table 3:
Step 2, expansion difference word seed, obtain candidate word set
The mode obtaining candidate word is expanded to difference word varied, as expanded by dictionary or employing n-gram mode, n-gram mode is adopted in the present embodiment, specific as follows: in network language material, get n=2,3,4 or 5 respectively, obtain all n-gram portmanteau word strings, to these n-gram words, if include any difference word, then retain, if meaningless word string, then delete.As: " good beautiful mew star people ", can extract following n-gram form respectively:
2-gram{ " good drift ", " beautiful ", " bright ", " mew ", " mew star ", " star people " },
3-gram{ " good beautiful ", " beautiful ", " bright mew ", " mew star ", " mew star people " },
4-gram{ " good beautiful ", " beautiful mew ", " bright mew star ", " mew star people " }, and 5-gram{ " good beautiful mew ", " beautiful mew star ", " bright mew star people " }
Then, add up the word frequency of these n-gram respectively, threshold value is set when word W word frequency f (W) exceedes threshold value and when comprising above-mentioned arbitrary difference word, elect candidate word as, the candidate word set finally obtained is as shown in table 4:
Table 4:
Step 3, removal repetitor.
First according to definition 2, candidate word set Set is found out candidateall repetitors; Be all repetitors of finding out for " mew star people " below: { mew star, mew star people }, { star people, mew star people }, { mew star people, mew star people }, { mew star people, the mew star people of love };
Secondly retain field according to the field difference size between two between repetitor to differ greatly candidate word; At this, the frequency that field difference can simply occur in two kinds of language materials with candidate word characterizes, and for overcoming the impact different because of language material that simple frequency difference band comes in the present embodiment, both employings ratio is asked logarithm to characterize, shown in following formula:
DV(W)=log(1+f s1(W)/(1+f s2(W)))
Further, the results show, if field difference not only can consider the as above difference value DV of field shown in formula, can also consider that solidifying right CV can obtain better duplicate removal effect, namely field difference by shown in following formula the two comprehensive after weights obtain:
V(W)=α n*DV(W)+CV(W)
Therefore, according to definition 3,4, calculate the solidifying right and difference value of above each word.With { mew star people, the mew star people of love } for example removes repetitor, mew star people word frequency is 6, and the mew star people word frequency of love is 3, and in news corpus, word frequency is 0, then:
DV (mew star people)=log ((6+1)/(0+1))=0.845
DV (the mew star people of love)=log ((3+1)/(0+1))=0.602
CV (mew star people) has " mew "+" star people " and the solidifying conjunction mode of " mew star "+" people " two kinds, and its solidifying right value is respectively
CV (" mew "+" star people ")=6/ (8*6)=0.125
CV (" mew star "+" people ")=6/ (6*7)=0.143.
Get its smaller value solidifying right as word " mew star people "
CV (mew star people)=0.125
In like manner CV (the mew star people of love) has " love "+" mew star people ", " love "+" mew star people ", " mew of love "+" star people ", the solidifying conjunction mode of " the mew star of love "+" people " four kinds.
Its solidifying right value is respectively:
CV (" love "+" mew star people ")=3/ (4*4)=0.185
CV (" love "+" mew star people ")=3/ (3*6)=0.167
CV (" mew of love "+" star people ")=3/ (3*6)=0.167
It is solidifying right as word " the mew star people of love " that its smaller value is got in CV (" the mew star of love "+" people ")=3/ (3*7)=0.143
CV (the mew star people of love)=0.143
Getting a parameter is 1.1
V (mew star people)=0.845*1.1 3+ 0.125=1.249
V (the mew star people of love)=0.602*1.1 5+ 0.143=1.113
So retain " mew star people " in this candidate word duplicate removal, leave out " the mew star people of love ".To Set candidatein all repetitor, perform step 3, until do not have repetitor to produce.The candidate word finally determined is as shown in table 5:
Table 5:
Step 4, obtain new set of words according to field differential screening candidate word and export.
Same step 3, described field difference is characterized after can being taken the logarithm by candidate word frequency ratio between different language material, but the experiment proved that, if field difference can consider field difference value DV, become word rate NWP and solidifying right CV, according to shown in following formula, three will be obtained better effect according to the words of certain proportion composite:
V(W)=a*DV(W)+b*CV(W)+c*NWP(W)
To candidate word set Set candidatein each candidate word, respectively according to definition 3,4,5, calculate its field difference value, become word rate and solidifying right:
Still for " mew star people " word:
Difference value: DV (mew star people)=log ((6+1)/(0+1))=0.845
Solidifying right: CV (mew star people)=6/ (8*6)=0.125 (getting " mew "+" star people " obtains minimum)
Become word rate:
The present embodiment adopts ICTCLAS participle instrument will obtain single (mew)=8, single (star)=6, single (people)=7 after participle above; F (mew)=8, f (star)=6, f (people)=7, f (mew star people)=6 again; Therefore
further, for obtaining better extraction effect, need the weights comprehensively obtaining field difference after above three kinds of values being normalized again;
In shown 7 words of table 5, maximum, the minimum value of three kinds of values are respectively:
DV max=0.903;DV min=0.176;
CV max=0.25;CV min=0.071;
NWP max=1;NWP min=0;
After normalization, " mew star people " three kinds of values are respectively:
Get a=0.6, b=0.4, c=-0.2;
V mew star people=0.6*0.920+0.4*0.302-0.2*0=0.6728
The field difference obtaining words all shown in table 5 is thus as shown in table 6:
Table 6:
Get threshold gamma=0.4, filtering all spectra difference obtains new set of words for { building-owner, mew star people, tinkling of pieces of jade body } lower than the word of threshold gamma.
Experimental result:
In order to verify the validity of the embodiment of the present invention based on the new term fetch method of field otherness, this experiment adopts Sina's three days microblogging 6-8 in June days microblogging, amount to 10, 237, article 813, and Baidu " the large Supreme Being of Li Yi " amounts to 3, 524, 584 models are as network language material, use Xinhua News Agency's news data to all issues in 2004 in 1993, amount to 9, 517, 292 sentences are as news corpus, utilize existing new term fetch method CV respectively, NWP, EMI, DV and the DV+CV+NWP method that PNWD and the present invention propose contrasts in new word identification quantity and accuracy rate, comparing result as shown in Figure 2.
CV and NWP is the new words extraction statistical method that those skilled in the art generally understand, and repeats no more herein.
The EnhancedMutualInformation algorithm that the people such as EMI:Zhang proposed in 2009, its formula:
E M I ( W ) = l o g F N Π i = 1 n F i - F N
Wherein, word W=w 1w 2w n, w ifor forming each word of word, n is the number of the word forming word.F represents word W occurrence number, F irepresent word w ioccurrence number.This algorithm idea is to weigh word to the dependence of each word, and be worth larger, then the possibility becoming word is larger.
The new word identification based on pattern (PattenNewWordDetection) algorithm that the people such as PNWD:Huang proposed in 2014.This algorithm core concept utilizes POS markup information and automatically selects to meet the model of phrase patterns as <ad, *, au> by seed vocabulary, more automatically extract to make new advances by these models and occur the method for vocabulary.
As shown in Figure 2, in figure, x-axis represents a front k word, and y-axis represents Average Accuracy AP (k) of a front k word.Can see by figure, compared with benchmarks EMI, CV, NWP, DV, DV+CV+NWP all obtains better effect, compared with benchmarks PNWD, DV and DV+CV+NWP better effects if, and CV and NWP is when results set is less, accuracy is slightly poorer than PNWD, and along with the expansion of result data, CV and NWP has again obvious lifting.This is because PWND can only find the neologisms describing part of speech, and have ignored the neologisms of other parts of speech, so after identifying the neologisms describing part of speech efficiently, PWND declines for the new word identification rate of other parts of speech.For DV, obtain extraordinary effect, main because the method takes full advantage of otherness between different field, and neologisms are good at embodying this field otherness.For CV and NWP, its recognition accuracy is slightly poor, it is main because CV and NWP is slightly poor for the judgement of 2-gram vocabulary, to 2-gram vocabulary, he can be divided into 2 individual characters, and the probability that individual character occurs is very large, cause these 2 values of 2-gram extremely low, not easily be identified, and in neologisms, 2-gram vocabulary has greatly, so these 2 kinds of method effects are not ideal.DV+CV+NWP combines DV, and the advantage of CV and NWP tri-kinds of methods, obtains best result.Therefore, compared with classic method, the new term fetch method based on field otherness that the present invention proposes can obtain higher accuracy and find more neologisms.
More than show and describe ultimate principle of the present invention and principal character and advantage of the present invention.The technician of the industry should understand; the present invention is not restricted to the described embodiments; what describe in above-described embodiment and instructions just illustrates principle of the present invention; without departing from the spirit and scope of the present invention; the present invention also has various changes and modifications; these changes and improvements are all in the claimed scope of the invention, and application claims protection domain is defined by appending claims and equivalent thereof.

Claims (10)

1. based on a new term fetch method for field otherness, it is characterized in that, comprise the following steps:
Step 1, by certain field of neologisms to be obtained input language material S 1with other field language material S 2carry out contrast acquisition field difference word seed;
Step 2, expansion field difference word seed, build candidate word set Set candidate;
Step 3, remove candidate word set Set according to the field difference size of candidate word candidatein repetitor;
Step 4, removal Set candidatethe candidate word that middle field difference is lower, adds new set of words Y by the candidate word higher than predetermined threshold value γ and output obtains all neologisms.
2. a kind of new term fetch method based on field otherness according to claim 1, is characterized in that, described field difference word seed is by following Procedure Acquisition:
(1) S is added up respectively 1and S 2in the frequency f that occurs of each word " c " s1(c) and f s2(c);
(2) by each word of following formulae discovery at S 1and S 2in difference value:
D word_seg(c)=f s1(c)/ fs2(c)
(3) threshold value λ is set, if the difference value D of word " c " word_segc () exceedes threshold value λ, using word " c " as difference word seed.
3. a kind of new term fetch method based on field otherness according to claim 1, is characterized in that, λ=2.
4. a kind of new term fetch method based on field otherness according to claim 1, is characterized in that, described expansion field difference word seed, builds candidate word set Set candidateundertaken by n-gram mode, detailed process is as follows:
(1) at language material S 1in, get n=2 respectively, 3,4,5, obtain all n-gram words of its correspondence, to these n-gram words, if include any difference word, then retain, and add up these n-gram word frequencies of occurrences, add candidate word set Set candidate;
(2) to candidate word set Set candidatein all candidate word W, with predetermined threshold value relatively, if its word frequency at candidate word set Set candidatein leave out W.
5. a kind of new term fetch method based on field otherness according to claim 1, is characterized in that, the field difference of described candidate word W is by following formulae discovery:
DV(W)=log(1+f s1(W)/(1+f s2(W)))
Wherein f s1(W) represent that word W is at language material S 1the frequency of middle appearance, f s2(W) represent that word W is at language material S 2the frequency of middle appearance.
6., according to the arbitrary described a kind of new term fetch method based on field otherness of claim 1-4, it is characterized in that, described according to field difference size removal candidate word set Set candidatein repetitor undertaken by following steps:
(1) n=2,3,4 or 5 is got, to Set candidatein all words compare, find out all repetitors, n represents Set candidatethe number of the word comprised in the word of set;
(2) solidifying right CV and field difference value DV is considered for the repetitor found and calculates its weight V by following formula, and retain the larger word of weight, remove the less word of weight thus reach the object of duplicate removal:
V(W)=α n*DV(W)+CV(W);
C V ( W ) = min ( f ( W ) &Pi; 1 2 f ( c i ) ) ;
DV(W)=log(1+f s1(W)/(1+f s2(W)));
Wherein, a is parameter, represents the tolerance of the difference allowed between different n-gram, c irepresent i-th word or word in word W, and W=c 1c 2;
(3) step (1), (2) are repeated, until no longer containing repetitor in candidate word set.
7., according to the arbitrary described a kind of new term fetch method based on field otherness of claim 6, it is characterized in that, a=1.1.
8., according to the arbitrary described a kind of new term fetch method based on field otherness of claim 1-5, it is characterized in that, described removal Set candidate" field difference " in the candidate word that middle field difference is lower be by field difference value DV, become word rate NWP and solidifying right CV comprehensive according to a certain percentage after value (weight) V, obtain especially by following process:
(1) according to following formula calculated candidate word W difference value DV (W):
DV(W)=log(1+f s1(W)/(1+f s2(W)))
(2) word rate NWP (W) is become according to following formula calculated candidate word W:
N W P ( W ) = &Pi; i n P ( c i ) 1 - P ( c i )
p ( c i ) = f ( c i ) - S i g l e e ( c i ) f ( c i )
Wherein, f (c i) represent word c ithe frequency of occurrences; Single (c i) represent and use participle instrument after, c ithe frequency of occurrences;
(3) right CV (W) is coagulated according to following formula calculated candidate word W:
C V ( W ) = min ( f ( W ) &Pi; 1 2 f ( c i ) )
(4) by difference value (DV), one-tenth word rate (NWP), and solidifying right (CV) is normalized respectively, and normalization formula is as follows:
s j = X j - X min X max - X min
Wherein, X ja corresponding jth word currency (difference value becomes word rate or coagulates right), X minrepresent the minimum of this value in all words, X maxrepresent the mxm. of this value in all words;
(4) according to following formula calculated candidate word W weight V:
V(W)=a*DV(W)+b*CV(W)+c*NWP(W)
Wherein, a, b and c represent respectively difference value, solidifying right, become word rate to account for the ratio of weight V.
9. a kind of new term fetch method based on field otherness according to claim 8, is characterized in that, a=0.6, b=0.4, c=-0.2.
10., according to claim 1-5,7 or 9 arbitrary described a kind of new term fetch methods based on field otherness, it is characterized in that, γ=0.4.
CN201510711219.7A 2015-10-28 2015-10-28 A kind of new words extraction method based on field otherness Active CN105488098B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510711219.7A CN105488098B (en) 2015-10-28 2015-10-28 A kind of new words extraction method based on field otherness

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510711219.7A CN105488098B (en) 2015-10-28 2015-10-28 A kind of new words extraction method based on field otherness

Publications (2)

Publication Number Publication Date
CN105488098A true CN105488098A (en) 2016-04-13
CN105488098B CN105488098B (en) 2019-02-05

Family

ID=55675073

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510711219.7A Active CN105488098B (en) 2015-10-28 2015-10-28 A kind of new words extraction method based on field otherness

Country Status (1)

Country Link
CN (1) CN105488098B (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106126495A (en) * 2016-06-16 2016-11-16 北京捷通华声科技股份有限公司 A kind of based on large-scale corpus prompter method and apparatus
CN108845982A (en) * 2017-12-08 2018-11-20 昆明理工大学 A kind of Chinese word cutting method of word-based linked character
CN110472140A (en) * 2019-07-17 2019-11-19 腾讯科技(深圳)有限公司 Object words recommending method, device and electronic equipment
CN110634145A (en) * 2018-06-22 2019-12-31 青岛日日顺物流有限公司 Warehouse checking method based on image processing
CN112668331A (en) * 2021-03-18 2021-04-16 北京沃丰时代数据科技有限公司 Special word mining method and device, electronic equipment and storage medium
CN113051912A (en) * 2021-04-08 2021-06-29 云南电网有限责任公司电力科学研究院 Domain word recognition method and device based on word forming rate

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1340804A (en) * 2000-08-30 2002-03-20 国际商业机器公司 Automatic new term fetch method and system
CN101119334A (en) * 2007-09-21 2008-02-06 腾讯科技(深圳)有限公司 Method, system and equipment for obtaining neology
CN102708147A (en) * 2012-03-26 2012-10-03 北京新发智信科技有限责任公司 Recognition method for new words of scientific and technical terminology
CN103294664A (en) * 2013-07-04 2013-09-11 清华大学 Method and system for discovering new words in open fields

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1340804A (en) * 2000-08-30 2002-03-20 国际商业机器公司 Automatic new term fetch method and system
CN101119334A (en) * 2007-09-21 2008-02-06 腾讯科技(深圳)有限公司 Method, system and equipment for obtaining neology
CN102708147A (en) * 2012-03-26 2012-10-03 北京新发智信科技有限责任公司 Recognition method for new words of scientific and technical terminology
CN103294664A (en) * 2013-07-04 2013-09-11 清华大学 Method and system for discovering new words in open fields

Non-Patent Citations (6)

* Cited by examiner, † Cited by third party
Title
MINLIE HUANG等: "New Word Detection for Sentiment Analysis", 《PROCEEDINGS OF THE 52ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS》 *
QIU L等: "A Method f or Automatic POS Guessing of Chinese Unknown Words", 《PROCEEDINGS OF THE 22ND INTERNATIONAL CONFERENCE ON COMPUTATIONAL LINGUISTICS》 *
刘华: "一种快速获取领域新词语的新方法", 《中文信息学报》 *
张海军等: "中文新词识别技术综述", 《计算机科学》 *
杜聪慧: "面向互联网数据的新词发现平台的设计与实现", 《万方数据》 *
段宇锋等: "基于N-Gram的专业领域中文新词识别研究", 《现代图书情报技术》 *

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106126495A (en) * 2016-06-16 2016-11-16 北京捷通华声科技股份有限公司 A kind of based on large-scale corpus prompter method and apparatus
CN106126495B (en) * 2016-06-16 2019-03-12 北京捷通华声科技股份有限公司 One kind being based on large-scale corpus prompter method and apparatus
CN108845982A (en) * 2017-12-08 2018-11-20 昆明理工大学 A kind of Chinese word cutting method of word-based linked character
CN108845982B (en) * 2017-12-08 2021-08-20 昆明理工大学 Chinese word segmentation method based on word association characteristics
CN110634145A (en) * 2018-06-22 2019-12-31 青岛日日顺物流有限公司 Warehouse checking method based on image processing
CN110472140A (en) * 2019-07-17 2019-11-19 腾讯科技(深圳)有限公司 Object words recommending method, device and electronic equipment
CN110472140B (en) * 2019-07-17 2023-10-31 腾讯科技(深圳)有限公司 Object word recommendation method and device and electronic equipment
CN112668331A (en) * 2021-03-18 2021-04-16 北京沃丰时代数据科技有限公司 Special word mining method and device, electronic equipment and storage medium
CN113051912A (en) * 2021-04-08 2021-06-29 云南电网有限责任公司电力科学研究院 Domain word recognition method and device based on word forming rate
CN113051912B (en) * 2021-04-08 2023-01-20 云南电网有限责任公司电力科学研究院 Domain word recognition method and device based on word forming rate

Also Published As

Publication number Publication date
CN105488098B (en) 2019-02-05

Similar Documents

Publication Publication Date Title
CN105488098A (en) Field difference based new word extraction method
CN109815336B (en) Text aggregation method and system
CN106776713A (en) It is a kind of based on this clustering method of the Massive short documents of term vector semantic analysis
CN107992542A (en) A kind of similar article based on topic model recommends method
CN107038480A (en) A kind of text sentiment classification method based on convolutional neural networks
CN105740229B (en) The method and device of keyword extraction
CN107480122A (en) A kind of artificial intelligence exchange method and artificial intelligence interactive device
CN112989802B (en) Bullet screen keyword extraction method, bullet screen keyword extraction device, bullet screen keyword extraction equipment and bullet screen keyword extraction medium
CN106933972B (en) The method and device of data element are defined using natural language processing technique
CN108388554B (en) Text emotion recognition system based on collaborative filtering attention mechanism
CN106610955A (en) Dictionary-based multi-dimensional emotion analysis method
CN106599054A (en) Method and system for title classification and push
CN107180084A (en) Word library updating method and device
CN107688630B (en) Semantic-based weakly supervised microbo multi-emotion dictionary expansion method
CN106227768B (en) A kind of short text opining mining method based on complementary corpus
CN110728144B (en) Extraction type document automatic summarization method based on context semantic perception
CN109858034A (en) A kind of text sentiment classification method based on attention model and sentiment dictionary
CN107463703A (en) English social media account number classification method based on information gain
CN109214445A (en) A kind of multi-tag classification method based on artificial intelligence
CN108875034A (en) A kind of Chinese Text Categorization based on stratification shot and long term memory network
CN107463715A (en) English social media account number classification method based on information gain
CN106681986A (en) Multi-dimensional sentiment analysis system
CN109614493A (en) A kind of text condensation recognition methods and system based on supervision term vector
CN109271513A (en) A kind of file classification method, computer-readable storage media and system
CN108319584A (en) A kind of new word discovery method based on the microblogging class short text for improving FP-Growth algorithms

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant