CN105488098A - Field difference based new word extraction method - Google Patents
Field difference based new word extraction method Download PDFInfo
- Publication number
- CN105488098A CN105488098A CN201510711219.7A CN201510711219A CN105488098A CN 105488098 A CN105488098 A CN 105488098A CN 201510711219 A CN201510711219 A CN 201510711219A CN 105488098 A CN105488098 A CN 105488098A
- Authority
- CN
- China
- Prior art keywords
- word
- candidate
- field
- difference
- words
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Data Mining & Analysis (AREA)
- Machine Translation (AREA)
Abstract
The invention relates to a field difference based new word extraction method and belongs to the technical field of natural language processing applications. The method comprises: firstly, comparing the difference of word distribution in different fields to obtain difference word seeds; secondly, expanding the difference word seeds in an n-gram manner to construct a candidate word set, and removing repeated words in the candidate word set according to a field difference value; and finally, for each word in the candidate word set, by taking the field difference value, an aggregation degree and a word forming rate as measurement standards, removing candidate words with relatively low field difference to obtain new words. Compared with the prior art, the method has the characteristics that seed words are selected by utilizing difference information of different corpus fields, and the seed words are expanded through n-gram to obtain the candidate word set; and then the new words in the candidate words are automatically selected by utilizing the words again and the field difference information, so that the new word discovery quantity is remarkably increased and the new word discovery accuracy is remarkably improved.
Description
Technical field
The present invention relates to a kind of method of new words extraction, particularly a kind of method of the new words extraction based on field otherness, belongs to natural language processing applied technical field.
Background technology
Network neologisms refer to along with internet appearance and some special languages of popular use or word.Usually the popular term of video display network is derived from, or some words acceptable to all produced because of a certain social phenomenon.Network neologisms at network field text, as: frequently occur in mhkc, microblogging.Statistics finds, China appears in daily life per year over 1000 neologisms.According to correlative study achievement, the participle mistake more than 60% carrys out automatic network neologisms, and the order of accuarcy of new word identification directly affects the performance of intelligent information handling system.Such as: in the text emotion analysis task of Intelligent Information Processing, fixed phrases collocation can embody feeling polarities, for neologisms phrase, if correctly cannot identify it, can cause judged feeling polarities distortion.As: " express very tall and big on " (this is the net exploxer comment of a product), here " on tall and big " is actual should as network neologisms, the positive emotion that entirety represents " high-end and atmospheric improve grade ", but in application system nearly all at present, the annotated sequence formed after word segmentation processing for " expressions/v ten points/adv is high/adj greatly/adj is upper/adv ", that is: these network neologisms are cut into individual character, the word segmentation processing of mistake makes this lost the implication of positive emotion tendency, creates have a strong impact on the intellectual analysis of follow-up.Therefore very important meaning is had to being effectively identified in natural language processing field of neologisms.
At present, new words extraction is mainly divided into rule-based method and Statistics-Based Method two class.The main thought of rule-based approach is: the word-building principle being conceived to neologisms, it can be used as theoretical foundation and sets up the conventional corpus that contributes to identifying neologisms; Then study self characteristic of speech sounds of word, build a special word-building rule storehouse based on the natural quality of word.The recognition accuracy of rule-based method to neologisms is higher, but needs extremely strong language attainment and pertinent arts background.Statistics-Based Method realizes new word identification and mainly contains two kinds of means, one be using new words extraction as the requisite part of participle, finally infer most possible separation by certain statistical model and then obtain neologisms.Classical statistical model is had ready conditions the Gradient Descent training pattern etc. of random field (ConditionalRandomFields, CRF), feature based frequency information.Another kind of means be using new words extraction as an independent task, usually need the pre-service doing part-of-speech tagging (Part-Of-Speech, POS).Because network neologisms have real-time, the features such as circulation is strong, dynamic change, therefore pure rule-based method often poor effect; And training data is sparse, validity feature extracts the deficiencies such as difficulty to adopt statistical means acquisition network neologisms also to exist completely.The method that current most of researcher's service regeulations and statistics combine, to playing respective advantage, but these methods all have ignored the information characteristics advantage of corpus itself, that is: the information of same words between different field theme (intension) difference, the word distribution performance being embodied as same words under different field theme corresponding is different.
Summary of the invention
The present invention is directed in network the neologisms constantly producing and use, a kind of new term fetch method based on field otherness is proposed, this method makes full use of the characteristic of different field language material self, under existing general appraisement system, effectively improves the accuracy rate of new word identification.
Thought of the present invention is the otherness by comparing word distribution between different field, obtain difference word seed, difference word is expanded by n-gram mode, build candidate word set, then to each word in candidate word set, respectively with field difference value, solidifying right, and become word rate as criterion, extract further and obtain neologisms.
The related definition related in the present invention is as follows:
Definition 1: field difference word, refer to the individual character that can embody field otherness, this individual character can reflect domain features, and its frequency of occurrences in different field language material has very large difference.As, if individual character c is frequency of occurrences f in network language material
internet(c) and frequency of occurrences f in News Field
newsc the ratio of () exceedes threshold value λ, then claim c to be field difference word.It individual character is become to the language phenomenon of word, if can symbolize otherness.The present invention also assert that it has the difference performance of word distribution.
Definition 2: repetitor, as word W
awith word W
bsatisfy condition
claim W
band W
arepetitor each other.As: " liking general greatly running quickly " (W
a) with " general greatly run quickly " (W
b).
Definition 3: field difference value DV (DifferenceValue), the tolerance of field otherness, utilizes word W at network language material frequency of occurrences f
internet(W) with news corpus frequency of occurrences f
news(W) calculate; Wherein f
internet(W) word W frequency of occurrences in network language material is represented, f
news(W) word W frequency of occurrences in news corpus is represented.
Definition 4: solidifying right CV (ConcreteValue), weighs word by the quantizating index of correct cutting.As " cinema " has " film "+" institute " and the solidifying conjunction mode of two kinds, " electricity "+" movie theatre ".To any word W=c
1c
2(wherein, c
1or c
2represent the word or word that form this word), by enumerating its all possible solidifying conjunction mode, calculating corresponding weights, getting wherein minimum value, coagulating right as this word.
Definition 5: become word rate NWP (NewWordProbability), judge whether certain individual character sequence forms the index of word.As: " liking to say ", " love is eaten " are by individual character composition, but NWP is very low, all do not form word both namely representing.
Object of the present invention is realized by following steps:
Based on a new term fetch method for field otherness, comprise the following steps:
Step one, by certain field of neologisms to be obtained input language material S
1with other field language material S
2carry out contrast acquisition field difference word seed;
As preferably, obtain field difference word seed by following steps:
(1) S is added up respectively
1and S
2in the frequency f that occurs of each word " c "
s1(c) and f
s2(c);
(2) by each word of following formulae discovery at S
1and S
2in difference value:
D
word_seg(c)=f
s1(c)/1+f
s2(c)
(3) threshold value λ is set, if the difference value D of word " c "
word_segc () exceedes threshold value λ, using word " c " as difference word seed.
Step 2, expands field difference word seed, builds candidate word set Set
candidate;
As preferably, adopt n-gram mode to expand by following steps, detailed process is as follows:
(1) at language material S
1in, get n=2 respectively, 3,4,5, obtain all n-gram words of its correspondence, to these n-gram words, if include any difference word, then retain, and add up these n-gram word frequencies of occurrences, add candidate word set Set
candidate;
(2) to candidate word set Set
candidatein all candidate word W, with predetermined threshold value
relatively, if its word frequency
at candidate word set Set
candidatein leave out W;
Step 3: remove candidate word set Set according to the field difference size of candidate word
candidatein repetitor;
As preferably, the field difference of candidate word W can pass through following formulae discovery:
DV(W)=log(1+f
s1(W)/(1+f
s2(W)))
Wherein f
s1(W) represent that W is at language material S
1the frequency of middle appearance, f
s2(W) represent that word W is at language material S
2the frequency of middle appearance.
Further, in order to obtain better duplicate removal effect, the field difference of repetitor can consider solidifying rightly to obtain with field difference value, namely according to definition 2, finds out candidate word set Set
candidatein all repetitor, compare repetitor, select the reservation that weight in repetitor is larger, less gives up; Repeat this process until candidate word set Set
candidatein no longer containing repetitor, detailed process is as follows:
(1) according to definition 2, n=2 is got, 3,4,5, to Set
candidatein all words compare, find out all repetitors, n represents Set
candidatethe individual character number comprised in the word of set;
(2) according to solidifying right CV (W) and the field difference value DV (W) of definition 3, each repetitor of definition 4 calculating, its computing formula is as follows respectively:
Solidifying right:
Field difference value:
DV(W)=log(1+f
s1(W)/(1+f
s2(W)))
Further, weights V size after weighting shown in formula is compared as follows between two to repetitor, leaves the word that weights are larger:
V(W)=α
n*DV(W)+CV(W)
Wherein, a is parameter, represents the tolerance of the difference allowed between different n-gram, and n represents individual character number in word W, c
irepresent i-th word or word in word W, w
1and w
2for two words repeated each other.
(3) step (1), (2) are repeated, until no longer containing repetitor in candidate word set.
Step 4, removal Set
candidatethe candidate word that middle field difference is lower, adds new set of words Y by the candidate word higher than predetermined threshold value γ and output obtains all neologisms.
As preferably, the field difference of candidate word W can pass through following formulae discovery:
DV(W)=log(1+f
s1(W)/(1+f
s2(W)))
Wherein f
s1(W) represent that W is at language material S
1the frequency of middle appearance, f
s2(W) represent that word W is at language material S
2the frequency of middle appearance.
Further, described field difference can be passed through candidate word set Set
candidatein each candidate word, respectively according to definition 3,4,5, calculate its field difference value (DV), become word rate (NWP) and solidifying right (CV), and it is comprehensively characterized according to a certain percentage, specific as follows:
(1) according to following formula calculated candidate word W difference value DV (W):
DV(W)=log(1+f
s1(W)/(1+f
s2(W)))
(2) word rate NWP (W) is become according to following formula calculated candidate word W:
Wherein, f (c
i) represent individual character c in W
ithe frequency of occurrences; Single (c
i) represent and use participle instrument after, c
ithe frequency of occurrences;
(3) right CV (W) is coagulated according to following formula calculated candidate word W:
(4) by difference value (DV), one-tenth word rate (NWP), and solidifying right (CV) is normalized respectively, and normalization formula is as follows:
Wherein, X
ja corresponding jth word currency (difference value becomes word rate or coagulates right), X
minrepresent the minimum of this value in all words, X
maxrepresent the mxm. of this value in all words;
(5) according to following formula calculated candidate word W weight V:
V(W)=a*DV(W)+b*CV(W)+c*NWP(W)
Wherein, a, b and c represent respectively difference value, solidifying right, become word rate to account for the ratio of weight V.
Beneficial effect
The present invention contrasts prior art, by utilizing different information between different language material field, selected seed word, and expands the set of acquisition candidate word by n-gram; And then utilize different information between word itself and field, automatically select the neologisms in candidate word, thus significantly improve number and the accuracy of new word discovery.
Accompanying drawing explanation
Fig. 1 is the schematic flow sheet of a kind of new term fetch method based on field otherness of the embodiment of the present invention;
Fig. 2 is the inventive method and the comparing result schematic diagram of existing four kinds of new term fetch methods in new word identification quantity and accuracy rate.
Embodiment
Below in conjunction with accompanying drawing and embodiment, the inventive method is described in further details.
Embodiment
The present embodiment is using network language material as S
1, news corpus is as S
2for example is described in detail to the inventive method.
Network language material selects a model in mhkc as shown in table 1:
Table 1:
News corpus selection certain news on April 4 calendar year 2001 as shown in table 2:
Table 2:
Based on a new term fetch method for field otherness, its treatment scheme as shown in Figure 1, comprises the following steps:
Step one, acquisition field difference word seed:
Field difference word be namely in a kind of language material occurrence number obviously more than the word of other language material, the mode of acquisition field difference word is varied, whether the frequency difference that this enforcement simply occurs in two kinds of language materials with word determines whether to it can be used as field difference word seed higher than certain predetermined threshold value, specific as follows:
The frequency that in statistics network language material, each word occurs respectively and its frequency occurred in news corpus; Then the difference value both calculating, finally setting threshold value λ is 2, difference value is more than or equal to the word of λ as difference word; Obtain the set of difference word as shown in table 3:
Table 3:
Step 2, expansion difference word seed, obtain candidate word set
The mode obtaining candidate word is expanded to difference word varied, as expanded by dictionary or employing n-gram mode, n-gram mode is adopted in the present embodiment, specific as follows: in network language material, get n=2,3,4 or 5 respectively, obtain all n-gram portmanteau word strings, to these n-gram words, if include any difference word, then retain, if meaningless word string, then delete.As: " good beautiful mew star people ", can extract following n-gram form respectively:
2-gram{ " good drift ", " beautiful ", " bright ", " mew ", " mew star ", " star people " },
3-gram{ " good beautiful ", " beautiful ", " bright mew ", " mew star ", " mew star people " },
4-gram{ " good beautiful ", " beautiful mew ", " bright mew star ", " mew star people " }, and 5-gram{ " good beautiful mew ", " beautiful mew star ", " bright mew star people " }
Then, add up the word frequency of these n-gram respectively, threshold value is set
when word W word frequency f (W) exceedes threshold value
and when comprising above-mentioned arbitrary difference word, elect candidate word as, the candidate word set finally obtained is as shown in table 4:
Table 4:
Step 3, removal repetitor.
First according to definition 2, candidate word set Set is found out
candidateall repetitors; Be all repetitors of finding out for " mew star people " below: { mew star, mew star people }, { star people, mew star people }, { mew star people, mew star people }, { mew star people, the mew star people of love };
Secondly retain field according to the field difference size between two between repetitor to differ greatly candidate word; At this, the frequency that field difference can simply occur in two kinds of language materials with candidate word characterizes, and for overcoming the impact different because of language material that simple frequency difference band comes in the present embodiment, both employings ratio is asked logarithm to characterize, shown in following formula:
DV(W)=log(1+f
s1(W)/(1+f
s2(W)))
Further, the results show, if field difference not only can consider the as above difference value DV of field shown in formula, can also consider that solidifying right CV can obtain better duplicate removal effect, namely field difference by shown in following formula the two comprehensive after weights obtain:
V(W)=α
n*DV(W)+CV(W)
Therefore, according to definition 3,4, calculate the solidifying right and difference value of above each word.With { mew star people, the mew star people of love } for example removes repetitor, mew star people word frequency is 6, and the mew star people word frequency of love is 3, and in news corpus, word frequency is 0, then:
DV (mew star people)=log ((6+1)/(0+1))=0.845
DV (the mew star people of love)=log ((3+1)/(0+1))=0.602
CV (mew star people) has " mew "+" star people " and the solidifying conjunction mode of " mew star "+" people " two kinds, and its solidifying right value is respectively
CV (" mew "+" star people ")=6/ (8*6)=0.125
CV (" mew star "+" people ")=6/ (6*7)=0.143.
Get its smaller value solidifying right as word " mew star people "
CV (mew star people)=0.125
In like manner CV (the mew star people of love) has " love "+" mew star people ", " love "+" mew star people ", " mew of love "+" star people ", the solidifying conjunction mode of " the mew star of love "+" people " four kinds.
Its solidifying right value is respectively:
CV (" love "+" mew star people ")=3/ (4*4)=0.185
CV (" love "+" mew star people ")=3/ (3*6)=0.167
CV (" mew of love "+" star people ")=3/ (3*6)=0.167
It is solidifying right as word " the mew star people of love " that its smaller value is got in CV (" the mew star of love "+" people ")=3/ (3*7)=0.143
CV (the mew star people of love)=0.143
Getting a parameter is 1.1
V (mew star people)=0.845*1.1
3+ 0.125=1.249
V (the mew star people of love)=0.602*1.1
5+ 0.143=1.113
So retain " mew star people " in this candidate word duplicate removal, leave out " the mew star people of love ".To Set
candidatein all repetitor, perform step 3, until do not have repetitor to produce.The candidate word finally determined is as shown in table 5:
Table 5:
Step 4, obtain new set of words according to field differential screening candidate word and export.
Same step 3, described field difference is characterized after can being taken the logarithm by candidate word frequency ratio between different language material, but the experiment proved that, if field difference can consider field difference value DV, become word rate NWP and solidifying right CV, according to shown in following formula, three will be obtained better effect according to the words of certain proportion composite:
V(W)=a*DV(W)+b*CV(W)+c*NWP(W)
To candidate word set Set
candidatein each candidate word, respectively according to definition 3,4,5, calculate its field difference value, become word rate and solidifying right:
Still for " mew star people " word:
Difference value: DV (mew star people)=log ((6+1)/(0+1))=0.845
Solidifying right: CV (mew star people)=6/ (8*6)=0.125 (getting " mew "+" star people " obtains minimum)
Become word rate:
The present embodiment adopts ICTCLAS participle instrument will obtain single (mew)=8, single (star)=6, single (people)=7 after participle above; F (mew)=8, f (star)=6, f (people)=7, f (mew star people)=6 again; Therefore
further, for obtaining better extraction effect, need the weights comprehensively obtaining field difference after above three kinds of values being normalized again;
In shown 7 words of table 5, maximum, the minimum value of three kinds of values are respectively:
DV
max=0.903;DV
min=0.176;
CV
max=0.25;CV
min=0.071;
NWP
max=1;NWP
min=0;
After normalization, " mew star people " three kinds of values are respectively:
Get a=0.6, b=0.4, c=-0.2;
V
mew star people=0.6*0.920+0.4*0.302-0.2*0=0.6728
The field difference obtaining words all shown in table 5 is thus as shown in table 6:
Table 6:
Get threshold gamma=0.4, filtering all spectra difference obtains new set of words for { building-owner, mew star people, tinkling of pieces of jade body } lower than the word of threshold gamma.
Experimental result:
In order to verify the validity of the embodiment of the present invention based on the new term fetch method of field otherness, this experiment adopts Sina's three days microblogging 6-8 in June days microblogging, amount to 10, 237, article 813, and Baidu " the large Supreme Being of Li Yi " amounts to 3, 524, 584 models are as network language material, use Xinhua News Agency's news data to all issues in 2004 in 1993, amount to 9, 517, 292 sentences are as news corpus, utilize existing new term fetch method CV respectively, NWP, EMI, DV and the DV+CV+NWP method that PNWD and the present invention propose contrasts in new word identification quantity and accuracy rate, comparing result as shown in Figure 2.
CV and NWP is the new words extraction statistical method that those skilled in the art generally understand, and repeats no more herein.
The EnhancedMutualInformation algorithm that the people such as EMI:Zhang proposed in 2009, its formula:
Wherein, word W=w
1w
2w
n, w
ifor forming each word of word, n is the number of the word forming word.F represents word W occurrence number, F
irepresent word w
ioccurrence number.This algorithm idea is to weigh word to the dependence of each word, and be worth larger, then the possibility becoming word is larger.
The new word identification based on pattern (PattenNewWordDetection) algorithm that the people such as PNWD:Huang proposed in 2014.This algorithm core concept utilizes POS markup information and automatically selects to meet the model of phrase patterns as <ad, *, au> by seed vocabulary, more automatically extract to make new advances by these models and occur the method for vocabulary.
As shown in Figure 2, in figure, x-axis represents a front k word, and y-axis represents Average Accuracy AP (k) of a front k word.Can see by figure, compared with benchmarks EMI, CV, NWP, DV, DV+CV+NWP all obtains better effect, compared with benchmarks PNWD, DV and DV+CV+NWP better effects if, and CV and NWP is when results set is less, accuracy is slightly poorer than PNWD, and along with the expansion of result data, CV and NWP has again obvious lifting.This is because PWND can only find the neologisms describing part of speech, and have ignored the neologisms of other parts of speech, so after identifying the neologisms describing part of speech efficiently, PWND declines for the new word identification rate of other parts of speech.For DV, obtain extraordinary effect, main because the method takes full advantage of otherness between different field, and neologisms are good at embodying this field otherness.For CV and NWP, its recognition accuracy is slightly poor, it is main because CV and NWP is slightly poor for the judgement of 2-gram vocabulary, to 2-gram vocabulary, he can be divided into 2 individual characters, and the probability that individual character occurs is very large, cause these 2 values of 2-gram extremely low, not easily be identified, and in neologisms, 2-gram vocabulary has greatly, so these 2 kinds of method effects are not ideal.DV+CV+NWP combines DV, and the advantage of CV and NWP tri-kinds of methods, obtains best result.Therefore, compared with classic method, the new term fetch method based on field otherness that the present invention proposes can obtain higher accuracy and find more neologisms.
More than show and describe ultimate principle of the present invention and principal character and advantage of the present invention.The technician of the industry should understand; the present invention is not restricted to the described embodiments; what describe in above-described embodiment and instructions just illustrates principle of the present invention; without departing from the spirit and scope of the present invention; the present invention also has various changes and modifications; these changes and improvements are all in the claimed scope of the invention, and application claims protection domain is defined by appending claims and equivalent thereof.
Claims (10)
1. based on a new term fetch method for field otherness, it is characterized in that, comprise the following steps:
Step 1, by certain field of neologisms to be obtained input language material S
1with other field language material S
2carry out contrast acquisition field difference word seed;
Step 2, expansion field difference word seed, build candidate word set Set
candidate;
Step 3, remove candidate word set Set according to the field difference size of candidate word
candidatein repetitor;
Step 4, removal Set
candidatethe candidate word that middle field difference is lower, adds new set of words Y by the candidate word higher than predetermined threshold value γ and output obtains all neologisms.
2. a kind of new term fetch method based on field otherness according to claim 1, is characterized in that, described field difference word seed is by following Procedure Acquisition:
(1) S is added up respectively
1and S
2in the frequency f that occurs of each word " c "
s1(c) and f
s2(c);
(2) by each word of following formulae discovery at S
1and S
2in difference value:
D
word_seg(c)=f
s1(c)/
fs2(c)
(3) threshold value λ is set, if the difference value D of word " c "
word_segc () exceedes threshold value λ, using word " c " as difference word seed.
3. a kind of new term fetch method based on field otherness according to claim 1, is characterized in that, λ=2.
4. a kind of new term fetch method based on field otherness according to claim 1, is characterized in that, described expansion field difference word seed, builds candidate word set Set
candidateundertaken by n-gram mode, detailed process is as follows:
(1) at language material S
1in, get n=2 respectively, 3,4,5, obtain all n-gram words of its correspondence, to these n-gram words, if include any difference word, then retain, and add up these n-gram word frequencies of occurrences, add candidate word set Set
candidate;
(2) to candidate word set Set
candidatein all candidate word W, with predetermined threshold value
relatively, if its word frequency
at candidate word set Set
candidatein leave out W.
5. a kind of new term fetch method based on field otherness according to claim 1, is characterized in that, the field difference of described candidate word W is by following formulae discovery:
DV(W)=log(1+f
s1(W)/(1+f
s2(W)))
Wherein f
s1(W) represent that word W is at language material S
1the frequency of middle appearance, f
s2(W) represent that word W is at language material S
2the frequency of middle appearance.
6., according to the arbitrary described a kind of new term fetch method based on field otherness of claim 1-4, it is characterized in that, described according to field difference size removal candidate word set Set
candidatein repetitor undertaken by following steps:
(1) n=2,3,4 or 5 is got, to Set
candidatein all words compare, find out all repetitors, n represents Set
candidatethe number of the word comprised in the word of set;
(2) solidifying right CV and field difference value DV is considered for the repetitor found and calculates its weight V by following formula, and retain the larger word of weight, remove the less word of weight thus reach the object of duplicate removal:
V(W)=α
n*DV(W)+CV(W);
DV(W)=log(1+f
s1(W)/(1+f
s2(W)));
Wherein, a is parameter, represents the tolerance of the difference allowed between different n-gram, c
irepresent i-th word or word in word W, and W=c
1c
2;
(3) step (1), (2) are repeated, until no longer containing repetitor in candidate word set.
7., according to the arbitrary described a kind of new term fetch method based on field otherness of claim 6, it is characterized in that, a=1.1.
8., according to the arbitrary described a kind of new term fetch method based on field otherness of claim 1-5, it is characterized in that, described removal Set
candidate" field difference " in the candidate word that middle field difference is lower be by field difference value DV, become word rate NWP and solidifying right CV comprehensive according to a certain percentage after value (weight) V, obtain especially by following process:
(1) according to following formula calculated candidate word W difference value DV (W):
DV(W)=log(1+f
s1(W)/(1+f
s2(W)))
(2) word rate NWP (W) is become according to following formula calculated candidate word W:
Wherein, f (c
i) represent word c
ithe frequency of occurrences; Single (c
i) represent and use participle instrument after, c
ithe frequency of occurrences;
(3) right CV (W) is coagulated according to following formula calculated candidate word W:
(4) by difference value (DV), one-tenth word rate (NWP), and solidifying right (CV) is normalized respectively, and normalization formula is as follows:
Wherein, X
ja corresponding jth word currency (difference value becomes word rate or coagulates right), X
minrepresent the minimum of this value in all words, X
maxrepresent the mxm. of this value in all words;
(4) according to following formula calculated candidate word W weight V:
V(W)=a*DV(W)+b*CV(W)+c*NWP(W)
Wherein, a, b and c represent respectively difference value, solidifying right, become word rate to account for the ratio of weight V.
9. a kind of new term fetch method based on field otherness according to claim 8, is characterized in that, a=0.6, b=0.4, c=-0.2.
10., according to claim 1-5,7 or 9 arbitrary described a kind of new term fetch methods based on field otherness, it is characterized in that, γ=0.4.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510711219.7A CN105488098B (en) | 2015-10-28 | 2015-10-28 | A kind of new words extraction method based on field otherness |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510711219.7A CN105488098B (en) | 2015-10-28 | 2015-10-28 | A kind of new words extraction method based on field otherness |
Publications (2)
Publication Number | Publication Date |
---|---|
CN105488098A true CN105488098A (en) | 2016-04-13 |
CN105488098B CN105488098B (en) | 2019-02-05 |
Family
ID=55675073
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510711219.7A Active CN105488098B (en) | 2015-10-28 | 2015-10-28 | A kind of new words extraction method based on field otherness |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105488098B (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106126495A (en) * | 2016-06-16 | 2016-11-16 | 北京捷通华声科技股份有限公司 | A kind of based on large-scale corpus prompter method and apparatus |
CN108845982A (en) * | 2017-12-08 | 2018-11-20 | 昆明理工大学 | A kind of Chinese word cutting method of word-based linked character |
CN110472140A (en) * | 2019-07-17 | 2019-11-19 | 腾讯科技(深圳)有限公司 | Object words recommending method, device and electronic equipment |
CN110634145A (en) * | 2018-06-22 | 2019-12-31 | 青岛日日顺物流有限公司 | Warehouse checking method based on image processing |
CN112668331A (en) * | 2021-03-18 | 2021-04-16 | 北京沃丰时代数据科技有限公司 | Special word mining method and device, electronic equipment and storage medium |
CN113051912A (en) * | 2021-04-08 | 2021-06-29 | 云南电网有限责任公司电力科学研究院 | Domain word recognition method and device based on word forming rate |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1340804A (en) * | 2000-08-30 | 2002-03-20 | 国际商业机器公司 | Automatic new term fetch method and system |
CN101119334A (en) * | 2007-09-21 | 2008-02-06 | 腾讯科技(深圳)有限公司 | Method, system and equipment for obtaining neology |
CN102708147A (en) * | 2012-03-26 | 2012-10-03 | 北京新发智信科技有限责任公司 | Recognition method for new words of scientific and technical terminology |
CN103294664A (en) * | 2013-07-04 | 2013-09-11 | 清华大学 | Method and system for discovering new words in open fields |
-
2015
- 2015-10-28 CN CN201510711219.7A patent/CN105488098B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1340804A (en) * | 2000-08-30 | 2002-03-20 | 国际商业机器公司 | Automatic new term fetch method and system |
CN101119334A (en) * | 2007-09-21 | 2008-02-06 | 腾讯科技(深圳)有限公司 | Method, system and equipment for obtaining neology |
CN102708147A (en) * | 2012-03-26 | 2012-10-03 | 北京新发智信科技有限责任公司 | Recognition method for new words of scientific and technical terminology |
CN103294664A (en) * | 2013-07-04 | 2013-09-11 | 清华大学 | Method and system for discovering new words in open fields |
Non-Patent Citations (6)
Title |
---|
MINLIE HUANG等: "New Word Detection for Sentiment Analysis", 《PROCEEDINGS OF THE 52ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS》 * |
QIU L等: "A Method f or Automatic POS Guessing of Chinese Unknown Words", 《PROCEEDINGS OF THE 22ND INTERNATIONAL CONFERENCE ON COMPUTATIONAL LINGUISTICS》 * |
刘华: "一种快速获取领域新词语的新方法", 《中文信息学报》 * |
张海军等: "中文新词识别技术综述", 《计算机科学》 * |
杜聪慧: "面向互联网数据的新词发现平台的设计与实现", 《万方数据》 * |
段宇锋等: "基于N-Gram的专业领域中文新词识别研究", 《现代图书情报技术》 * |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106126495A (en) * | 2016-06-16 | 2016-11-16 | 北京捷通华声科技股份有限公司 | A kind of based on large-scale corpus prompter method and apparatus |
CN106126495B (en) * | 2016-06-16 | 2019-03-12 | 北京捷通华声科技股份有限公司 | One kind being based on large-scale corpus prompter method and apparatus |
CN108845982A (en) * | 2017-12-08 | 2018-11-20 | 昆明理工大学 | A kind of Chinese word cutting method of word-based linked character |
CN108845982B (en) * | 2017-12-08 | 2021-08-20 | 昆明理工大学 | Chinese word segmentation method based on word association characteristics |
CN110634145A (en) * | 2018-06-22 | 2019-12-31 | 青岛日日顺物流有限公司 | Warehouse checking method based on image processing |
CN110472140A (en) * | 2019-07-17 | 2019-11-19 | 腾讯科技(深圳)有限公司 | Object words recommending method, device and electronic equipment |
CN110472140B (en) * | 2019-07-17 | 2023-10-31 | 腾讯科技(深圳)有限公司 | Object word recommendation method and device and electronic equipment |
CN112668331A (en) * | 2021-03-18 | 2021-04-16 | 北京沃丰时代数据科技有限公司 | Special word mining method and device, electronic equipment and storage medium |
CN113051912A (en) * | 2021-04-08 | 2021-06-29 | 云南电网有限责任公司电力科学研究院 | Domain word recognition method and device based on word forming rate |
CN113051912B (en) * | 2021-04-08 | 2023-01-20 | 云南电网有限责任公司电力科学研究院 | Domain word recognition method and device based on word forming rate |
Also Published As
Publication number | Publication date |
---|---|
CN105488098B (en) | 2019-02-05 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN105488098A (en) | Field difference based new word extraction method | |
CN109815336B (en) | Text aggregation method and system | |
CN106776713A (en) | It is a kind of based on this clustering method of the Massive short documents of term vector semantic analysis | |
CN107992542A (en) | A kind of similar article based on topic model recommends method | |
CN107038480A (en) | A kind of text sentiment classification method based on convolutional neural networks | |
CN105740229B (en) | The method and device of keyword extraction | |
CN107480122A (en) | A kind of artificial intelligence exchange method and artificial intelligence interactive device | |
CN112989802B (en) | Bullet screen keyword extraction method, bullet screen keyword extraction device, bullet screen keyword extraction equipment and bullet screen keyword extraction medium | |
CN106933972B (en) | The method and device of data element are defined using natural language processing technique | |
CN108388554B (en) | Text emotion recognition system based on collaborative filtering attention mechanism | |
CN106610955A (en) | Dictionary-based multi-dimensional emotion analysis method | |
CN106599054A (en) | Method and system for title classification and push | |
CN107180084A (en) | Word library updating method and device | |
CN107688630B (en) | Semantic-based weakly supervised microbo multi-emotion dictionary expansion method | |
CN106227768B (en) | A kind of short text opining mining method based on complementary corpus | |
CN110728144B (en) | Extraction type document automatic summarization method based on context semantic perception | |
CN109858034A (en) | A kind of text sentiment classification method based on attention model and sentiment dictionary | |
CN107463703A (en) | English social media account number classification method based on information gain | |
CN109214445A (en) | A kind of multi-tag classification method based on artificial intelligence | |
CN108875034A (en) | A kind of Chinese Text Categorization based on stratification shot and long term memory network | |
CN107463715A (en) | English social media account number classification method based on information gain | |
CN106681986A (en) | Multi-dimensional sentiment analysis system | |
CN109614493A (en) | A kind of text condensation recognition methods and system based on supervision term vector | |
CN109271513A (en) | A kind of file classification method, computer-readable storage media and system | |
CN108319584A (en) | A kind of new word discovery method based on the microblogging class short text for improving FP-Growth algorithms |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |