CN106649666A - Left-right recursion-based new word discovery method - Google Patents
Left-right recursion-based new word discovery method Download PDFInfo
- Publication number
- CN106649666A CN106649666A CN201611152464.XA CN201611152464A CN106649666A CN 106649666 A CN106649666 A CN 106649666A CN 201611152464 A CN201611152464 A CN 201611152464A CN 106649666 A CN106649666 A CN 106649666A
- Authority
- CN
- China
- Prior art keywords
- word
- language material
- new
- comentropy
- recurrence
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 74
- 150000001875 compounds Chemical class 0.000 claims abstract description 5
- 239000000463 material Substances 0.000 claims description 59
- 206010028916 Neologism Diseases 0.000 claims description 49
- 238000006243 chemical reaction Methods 0.000 claims description 7
- 239000000203 mixture Substances 0.000 claims description 7
- 238000011156 evaluation Methods 0.000 claims description 6
- 238000012217 deletion Methods 0.000 claims description 5
- 230000037430 deletion Effects 0.000 claims description 5
- 230000011218 segmentation Effects 0.000 claims description 3
- 235000013399 edible fruits Nutrition 0.000 claims description 2
- 238000001914 filtration Methods 0.000 claims description 2
- 238000001514 detection method Methods 0.000 claims 2
- 238000004458 analytical method Methods 0.000 abstract description 6
- 238000004364 calculation method Methods 0.000 abstract description 4
- 238000012545 processing Methods 0.000 abstract description 2
- 238000007781 pre-processing Methods 0.000 abstract 1
- 238000012827 research and development Methods 0.000 abstract 1
- 238000005516 engineering process Methods 0.000 description 6
- 238000013461 design Methods 0.000 description 4
- 238000011161 development Methods 0.000 description 4
- 230000018109 developmental process Effects 0.000 description 4
- 230000000877 morphologic effect Effects 0.000 description 3
- 230000015271 coagulation Effects 0.000 description 2
- 238000005345 coagulation Methods 0.000 description 2
- 238000000205 computational method Methods 0.000 description 2
- 230000007547 defect Effects 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 241000227399 Coronis Species 0.000 description 1
- 230000006978 adaptation Effects 0.000 description 1
- 230000003044 adaptive effect Effects 0.000 description 1
- 238000007630 basic procedure Methods 0.000 description 1
- 230000003796 beauty Effects 0.000 description 1
- 230000015572 biosynthetic process Effects 0.000 description 1
- OGHNVEJMJSYVRP-UHFFFAOYSA-N carvedilol Chemical compound COC1=CC=CC=C1OCCNCC(O)COC1=CC=CC2=C1C1=CC=CC=C1N2 OGHNVEJMJSYVRP-UHFFFAOYSA-N 0.000 description 1
- 239000012141 concentrate Substances 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000007717 exclusion Effects 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 238000011835 investigation Methods 0.000 description 1
- 230000008140 language development Effects 0.000 description 1
- 230000029052 metamorphosis Effects 0.000 description 1
- 230000008520 organization Effects 0.000 description 1
- 238000012913 prioritisation Methods 0.000 description 1
- 238000003672 processing method Methods 0.000 description 1
- 230000029058 respiratory gaseous exchange Effects 0.000 description 1
- 238000003786 synthesis reaction Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/953—Querying, e.g. by the use of web search engines
- G06F16/9535—Search customisation based on user profiles and personalisation
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Machine Translation (AREA)
Abstract
The invention discloses a left-right recursion-based new word discovery method, which belongs to the technical field of search engines and derives from lexical analysis, quick retrieval use and research and development practice. According to the method, the randomness of a left neighbor set and a right neighbor set of a character string is measured by using an information entropy; new words are classified into concrete nouns, derived words, abbreviated words, compound words and digit combination words; and the processing steps of the method include corpus preprocessing, position set calculation, set traversing, reception judgment, word frequency calculation, left recursion, right recursion and combination.
Description
Technical field
The invention belongs to search engine technique field, from the structure and use practice of morphological analysis and quick-searching.
The present invention not only can be used for the efficient classification of common commercial data and retrieval but also can be used as the special of the special dimensions such as public security, military affairs
Industry go-on-go.
Background technology
With the informationalized fast development of industry-by-industry, the data in current each vertical field are more and more, and wherein have
Many data are useless, it is not necessary to specially treated.
How quick retrieval in mass data, efficiently from vertical field, intelligent excavating goes out useful information becomes existing
In a great problem of intelligent searching engine development.With the development of search engine technique, various search engines are occurred in that
Technology, but most search engine technique can not carry out effectively for the specific term and special expression mode in special dimension
Retrieval and intelligent recommendation, so existing search engine technique can not meet the demand of current industry, this promotes distribution
The development of formula intelligent searching engine.
Many specific specialized vocabularies and new term often occur for vertical field, it is existing for these vocabulary
It is no in dictionary, if containing these vocabulary in language material, then will make during the process of morphological analysis logic processing module
Into certain error.So the collection function of being automated for these specialized vocabularies, new term is needed, and these vocabulary
It is added in dictionary and constructs the special dictionary in the vertical field, can thus improves morphological analysis logical process in search engine
The treatment effeciency of module, precision, so as to improve search efficiency, the accuracy of search engine.
In general, when data enter distributed search engine, while flow process is built also into neologisms dictionary, to input
Corpus data carries out new word discovery, if it find that neologisms not occurring in existing dictionary, just this neologisms is added to
Neologisms dictionary.
Existing new word discovery method is usually using rule-based new word discovery or the new word discovery based on statistics.
The new word discovery method for adopting earliest is all the rule-based new word discovery method for adopting, and it is by the inside structure of research neologisms
Make rule and external structure rule to form corresponding rule base, with this rule base as criterion neologisms are found.And it is based on statistics
New word discovery method be all vocabulary for being not more than n by finding length, the meter of word frequency, mutual information is carried out to these vocabulary
Calculate, as neologisms if parameter meets metrics-thresholds set in advance.
In new word discovery method, aforementioned both of which cuts both ways.Rule-based method, new word discovery it is accurate
Degree, efficiency are all of a relatively high, but need to expend substantial amounts of manpower in the establishment of rule base and go to carry out Rule Extraction, with language
Development, rule base needs constantly to update, therefore the method is not adaptive, and autgmentability is bad;Statistics-Based Method,
The process of new word discovery is automation, but this mode can find the higher rubbish string of many word frequency, and can not find length
The very long neologisms of degree, for example, ethnic group's name, transliteration name.
By the investigation to various new word discovery technologies, the new word discovery method for finding Most current is all based on window
Pattern go to find neologisms that this pattern prevents the longer neologisms of length from being found.We invent a kind of left and right that is based on and pass
The new word discovery method returned, when new word discovery is carried out, substantially increases the degree of accuracy of new word discovery.Using this new word discovery
Method can easily set up the vertical domain lexicon of self adaptation of high accuracy, and with the increase of data volume, dictionary is got over
Come more sound.For special field, the degree of accuracy of participle when can increase substantially index data.
The problems referred to above that existing method faces can effectively solve the problem that by a kind of left and right recurrence new word discovery method.
The content of the invention
The invention discloses a kind of left and right recurrence new word discovery method.A kind of left and right recurrence new word discovery method is pre- by language material
Process, location sets calculating, COLLECTION TRAVERSALSThe, the judgement of storage property, word frequency calculating, left recusion, right recurrence, eight step groups of merging
Into.
This left and right recurrence new word discovery method of specific design below:
A kind of left and right recurrence new word discovery method is according to three index evaluations, one neologisms, i.e. word frequency, mutual information, information
Entropy.
(1) word frequency
The frequency that statistics vocabulary occurs in language material, it is probably more neologisms that the frequency of appearance is more high, when word frequency reaches certain
Individual threshold value is considered as being likely to become a neologisms, and computing formula is as follows:
Wherein, N (X) represents the number of times that character string X occurs;N represents the total number of word of language material.
(2) mutual information
Mutual information is to occur in the measure information index in information theory earliest, identifies an event sets and another thing
The information content of part set relations.Mutual information between two event sets indicates that more greatly correlation is bigger, otherwise less.Mutual trust
The common method as computational linguistics model analysis is ceased, because it is to Feature Words and divides property of classes relation not any
Limit, so mutual information is frequently utilized for the feature of text classification and the registration of classification.
In new word discovery method, using mutual information it can be found that a correlation degree between character string and character string,
Character string X, the computing formula of Y mutual informations is as follows:
Wherein, X, Y represent character string or individual character;P (XY) represents character string X and character string Y in input language material simultaneously
The probability of appearance;P (X) and p (Y) represent respectively the probability and character string Y of the appearance in input language material of character string X in input language material
The probability of middle appearance.
Illustrate:If in the language material of 30,000,000 words, " airport " occurs in that the probability of appearance is about 215 times
7.1667×10-6;" aircraft " occurs in that the probability of appearance is about 9.6633 × 10 2899 times-5;" field " occurs in that 5384 times, goes out
Existing probability is about 1.7947 × 10-4.If had no bearing between " aircraft " and " field " in theory, the probability that " airport " occurs
Should be 9.6633 × 10-5×1.7947×10-4, about 1.7345 × 10-8.But, actual value is but about the 5571 of theoretical value
Times, this shows that " aircraft " and " field " arranges in pairs or groups to not accidentally being stitched together together, but two character strings have certain certainty
Contact.But " airport " is also likely to be to be combined by " winged " and " airport ", so in new word discovery technology, we usually make
The mutual information of character string is weighed with Average Mutual, Average Mutual is as follows:
Wherein X1, X2..., XnRepresent character string or individual character;p(X1, X2..., Xn) represent character string X1, X2..., Xn
The probability occurred simultaneously in input language material.
(3) comentropy
Comentropy represents the number of information content, wherein, information content reduces with the increase of comentropy.The calculating of comentropy
Formula is as follows:
Wherein, p (Xi) represent event XiThe probability of generation.
In new word discovery technology, we weighed with comentropy the left adjacent set of character string and right adjacent set with
Machine, the higher possibility for becoming neologisms of randomness is bigger.For example " I likes to eat chongqing chafing dish, and I likes to see Chongqing beauty." this
In section language material, " Chongqing " occurs in that 2 times, and its left adjacent collection is combined into { eating, see }, and its right adjacent collection is combined into { fire, beautiful }, according to information
Entropy computational methods, we can obtain:
In new word discovery method, calculate the comentropy of certain character string, we typically using left and right average information entropy come
The free degree of the character string is weighed, computational methods are as follows:
Wherein, HL(X) the left comentropy of character string X is represented;HR(X) the right comentropy of character string X is represented.
Existing new word discovery technology there are problems that the longer neologisms of length cannot be found, find some rubbish, and sheet
Method can efficiently solve these problems.
(1) pretreatment is expected in input
Due to the language material being input into, often to there is form lack of standardization, or includes some otiose rubbish in Chinese and English language material
String, this causes very big interference to new word discovery so that the calculating to neologisms index has error.This method is for input
Language material carries out Text Pretreatment, and key step is as follows:
The first step:Filtered by canonical, delete the Html labels, the Xml labels that include in language material etc. unrelated with text
Special tag.
Second step:Full-shape is carried out to the symbol in language material and turns half-angle operation, traditional Chinese turns simplified operation and specification is first
The operation such as row.
3rd step:Delete the blank characters such as space, newline, tab unnecessary in language material.
4th step:Delete the additional character that includes in language material, including ASCII codings, specific coding, mess code etc. and text without
The additional character of pass.
5th step:Delete the non-text data in text, including the data such as picture, sound, video.
6th step:Indicated by making pauses in reading unpunctuated ancient writings, i.e., ".”、“!”、“”、“…”、“;", space, newline etc. by material segmentation into
Sentence one by one.If including symmetrical symbol in sentence, such as quotation marks, punctuation marks used to enclose the title etc. then require left and right matching.Also finally
Needs are numbered the sentence for having divided.
7th step:In order to avoid a large amount of calculating of the identical sentence to neologisms index causes error in language material, for
The sentence divided in step 4 carries out Hash evaluations, and to the identical sentence of Hash codes duplicate removal is carried out, it is to avoid in input language material
There is a large amount of identical sentences or paragraph.
The preconditioning technique that a kind of left and right recurrence new word discovery method is adopted can make language material specification, accurate, significant,
Avoid in the case of very big because the wrong index that the defect of language material itself, error are caused is calculated, improve the standard of new word discovery
True property.
(2) neologisms classification design
A kind of left and right recurrence new word discovery method is according to substantive noun, derivative, initialism, compound word and digital portmanteau word
Classification design is carried out to neologisms.In new word discovery, for the difficult treatment processed than English language material of Chinese language material, this is
Due to Chinese syntax rule it is particularly complicated.Being between the word of English directly can be separate with space, and this causes neologisms
Discovery it is just very simple, it is only necessary to carry out participle using space, being matched with dictionary in the word to having divided just can be with;In word
On the metamorphosis of remittance, although english vocabulary changes the conversion of complex entirely regularization, and the conversion of Chinese is
It is governed without rule;Chinese has strict word order, and different word order represents the diverse meaning, such as " not exclusively
Approval " and " not accepting completely " just represent the completely different meaning;For the use of function word is very extensive in Chinese, lead to
Cross research show " " and the usage frequency of " " word be up to 3%-5%, so when new word discovery is carried out, often " category
This new word discovery of silk " is " Cock silks ";Chinese character in Chinese is a more stable set, although the vocabulary of Chinese character composition
It is Protean, but seldom has new Chinese character and occur, so this is than a bit advantageous to Chinese new word discovery.
In sum, the characteristics of Chinese language feature is relative to other language are diverse, and we must find special letter
Breath processing method and rule could effectively improve the degree of accuracy of Chinese new word discovery.
A kind of left and right recurrence new word discovery method found mainly for the vocabulary not included in existing dictionary, Jing Guoyan
Study carefully and show that most of neologisms are broadly divided into following five big class:
The first kind is substantive noun:Mainly include name, place name, organization name, wherein name according to the difference of feature again
It has been divided into Han nationality's name, minority name clansman's name, English and has translated name etc..Substantive noun is one of primary lexical that this method finds.
Equations of The Second Kind is derivative:Be primarily referred to as adding the neologisms that specific suffix is formed according to existing vocabulary, such as it is " old
Yearization ", when new word discovery is carried out, this method can be found that some similar derivatives.
3rd class is initialism:This vocabulary mainly for some very long vocabulary, these vocabulary using it is therein certain
Several words represent the word, and such as initialism of " China national football team " is " national football team ", " U.S. man professional basketball league matches "
Initialism is " NBA ".
4th class is compound word:This vocabulary is mainly formed by combining by verb, noun, by two or more words
Merging and formed, such as " soft project ", " Chongqing in China shaba District A areas of University Of Chongqing ", the former is exactly two noun phrases
The vocabulary of synthesis, the latter is then the word synthesized by " China ", " shaba District ", " University Of Chongqing ", " A areas " multiple place name phrases
Converge.
5th class is digital portmanteau word:This vocabulary is by the shape to combinations such as numeral, date, telephone number, codings
Into vocabulary.
For above-mentioned rule, we devise the rule base of correlation.Rule in rule base both can add automatically
Plus can also add by hand.With the increase of rule in rule base, new word discovery is just more and more accurate.
Due between rule and rule, sometimes there are exclusion relationses, this method takes prioritization strategy, works as rule
Then just using priority enter the selection of line discipline with during rule conflict.
(3) neologisms deliberated index design
Neologisms deliberated index is word frequency, mutual information and comentropy.Word frequency indicates the number of times of character string appearance, only word
Frequency is reached after threshold value, and we can just consider that this character string is probably a neologisms.If it is secondary to see that character string occurs
Several or inadequate, such as " film " occurs in that " cinema " occurs in that 323 times 492 times, and the frequency that " film " occurs is
The frequency occurred more than " cinema ", but we are more likely to " cinema " as a word, because we feel " film "
" institute " coagulation grade ratio " " and the coagulation grade of " film " it is bigger, so after word frequency reaches threshold value, we also need to
Show the solidified inside degree of word with mutual information.If we judge whether simply by word frequency, mutual information two indices
For neologisms, this or it is inadequate, we are also needed to its outside performance from the point of view of comentropy.Such as " century " and " deed " two
Character string, we may only say that " last century ", " next century ", " this century " etc. are limited several, it is seen that " century " this character string
Left adjacent set very little, so we are more prone to " last century ", " next century ", " this century " as a neologisms;And to " deed "
This character string, we can say that " brilliant career ", " outstanding deed ", " personage's deed ", " candidate's deed " etc. are a lot, it
Left adjacent set it is very big, so we can be directly using " deed " as a neologisms.Comentropy has weighed a character string
Freely use degree, if the left adjacent set of character string and right adjacent set are bigger, then this character string just more has can
Can be used as a neologisms.After calculating the left comentropy and right comentropy of character string, further according to left comentropy or right comentropy
Carry out the calculating of left recusion or right recurrence.
A kind of left and right recurrence new word discovery method can be found that the neologisms of any length, and its step is as follows:
Step one:Language material is pre-processed.Referring specifically to abovementioned steps.
Step 2:Location sets are calculated, that is, calculate:
Wherein w1, w2..., wmRepresent and occurred and mutually different word in input language material;(wi, POSi) represent a set
AndIt is an element of the set, represents i-th word wiThe position that jth time occurs in input language material;Represent i-th word wi Position occurred.
For example, for language material " I am Chinese, and I likes China " so set W for (' I ', { 0,6 }), (' being ',
{ 1 }), (' in ', { 2,8 }), (' state ', { 3,9 }), (' people ', { 4 }), (' love ', { 7 }) }.Location sets W have recorded each word and go out
Existing position;
Step 3:COLLECTION TRAVERSALSThe, that is, travel through set W, and each word w is taken out successivelyi, i.e. word=wi, i=i+1.
Step 4:The judgement of storage property, that is, judge whether word meets the requirement of rule base storage rule, if be unsatisfactory for,
Rebound step 3, if meet carrying out step 5.Storage property refers to whether neologisms meet neologisms criteria for classification, neologisms criteria for classification
Including substantive noun, derivative, initialism, compound word, digital portmanteau word.For this five classes word, we devise corresponding rule
Then storehouse, the rule in rule base both can add automatically can also be added by hand.Therefore when being received, be according to this five kinds
Type is carried out.
Step 5:Word frequency is calculated, that is, calculate the frequency that word occurs, if less than threshold value, rebound step 3, if greater than
Threshold value then carries out step 6.
Step 6:Left recusion, i.e., perform left recusion for word, is designated as createPrefixTree (word, W), obtains
New set of words P of left recusion1, it is concrete and is divided into the following steps:
A. the left comentropy and mutual information of word are calculated by W, if left comentropy and mutual information meet threshold requirement,
Then carry out step b;If left comentropy and mutual information are unsatisfactory for threshold requirement, step c is carried out;
If b. not including word in existing dictionary S, word is added to new set of words P1;Judge whether set W travels through
It is complete, if not traveled through, rebound step 3, conversely, into step 7;
C. the left adjacent word collection pre { pre of word are calculated1, pre2..., prek}:
Step 7:Right recurrence, i.e., perform createSufTree (word, W) for word, obtains the new set of words of right recurrence
P2。
A. the right comentropy and mutual information of word are calculated by W, if right comentropy and mutual information meet threshold requirement,
Then carry out step b;If right comentropy and mutual information are unsatisfactory for threshold requirement, step c is carried out;
If b. not including word in existing dictionary S, word is added to new set of words P2;Judge whether set W travels through
It is complete, if not traveled through, rebound step 3, conversely, into step 8;
C. the right adjacent word collection suf { suf of word are calculated1, suf2..., sufk}
Step 8:Merge, i.e. set of computations P1With set P2Common factor obtain find new set of words P3。
Description of the drawings
It is that structure and workflow to the present invention are illustrated below with reference to accompanying drawing, wherein:
Fig. 1 is a kind of flow chart of left and right recurrence new word discovery method
Fig. 2 is input language material pretreatment block diagram
Fig. 3 is left recusion flow chart
Fig. 4 is right recurrence flow chart
Specific embodiment
Come below in conjunction with the accompanying drawings to make the embodiment of " a kind of left and right recurrence new word discovery method " of the present invention into
The explanation of one step.
(1) it is input into language material pretreatment
Input language material pretreatment is related to canonical filtration, half-angle conversion, blank character is deleted, additional character is deleted, non-text
This symbol is deleted, seven steps of cutting sentence and Hash duplicate removals.
Several steps that input language material pretreatment is related to can be exchanged, process provides one of which implementation conduct
Checking, i.e., according to canonical filter, half-angle conversion, blank character delete, additional character delete, non-textual symbol deletion, cutting sentence,
The order of Hash duplicate removals carries out being input into language material pretreatment.Wherein:The effect that canonical is filtered is to delete the Html marks included in language material
The special tags unrelated with text such as label, Xml labels;Half-angle conversion be the symbol in language material is carried out full-shape turn half-angle operation,
Traditional Chinese turns simplified operation and specification first trip operation;Blank character delete be delete unnecessary space in language material, newline,
Tab;It is to delete ASCII codings, specific coding, the mess code included in language material that additional character is deleted;Non-textual symbol deletion is to delete
Except the non-text data in text;Cutting sentence is indicated by making pauses in reading unpunctuated ancient writings, i.e., ".”、“!”、“”、“…”、“;", space, line feed
By material segmentation into sentence one by one, if including symmetrical symbol in sentence, such as quotation marks, punctuation marks used to enclose the title etc. then require a left side to symbol
Right matching, finally also needs to the sentence for having divided to be numbered;Hash duplicate removals are that the sentence to having divided carries out Hash evaluations, right
The identical sentence of Hash codes carries out duplicate removal.
This preconditioning technique can make language material specification, accurate, significant, it is to avoid because the defect of language material itself,
The wrong index that error is caused is calculated.
(2) new word discovery method
This new word discovery method is based on left and right recurrence, can effectively find that minority name clansman's name, English are translated name, English and translate ground
The very long neologisms of string length such as name.Its basic procedure is:Appoint from location sets and take an element, be designated as e, count first
Its left comentropy and mutual information is calculated, is then compared with threshold value, if left comentropy and mutual information are more than or equal to threshold value and dictionary S
In include e, just by e addition dictionary P1;The left adjacent word collection of e is calculated if left comentropy and mutual information are less than threshold value, then
Calculate left adjacent word collection again, and for left adjacent word concentrate each element left comentropy and mutual information and make further threshold decision and
Whether the judgement of dictionary P1 is added, it is eligible, add, otherwise recurrence is calculated.Right recurrence is carried out in the same manner, and will be met
The neologisms of condition are added in dictionary P2.Finally, the common factor of P1 and P2 is sought, you can obtain final new set of words.
We have been embodied as new word discovery method, use Baidupedia 200 with " film " be the theme it is true
Article is used as experimental data.Minority name clansman's name, transliteration name, transliteration place name etc. are continually occurred in that in this 200 articles
Word.
Specific implementation step is as follows:
1. this 200 articles are carried out with new word discovery using existing new word discovery method, records the neologisms for finding, and counted
Calculate accuracy, recall rate and F values.
2. this 200 articles are carried out with new word discovery using this new word discovery method, records the neologisms for finding, and calculated
Accuracy, recall rate and F values.
3. contrast and analyze for the new word discovery result of existing new word discovery method and this new word discovery method.
Experimental result and analysis
The new word discovery methods and resultses of table 1 are contrasted
New word discovery method traditional in the case where window size is set to 4 as seen from Table 1 can only find string length
Neologisms less than 4, can significantly find out " Coronis handkerchief ", " Leonardo ", " livre ", " speed with swash ", " Humphrey "
It is a part for neologisms Deng word, due to the big break manner of breathing that the neologisms that existing new word discovery method finds out set with window
Close, the size of the length no more than window of neologisms;And this new word discovery method is the new word discovery method for being based on left and right recurrence,
The neologisms that it finds are not by the length limitation of character string.
We employ accuracy (precision), three indexs of recall rate (recall) and F values (F-measure)
New word discovery result is passed judgment on, wherein
Accuracy computing formula is as follows:
Recall rate computing formula is as follows:
F value computing formula are as follows:
In above formula, n1Represent the neologisms number for correctly identifying;n2Represent total of the word string for identifying
Number;n3Represent the total number of neologisms in language material.
Table 2 each index evaluation result
As shown in table 2 each index evaluation result, our the new word discovery method new word discovery effect longer for length
Fruit is significantly increased, and in terms of other indexs, our method is improved.
(3) left recusion method
The enforcement of left recusion is as follows:
Obtain first to appoint from location sets and take an element, be designated as word;
Then according to formula (4) and its left comentropy of example calculations word;
3rd, according to default information entropy threshold, detect whether the left comentropy for calculating meets threshold value, if being unsatisfactory for
Calculate the left adjacent set of word.And appoint in left adjacent set and take an element pre, pre and word is combined into into pre+word and is returned
Return previous step and recalculate left comentropy, and make a decision again, by that analogy;
4th, if the left comentropy for calculating meets threshold value, mutual information is calculated according to formula (2) and (3);
5th, according to default mutual information threshold value, detect whether the mutual information for calculating meets threshold value, count if being unsatisfactory for
Corresponding left adjacent set is calculated, and is appointed in left adjacent set and is taken an element pre, pre and word are combined into into pre+word and are returned
Previous step recalculates left comentropy, and makes a decision again, by that analogy;
6th, if the mutual information for calculating meets threshold value, judge word or pre+word whether Already in
Dictionary, adds dictionary P if not existing1, if there is a new word is then further taken out from location sets, repeat to enter
Row abovementioned steps.
(4) right recursion method
The enforcement of right recurrence is as follows:
Appoint first from location sets and take an element, be designated as word;
Then according to formula (4) and its right comentropy of example calculations;
3rd, according to default information entropy threshold, detect whether the right comentropy for calculating meets threshold value, if being unsatisfactory for
The right adjacent set of word is calculated, and is appointed in right adjacent set and is taken an element suf, word and suf is combined into into word+suf and is returned
Return previous step and recalculate right comentropy, and make a decision again, by that analogy;
4th, if the right comentropy for calculating meets threshold value, mutual information is calculated according to formula (2) and (3);
5th, according to default mutual information threshold value, detect whether the mutual information for calculating meets threshold value, count if being unsatisfactory for
Corresponding right adjacent set is calculated, and is appointed in right adjacent set and is taken an element suf, word and suf are combined into into word+suf and are returned
Previous step recalculates right comentropy, and makes a decision again, by that analogy;
6th, if the mutual information for calculating meets threshold value, judge word or word+suf whether Already in
Dictionary, adds dictionary P if not existing2, if there is a new word is then further taken out from location sets, repeat to enter
Row abovementioned steps.
Claims (9)
1. a kind of left and right recurrence new word discovery method, including language material pretreatment [1], location sets calculate [2], COLLECTION TRAVERSALSThe [3],
Storage property judges that [4], word frequency calculate [5], left recusion [6], right recurrence [7], [8] eight steps of merging:
Language material pre-processes [1]:Filtered by canonical, full half-angle is changed, blank character is deleted, unrelated special symbol is deleted, non-textual
Delete, punctuate cutting is processed being input into language material;
Location sets calculate [2]:Calculate the position that the word in input language material occurs in input language material;
COLLECTION TRAVERSALSThe [3]:Traversal location sets;
Storage property judges [4]:Judge whether each element in location sets meets the requirement of storage rule;
Word frequency calculates [5]:Calculate the frequency that each word occurs;
Left recusion [6]:For each word, the word composition neologisms on its left side are taken successively and is judged;
Right recurrence [7]:For each word, composition neologisms of the word on the right of it are taken successively and is judged;
Merge [8]:The new set of words that the new set of words that left recusion finds finds with right recurrence is merged.
2. a kind of left and right recurrence new word discovery method according to claim 1, it is characterised in that:Language material pretreatment [1] leads to
Canonical filtration, the conversion of full half-angle, blank character deletion, unrelated special symbol deletion, non-textual deletion, punctuate cutting are crossed to being input into language
Material is processed;Wherein, it is to delete Html labels, the Xml labels included in language material that canonical is filtered;Full half-angle conversion is by language material
In SBC case be converted into DBC case, by Chinese traditional font be converted into it is simplified;It is to delete many in language material that blank character is deleted
Remaining space, newline, tab;It is to delete ASCII codings, the special dimension volume included in language material that unrelated special symbol is deleted
Code, mess code symbol;It is to delete picture, sound, the video data in text that non-textual is deleted;Punctuate cutting is according to fullstop, sense
Exclamation, question mark, ellipsis, branch, space, newline are by material segmentation into sentence one by one;In order to avoid big in language material
Measure identical sentence and calculate neologisms index the error that causes, the sentence to segmenting carries out Hash evaluations, and by Hash
The identical sentence duplicate removal of code.
3. a kind of left and right recurrence new word discovery method according to claim 1, it is characterised in that:Location sets calculate [2]
Calculate the position that the word in input language material occurs in input language material;The computing formula of wherein location sets is:
Wherein w1, w2..., wmRepresent and occurred and mutually different word in input language material;(wi, POSi) represent one set andIt is an element of the set, represents i-th word wiThe position that jth time occurs in input language material;Represent i-th word wi Position occurred.
4. a kind of left and right recurrence new word discovery method according to claim 1, it is characterised in that:COLLECTION TRAVERSALSThe [3] is time
Go through location sets, note location sets are W, that is, take out from location sets W each word w successivelyi, it is stored in variable word
In, i.e. word=wi, i=i+1.
5. a kind of left and right recurrence new word discovery method according to claim 1, it is characterised in that:Storage property judges that [4] are
Judge whether each element in location sets meets the requirement of storage rule;Storage property rule refer to according to substantive noun,
Derivative, initialism, compound word, digital this five type of portmanteau word are received.
6. a kind of left and right recurrence new word discovery method according to claim 1, it is characterised in that:Word frequency calculates [5] and refers to
The frequency that each word occurs is calculated, frequency formula is:
Wherein, N (X) represents the number of times that character string X occurs;N represents the total number of word of language material.
7. a kind of left and right recurrence new word discovery method according to claim 1, it is characterised in that:Left recusion [6] is pointer
The word composition neologisms for taking its left side successively to each word judge;Remember that current word is word, then its step includes:(1) count
Calculate the left comentropy of word;(2) whether the left comentropy calculated according to default comentropy threshold test meets threshold value, if not
Meet the left adjacent set for then calculating word, left adjacent set is the set of word left sides word composition;(3) appoint in left adjacent set and take one
Individual element pre, is combined into pre and word pre+word returns previous step and recalculates left comentropy, and makes a decision again;
(4) if the left comentropy for calculating meets threshold value, mutual information is calculated;(5) according to default mutual information threshold value, detection is calculated
Whether the mutual information for going out meets threshold value, and corresponding left adjacent set is calculated if being unsatisfactory for;(6) appoint in left adjacent set and take a unit
Plain pre, is combined into pre and word pre+word returns previous step and recalculates left comentropy, and makes a decision again, with such
Push away;(7) if the mutual information for calculating meets threshold value, word or pre+word whether Already in dictionaries are judged, such as
Be present then addition dictionary P1 in fruit, if there is then from location sets a new word is further taken out, repeat aforementioned
Step.
8. a kind of left and right recurrence new word discovery method according to claim 1, it is characterised in that:Right recurrence [7] is pointer
Take composition neologisms of the word on the right of it successively to each word and judge;Remember that current word is word, then its step includes:(1)
Calculate the right comentropy of word;(2) according to default information entropy threshold, detect whether the right comentropy for calculating meets threshold value,
As being unsatisfactory for, the right adjacent set of word is calculated, right adjacent set is the set of word the right word composition;(3) appoint in right adjacent set
An element suf is taken, word and suf are combined into into word+suf are returned previous step and recalculate right comentropy, and done sentence again
It is disconnected;(4) if the right comentropy for calculating meets threshold value, mutual information is calculated;(5) according to default mutual information threshold value, detection
Whether the mutual information for calculating meets threshold value, and corresponding right adjacent set is calculated if being unsatisfactory for;(6) appoint in right adjacent set and take one
Individual element suf, is combined into word and suf word+suf returns previous step and recalculates right comentropy, and makes a decision again, with
This analogizes;(7) if the mutual information for calculating meets threshold value, word or word+suf whether Already in words are judged
Storehouse, adds dictionary P2 if not existing, and if there is then from location sets a new word is further taken out, repeats
Abovementioned steps.
9. a kind of left and right recurrence new word discovery method according to claim 1, it is characterised in that:It is to pass on a left side to merge [8]
The new set of words that the new set of words and right recurrence for returning discovery finds is merged, and remembers that the new set of words that left recusion finds is P1, the right side
The new set of words that recurrence finds is P2, P1 is merged with P2 and refers to the common factor for asking P1 and P2.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611152464.XA CN106649666A (en) | 2016-11-30 | 2016-11-30 | Left-right recursion-based new word discovery method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611152464.XA CN106649666A (en) | 2016-11-30 | 2016-11-30 | Left-right recursion-based new word discovery method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN106649666A true CN106649666A (en) | 2017-05-10 |
Family
ID=58822481
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201611152464.XA Pending CN106649666A (en) | 2016-11-30 | 2016-11-30 | Left-right recursion-based new word discovery method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106649666A (en) |
Cited By (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107103902A (en) * | 2017-06-14 | 2017-08-29 | 上海适享文化传播有限公司 | Complete speech content recurrence recognition methods |
CN107992570A (en) * | 2017-11-29 | 2018-05-04 | 北京小度信息科技有限公司 | Character string method for digging, device, electronic equipment and computer-readable recording medium |
CN108021558A (en) * | 2017-12-27 | 2018-05-11 | 北京金山安全软件有限公司 | Keyword recognition method and device, electronic equipment and storage medium |
CN108595433A (en) * | 2018-05-02 | 2018-09-28 | 北京中电普华信息技术有限公司 | A kind of new word discovery method and device |
CN108846033A (en) * | 2018-05-28 | 2018-11-20 | 北京邮电大学 | The discovery and classifier training method and apparatus of specific area vocabulary |
CN109299230A (en) * | 2018-09-06 | 2019-02-01 | 华泰证券股份有限公司 | A kind of customer service public sentiment hot word data digging system and method |
CN110222157A (en) * | 2019-06-20 | 2019-09-10 | 贵州电网有限责任公司 | A kind of new word discovery method based on mass text |
CN110674252A (en) * | 2019-08-26 | 2020-01-10 | 银江股份有限公司 | High-precision semantic search system for judicial domain |
CN112633852A (en) * | 2020-12-30 | 2021-04-09 | 广东电网有限责任公司电力调度控制中心 | Examination system of business document |
CN113609844A (en) * | 2021-07-30 | 2021-11-05 | 国网山西省电力公司晋城供电公司 | Electric power professional word bank construction method based on hybrid model and clustering algorithm |
CN115495507A (en) * | 2022-11-17 | 2022-12-20 | 江苏鸿程大数据技术与应用研究院有限公司 | Engineering material information price matching method, system and storage medium |
CN115858771A (en) * | 2022-01-11 | 2023-03-28 | 北京中关村科金技术有限公司 | Word searching method and device and computer readable storage medium |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101131705A (en) * | 2007-09-27 | 2008-02-27 | 中国科学院计算技术研究所 | New word discovering method and system thereof |
CN104572622A (en) * | 2015-01-05 | 2015-04-29 | 语联网(武汉)信息技术有限公司 | Term filtering method |
CN106095736A (en) * | 2016-06-07 | 2016-11-09 | 华东师范大学 | A kind of method of field neologisms extraction |
-
2016
- 2016-11-30 CN CN201611152464.XA patent/CN106649666A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101131705A (en) * | 2007-09-27 | 2008-02-27 | 中国科学院计算技术研究所 | New word discovering method and system thereof |
CN104572622A (en) * | 2015-01-05 | 2015-04-29 | 语联网(武汉)信息技术有限公司 | Term filtering method |
CN106095736A (en) * | 2016-06-07 | 2016-11-09 | 华东师范大学 | A kind of method of field neologisms extraction |
Non-Patent Citations (1)
Title |
---|
邢恩军 等: "基于上下文词频词汇量指标的新词发现方法", 《计算机应用与软件》 * |
Cited By (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107103902B (en) * | 2017-06-14 | 2020-02-04 | 上海适享文化传播有限公司 | Complete speech content recursive recognition method |
CN107103902A (en) * | 2017-06-14 | 2017-08-29 | 上海适享文化传播有限公司 | Complete speech content recurrence recognition methods |
CN107992570A (en) * | 2017-11-29 | 2018-05-04 | 北京小度信息科技有限公司 | Character string method for digging, device, electronic equipment and computer-readable recording medium |
CN108021558A (en) * | 2017-12-27 | 2018-05-11 | 北京金山安全软件有限公司 | Keyword recognition method and device, electronic equipment and storage medium |
CN108595433A (en) * | 2018-05-02 | 2018-09-28 | 北京中电普华信息技术有限公司 | A kind of new word discovery method and device |
CN108846033A (en) * | 2018-05-28 | 2018-11-20 | 北京邮电大学 | The discovery and classifier training method and apparatus of specific area vocabulary |
CN108846033B (en) * | 2018-05-28 | 2022-04-08 | 北京邮电大学 | Method and device for discovering specific domain vocabulary and training classifier |
CN109299230A (en) * | 2018-09-06 | 2019-02-01 | 华泰证券股份有限公司 | A kind of customer service public sentiment hot word data digging system and method |
CN110222157A (en) * | 2019-06-20 | 2019-09-10 | 贵州电网有限责任公司 | A kind of new word discovery method based on mass text |
CN110674252A (en) * | 2019-08-26 | 2020-01-10 | 银江股份有限公司 | High-precision semantic search system for judicial domain |
CN112633852A (en) * | 2020-12-30 | 2021-04-09 | 广东电网有限责任公司电力调度控制中心 | Examination system of business document |
CN113609844A (en) * | 2021-07-30 | 2021-11-05 | 国网山西省电力公司晋城供电公司 | Electric power professional word bank construction method based on hybrid model and clustering algorithm |
CN113609844B (en) * | 2021-07-30 | 2024-03-08 | 国网山西省电力公司晋城供电公司 | Electric power professional word stock construction method based on hybrid model and clustering algorithm |
CN115858771A (en) * | 2022-01-11 | 2023-03-28 | 北京中关村科金技术有限公司 | Word searching method and device and computer readable storage medium |
CN115495507A (en) * | 2022-11-17 | 2022-12-20 | 江苏鸿程大数据技术与应用研究院有限公司 | Engineering material information price matching method, system and storage medium |
CN115495507B (en) * | 2022-11-17 | 2023-03-24 | 江苏鸿程大数据技术与应用研究院有限公司 | Engineering material information price matching method, system and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106649666A (en) | Left-right recursion-based new word discovery method | |
US11475209B2 (en) | Device, system, and method for extracting named entities from sectioned documents | |
CN109492077B (en) | Knowledge graph-based petrochemical field question-answering method and system | |
CN110633409B (en) | Automobile news event extraction method integrating rules and deep learning | |
CN105718586B (en) | The method and device of participle | |
CN110502621A (en) | Answering method, question and answer system, computer equipment and storage medium | |
CN102254014B (en) | Adaptive information extraction method for webpage characteristics | |
US8751218B2 (en) | Indexing content at semantic level | |
CN103678412B (en) | A kind of method and device of file retrieval | |
CN107122413A (en) | A kind of keyword extracting method and device based on graph model | |
CN106776562A (en) | A kind of keyword extracting method and extraction system | |
CN105243129A (en) | Commodity property characteristic word clustering method | |
CN114065758B (en) | Document keyword extraction method based on hypergraph random walk | |
CN105824933A (en) | Automatic question-answering system based on theme-rheme positions and realization method of automatic question answering system | |
JP2005526317A (en) | Method and system for automatically searching a concept hierarchy from a document corpus | |
CN105528411B (en) | Apparel interactive electronic technical manual full-text search device and method | |
CN113806531B (en) | Drug relationship classification model construction method, drug relationship classification method and system | |
Trabelsi et al. | Bridging folksonomies and domain ontologies: Getting out non-taxonomic relations | |
CN107436955A (en) | A kind of English word relatedness computation method and apparatus based on Wikipedia Concept Vectors | |
CN108153851B (en) | General forum subject post page information extraction method based on rules and semantics | |
JPWO2014002774A1 (en) | Synonym extraction system, method and recording medium | |
JP2008009671A (en) | Data display device, data display method and data display program | |
CN107491524A (en) | A kind of Chinese word relatedness computation method and apparatus based on Wikipedia Concept Vectors | |
Ayyasamy et al. | Mining Wikipedia knowledge to improve document indexing and classification | |
Hoxha et al. | An automatically generated annotated corpus for Albanian named entity recognition |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20170510 |
|
WD01 | Invention patent application deemed withdrawn after publication |