CN106649666A

CN106649666A - Left-right recursion-based new word discovery method

Info

Publication number: CN106649666A
Application number: CN201611152464.XA
Authority: CN
Inventors: 尹云飞; 刘欢; 曾亚飞
Original assignee: Chongqing University; Langchao Electronic Information Industry Co Ltd
Current assignee: Chongqing University; Inspur Electronic Information Industry Co Ltd
Priority date: 2016-11-30
Filing date: 2016-11-30
Publication date: 2017-05-10

Abstract

The invention discloses a left-right recursion-based new word discovery method, which belongs to the technical field of search engines and derives from lexical analysis, quick retrieval use and research and development practice. According to the method, the randomness of a left neighbor set and a right neighbor set of a character string is measured by using an information entropy; new words are classified into concrete nouns, derived words, abbreviated words, compound words and digit combination words; and the processing steps of the method include corpus preprocessing, position set calculation, set traversing, reception judgment, word frequency calculation, left recursion, right recursion and combination.

Description

A kind of left and right recurrence new word discovery method

Technical field

The invention belongs to search engine technique field, from the structure and use practice of morphological analysis and quick-searching. The present invention not only can be used for the efficient classification of common commercial data and retrieval but also can be used as the special of the special dimensions such as public security, military affairs Industry go-on-go.

Background technology

With the informationalized fast development of industry-by-industry, the data in current each vertical field are more and more, and wherein have Many data are useless, it is not necessary to specially treated.

How quick retrieval in mass data, efficiently from vertical field, intelligent excavating goes out useful information becomes existing In a great problem of intelligent searching engine development.With the development of search engine technique, various search engines are occurred in that Technology, but most search engine technique can not carry out effectively for the specific term and special expression mode in special dimension Retrieval and intelligent recommendation, so existing search engine technique can not meet the demand of current industry, this promotes distribution The development of formula intelligent searching engine.

Many specific specialized vocabularies and new term often occur for vertical field, it is existing for these vocabulary It is no in dictionary, if containing these vocabulary in language material, then will make during the process of morphological analysis logic processing module Into certain error.So the collection function of being automated for these specialized vocabularies, new term is needed, and these vocabulary It is added in dictionary and constructs the special dictionary in the vertical field, can thus improves morphological analysis logical process in search engine The treatment effeciency of module, precision, so as to improve search efficiency, the accuracy of search engine.

In general, when data enter distributed search engine, while flow process is built also into neologisms dictionary, to input Corpus data carries out new word discovery, if it find that neologisms not occurring in existing dictionary, just this neologisms is added to Neologisms dictionary.

Existing new word discovery method is usually using rule-based new word discovery or the new word discovery based on statistics. The new word discovery method for adopting earliest is all the rule-based new word discovery method for adopting, and it is by the inside structure of research neologisms Make rule and external structure rule to form corresponding rule base, with this rule base as criterion neologisms are found.And it is based on statistics New word discovery method be all vocabulary for being not more than n by finding length, the meter of word frequency, mutual information is carried out to these vocabulary Calculate, as neologisms if parameter meets metrics-thresholds set in advance.

In new word discovery method, aforementioned both of which cuts both ways.Rule-based method, new word discovery it is accurate Degree, efficiency are all of a relatively high, but need to expend substantial amounts of manpower in the establishment of rule base and go to carry out Rule Extraction, with language Development, rule base needs constantly to update, therefore the method is not adaptive, and autgmentability is bad；Statistics-Based Method, The process of new word discovery is automation, but this mode can find the higher rubbish string of many word frequency, and can not find length The very long neologisms of degree, for example, ethnic group's name, transliteration name.

By the investigation to various new word discovery technologies, the new word discovery method for finding Most current is all based on window Pattern go to find neologisms that this pattern prevents the longer neologisms of length from being found.We invent a kind of left and right that is based on and pass The new word discovery method returned, when new word discovery is carried out, substantially increases the degree of accuracy of new word discovery.Using this new word discovery Method can easily set up the vertical domain lexicon of self adaptation of high accuracy, and with the increase of data volume, dictionary is got over Come more sound.For special field, the degree of accuracy of participle when can increase substantially index data.

The problems referred to above that existing method faces can effectively solve the problem that by a kind of left and right recurrence new word discovery method.

The content of the invention

The invention discloses a kind of left and right recurrence new word discovery method.A kind of left and right recurrence new word discovery method is pre- by language material Process, location sets calculating, COLLECTION TRAVERSALSThe, the judgement of storage property, word frequency calculating, left recusion, right recurrence, eight step groups of merging Into.

This left and right recurrence new word discovery method of specific design below：

A kind of left and right recurrence new word discovery method is according to three index evaluations, one neologisms, i.e. word frequency, mutual information, information Entropy.

(1) word frequency

The frequency that statistics vocabulary occurs in language material, it is probably more neologisms that the frequency of appearance is more high, when word frequency reaches certain Individual threshold value is considered as being likely to become a neologisms, and computing formula is as follows：

Wherein, N (X) represents the number of times that character string X occurs；N represents the total number of word of language material.

(2) mutual information

Mutual information is to occur in the measure information index in information theory earliest, identifies an event sets and another thing The information content of part set relations.Mutual information between two event sets indicates that more greatly correlation is bigger, otherwise less.Mutual trust The common method as computational linguistics model analysis is ceased, because it is to Feature Words and divides property of classes relation not any Limit, so mutual information is frequently utilized for the feature of text classification and the registration of classification.

In new word discovery method, using mutual information it can be found that a correlation degree between character string and character string, Character string X, the computing formula of Y mutual informations is as follows：

Wherein, X, Y represent character string or individual character；P (XY) represents character string X and character string Y in input language material simultaneously The probability of appearance；P (X) and p (Y) represent respectively the probability and character string Y of the appearance in input language material of character string X in input language material The probability of middle appearance.

Illustrate：If in the language material of 30,000,000 words, " airport " occurs in that the probability of appearance is about 215 times 7.1667×10^-6；" aircraft " occurs in that the probability of appearance is about 9.6633 × 10 2899 times^-5；" field " occurs in that 5384 times, goes out Existing probability is about 1.7947 × 10^-4.If had no bearing between " aircraft " and " field " in theory, the probability that " airport " occurs Should be 9.6633 × 10^-5×1.7947×10^-4, about 1.7345 × 10^-8.But, actual value is but about the 5571 of theoretical value Times, this shows that " aircraft " and " field " arranges in pairs or groups to not accidentally being stitched together together, but two character strings have certain certainty Contact.But " airport " is also likely to be to be combined by " winged " and " airport ", so in new word discovery technology, we usually make The mutual information of character string is weighed with Average Mutual, Average Mutual is as follows：

Wherein X₁, X₂..., X_nRepresent character string or individual character；p(X₁, X₂..., X_n) represent character string X₁, X₂..., X_n The probability occurred simultaneously in input language material.

(3) comentropy

Comentropy represents the number of information content, wherein, information content reduces with the increase of comentropy.The calculating of comentropy Formula is as follows：

Wherein, p (X_i) represent event X_iThe probability of generation.

In new word discovery technology, we weighed with comentropy the left adjacent set of character string and right adjacent set with Machine, the higher possibility for becoming neologisms of randomness is bigger.For example " I likes to eat chongqing chafing dish, and I likes to see Chongqing beauty." this In section language material, " Chongqing " occurs in that 2 times, and its left adjacent collection is combined into { eating, see }, and its right adjacent collection is combined into { fire, beautiful }, according to information Entropy computational methods, we can obtain：

In new word discovery method, calculate the comentropy of certain character string, we typically using left and right average information entropy come The free degree of the character string is weighed, computational methods are as follows：

Wherein, H_L(X) the left comentropy of character string X is represented；H_R(X) the right comentropy of character string X is represented.

Existing new word discovery technology there are problems that the longer neologisms of length cannot be found, find some rubbish, and sheet Method can efficiently solve these problems.

(1) pretreatment is expected in input

Due to the language material being input into, often to there is form lack of standardization, or includes some otiose rubbish in Chinese and English language material String, this causes very big interference to new word discovery so that the calculating to neologisms index has error.This method is for input Language material carries out Text Pretreatment, and key step is as follows：

The first step：Filtered by canonical, delete the Html labels, the Xml labels that include in language material etc. unrelated with text Special tag.

Second step：Full-shape is carried out to the symbol in language material and turns half-angle operation, traditional Chinese turns simplified operation and specification is first The operation such as row.

3rd step：Delete the blank characters such as space, newline, tab unnecessary in language material.

4th step：Delete the additional character that includes in language material, including ASCII codings, specific coding, mess code etc. and text without The additional character of pass.

5th step：Delete the non-text data in text, including the data such as picture, sound, video.

6th step：Indicated by making pauses in reading unpunctuated ancient writings, i.e., ".”、“！”、“”、“…”、“；", space, newline etc. by material segmentation into Sentence one by one.If including symmetrical symbol in sentence, such as quotation marks, punctuation marks used to enclose the title etc. then require left and right matching.Also finally Needs are numbered the sentence for having divided.

7th step：In order to avoid a large amount of calculating of the identical sentence to neologisms index causes error in language material, for The sentence divided in step 4 carries out Hash evaluations, and to the identical sentence of Hash codes duplicate removal is carried out, it is to avoid in input language material There is a large amount of identical sentences or paragraph.

The preconditioning technique that a kind of left and right recurrence new word discovery method is adopted can make language material specification, accurate, significant, Avoid in the case of very big because the wrong index that the defect of language material itself, error are caused is calculated, improve the standard of new word discovery True property.

(2) neologisms classification design

A kind of left and right recurrence new word discovery method is according to substantive noun, derivative, initialism, compound word and digital portmanteau word Classification design is carried out to neologisms.In new word discovery, for the difficult treatment processed than English language material of Chinese language material, this is Due to Chinese syntax rule it is particularly complicated.Being between the word of English directly can be separate with space, and this causes neologisms Discovery it is just very simple, it is only necessary to carry out participle using space, being matched with dictionary in the word to having divided just can be with；In word On the metamorphosis of remittance, although english vocabulary changes the conversion of complex entirely regularization, and the conversion of Chinese is It is governed without rule；Chinese has strict word order, and different word order represents the diverse meaning, such as " not exclusively Approval " and " not accepting completely " just represent the completely different meaning；For the use of function word is very extensive in Chinese, lead to Cross research show " " and the usage frequency of " " word be up to 3%-5%, so when new word discovery is carried out, often " category This new word discovery of silk " is " Cock silks "；Chinese character in Chinese is a more stable set, although the vocabulary of Chinese character composition It is Protean, but seldom has new Chinese character and occur, so this is than a bit advantageous to Chinese new word discovery. In sum, the characteristics of Chinese language feature is relative to other language are diverse, and we must find special letter Breath processing method and rule could effectively improve the degree of accuracy of Chinese new word discovery.

A kind of left and right recurrence new word discovery method found mainly for the vocabulary not included in existing dictionary, Jing Guoyan Study carefully and show that most of neologisms are broadly divided into following five big class：

The first kind is substantive noun：Mainly include name, place name, organization name, wherein name according to the difference of feature again It has been divided into Han nationality's name, minority name clansman's name, English and has translated name etc..Substantive noun is one of primary lexical that this method finds.

Equations of The Second Kind is derivative：Be primarily referred to as adding the neologisms that specific suffix is formed according to existing vocabulary, such as it is " old Yearization ", when new word discovery is carried out, this method can be found that some similar derivatives.

3rd class is initialism：This vocabulary mainly for some very long vocabulary, these vocabulary using it is therein certain Several words represent the word, and such as initialism of " China national football team " is " national football team ", " U.S. man professional basketball league matches " Initialism is " NBA ".

4th class is compound word：This vocabulary is mainly formed by combining by verb, noun, by two or more words Merging and formed, such as " soft project ", " Chongqing in China shaba District A areas of University Of Chongqing ", the former is exactly two noun phrases The vocabulary of synthesis, the latter is then the word synthesized by " China ", " shaba District ", " University Of Chongqing ", " A areas " multiple place name phrases Converge.

5th class is digital portmanteau word：This vocabulary is by the shape to combinations such as numeral, date, telephone number, codings Into vocabulary.

For above-mentioned rule, we devise the rule base of correlation.Rule in rule base both can add automatically Plus can also add by hand.With the increase of rule in rule base, new word discovery is just more and more accurate.

Due between rule and rule, sometimes there are exclusion relationses, this method takes prioritization strategy, works as rule Then just using priority enter the selection of line discipline with during rule conflict.

(3) neologisms deliberated index design

Neologisms deliberated index is word frequency, mutual information and comentropy.Word frequency indicates the number of times of character string appearance, only word Frequency is reached after threshold value, and we can just consider that this character string is probably a neologisms.If it is secondary to see that character string occurs Several or inadequate, such as " film " occurs in that " cinema " occurs in that 323 times 492 times, and the frequency that " film " occurs is The frequency occurred more than " cinema ", but we are more likely to " cinema " as a word, because we feel " film " " institute " coagulation grade ratio " " and the coagulation grade of " film " it is bigger, so after word frequency reaches threshold value, we also need to Show the solidified inside degree of word with mutual information.If we judge whether simply by word frequency, mutual information two indices For neologisms, this or it is inadequate, we are also needed to its outside performance from the point of view of comentropy.Such as " century " and " deed " two Character string, we may only say that " last century ", " next century ", " this century " etc. are limited several, it is seen that " century " this character string Left adjacent set very little, so we are more prone to " last century ", " next century ", " this century " as a neologisms；And to " deed " This character string, we can say that " brilliant career ", " outstanding deed ", " personage's deed ", " candidate's deed " etc. are a lot, it Left adjacent set it is very big, so we can be directly using " deed " as a neologisms.Comentropy has weighed a character string Freely use degree, if the left adjacent set of character string and right adjacent set are bigger, then this character string just more has can Can be used as a neologisms.After calculating the left comentropy and right comentropy of character string, further according to left comentropy or right comentropy Carry out the calculating of left recusion or right recurrence.

A kind of left and right recurrence new word discovery method can be found that the neologisms of any length, and its step is as follows：

Step one：Language material is pre-processed.Referring specifically to abovementioned steps.

Step 2：Location sets are calculated, that is, calculate：

Wherein w₁, w₂..., w_mRepresent and occurred and mutually different word in input language material；(w_i, POS_i) represent a set AndIt is an element of the set, represents i-th word w_iThe position that jth time occurs in input language material；Represent i-th word w_i Position occurred.

For example, for language material " I am Chinese, and I likes China " so set W for (' I ', { 0,6 }), (' being ', { 1 }), (' in ', { 2,8 }), (' state ', { 3,9 }), (' people ', { 4 }), (' love ', { 7 }) }.Location sets W have recorded each word and go out Existing position；

Step 3：COLLECTION TRAVERSALSThe, that is, travel through set W, and each word w is taken out successively_i, i.e. word=w_i, i=i+1.

Step 4：The judgement of storage property, that is, judge whether word meets the requirement of rule base storage rule, if be unsatisfactory for, Rebound step 3, if meet carrying out step 5.Storage property refers to whether neologisms meet neologisms criteria for classification, neologisms criteria for classification Including substantive noun, derivative, initialism, compound word, digital portmanteau word.For this five classes word, we devise corresponding rule Then storehouse, the rule in rule base both can add automatically can also be added by hand.Therefore when being received, be according to this five kinds Type is carried out.

Step 5：Word frequency is calculated, that is, calculate the frequency that word occurs, if less than threshold value, rebound step 3, if greater than Threshold value then carries out step 6.

Step 6：Left recusion, i.e., perform left recusion for word, is designated as createPrefixTree (word, W), obtains New set of words P of left recusion₁, it is concrete and is divided into the following steps：

A. the left comentropy and mutual information of word are calculated by W, if left comentropy and mutual information meet threshold requirement, Then carry out step b；If left comentropy and mutual information are unsatisfactory for threshold requirement, step c is carried out；

If b. not including word in existing dictionary S, word is added to new set of words P₁；Judge whether set W travels through It is complete, if not traveled through, rebound step 3, conversely, into step 7；

C. the left adjacent word collection pre { pre of word are calculated₁, pre₂..., pre_k}：

Step 7：Right recurrence, i.e., perform createSufTree (word, W) for word, obtains the new set of words of right recurrence P₂。

A. the right comentropy and mutual information of word are calculated by W, if right comentropy and mutual information meet threshold requirement, Then carry out step b；If right comentropy and mutual information are unsatisfactory for threshold requirement, step c is carried out；

If b. not including word in existing dictionary S, word is added to new set of words P₂；Judge whether set W travels through It is complete, if not traveled through, rebound step 3, conversely, into step 8；

C. the right adjacent word collection suf { suf of word are calculated₁, suf₂..., suf_k}

Step 8：Merge, i.e. set of computations P₁With set P₂Common factor obtain find new set of words P₃。

Description of the drawings

It is that structure and workflow to the present invention are illustrated below with reference to accompanying drawing, wherein：

Fig. 1 is a kind of flow chart of left and right recurrence new word discovery method

Fig. 2 is input language material pretreatment block diagram

Fig. 3 is left recusion flow chart

Fig. 4 is right recurrence flow chart

Specific embodiment

Come below in conjunction with the accompanying drawings to make the embodiment of " a kind of left and right recurrence new word discovery method " of the present invention into The explanation of one step.

(1) it is input into language material pretreatment

Input language material pretreatment is related to canonical filtration, half-angle conversion, blank character is deleted, additional character is deleted, non-text This symbol is deleted, seven steps of cutting sentence and Hash duplicate removals.

Several steps that input language material pretreatment is related to can be exchanged, process provides one of which implementation conduct Checking, i.e., according to canonical filter, half-angle conversion, blank character delete, additional character delete, non-textual symbol deletion, cutting sentence, The order of Hash duplicate removals carries out being input into language material pretreatment.Wherein：The effect that canonical is filtered is to delete the Html marks included in language material The special tags unrelated with text such as label, Xml labels；Half-angle conversion be the symbol in language material is carried out full-shape turn half-angle operation, Traditional Chinese turns simplified operation and specification first trip operation；Blank character delete be delete unnecessary space in language material, newline, Tab；It is to delete ASCII codings, specific coding, the mess code included in language material that additional character is deleted；Non-textual symbol deletion is to delete Except the non-text data in text；Cutting sentence is indicated by making pauses in reading unpunctuated ancient writings, i.e., ".”、“！”、“”、“…”、“；", space, line feed By material segmentation into sentence one by one, if including symmetrical symbol in sentence, such as quotation marks, punctuation marks used to enclose the title etc. then require a left side to symbol Right matching, finally also needs to the sentence for having divided to be numbered；Hash duplicate removals are that the sentence to having divided carries out Hash evaluations, right The identical sentence of Hash codes carries out duplicate removal.

This preconditioning technique can make language material specification, accurate, significant, it is to avoid because the defect of language material itself, The wrong index that error is caused is calculated.

(2) new word discovery method

This new word discovery method is based on left and right recurrence, can effectively find that minority name clansman's name, English are translated name, English and translate ground The very long neologisms of string length such as name.Its basic procedure is：Appoint from location sets and take an element, be designated as e, count first Its left comentropy and mutual information is calculated, is then compared with threshold value, if left comentropy and mutual information are more than or equal to threshold value and dictionary S In include e, just by e addition dictionary P1；The left adjacent word collection of e is calculated if left comentropy and mutual information are less than threshold value, then Calculate left adjacent word collection again, and for left adjacent word concentrate each element left comentropy and mutual information and make further threshold decision and Whether the judgement of dictionary P1 is added, it is eligible, add, otherwise recurrence is calculated.Right recurrence is carried out in the same manner, and will be met The neologisms of condition are added in dictionary P2.Finally, the common factor of P1 and P2 is sought, you can obtain final new set of words.

We have been embodied as new word discovery method, use Baidupedia 200 with " film " be the theme it is true Article is used as experimental data.Minority name clansman's name, transliteration name, transliteration place name etc. are continually occurred in that in this 200 articles Word.

Specific implementation step is as follows：

1. this 200 articles are carried out with new word discovery using existing new word discovery method, records the neologisms for finding, and counted Calculate accuracy, recall rate and F values.

2. this 200 articles are carried out with new word discovery using this new word discovery method, records the neologisms for finding, and calculated Accuracy, recall rate and F values.

3. contrast and analyze for the new word discovery result of existing new word discovery method and this new word discovery method.

Experimental result and analysis

The new word discovery methods and resultses of table 1 are contrasted

New word discovery method traditional in the case where window size is set to 4 as seen from Table 1 can only find string length Neologisms less than 4, can significantly find out " Coronis handkerchief ", " Leonardo ", " livre ", " speed with swash ", " Humphrey " It is a part for neologisms Deng word, due to the big break manner of breathing that the neologisms that existing new word discovery method finds out set with window Close, the size of the length no more than window of neologisms；And this new word discovery method is the new word discovery method for being based on left and right recurrence, The neologisms that it finds are not by the length limitation of character string.

We employ accuracy (precision), three indexs of recall rate (recall) and F values (F-measure) New word discovery result is passed judgment on, wherein

Accuracy computing formula is as follows：

Recall rate computing formula is as follows：

F value computing formula are as follows：

In above formula, n₁Represent the neologisms number for correctly identifying；n₂Represent total of the word string for identifying Number；n₃Represent the total number of neologisms in language material.

Table 2 each index evaluation result

As shown in table 2 each index evaluation result, our the new word discovery method new word discovery effect longer for length Fruit is significantly increased, and in terms of other indexs, our method is improved.

(3) left recusion method

The enforcement of left recusion is as follows：

Obtain first to appoint from location sets and take an element, be designated as word；

Then according to formula (4) and its left comentropy of example calculations word；

3rd, according to default information entropy threshold, detect whether the left comentropy for calculating meets threshold value, if being unsatisfactory for Calculate the left adjacent set of word.And appoint in left adjacent set and take an element pre, pre and word is combined into into pre+word and is returned Return previous step and recalculate left comentropy, and make a decision again, by that analogy；

4th, if the left comentropy for calculating meets threshold value, mutual information is calculated according to formula (2) and (3)；

5th, according to default mutual information threshold value, detect whether the mutual information for calculating meets threshold value, count if being unsatisfactory for Corresponding left adjacent set is calculated, and is appointed in left adjacent set and is taken an element pre, pre and word are combined into into pre+word and are returned Previous step recalculates left comentropy, and makes a decision again, by that analogy；

6th, if the mutual information for calculating meets threshold value, judge word or pre+word whether Already in Dictionary, adds dictionary P if not existing₁, if there is a new word is then further taken out from location sets, repeat to enter Row abovementioned steps.

(4) right recursion method

The enforcement of right recurrence is as follows：

Appoint first from location sets and take an element, be designated as word；

Then according to formula (4) and its right comentropy of example calculations；

3rd, according to default information entropy threshold, detect whether the right comentropy for calculating meets threshold value, if being unsatisfactory for The right adjacent set of word is calculated, and is appointed in right adjacent set and is taken an element suf, word and suf is combined into into word+suf and is returned Return previous step and recalculate right comentropy, and make a decision again, by that analogy；

4th, if the right comentropy for calculating meets threshold value, mutual information is calculated according to formula (2) and (3)；

5th, according to default mutual information threshold value, detect whether the mutual information for calculating meets threshold value, count if being unsatisfactory for Corresponding right adjacent set is calculated, and is appointed in right adjacent set and is taken an element suf, word and suf are combined into into word+suf and are returned Previous step recalculates right comentropy, and makes a decision again, by that analogy；

6th, if the mutual information for calculating meets threshold value, judge word or word+suf whether Already in Dictionary, adds dictionary P if not existing₂, if there is a new word is then further taken out from location sets, repeat to enter Row abovementioned steps.

Claims

1. a kind of left and right recurrence new word discovery method, including language material pretreatment [1], location sets calculate [2], COLLECTION TRAVERSALSThe [3], Storage property judges that [4], word frequency calculate [5], left recusion [6], right recurrence [7], [8] eight steps of merging：

Language material pre-processes [1]：Filtered by canonical, full half-angle is changed, blank character is deleted, unrelated special symbol is deleted, non-textual Delete, punctuate cutting is processed being input into language material；

Location sets calculate [2]：Calculate the position that the word in input language material occurs in input language material；

COLLECTION TRAVERSALSThe [3]：Traversal location sets；

Storage property judges [4]：Judge whether each element in location sets meets the requirement of storage rule；

Word frequency calculates [5]：Calculate the frequency that each word occurs；

Left recusion [6]：For each word, the word composition neologisms on its left side are taken successively and is judged；

Right recurrence [7]：For each word, composition neologisms of the word on the right of it are taken successively and is judged；

Merge [8]：The new set of words that the new set of words that left recusion finds finds with right recurrence is merged.

2. a kind of left and right recurrence new word discovery method according to claim 1, it is characterised in that：Language material pretreatment [1] leads to Canonical filtration, the conversion of full half-angle, blank character deletion, unrelated special symbol deletion, non-textual deletion, punctuate cutting are crossed to being input into language Material is processed；Wherein, it is to delete Html labels, the Xml labels included in language material that canonical is filtered；Full half-angle conversion is by language material In SBC case be converted into DBC case, by Chinese traditional font be converted into it is simplified；It is to delete many in language material that blank character is deleted Remaining space, newline, tab；It is to delete ASCII codings, the special dimension volume included in language material that unrelated special symbol is deleted Code, mess code symbol；It is to delete picture, sound, the video data in text that non-textual is deleted；Punctuate cutting is according to fullstop, sense Exclamation, question mark, ellipsis, branch, space, newline are by material segmentation into sentence one by one；In order to avoid big in language material Measure identical sentence and calculate neologisms index the error that causes, the sentence to segmenting carries out Hash evaluations, and by Hash The identical sentence duplicate removal of code.

3. a kind of left and right recurrence new word discovery method according to claim 1, it is characterised in that：Location sets calculate [2] Calculate the position that the word in input language material occurs in input language material；The computing formula of wherein location sets is：

\{\begin{matrix} W {(w_{1}, {POS}_{1}), (w_{2}, {POS}_{2}), ... (w_{i}, {POS}_{i}), ..., (w_{m}, {POS}_{m})} \\ {POS}_{i} = {w_{i} {pos}_{i_{1}}, w_{i} {pos}_{i_{2}}, ..., w_{i} {pos}_{i_{n}}} \end{matrix}

Wherein w₁, w₂..., w_mRepresent and occurred and mutually different word in input language material；(w_i, POS_i) represent one set andIt is an element of the set, represents i-th word w_iThe position that jth time occurs in input language material；Represent i-th word w_i Position occurred.

4. a kind of left and right recurrence new word discovery method according to claim 1, it is characterised in that：COLLECTION TRAVERSALSThe [3] is time Go through location sets, note location sets are W, that is, take out from location sets W each word w successively_i, it is stored in variable word In, i.e. word=w_i, i=i+1.

5. a kind of left and right recurrence new word discovery method according to claim 1, it is characterised in that：Storage property judges that [4] are Judge whether each element in location sets meets the requirement of storage rule；Storage property rule refer to according to substantive noun, Derivative, initialism, compound word, digital this five type of portmanteau word are received.

6. a kind of left and right recurrence new word discovery method according to claim 1, it is characterised in that：Word frequency calculates [5] and refers to The frequency that each word occurs is calculated, frequency formula is：

p (X) = \frac{N (X)}{N}

7. a kind of left and right recurrence new word discovery method according to claim 1, it is characterised in that：Left recusion [6] is pointer The word composition neologisms for taking its left side successively to each word judge；Remember that current word is word, then its step includes：(1) count Calculate the left comentropy of word；(2) whether the left comentropy calculated according to default comentropy threshold test meets threshold value, if not Meet the left adjacent set for then calculating word, left adjacent set is the set of word left sides word composition；(3) appoint in left adjacent set and take one Individual element pre, is combined into pre and word pre+word returns previous step and recalculates left comentropy, and makes a decision again； (4) if the left comentropy for calculating meets threshold value, mutual information is calculated；(5) according to default mutual information threshold value, detection is calculated Whether the mutual information for going out meets threshold value, and corresponding left adjacent set is calculated if being unsatisfactory for；(6) appoint in left adjacent set and take a unit Plain pre, is combined into pre and word pre+word returns previous step and recalculates left comentropy, and makes a decision again, with such Push away；(7) if the mutual information for calculating meets threshold value, word or pre+word whether Already in dictionaries are judged, such as Be present then addition dictionary P1 in fruit, if there is then from location sets a new word is further taken out, repeat aforementioned Step.

8. a kind of left and right recurrence new word discovery method according to claim 1, it is characterised in that：Right recurrence [7] is pointer Take composition neologisms of the word on the right of it successively to each word and judge；Remember that current word is word, then its step includes：(1) Calculate the right comentropy of word；(2) according to default information entropy threshold, detect whether the right comentropy for calculating meets threshold value, As being unsatisfactory for, the right adjacent set of word is calculated, right adjacent set is the set of word the right word composition；(3) appoint in right adjacent set An element suf is taken, word and suf are combined into into word+suf are returned previous step and recalculate right comentropy, and done sentence again It is disconnected；(4) if the right comentropy for calculating meets threshold value, mutual information is calculated；(5) according to default mutual information threshold value, detection Whether the mutual information for calculating meets threshold value, and corresponding right adjacent set is calculated if being unsatisfactory for；(6) appoint in right adjacent set and take one Individual element suf, is combined into word and suf word+suf returns previous step and recalculates right comentropy, and makes a decision again, with This analogizes；(7) if the mutual information for calculating meets threshold value, word or word+suf whether Already in words are judged Storehouse, adds dictionary P2 if not existing, and if there is then from location sets a new word is further taken out, repeats Abovementioned steps.

9. a kind of left and right recurrence new word discovery method according to claim 1, it is characterised in that：It is to pass on a left side to merge [8] The new set of words that the new set of words and right recurrence for returning discovery finds is merged, and remembers that the new set of words that left recusion finds is P1, the right side The new set of words that recurrence finds is P2, P1 is merged with P2 and refers to the common factor for asking P1 and P2.