CN108897842A

CN108897842A - Computer readable storage medium and computer system

Info

Publication number: CN108897842A
Application number: CN201810678724.XA
Authority: CN
Inventors: 朱频频
Original assignee: Shanghai Zhizhen Intelligent Network Technology Co Ltd
Current assignee: Shanghai Zhizhen Intelligent Network Technology Co Ltd
Priority date: 2015-10-27
Filing date: 2015-10-27
Publication date: 2018-11-27
Anticipated expiration: 2035-10-27
Also published as: CN108897842B; CN105389349B; CN105389349A

Abstract

The invention discloses a kind of computer readable storage medium and computer systems.It is stored with program on the medium, which, which is performed, realizes dictionary update method, the method includes：The corpus received is pre-processed, to obtain text data；Branch's processing is carried out to the text data, obtains phrase data；Word segmentation processing is carried out to the phrase data according to the independent word for including in basic dictionary, with the term data after being segmented；Processing is combined to the term data after the adjacent participle, to generate candidate data string；Judgement processing is carried out to the candidate data string, to find neologisms；If it was found that the neologisms are added to the basic dictionary by neologisms, to update the basic dictionary.The present invention can reduce dictionary maintenance cost, promote dictionary and update efficiency.

Description

Computer readable storage medium and computer system

It is on October 27th, 2015 that the application, which is the applying date, and application No. is 201510706335.X, invention and created name is The divisional application of " dictionary update method and device ".

Technical field

The present invention relates to intelligent interaction field more particularly to a kind of computer readable storage medium and computer systems.

Background technique

In the various fields of Chinese information processing, it is required to complete corresponding function based on dictionary.For example, in intelligent retrieval In system or Intelligent dialogue system, by participle, problem retrieval, similarity mode, answering for search result or Intelligent dialogue is determined Case etc., wherein it is that minimum unit is calculated that each process, which is by word, the basis of calculating is word dictionary, so word Dictionary has the performance of whole system very big influence.

Socio-cultural progress and transition, the fast development of economic business often drive the variation of language, and most quickly Embody language change is exactly the appearance of neologisms.Especially in specific area, if can timely update word after neologisms appearance Dictionary has conclusive influence to the system effectiveness of the Intelligent dialogue system where word dictionary.

It is all to add neologisms into dictionary by the way of artificial in the prior art.It include independent word, neologisms in dictionary It is exactly at least following three sources of newfound independent word：The neologisms in field that client provides；The language provided by client Expect the neologisms of discovery；The neologisms found during operation.

Fig. 1 be in the prior art it is a kind of update dictionary flow chart, including：

S11, it is artificial by reading discovery candidate data string；

S12 judges whether candidate data string is included in existing dictionary by retrieval；

S13 is added to when candidate data string is not included in dictionary using the candidate data string as new independent word Have in dictionary to form new dictionary.

But above-mentioned artificial working method causes the maintenance cost of dictionary high, low efficiency, and is easy to happen omission, finally Prevent neologisms from being added in dictionary in time.

Summary of the invention

Present invention solves the technical problem that being how to reduce dictionary maintenance cost, promotes dictionary and update efficiency.

In order to solve the above technical problems, the embodiment of the present invention provides a kind of computer readable storage medium, it is stored thereon with Program, which, which is performed, realizes dictionary update method, the method includes：

The corpus received is pre-processed, to obtain text data；

Branch's processing is carried out to the text data, obtains phrase data；

Word segmentation processing is carried out to the phrase data according to the independent word for including in basic dictionary, with the word after being segmented Language data；

Processing is combined to the term data after the adjacent participle, to generate candidate data string；

Judgement processing is carried out to the candidate data string, to find neologisms；

If it was found that the neologisms are added to the basic dictionary by neologisms, to update the basic dictionary.

Optionally, the generation candidate data string, including：It will be adjacent in the phrase data of same a line using Bigram model Word is as candidate data string.

Optionally, the method also includes：The phrase data is segmented again according to updated basic dictionary Processing, combined treatment and judgement processing, and the basic dictionary is constantly updated using the neologisms found every time.

Optionally, described that judgement processing is carried out to the candidate data string, to find that neologisms include：Internal judgment and/or Outside judgement；

The internal judgment includes：It calculates candidate data and conspires to create the probability characteristics value for neologisms, the candidate data conspires to create For neologisms probability characteristics value within a preset range when, the candidate data string be neologisms；

The outside judges：Calculate the probability that each word and its outside word in the candidate data string constitute neologisms Characteristic value, removes each word and its outside word constitutes candidate data string of the probability characteristics value of neologisms outside preset range, remains Remaining candidate data string is neologisms.

Optionally, the calculating candidate data conspires to create comprises at least one of the following for the probability characteristics value of neologisms：

Calculate the frequency, frequency or the frequency and frequency calculating occurred according to the candidate data string that candidate data string occurs Obtained numerical value；

Calculate the mutual information in candidate data string between each term data；

Calculate the boundary term data of candidate data string and the comentropy of inside term data.

Optionally, when the type that the candidate data that need to be calculated conspires to create the probability characteristics value for neologisms is more than one, Judge whether within a preset range to calculate the preceding probability characteristics value of order, only the candidate to probability characteristics value within a preset range The calculating of the serial data progress posterior probability characteristics value of order.

Optionally, the probability characteristics value for calculating each word and its outside word in the candidate data string and constituting neologisms Including：Calculate the boundary term data of candidate data string and the comentropy of outside term data.

Optionally, described that judgement processing is carried out to the candidate data string, include to find neologisms successively：

The frequency for calculating the candidate data string removes candidate data string of the frequency outside preset range；

The mutual information for calculating the remaining candidate data string, removes candidate data of the mutual information outside preset range String；

The comentropy for calculating remaining the candidate data string boundary term data and inside term data, removes the letter Cease candidate data string of the entropy outside preset range；

The comentropy for calculating remaining the candidate data string boundary term data and outside term data, removes the letter Cease candidate data string of the entropy outside preset range；

The remaining candidate data string is as neologisms.

Optionally, the method also includes：The length range of candidate data string is set, to exclude length in the length model Candidate data string except enclosing.

The embodiment of the present invention also provides a kind of computer system, has electronic data processing capability, including dictionary more new clothes It sets, described device includes：Pretreatment unit, branch's processing unit, word segmentation processing unit, combined treatment unit, new word discovery list Member and updating unit；Wherein：

The pretreatment unit, suitable for being pre-processed to the corpus received, to obtain text data；

Branch's processing unit is suitable for carrying out branch's processing to the text data, obtains phrase data；

The word segmentation processing unit, suitable for dividing according to the term data for including in basic dictionary the phrase data Word processing, with the term data after being segmented；

The combined treatment unit, suitable for being combined processing to the term data after the adjacent participle, to generate Candidate data string；

The new word discovery unit, suitable for carrying out judgement processing to the candidate data string, to find neologisms；

The updating unit is suitable for after finding neologisms, and the neologisms are added to the basic dictionary, to update the base Plinth dictionary.

Compared with prior art, the technical solution of the embodiment of the present invention has the advantages that：

By being pre-processed to corpus, branch's processing, word segmentation processing, to obtain the list that the corresponding basic dictionary of corpus includes Only word generates candidate data string by combined treatment, by handling candidate data string judgement, to find neologisms.The above process It realizes and corpus is automatically processed, so as to reduce the update cost of dictionary；Corpus is handled based on computer The efficiency that dictionary update can be promoted, avoids omitting, and guarantees the accuracy that dictionary updates.

Further, the candidate data that need to be calculated conspire to create the probability characteristics value for neologisms type it is more than one when, lead to It crosses and successively candidate data string is judged, judge whether within a preset range to calculate the preceding probability characteristics value of order, it is only right The candidate data string of probability characteristics value within a preset range carries out the calculating of the posterior probability characteristics value of order, it is possible to reduce order Posterior computer capacity is promoted to reduce calculation amount and updates efficiency.

Further, according to updated basic dictionary again to the phrase data carry out word segmentation processing, combined treatment and Judgement is handled, and constantly updates the basic dictionary using the neologisms found every time, will not obtained neologisms and is used as stopping dictionary more New condition promotes the reliability of dictionary so as to comprehensively be updated to dictionary.

In addition, by the length range of setting candidate data string, it is adjacent except the length range to exclude length Term data, so that only the calculating of probability characteristics value need to be carried out to adjacent word data of the length in the length range, finally The calculation amount of dictionary update can be further decreased, is promoted and updates efficiency.

Detailed description of the invention

Fig. 1 be in the prior art it is a kind of update dictionary flow chart；

Fig. 2 is a kind of application schematic diagram of dictionary updating device in the embodiment of the present invention；

Fig. 3 is a kind of flow chart of dictionary update method in the embodiment of the present invention；

Fig. 4 is a kind of flow chart for the specific implementation for finding neologisms step in the embodiment of the present invention；

Fig. 5 is a kind of structural schematic diagram of dictionary updating device in the embodiment of the present invention；

Fig. 6 is the structural schematic diagram of new word discovery unit in the embodiment of the present invention；

Fig. 7 is a kind of structural schematic diagram of internal judgment unit in the embodiment of the present invention.

Specific embodiment

As previously mentioned, being all to add neologisms into dictionary by the way of artificial in the prior art.Added by manual type Neologisms are added easily to omit；Due to being limited by artificial treatment speed, efficiency is lower；The maintenance cost of dictionary also by manually at Originally it raises.

The embodiment of the present invention is handled corpus by computer, and corpus is unified for and was found suitable for computer new word The format of journey generates candidate data string, sets suitable condition and screen to candidate data string, to find neologisms.Based on Calculation machine discovery neologisms can promote the efficiency of dictionary update, avoid omitting, and guarantee the accuracy that dictionary updates.

It is understandable to enable above-mentioned purpose of the invention, feature and beneficial effect to become apparent, with reference to the accompanying drawing to this The specific embodiment of invention is described in detail.

Fig. 2 is a kind of application schematic diagram of dictionary updating device in the embodiment of the present invention.

Dictionary updating device 22 is suitable for receiving corpus 21, corpus is pre-processed based on basic dictionary 23, is handled in lines, Word segmentation processing, combined treatment and judge that processing is handled, to find neologisms, if discovery neologisms, the neologisms are added to described Basic dictionary 23, to update the basic dictionary 23.Basic dictionary 23 can be the form of database.

Wherein, dictionary updating device 22 can be located in the electronic computer system with electronic data processing capability, electricity Sub- computer system can use minicomputer, can also use large server；It can be separate unit calculating, server cluster Or distributed server system.

Since dictionary updating device 22 is located in electronic computer system, corpus is handled by computer, thus The processing speed to corpus can be greatly improved, human resources are saved, reduces processing cost, promotes treatment effeciency, in time efficiently Accurately update dictionary.

Fig. 3 is a kind of flow chart of dictionary update method in the embodiment of the present invention.

S31 pre-processes the corpus received, to obtain text data.

Corpus can be the corpus in the corresponding field of dictionary application system, that is, in some specific field, when having It may include the text paragraph of neologisms when neologisms occur.For example, corpus can in dictionary application when bank's intelligent Answer System To be article, question answering system FAQs or the system log etc. of bank's offer.

The diversity in corpus source, which can be, to be updated dictionary more comprehensive, but simultaneously, Format Type is more in corpus, is Convenient for carrying out subsequent processing to corpus, corpus need to be pre-processed, obtain text data.

In specific implementation, the uniform format of corpus can be text formatting by the pretreatment, and filter dirty word, sensitivity One of word and stop words are a variety of.It, can wouldn't energy by current techniques when being text formatting by the uniform format of corpus The information filtering for being converted to text formatting is fallen.

S32 carries out branch's processing to the text data, obtains phrase data.

Branch's processing can be to corpus according to punctuate branch, such as the punctuates such as fullstop, comma, exclamation, question mark is occurring Punishment row.Obtaining phrase data herein is the primary segmentation to corpus, in order to the range of the subsequent word segmentation processing of determination.

S33 carries out word segmentation processing to the phrase data according to the independent word for including in basic dictionary, after obtaining participle Term data.

Basic dictionary includes multiple independent words, and the length of different individually words can be different.In specific implementation, based on basis The process that dictionary carries out word segmentation processing can use one of the two-way maximum matching method of dictionary, HMM method and CRF method or more Kind.

The word segmentation processing is to carry out word segmentation processing to the phrase data of same a line, so that the term data after participle is located at Same a line, and the term data is all included in the independent word in dictionary.

When the dictionary difference of use, different word segmentation results can be obtained.

Due in conversational system, by participle, problem retrieval, similarity mode, determining the processes such as answer reality in field The process of the intelligent replying of existing problem, is all to be calculated using independent word as minimum unit, is divided herein according to basic dictionary The process of word processing is similar in the running participle process of conversational system, difference be word segmentation processing based on dictionary in have Difference.

S34 is combined processing to the term data after the adjacent participle, to generate candidate data string.

Word segmentation processing is carried out according to current basic dictionary, it may appear that by should be as the word of a word in some field The case where language data are divided into multiple term datas, the update of dictionary is namely based on currently segmenting as a result, setting condition filters out Dictionary is added by the candidate data string that should be used as neologisms for the candidate data string.Candidate data string is generated as above-mentioned The premise of screening process can be completed using various ways.

If regarding adjacent word all in corpus as candidate data string, the calculation amount of dictionary more new system is excessively huge Greatly, efficiency is lower, and is located at the adjacent word that do not go together and also has no the meaning calculated.Therefore adjacent word can be screened, Generate candidate data string.

In specific implementation, it can use Bigram model using word two neighboring in the phrase data of same a line as time Select serial data.

Assuming that a sentence S can be expressed as a sequence S=w1w2 ... wn, language model is exactly to require that sentence S's is general Rate p (S)：

P (S)=p (w1, w2, w3, w4, w5 ..., wn)

=p (w1) p (w2 | w1) p (w3 | w1, w2) ... p (wn | w1, w2 ..., wn-1) (1)

Probability statistics are based on Ngram model in formula (1), and the calculation amount of probability is too big, can not be applied in practical application. (Markov Assumption) is assumed based on Markov：The appearance of next word only relies upon one or several before it Word.Assuming that the appearance of next word relies on a word before it, then have：

P (S)=p (w1) p (w2 | w1) p (w3 | w1, w2) ... p (wn | w1, w2 ..., wn-1)

=p (w1) p (w2 | w1) p (w3 | w2) ... p (wn | wn-1) (2)

Assuming that the appearance of next word relies on two words before it, then have：

P (S)=p (w1) p (w2 | w1) p (w3 | w1, w2) ... p (wn | w1, w2 ..., wn-1)

=p (w1) p (w2 | w1) p (w3 | w1, w2) ... p (wn | wn-1, wn-2) (3)

Formula (2) is the calculation formula of Bigram probability, and formula (3) is the calculation formula of Trigram probability.Pass through setting The more constraint informations occurred to next word can be set in bigger n value, have bigger discrimination；By being arranged more Small n, the number that candidate data string occurs in dictionary update is more, can provide more reliable statistical information, has higher Reliability.

Theoretically, n value is bigger, and reliability is higher, and in existing processing method, Trigram's is most.But Bigram's Calculation amount is smaller, and system effectiveness is higher.

In specific implementation, the length range of candidate data string can also be set, to exclude length in the length range Except candidate data string.So as to obtain the neologisms of different length range according to demand, to be applied to different scenes.Example Such as, the lesser range of length range numerical value is set, to obtain the word on grammatical meaning, is applied to intelligent Answer System；Setting The biggish range of length range numerical value, to obtain phrase or short sentence, with the keyword etc. as literature search catalogue.

S35 carries out judgement processing to the candidate data string, to find neologisms.

In specific implementation, described that judgement processing is carried out to the candidate data string, to find that neologisms can pass through inside Judgement discovery judges discovery by external, can also pass through internal judgment and external judgement discovery jointly.

The internal judgment may include：It calculates candidate data and conspires to create the probability characteristics value for neologisms, when the candidate number According to conspire to create for neologisms probability characteristics value within a preset range when, the candidate data string be neologisms.

The outside judges：It calculates each word and its outside word in the candidate data string and constitutes neologisms Probability characteristics value, removes each word and its outside word constitutes the candidate data of the probability characteristics value of neologisms within a preset range String, remaining candidate data string are neologisms.

In specific implementation, candidate data, which conspires to create, passes through given threshold reality in preset range for the probability characteristics value of neologisms Existing, the specific value of threshold value is set according to the type and demand of probability characteristics value.

In specific implementation, it includes following a kind of or more that the calculating candidate data, which conspires to create as the probability characteristics value of neologisms, Kind：The frequency, frequency or the frequency occurred according to the candidate data string and frequency for calculating the appearance of candidate data string are calculated Numerical value；Calculate the mutual information in candidate data string between each term data；Calculate candidate data string boundary term data with The comentropy of inside term data.

The frequency that candidate data string occurs refers to the number that candidate data string occurs in corpus, and frequency filtering is waited for judging The connecting times for selecting serial data then filter out the candidate data string when the frequency is lower than a certain threshold value；What candidate data string occurred Total word amount is related in the number and corpus that frequency occurs with it.The frequency and frequency meter that will be occurred according to the candidate data string Obtained numerical value is higher as the probability characteristics value accuracy of the candidate data string.In an embodiment of the present invention, according to institute The frequency and frequency for stating the appearance of candidate data string are calculated probability characteristics value and can use TF-IDF (Term Frequency- Inverse Document Frequency) technology.

TF-IDF is a kind of common weighting technique prospected for information retrieval and information, to assess some words for The significance level of one file set or a copy of it file in a corpus, that is, the significance level in corpus.Word The importance of word is with the directly proportional increase of number that it occurs hereof, but the frequency that can occur in corpus with it simultaneously Rate is inversely proportional decline.

The main thought of TF-IDF is：If the frequency TF high that some word or phrase occur in an article, and Seldom occur in other articles, then it is assumed that this word or phrase have good class discrimination ability, are adapted to classify.TF- IDF is actually：TF*IDF, TF word frequency (Term Frequency), the anti-document frequency of IDF (Inverse Document Frequency).TF indicates frequency (another theory that entry occurs in document d：TF word frequency (Term Frequency) refers to The number that some given word occurs in this document).The main thought of IDF is：If the document comprising entry t is got over Less, that is, n is smaller, and IDF is bigger, then illustrates that entry t has good class discrimination ability.If wrapped in certain a kind of document C The number of files of the t containing entry is m, and the total number of documents that other classes include t is k, it is clear that all number of files n=m+k comprising t work as m When big, n is also big, and the value of the IDF obtained according to IDF formula can be small, just illustrates that entry t class discrimination is indifferent.It is (another One says：The anti-document frequency of IDF (Inverse Document Frequency) refers to that the document comprising entry is fewer, and IDF is bigger, Then illustrate that entry has good class discrimination ability.) but in fact, if an entry is frequent in the document of a class Occur, that is, frequently occurred in corpus, then illustrates that the entry can represent the feature of text of this class very well, it is such Entry should assign higher weight to them, and select the Feature Words as the class text to distinguish and other class documents.? Being exactly can be using such entry as the neologisms in the field of dictionary application.

Formula is shown in the definition of mutual information (Mutual Information, MI)：

Mutual information reflects the cooccurrence relation of candidate data string with wherein term data, the candidate being made of two independent words The mutual information of serial data is a value (mutual information between i.e. two independent words), as a candidate data string W and wherein term data When co-occurrence frequency is high, i.e., when frequency of occurrence is close, it is known that the mutual information MI of candidate data string W is close to 1, that is to say, that this when A possibility that selecting serial data W to become a word is very big.If the value very little of mutual information MI, close to 0, then illustrate that W is almost impossible As a word, unlikely become a neologisms.Mutual information reflects the degree of dependence inside a candidate data string, thus It can be used to judge whether candidate data string can be likely to become neologisms.

Comentropy is to the probabilistic measurement of stochastic variable, and calculation formula is as follows：

H (X)=- ∑ p (x_i)log p(x_i)

Comentropy is bigger, indicates that the uncertainty of variable is bigger；The probability that i.e. each possible value occurs is average.Such as The probability that some value of fruit variable occurs is 1, then entropy is 0.Show that variable only works as the generation of former value, is an inevitable thing Part.

Using this property of entropy, each independent term data is successively fixed to candidate data string, is calculated in the word number The comentropy occurred according to another word in the case of appearance.If in candidate data string string (w1w2) with the right combination of term data w1 The right side comentropy of term data be greater than threshold value, and with the left side comentropy of the left combination of term data w2 also greater than threshold value, Then think that the candidate data string is likely to become neologisms.Calculation formula is as follows：

H₁(W)=∑_x∈X(#XW>0)P (x | W) log P (x | W), wherein X is all term data collection for appearing in the left side W It closes；H₁(W) the left side comentropy for being term data W.

H₂(W)=∑_x∈Y(#WY>0)P (y | W) log P (y | W), wherein Y is all term data collection appeared on the right of W It closes, H₂(W) the right side comentropy for being term data W.

In specific implementation, if the candidate data that need to be calculated conspires to create the type of the probability characteristics value for neologisms more than one Kind, then it may determine that whether within a preset range to calculate the preceding probability characteristics value of order, only to probability characteristics value in default model Candidate data string in enclosing carries out the calculating of the posterior probability characteristics value of order.The only time to probability characteristics value within a preset range Serial data is selected to carry out the calculating of the posterior probability characteristics value of order, it is possible to reduce the posterior computer capacity of order, to reduce meter Calculation amount, lifting system efficiency.

In specific implementation, the probability for calculating each word and its outside word in the candidate data string and constituting neologisms Characteristic value includes：Calculate the boundary term data of candidate data string and the comentropy of outside term data.

It calculates the entropy of term data and the term data on the outside of it in candidate data string and embodies word on the outside of the term data The confusion degree of language data.For example, by calculating candidate data string W₁W₂Middle left side term data W₁Left side comentropy, right side Term data W₂Right side comentropy may determine that term data W₁And W₂The confusion degree in outside, so as to pre- by setting If range is screened, excludes each word and its outside word constitutes candidate number of the probability characteristics value of neologisms outside preset range According to string.

Illustrate so that candidate data string only includes two independent words (w1w2) as an example, independent word w1 and adjacent candidate data string In independent word there is an outside comentropy, individually there is word w2 an inside to believe in independent word w1 and same candidate data string Cease entropy；Independent word w2 has an inside comentropy, independent word w2 and adjacent time with word w1 independent in same candidate data string Select the independent word in serial data that there is an outside comentropy, i.e., the independent word of centrally located (non-end) all has one Inside comentropy and outside comentropy.

On the inside of progress when the judgement of comentropy or outside comentropy, need to two inside letters in a candidate data string Breath entropy or two outside comentropies are all judged that only there are two inside comentropies or two outside comentropies to be all located at default model When enclosing, just think that the inside comentropy of the candidate data string or outside comentropy are located in preset range；Otherwise, as long as there is one Inside comentropy or an outside comentropy are located at outside preset range, are considered as inside comentropy or the outside of the candidate data string Comentropy is located at outside preset range.

For example, two adjacent candidate data strings are respectively：The candidate being made of independent word " I " and independent word " handling " Serial data；The candidate data string being made of independent word " North China " and independent word " mall ".The internal information of two candidate data strings Entropy is respectively：Comentropy between independent word " I " and independent word " handling "：Between independent word " North China " and independent word " mall " Comentropy.External information entropy between two candidate data strings is：Letter between independent word " handling " and independent word " North China " Cease entropy.

In an embodiment of the present invention, after completing internal judgment to candidate data string, think possible to through internal judgment Candidate data string as neologisms carries out external judgement, excludes the probability characteristics value that each word and its outside word constitute neologisms and exists Candidate data string outside preset range.

S36 judges whether to find neologisms, if discovery neologisms, then follow the steps S37.If not finding neologisms, then follow the steps S39 terminates dictionary and updates.

The neologisms are then added to the basic dictionary by S37, to update the basic dictionary.

In specific implementation, it is also an option that executing following steps：

S38 carries out word segmentation processing to the phrase data again according to updated basic dictionary, the word after being segmented Language data.After step S38 is finished, execute step S34 again, so as to according to updated basic dictionary again to institute It states phrase data and carries out word segmentation processing, combined treatment and judgement processing, and constantly update the base using the neologisms found every time Plinth dictionary.Until judging through step S36, when not finding neologisms, terminates dictionary and update.

Since the length of neologisms is likely larger than 2, place can be iterated to word segmentation processing, new word discovery and post-processing Reason, the dictionary used when carrying out word segmentation processing next time once post-process obtained new dictionary before being exactly, are segmented next time The length for handling obtained candidate data string once adds 1 than preceding, and iteration time can be limited by the length limitation to neologisms Number.

For the sake of accurate, manually examined when neologisms can be added in dictionary in last time iterative process It looks into.

The phrase data is carried out at word segmentation processing, combined treatment and judgement again according to updated basic dictionary It manages, and constantly updates the basic dictionary using the neologisms found every time, neologisms will not obtained as the item for stopping dictionary update Part promotes the reliability of dictionary so as to comprehensively be updated to dictionary.

In specific implementation, step S31 to step S37 can be executed, only to realize that a dictionary updates；In step S35, The candidate data string is carried out in judgement processing, internal judgment can be only carried out, can also only carry out external judgement, Huo Zheye It can choose and not only carry out internal judgment but also carry out external judgement.

When carrying out internal judgment, following probability characteristics value can be calculated：The frequency, frequency or the root that candidate data string occurs The numerical value that the frequency and frequency occurred according to the candidate data string is calculated；It calculates in candidate data string between each term data Mutual information and calculate candidate data string boundary term data and inside term data comentropy.Or selection calculating is above-mentioned general One or both of rate characteristic value.

In a specific example, the corpus that receives be voice data " I handle North China mall Long Card need how long when Between？".Being pre-processed by first time by above-mentioned language data process is text data；It is handled by the first sub-branch by the text The text data of data and other rows is distinguished；This article notebook data is divided by first time word segmentation processing：I, handle, North China, Mall, Long Card, needs, more, long and these independent words of time.

Following candidate data string is obtained by first time combined treatment：I handles, handles North China, North China mall, quotient Tall building Long Card, Long Card need, need it is more, how long, for a long time；The frequency is calculated by first time, remove " I handles " and " handles China The two candidate data strings of north "；By first time calculate mutual information, removal " needing more ", " how long " and " long-time " these three Candidate data string；The comentropy with outside term data is calculated by first time, removes " Long Card needs " this candidate data string, To obtain neologisms " North China mall ", " North China mall " is added in basic dictionary.

This article notebook data is divided by second of word segmentation processing：I, handle, be North China mall, Long Card, needs, more, long These independent words with the time；Following candidate data string is obtained by second of combined treatment：I handles, handles North China quotient Tall building, North China mall Long Card, Long Card need, need it is more, how long, for a long time；Calculate the frequency by second, remove " I handles " and " handling North China mall " the two candidate data strings；By second of calculating mutual information, removal " needing more ", " how long " and it is " long These three candidate data strings of time "；Calculate comentropy with outside term data by second, removal " Long Card needs " this Candidate data string to obtain neologisms " North China mall Long Card ", and " North China mall Long Card " is added in basic dictionary.

It can continue to carry out word segmentation processing, combined treatment according to the basic dictionary for including " North China mall Long Card " below and sentence Disconnected processing, and the basic dictionary is constantly updated using the neologisms found every time.

It should be noted that in the above example, followed by judgement processing in, both can be to all candidate datas String re-starts judgement；Also it can recorde previous judging result, thus before can calling directly to same candidate data string The judging result in face；The candidate data string including neologisms can also only be formed, thus only to include neologisms candidate data string into Row judgement.

Fig. 4 is a kind of flow chart for the specific implementation for finding neologisms step in the embodiment of the present invention, and wherein step S351 is extremely Step S353 is the specific implementation of step S35 as shown in Figure 3, and the explanation carried out for flow chart in Fig. 3 is herein not Repeat explanation.

S351 calculates the frequency that candidate data string occurs.

Whether within a preset range S352 judges the frequency that the candidate data string occurs, if the candidate data string goes out The existing frequency within a preset range, thens follow the steps S353；If the frequency that the candidate data string occurs is not within a preset range, Then follow the steps S361.

S353 calculates the mutual information in candidate data string between each term data.It is understood that mutual information at this time It calculates and is carried out only for the candidate data string of the frequency within a preset range.

Whether within a preset range S354 judges mutual information in candidate data string between each term data, if candidate number Within a preset range according to the mutual information between term data each in string, S355 is thened follow the steps；If each word in candidate data string Mutual information between language data within a preset range, does not then follow the steps S361.

S355 calculates the boundary term data of candidate data string and the comentropy of inside term data.

It is understood that the calculating with the comentropy of inside term data of candidate data string exists only for the frequency at this time In preset range and the candidate data string of mutual information within a preset range carries out.

Whether the comentropy of S356, the boundary term data for judging candidate data string and inside term data is in preset range It is interior, if the comentropy of the boundary term data of candidate data string and inside term data within a preset range, thens follow the steps S357；If the comentropy of the boundary term data of candidate data string and inside term data within a preset range, does not execute step Rapid S361.

S357 calculates the boundary term data of candidate data string and the comentropy of outside term data.

It is understood that selecting the calculating of the boundary term data of serial data and the comentropy of outside term data at this time only For the frequency preset range, mutual information within a preset range, and the comentropy of boundary term data and inside term data exists Candidate data string in preset range carries out.

Whether the comentropy of S358, the boundary term data for judging candidate data string and outside term data is in preset range It is interior, if the comentropy of the boundary term data of phase candidate data string and outside term data within a preset range, thens follow the steps S362；If the comentropy of the boundary term data of candidate data string and outside term data within a preset range, does not execute step Rapid S361.

Step S361 and step S362 is that two kinds of step S36 in Fig. 3 differentiate as a result, wherein step S361 is through judging not It was found that neologisms, step S362 is to find neologisms through judgement.

In embodiments of the present invention, due to successively calculate the frequency, mutual information, candidate data string boundary term data with it is interior The comentropy of side term data, and the difficulty in computation of above-mentioned three kinds of probability characteristics values is incremented by, the preceding calculating of order can exclude Not candidate data string within a preset range, the candidate data string being excluded are no longer participate in the posterior calculating of order, so as to It saves and calculates the time, improve the efficiency of dictionary update method.

The embodiment of the present invention also provides a kind of dictionary updating device, as shown in Figure 5.

Dictionary updating device 22 includes：Pretreatment unit 221, branch's processing unit 222, word segmentation processing unit 223, combination Processing unit 224, new word discovery unit 225 and updating unit 226, wherein：

The pretreatment unit 221, suitable for being pre-processed to the corpus received, to obtain text data；

Branch's processing unit 222 is suitable for carrying out branch's processing to the text data, obtains phrase data；

The word segmentation processing unit 223 divides the phrase data according to the term data for including in basic dictionary Word processing, with the term data after being segmented；

The combined treatment unit 224 is combined processing to the term data after the adjacent participle, is waited with generating Select serial data；

The new word discovery unit 225 carries out judgement processing to the candidate data string, to find neologisms；

The updating unit 226 is suitable for after finding neologisms, and the neologisms are added to the basic dictionary, to update State basic dictionary.

In specific implementation, combined treatment unit 224 is suitable for utilizing Bigram model by phase in the phrase data of same a line Adjacent word is as candidate data string.

In specific implementation, dictionary updating device 22 can also include：Iteration unit 227 is updated, is suitable on the basis Dictionary indicates that the word segmentation processing unit is based on updated basic dictionary after updating, carry out at participle to the phrase data Reason indicates that the combined treatment unit generates candidate data string, indicate the new word discovery unit to the candidate data string into Row judgement processing to find neologisms, and indicates that the updating unit updates the basic dictionary using the neologisms of discovery；If not sending out Existing neologisms, then terminate the update of basic dictionary.

In specific implementation, the new word discovery unit 225 may include：Internal judgment unit 2251 (referring to Fig. 6, with Lower combination Fig. 6 is illustrated) and/or external judging unit 2252；Wherein：

The internal judgment unit 2251 conspires to create the probability characteristics value for neologisms, the candidate suitable for calculating candidate data When serial data becomes the probability characteristics value of neologisms within a preset range, which is neologisms；

The external judging unit 2252 is constituted newly suitable for calculating each word and its outside word in the candidate data string The probability characteristics value of word, removes each word and its outside word constitutes candidate number of the probability characteristics value of neologisms outside preset range According to string, remaining candidate data string is neologisms.

In specific implementation, the internal judgment unit 2251 conspires to create the probability characteristics for neologisms suitable for calculating candidate data Value comprises at least one of the following：

In specific implementation, when the candidate data that need to be calculated conspires to create the type of the probability characteristics value for neologisms more than one When kind, the internal judgment unit 2251 is suitable for judging whether within a preset range to calculate the preceding probability characteristics value of order, only The calculating of the posterior probability characteristics value of order is carried out to the candidate data string of probability characteristics value within a preset range.

In specific implementation, the internal judgment unit 2251 (referring to Fig. 7, being illustrated below in conjunction with Fig. 7) can wrap It includes：Frequency filter element 22511, mutual information filter element 22512 and internal information entropy filter element 22513；The outside Judging unit 2252 includes external information entropy filter element；Wherein：

The frequency filter element 22511 removes the frequency default suitable for calculating the frequency of the candidate data string Candidate data string outside range；

The mutual information filter element 22512 is suitable for calculating after frequency filter element filtering, the remaining time The mutual information for selecting serial data removes candidate data string of the mutual information outside preset range；

The internal information entropy filter element 22513 is suitable for calculating after mutual information filter element filtering, remaining The comentropy of the candidate data string boundary term data and inside term data removes the comentropy outside preset range Candidate data string；

The external information entropy filter element is suitable for calculating after internal information entropy filter element filtering, remaining The comentropy of the candidate data string boundary term data and outside term data removes the comentropy outside preset range Candidate data string.

In specific implementation, the external judging unit 2252 is suitable for calculating the boundary term data of candidate data string and outer The comentropy of side term data.

In specific implementation, the pretreatment unit 221 is suitable for the uniform format of corpus being text formatting；It filters dirty One of word, sensitive word and stop words are a variety of.

In specific implementation, the word segmentation processing unit 223 be suitable for using the two-way maximum matching method of dictionary, HMM method and One of CRF method is a variety of.

In specific implementation, dictionary updating device 22 further includes：Length filter elements 228 are suitable for setting candidate data string Length range, to exclude candidate data string of the length except the length range.

The embodiment of the present invention is by pre-processing corpus, branch is handled, word segmentation processing, to obtain the corresponding basis of corpus The independent word that dictionary includes generates candidate data string by combined treatment, new to find by handling candidate data string judgement Word.The above process, which realizes, automatically processes corpus, so as to reduce cost of labor；Based on computer to corpus at Reason can also promote the efficiency and accuracy of dictionary update.

Those of ordinary skill in the art will appreciate that all or part of the steps in the various methods of above-described embodiment is can It is completed with instructing relevant hardware by program, which can be stored in a computer readable storage medium, storage Medium may include：ROM, RAM, disk or CD etc..

Although present disclosure is as above, present invention is not limited to this.Anyone skilled in the art are not departing from this It in the spirit and scope of invention, can make various changes or modifications, therefore protection scope of the present invention should be with claim institute Subject to the range of restriction.

Claims

1. a kind of computer readable storage medium, is stored thereon with program, which is characterized in that the program is performed realization dictionary Update method, the method includes：

The corpus received is pre-processed, to obtain text data；

Branch's processing is carried out to the text data, obtains phrase data；

Word segmentation processing is carried out to the phrase data according to the independent word for including in basic dictionary, with the word number after being segmented According to；

2. computer readable storage medium according to claim 1, which is characterized in that the generation candidate data string, packet It includes：Using Bigram model using adjacent word in the phrase data of same a line as candidate data string.

3. computer readable storage medium according to claim 1 or 2, which is characterized in that the method also includes：According to Updated basis dictionary carries out word segmentation processing, combined treatment and judgement to the phrase data again and handles, and using every time It was found that neologisms constantly update the basic dictionary.

4. computer readable storage medium according to claim 1, which is characterized in that it is described to the candidate data string into Row judgement processing, to find that neologisms include：Internal judgment and/or external judgement；

The internal judgment includes：It calculates candidate data and conspires to create the probability characteristics value for neologisms, it is new that the candidate data, which conspires to create, The probability characteristics value of word within a preset range when, the candidate data string be neologisms；

The outside judges：Calculate the probability characteristics that each word and its outside word in the candidate data string constitute neologisms Value, removes each word and its outside word constitutes candidate data string of the probability characteristics value of neologisms outside preset range, remaining Candidate data string is neologisms.

5. computer readable storage medium according to claim 4, which is characterized in that the calculating candidate data conspire to create for The probability characteristics value of neologisms comprises at least one of the following：

The frequency, frequency or the frequency occurred according to the candidate data string and frequency for calculating the appearance of candidate data string are calculated Numerical value；

6. computer readable storage medium according to claim 5, which is characterized in that when the candidate data that need to be calculated Conspire to create the probability characteristics value for neologisms type it is more than one when, whether judge to calculate the preceding probability characteristics value of order default In range, the calculating of the posterior probability characteristics value of order is only carried out to the candidate data string of probability characteristics value within a preset range.

7. computer readable storage medium according to claim 4, which is characterized in that described to calculate the candidate data string In each word and its outside word constitute neologisms probability characteristics value include：Calculate the boundary term data of candidate data string and outer The comentropy of side term data.

8. computer readable storage medium according to claim 1, which is characterized in that it is described to the candidate data string into Row judgement is handled, and includes to find neologisms successively：

The mutual information for calculating the remaining candidate data string removes candidate data string of the mutual information outside preset range；

The comentropy for calculating remaining the candidate data string boundary term data and inside term data, removes the comentropy Candidate data string outside preset range；

The comentropy for calculating remaining the candidate data string boundary term data and outside term data, removes the comentropy Candidate data string outside preset range；

The remaining candidate data string is as neologisms.

9. computer readable storage medium according to claim 1, which is characterized in that the method also includes：Setting is waited The length range of serial data is selected, to exclude candidate data string of the length except the length range.

10. a kind of computer system has electronic data processing capability, which is characterized in that including dictionary updating device, the dress Set including：Pretreatment unit, branch's processing unit, word segmentation processing unit, combined treatment unit, new word discovery unit and update Unit；Wherein：

The word segmentation processing unit, suitable for being carried out at participle according to the term data for including in basic dictionary to the phrase data Reason, with the term data after being segmented；

The combined treatment unit, suitable for being combined processing to the term data after the adjacent participle, to generate candidate Serial data；

The updating unit is suitable for after finding neologisms, and the neologisms are added to the basic dictionary, to update the basic word Allusion quotation.