CN103885938B

CN103885938B - Industry spelling mistake checking method based on user feedback

Info

Publication number: CN103885938B
Application number: CN201410149427.8A
Authority: CN
Inventors: 杨明; 罗军舟; 倪俊辉; 马成平; 任新才
Original assignee: Southeast University; Focus Technology Co Ltd
Current assignee: Southeast University; Focus Technology Co Ltd
Priority date: 2014-04-14
Filing date: 2014-04-14
Publication date: 2015-04-22
Anticipated expiration: 2034-04-14
Also published as: CN103885938A

Abstract

The invention discloses an industry spelling mistake checking method based on user feedback. According to the industry spelling mistake checking method based on user feedback, spelling mistake checking is carried out on English text by using an N-gram method and a user dictionary which is designed in a classified mode, recommendation of correct words is accomplished by searching for a large corpus database, and thus checking of spelling mistakes related to a user is achieved. The N-gram method serves as a basic method for natural language processing, and the mistakes in the text are checked according to the characteristics of words or statements and statistical information in a corpus; recommended words which are most related to wrong words in the text input by the user are selected through cooperation between the user dictionary designed in the classified mode and statistical data of the corpus according to historical information of the user at present; the database is searched for a word chain with the largest conditional probability product by using the Viterbi algorithm, and computational efficiency of a hidden Markov model in the large corpus and use efficiency of the statistical information in the database are improved.

Description

Based on the industry misspelling inspection method of user feedback

Technical field

The present invention is a kind of English spelling error check method, make use of the correlation techniques such as corpus, natural language statistical model and the Hidden Markov Model (HMM) comprising a large amount of language message, relates to natural language processing particularly English spelling inspection field.

Background technology

First the abbreviation of using in the present invention is defined:

NLP(Natural Language Processing): natural language processing;

BNC(British National Corpus): British National Corpus;

LDC(Linguistic Data Consortium): language data alliance;

LD(Levenshtein Distance): editing distance;

N-gram:N metagrammar.

Misspelling checks that (Spelling Checker) is an important branch and the basic link of NLP, natural language processing is inerrancy and intelligible text by it, has natural supporting role for senior NLP technology such as mechanical translation, phonetic synthesis, speech recognitions.Meanwhile, this technology effectively can improve the friendly of user interface and intelligent, has important actual application value.

Early stage NLP mainly adopts the method based on syntax-semantic rules.Along with the emergence of Corpus Construction and corpus linguistics, the main target being treated as natural language processing of extensive real text.Rule-based method, after development for many years, still can not break through the restriction of accuracy rate and efficiency two aspect, and statistical method shows gradually in the more advantage of natural language processing field.Use the Auto-learning Method of Corpus--based Method to obtain linguistry in natural language processing more and more, this also comprises misspelling inspection.Statistics-Based Method relates generally to corpus and statistical language model two aspects.

Multiple tissue and research institution provide respective corpus and various statistics thereof, and this free e-books more than 4200 that Chinese and English news category language material, BNC, LDC, Gutenberg project as Text Classification research provide, ten thousand sections are randomly drawed paper Chinese DBLP resource, UCI evaluates sorting data etc.

The mode of web page text Penn Treebank has been carried out element by Brants and the Franz of Google, and altogether create the data more than 1T, detailed content is as shown in table 1.The 5-grams corpus based on 1T web page text data that Google announces is the more comprehensive English corpus of ratio of current Corpus--based Method method.This corpus provides the statistical information from 1 ~ 5-grams, and the natural language processing for Corpus--based Method method provides abundant analysis Data Source.

Corpus aspect, dictionary is that word error correction provides the most basic non-word bug check ability, and design has good management interface, extendible normal dictionary, the basic function that word can be provided to detect for user and raising system performance; Support that the corpus of statistical method realizes the basis that misspelling checks, it provides the data available that scale is considerable, information is full and accurate for Natural Language Processing Models; Corpus based on semanteme is the excellent model that professional domain divides, but due to the poor efficiency of syntax rule, this method cannot obtain practicality.Need to adopt statistical method indirectly to realize the corpus of trade classification.

Traditional misspelling inspection is paid attention to solve the non-word bug check correct word being input as invalid words, the distance measure that conventional method is use reliable dictionary and determines, as LD.Because the cost manually setting up reliable dictionary is very high, the dictionary that traditional spell check uses is smaller.Along with statistical model is introduced in misspelling, error model and N-gram language model become the key components of misspelling check system.Transition matrix and the proper vector of Kukich proposition error probability are spelling the application in error correction, are the bases of N-gram method realization afterwards.It is the key improving spell check precision that Brill and Moore demonstrates a good statistical model, but sets up such error model and need to do a large amount of manual markings to error correction phrase, and this relates to high cost.Whitelaw etc. use Web text to improve this efficiency to a certain extent.Along with the development of Web technology and application, misspelling inspection also more and more receives publicity, and more misspelling type is mentioned, and as failed to write, wrongly increasing letter, exchanging the order of some letters, the merging of mistake, splits word, misuse word etc.The problem that these methods mainly solve is searched input error, search word candidate space and sets up word candidate score function.

In existing misspelling inspection model, major part is all the off-line model based on N-gram model, and this method has become the main flow of spell check research now.The main thought of model uses the statistical information in the Bayesian formula calculating natural language of expansion, and maximum feature is that to have employed statistical method, model simply efficient.The instrument that current research mainly uses is N-gram model, the Bayesian formula of expansion and Hidden Markov Model (HMM).Be divided into and add up word probability with Bayesian formula, use Hidden Markov Model (HMM) to solve these aspects of rapid solving of Hidden Markov Model (HMM) in N-gram model parameter and Bayesian formula.The efficiency of model and practicality are this field problems in the urgent need to address.

Summary of the invention

Technical matters: in misspelling check system, corpus is as the basis of whole model, and calculating wherein and query script inevitably become the performance bottleneck of whole system.If corpus is based on syntax rule or only add up the frequency that word occurs, the degraded performance being easy to occur that rules explosion causes in query script or the result of calculation caused because of statistics deficiency inaccurate.Misspelling inspection model aspect, simply according to a certain estimate carry out mating or only adopt N-gram computation model, there is larger error in the check result that the former obtains, the latter produces larger impact to the performance of system.The technical problem to be solved in the present invention is the dynamic adjustment capability that system lacks based on user feedback, effectively cannot comprehensively use multiple corpus information.For the problem that effectively can not utilize multiple corpus, employing user dictionary, industry corpus and core corpus be combined with each other, the method for weighted calculation.This method has inquiry fast, and result of calculation is accurate, to context environmental adaptability high, can automatically regulate corpus to the use of different piece under different users and text environments, effectively improves system effectiveness and ensures result accuracy.The present invention calculates the Markov chain in N-gram model by using viterbi algorithm, obtains the set that most possible correct word is formed.In corpus, according to the word of N-1 before incorrect word, each possible word is carried out to the calculating of probability, the part estimated with word residing for corpus according to LD calculates weights, obtains the recommendation list sorted according to the probability of occurrence of correct word.Information Statistics in user version are entered in the corpus of system by the correct word chosen according to user and context.After system obtains new statistical information, according to the statistic algorithm in N-gram model, the word frequency of relative recording in corpus tables of data and conditional probability are revised, make corpus synchronous with the actual use of user, record the statistics of all history texts, complete the whole updating of misspelling check system.

Technical scheme:

For solving the problems of the technologies described above, the present invention utilizes N-gram corpus data and relevant statistical method, proposes a kind of industry misspelling inspection method based on user feedback.This misspelling inspection method is specific as follows:

Based on an industry misspelling inspection method for user feedback, comprise step:

1) acquisition of corpus and user dictionary and foundation:

Corpus is divided into core corpus and industry corpus, as the core statistics storing language message, morphology, the syntactic and semantic information of in store overall statistical language and professional terms, when carrying out misspelling and checking, core corpus and industry corpus provide all word, statement information for spell check model, provide the global data of whole language; Meanwhile, according to the dictionary that user builds voluntarily, obtain the special language material information about user;

In a database, define tables of data to store the language material of entirety and user's language material information;

2) structure of spell check model:

The structure of misspelling inspection model calculates with the statistical information of N-gram model to corpus, and obtain the word chain combination that conditional probability is maximum, step comprises:

21) correction judgement of word: the coupling word in text being done to core corpus, if word is not in core corpus, then uses industry corpus and user dictionary to judge successively; If all do not existed in aforementioned three kinds of tables of data, be then judged as incorrect word, carry out next step;

22) recommendation of correct word: according to word close under editing distance with incorrect word in each corpus, calculate probability and the context joint probability thereof of these words, again by weight computing and the maximally related correct word of incorrect word of each corpus, several maximum words of all corpus weighting posterior probabilities are selected to form the recommendation list of correct word;

3) recommend to process the text of user's input by the bug check in spell check model and word;

4) upgrade and user-dependent text statistical information, dictionary and corpus: the text input user and the correct word of selection are added up, by the correct word information in text and contextual information statistics access customer dictionary, core corpus and corresponding industry corpus.

Described step 1) in, the necessary condition of effective corpus and user dictionary comprises:

(1) namely there is not incorrect word in user dictionary, also must be the correct word obtained from the recognized standard such as Oxford, Longman dictionary, or user-defined industry or special words;

(2) core corpus is enough large, there is not the skewed popularity such as industry, timeliness, and must include N-gram information, be used to provide basic word context statistical information;

(3) industry corpus carries out preliminary division according to demand, and according to the selection Nature creating of user, unique user can be the user of multiple industry corpus.

Described step 21) in, viterbi algorithm is used to calculate the probability of current word in core corpus, industry language material fast in N-gram model, and obtain the joint probability that current word and a front N-1 word occur, realize the judgement to current word correctness.

Described step 22) in, N-gram model is used to search in industry corpus and core corpus to the position at incorrect word place, and mated in user dictionary by editing distance and word probability of occurrence, to obtain most possible word list; For the probability of each word in different corpus, adopt probability-weighted to sort to recommendation word list, right rear line provides the recommendation results after sequence.

Described step 4) in, after the text that system of users inputs carries out bug check, calculate the text statistical information in user's input, for the N-gram data in user dictionary and corpus provide lastest imformation, after corresponding tables of data being upgraded, provide bug check service by new corpus data and user dictionary.

In this method, the statistical language model used is exactly use Hidden Markov Model (HMM) to check in corpus to make using the highest word of the context-sensitive word chain probability of occurrence of incorrect word position as correct word list, each corpus has different weights, by the probability of word in corpus and the weighted calculation of corpus, obtain the recommendation word list after sorting.Misspelling inspection is completed recommending the selection of word by user.

This method, based on N-gram Natural Language Processing Models, adopts core corpus, corpus, user dictionary and statistical language model by trade classification, the function that the text for user's input provides bug check and correct word to recommend.After user inputs one section of text, server carries out element to text, is the word chain set under the N unit syntax, thus calculates the conditional probability of last word in corpus in each word chain by text dividing.Statistical language model calculates the alternative set of word as correct word of several maximum probabilities, if former word is in alternative set, then judge that former word is correct, otherwise user selects from alternative set a word as correct word.

The present invention is directed to efficiency and the practicality problem of misspelling check system, utilize the mode of classification corpus weighted calculation, estimate and searching algorithm in conjunction with LD, carry out spelling words bug check in the mode of recommending after first debugging, efficiently can realize bug check fast and the stronger word of contextual relevance is recommended; Have employed viterbi algorithm, propose a kind of statistical language model, the word list that in user version, word probability-weighted in corpus is maximum can be calculated fast.Obtain the user recommending word list, select correct word according to actual conditions and feed back to system, the word that user selects by system and context statistical information thereof join with in user-dependent corpus: calculate the more new data of this word in core corpus, industry corpus and user dictionary by statistical model and add in tables of data, with new data, misspelling inspection is carried out to the user version arrived next time, thus the characteristic that the system that achieves can provide misspelling to check for text according to practical service environment and different users.

Beneficial effect: the present invention has that corpus service efficiency is high, data carry out the feature such as adjusting based on user's actual feedback, makes the practical of system, checks that speed is fast, data syn-chronization high (upgrading in time corpus data according to service condition); Be combined multiple different corpus, can effectively realize efficient misspelling inspection under the environment of multi-user, high concurrent request.

Accompanying drawing explanation

Fig. 1 is N-gram statistical model figure of the present invention.

Fig. 2 is misspelling check system structural drawing of the present invention.

Fig. 3 is specific embodiment of the invention process flow diagram.

Fig. 4 is misspelling audit function module map.

Fig. 5 is Google1T N-gram data message table.

Embodiment

Below in conjunction with accompanying drawing and concrete example, the present invention is further described in more detail.

Industry misspelling inspection method based on user feedback of the present invention, mainly solve the problem lacking user-association and fast search Big-corpus in current misspelling inspection, relate to the correlation techniques such as natural language processing, user dictionary design and database search.The method utilizes the user dictionary of classification design, adopts N-gram method to carry out misspelling inspection to English text, and completes the recommendation of correct word by large corpus data library searching, thus the misspelling inspection that realization is associated with user.N-gram model (Fig. 1), as the basic skills of natural language processing, is checked the mistake in text by the statistical information in word or statement feature and corpus; The user dictionary of classification design is according to the historical information of current user, and the statistics in conjunction with corpus selects the maximally related recommendation word with incorrect word in user input text; Use viterbi algorithm to find out the maximum word chain of database conditional probability product, improve the service efficiency of statistical information in the counting yield of Hidden Markov Model (HMM) in Big-corpus and database.The structure of whole system and the functional module of each several part divide as shown in Figure 2, Figure 4 shows, are below the design concept of each several part and the description of implementation detail.

1, the acquisition of corpus and user dictionary and foundation:

Corpus is divided into core corpus and industry corpus, as the core statistics storing language message, morphology, the syntactic and semantic information of in store overall statistical language and professional terms, when carrying out misspelling and checking, corpus provides all word, statement information for spell check model, provides the global data of statistical language; Meanwhile, according to the dictionary that user builds voluntarily, obtain the special language material information about user, and by its historical information of text entry that counting user inputs; In a database, define tables of data to store each corpus and user's input information.Concrete list structure is as follows:

(1) user dictionary list structure

(2) unitary data list structure

(3) binary data list structure

(4) industry language material list structure

(5) weight data list structure

2, the structure of misspelling inspection model:

The structure of misspelling inspection model calculates with the statistical information of N-gram model to corpus, obtains the word chain combination of weighting conditions maximum probability in each corpus.In model construction process, misspelling inspection as target, does not increase the complexity of Data Matching and sequence in computation process with the practicality of the correctness and recommendation list that accurately judge each word simultaneously.According to the statistical model of the overall situation, user's request and text message, use all corpus data, find out all possible word probability-weighted.Consider the probability size in word compiling distance, each corpus, according to weight calculation, produce an optimum recommendation list of current word.By spell check model, bug check and word recommendation are carried out to the text that user inputs.Specifically as shown in Figure 3.This model is specifically divided into two stages:

A) the best candidate set of word is generated

The concrete definition of word probability of occurrence: if during N=3, examined word word first two words is in the text respectively word1 and word2, four-tuple (word1 is got in corpus, word2, word, COUNT), calculate the ratio of COUNT sum in COUNT and whole corpus, namely calculate the probability of word word; Word1, word2 represent current word word the first two word in the text; If word is second word in statement, then word1=' # '; If word is first word in statement, then word1=word2=' # '; COUNT represents the appearance total degree of this combinations of words in corpus.

Effective corpus and user dictionary must meet following 3 conditions:

(1) namely there is not incorrect word in user dictionary, also must be the correct word obtained from the recognized standard such as Oxford, Longman dictionary, or the special string that user inputs voluntarily.Now N=1, this part word probability calculates in conjunction with the editing distance of the COUNT in two tuples (word, COUNT) and word itself;

(2) core corpus is enough large, and ensures that the probability calculation in model has statistical significance.Based on the data volume size in corpus, we do not calculate the N-gram data of COUNT<=200 regulation.There is not the skewed popularity such as industry, timeliness, in the use procedure each time of user, all the statistical information of its text is added in core corpus;

(3) according to application demand, industry corpus being preliminarily divided into several large classes, and can generating new industry or combination according to the use of user, is also that the bussiness field in industry corpus constantly expands along with using.

First with dictionary, coupling is carried out to the word in text and judge its correctness, if word is not in user dictionary, then according to the close word in editing distance Dictionary of Computing, obtain word candidate set; Secondly to each word in word candidate set, in conjunction with its top n word in the text, industry corpus, core corpus calculating probability formula is used successively

p (w_{i} | w_{1} w_{2} \cdot \cdot \cdot w_{i - 1}) = \frac{C (w_{1} w_{2} \cdot \cdot \cdot w_{i - 1} w_{i})}{C (w_{1} w_{2} \cdot \cdot \cdot w_{i - 1})} .

According to the weights W of user dictionary, industry corpus and core corpus _h, W _p, W _ccalculate final word weights W _w.Wherein, p ₁, p ₂, p ₃be respectively the probability of occurrence of word W in user dictionary, industry corpus and core corpus; W _h+ W _p+ W _c=1, weight calculates the situation of calling of each corpus according in user's use procedure, W time initial _h=W _p=W _c, reject weights and be less than threshold value W _tword after, obtain word candidate set.

W _W＝W _H*p ₁+W _P*p ₂+W _C*p ₃

Table 1 is the algorithm false code of word weight computing:

B) correct word is recommended

If former word is present in word candidate set, then again determine that it is correct word; Otherwise, according to the probability-weighted value of corpus weights and word chain joint probability calculation word candidate, according to probability-weighted value, word candidate set is sorted, the word after sequence is formed recommendation list and sends to user.

3, renewal and user-dependent text statistical information, dictionary and corpus

User obtains word and to recommend and after have selected correct word, user input text also becomes a part for corpus, the amended text of system of users is added up, by the N-gram Information Statistics access customer dictionary in correct text, core corpus and corresponding industry corpus, concrete occurrence number and context data are increased in these tables of data.And the corpus at correct word place is chosen according to user, recalculate the weights W of this user when calling each corpus _h, W _p, W _c.

The present invention also can have other numerous embodiments; when not deviating from the present invention's spirit and essence thereof; those of ordinary skill in the art can make various corresponding change and distortion according to the present invention, and these change accordingly and are out of shape the protection domain that all should belong to the claim appended by the present invention.

Claims

1., based on the industry misspelling inspection method of user feedback, it is characterized in that, comprise step:

The acquisition of step one, corpus and user dictionary and foundation:

Corpus is divided into user dictionary, core corpus and industry corpus, as the core statistics storing language message, morphology, the syntactic and semantic information of in store whole language, when carrying out misspelling and checking, corpus provides all word, statement information for misspelling inspection model, provides the global data of whole language; Meanwhile, according to text and the service condition of user's input, obtain the new language material information about user, upgrade corpus and user dictionary;

In a database, define tables of data to store the language material of entirety and user's input information;

The structure of step 2, misspelling inspection model:

The structure of misspelling inspection model calculates with the statistical information of N-gram model to corpus, obtains the word chain combination that conditional probability is maximum;

Step 3, system interaction interface recommend to process the text of user's input by using the bug check in misspelling inspection model and word;

Step 4, renewal and user-dependent text statistical information, dictionary and corpus: add up the input of user and the correct word of selection, by the word information in correct text and context statistics access customer dictionary, core corpus and corresponding industry corpus.

2. the industry misspelling inspection method based on user feedback according to claim 1, it is characterized in that, in described step one, the necessary condition of effective corpus and user dictionary comprises:

(1) namely there is not incorrect word in dictionary, also must be the correct word obtained from the recognized standard such as Oxford, Longman dictionary, and user-defined industry or special words;

(2) core corpus is enough large, there is not industry, timeliness skewed popularity, and must include N-gram information, be used to provide basic word chain statistical information;

(3) industry corpus carries out preliminary division according to demand, and according to the selection Nature creating of user, certain user can be the user of multiple industry corpus;

(4) user dictionary is the dictionary constructed according to the input demand of user, and user can be allowed to manage voluntarily.

3. the industry misspelling inspection method based on user feedback according to claim 1, is characterized in that, in described step 2, specifically comprise:

The correction judgement of step 2.1 word: the coupling word in text being done to normal dictionary, if word is not in normal dictionary, then uses industry corpus and user dictionary to judge successively; If all do not existed in aforementioned three kinds of tables of data, be then judged as incorrect word, carry out next step;

The recommendation of the correct word of step 2.2: according to editing distance and word chain joint probability, adopt each corpus weighted calculation and the maximally related correct word of incorrect word, several words selecting combined probability maximum form the recommendation list of incorrect word.

4. the industry misspelling inspection method based on user feedback according to claim 3, it is characterized in that, in described step 2.1, use viterbi algorithm Rapid matching current word probability of occurrence in each corpus in N-gram model, and obtain the joint probability that current word and a front N-1 word occur, realize the judgement to current word correctness.

5. the industry misspelling inspection method based on user feedback according to claim 3, is characterized in that, in described step 2.2, sorted to recommendation word list by editing distance and word probability of occurrence, right rear line provides recommendation results; Being used for the weights of ordered word list is be weighted acquisition to the probability of word in each corpus.

6. the industry misspelling inspection method based on user feedback according to claim 1, it is characterized in that, in described step 4, after bug check is carried out to the text of user's input, calculate the text statistical information in user's input, for the N-gram data in user dictionary and corpus provide lastest imformation, after corresponding tables of data being upgraded, provide bug check service by new corpus data and dictionary.

7. the industry misspelling inspection method based on user feedback according to claim 1, it is characterized in that, use Hidden Markov Model (HMM) to check in corpus to make using the highest word of the context-sensitive word chain probability of occurrence of incorrect word position as correct word list, each corpus has different weights, by the probability of word in corpus and the weighted calculation of corpus, obtain the recommendation word list after sorting; Misspelling inspection is completed recommending the selection of word by user.

8. the industry misspelling inspection method based on user feedback according to claim 1, it is characterized in that, adopt user dictionary, core corpus, industry corpus and statistical language model, after user inputs one section of text, server carries out element to text, be the word chain set under the N unit syntax by text dividing, thus calculate the conditional probability of last word in corpus in each word chain; Statistical language model calculates the alternative set of word as correct word of several maximum probabilities, if former word is in alternative set, then judge that former word is correct, otherwise user selects from alternative set a word as correct word.

9. the industry misspelling inspection method based on user feedback according to claim 8, is characterized in that, statistical language model adopts viterbi algorithm, calculates the word list that in user version, word probability-weighted in corpus is maximum; Obtain the user recommending word list, select correct word and context statistical information thereof to join with in user-dependent corpus according to actual conditions; Calculate the more new data of this word in user dictionary, core corpus and industry corpus by statistical language model and add in tables of data, with new data, misspelling inspection being carried out to the user version arrived next time.