CN103885938A

CN103885938A - Industry spelling mistake checking method based on user feedback

Info

Publication number: CN103885938A
Application number: CN201410149427.8A
Authority: CN
Inventors: 杨明; 罗军舟; 倪俊辉; 马成平; 任新才
Original assignee: Southeast University; Focus Technology Co Ltd
Current assignee: Southeast University; Focus Technology Co Ltd
Priority date: 2014-04-14
Filing date: 2014-04-14
Publication date: 2014-06-25
Anticipated expiration: 2034-04-14
Also published as: CN103885938B

Abstract

The invention discloses an industry spelling mistake checking method based on user feedback. According to the industry spelling mistake checking method based on user feedback, spelling mistake checking is carried out on English text by using an N-gram method and a user dictionary which is designed in a classified mode, recommendation of correct words is accomplished by searching for a large corpus database, and thus checking of spelling mistakes related to a user is achieved. The N-gram method serves as a basic method for natural language processing, and the mistakes in the text are checked according to the characteristics of words or statements and statistical information in a corpus; recommended words which are most related to wrong words in the text input by the user are selected through cooperation between the user dictionary designed in the classified mode and statistical data of the corpus according to historical information of the user at present; the database is searched for a word chain with the largest conditional probability product by using the Viterbi algorithm, and computational efficiency of a hidden Markov model in the large corpus and use efficiency of the statistical information in the database are improved.

Description

Industry misspelling inspection method based on user feedback

Technical field

The present invention is a kind of English spelling error check method, has utilized the correlation techniques such as the corpus, natural language statistical model and the Hidden Markov Model (HMM) that comprise a large amount of language messages, relate to natural language processing particularly English spelling check field.

Background technology

First the abbreviation of using in the present invention is defined:

NLP(Natural Language Processing): natural language processing;

BNC(British National Corpus): British National Corpus;

LDC(Linguistic Data Consortium): language data alliance;

LD(Levenshtein Distance): editing distance;

N-gram:N metagrammar.

Misspelling checks that (Spelling Checker) is important branch and the basic link of NLP, it is inerrancy and intelligible text by natural language processing, has natural supporting role for senior NLP technology such as mechanical translation, phonetic synthesis, speech recognitions.Meanwhile, this technology can effectively improve the friendly of user interface and intelligent, has important actual application value.

Early stage NLP mainly adopts the method based on syntax-semantic rules.Along with the emergence of Corpus Construction and corpus linguistics, the main target that is treated as natural language processing of extensive real text.Rule-based method, after development for many years, still can not break through the restriction of accuracy rate and efficiency two aspects, and statistical method shows gradually in the more advantage of natural language processing field.In natural language processing, use more and more the Auto-learning Method based on statistics to obtain linguistry, this is also including misspelling inspection.Method based on statistics relates generally to corpus and two aspects of statistical language model.

Multiple tissues and research institution provide corpus and various statistics thereof separately, and this free e-books more than 4200 of providing as Chinese and English news category language material, BNC, LDC, the Gutenberg project of Text Classification research, ten thousand pieces are randomly drawed paper Chinese DBLP resource, UCI evaluates sorting data etc.

The Brants of Google and Franz have carried out element by web page text by the mode of Penn Treebank, have altogether produced the data that exceed 1T, and detailed content is as shown in table 1.The 5-grams corpus based on 1T web page text data that Google announces is the current more comprehensive English corpus of ratio based on statistical method.This corpus provides the statistical information from 1～5-grams, for the natural language processing based on statistical method provides abundant analysis Data Source.

Corpus aspect, dictionary, for word error correction provides the most basic non-word bug check ability, designs and has good management interface, extendible normal dictionary, can the basic function of word detection is provided and improve system performance for user; Support that the corpus of statistical method is to realize the basis that misspelling checks, it provides the data available that scale is considerable, information is full and accurate for Natural Language Processing Models; Corpus based on semantic is the good model that professional domain is divided, but due to the poor efficiency of syntax rule, this method cannot obtain practicality.Need to adopt statistical method indirectly to realize the corpus of trade classification.

Traditional misspelling inspection pays attention to solve the non-word bug check that correct word is input as to invalid words, and conventional method is to use a reliable dictionary and definite distance measure, as LD.Owing to manually setting up, the cost of reliable dictionary is very high, and the dictionary that traditional spell check is used is smaller.Along with statistical model is introduced in misspelling, error model and N-gram language model become the key components of misspelling check system.Kukich proposes transition matrix and the application of proper vector in spelling error correction of error probability, is the basis that N-gram method realizes afterwards.Brill and Moore have proved that a good statistical model is the key that improves spell check precision, need to do a large amount of manual markings to error correction phrase but set up such error model, and this relates to high cost.The use Web texts such as Whitelaw have improved this efficiency to a certain extent.Along with the development of Web technology and application, misspelling inspection also more and more receives publicity, and more misspelling type is mentioned, as fail to write, the wrong letter that increases, exchange the order of some letters, the merging of mistake, split word, misuse word etc.The problem that these methods mainly solve is search input error, search word candidate space and set up word candidate score function.

In existing misspelling inspection model, major part is all the off-line model based on N-gram model, and this method has become the main flow of spell check research now.The main thought of model is the statistical information using in the Bayesian formula calculating natural language of expanding, and maximum feature is to have adopted statistical method, model simply efficient.The instrument that current research mainly uses is Bayesian formula and the Hidden Markov Model (HMM) of N-gram model, expansion.Be divided into and add up word probability, use Hidden Markov Model (HMM) to solve these aspects of rapid solving of Hidden Markov Model (HMM) in N-gram model parameter and Bayesian formula with Bayesian formula.The efficiency of model and practicality are this field problems in the urgent need to address.

Summary of the invention

Technical matters: in misspelling check system, corpus is as the basis of whole model, and calculating wherein and query script inevitably become the performance bottleneck of whole system.If corpus based on syntax rule or only add up the frequency that word occurs, is easy in query script to occur that the performance result of calculation low or that cause because of statistics deficiency that rules explosion causes is inaccurate.Misspelling inspection model aspect, simply according to a certain estimate mate or only adopt N-gram computation model, there is larger error in the check result that the former obtains, the latter produces larger impact to the performance of system.The technical problem to be solved in the present invention is that system lacks the dynamic adjustment capability based on user feedback, effectively the multiple corpus information of Integrated using.For the problem that can not effectively utilize multiple corpus, adopt user dictionary, industry corpus and core corpus mutually combines, the method for weighted calculation.This method has inquiry fast, and result of calculation is accurate, to context environmental adaptability high, can under different users and text environments, automatically regulate the use of corpus to different piece, effectively improves system effectiveness and guarantees result accuracy.The present invention, by using viterbi algorithm to calculate the Markov chain in N-gram model, obtains the set that most possible correct word forms.In corpus, according to N-1 word before wrong word, each possible word is carried out to the calculating of probability, estimate with word and calculate weights in the residing part of corpus according to LD, obtain according to the recommendation list of the probability of occurrence sequence of correct word.Correct word and the context chosen according to user, enter the Information Statistics in user version in the corpus of system.System obtains after new statistical information, according to the statistic algorithm in N-gram model, the word frequency and conditional probability to relative recording in corpus tables of data are revised, corpus is synchronizeed with user's actual use, record the statistics of all history texts, complete the whole updating of misspelling check system.

Technical scheme:

For solving the problems of the technologies described above, the present invention utilizes N-gram corpus data and relevant statistical method, has proposed a kind of industry misspelling inspection method based on user feedback.This misspelling inspection method is specific as follows:

An industry misspelling inspection method based on user feedback, comprises step:

1) the obtaining and setting up of corpus and user dictionary:

Corpus is divided into core corpus and industry corpus, as the core statistics of storage language message, morphology, the syntactic and semantic information of in store overall statistical language and industry term, in the time carrying out misspelling inspection, core corpus and industry corpus, for spell check model provides all word, statement information, provide the global data of whole language; Meanwhile, the dictionary building voluntarily according to user, obtains the special language material information about user;

In database, definition tables of data is stored overall language material and user's language material information;

2) structure of spell check model:

The structure of misspelling inspection model is with N-gram model, the statistical information of corpus to be calculated, and obtains the word chain combination of conditional probability maximum, and step comprises:

21) correction judgement of word: the word in text is done to the coupling of core corpus, if word, not in core corpus, then uses industry corpus and user dictionary to judge successively; If all do not existed in aforementioned three kinds of tables of data, be judged as wrong word, carry out next step;

22) recommendation of correct word: according in each corpus with wrong word close word under editing distance, calculate probability and the context joint probability thereof of these words, calculate and the maximally related correct word of wrong word by the weights of each corpus again, select several words of all corpus weighting posterior probability maximums to form the recommendation list of correct word;

3) recommend the text of user's input to process by the bug check in spell check model and word;

4) upgrade and user-dependent text statistical information, dictionary and corpus: the text to user's input and the correct word of selection are added up, by the correct word information in text and contextual information statistics access customer dictionary, core corpus and corresponding industry corpus.

Described step 1) in, effectively the necessary condition of corpus and user dictionary comprises:

(1) in user dictionary, not having wrong word, must be also the correct word obtaining from the recognized standard such as Oxford, Longman dictionary, or user-defined industry or special words;

(2) core corpus is enough large, does not have the skewed popularity such as industry, timeliness, and must include N-gram information, is used to provide basic word context statistical information;

(3) industry corpus carries out preliminary division according to demand, and according to user's selection Nature creating, unique user can be the user of multiple industry corpus.

Described step 21) in, use viterbi algorithm in N-gram model, to calculate fast the probability of current word in core corpus, industry language material, and obtain current word and front N-1 the joint probability that word occurs, realize the judgement to current word correctness.

Described step 22) in, use N-gram model to search in industry corpus and core corpus to the position at wrong word place, and mate in user dictionary by editing distance and word probability of occurrence, to obtain most possible word list; Probability for each word in different corpus, adopts probability-weighted to recommending word list to sort, and so rear line provides the recommendation results after sequence.

Described step 4) in, system is carried out after bug check the text of user's input, calculate the text statistical information in user's input, for the N-gram data in user dictionary and corpus provide lastest imformation, after corresponding tables of data is upgraded, provide bug check service by new corpus data and user dictionary.

In this method, the statistical language model using is exactly to use Hidden Markov Model (HMM) to check to make in corpus using the highest word of the context dependent word chain probability of occurrence of wrong word position as correct word list, each corpus has different weights, the weighted calculation of probability and corpus by word in corpus, the recommendation word list after being sorted.Misspelling inspection completes the selection of recommending word by user.

This method, based on N-gram Natural Language Processing Models, adopts core corpus, presses corpus, user dictionary and the statistical language model of trade classification, the function that the text of inputting for user provides bug check and correct word to recommend.Input after one section of text user, server carries out element to text, is the word chain set under the N unit syntax, thereby calculates the conditional probability of last word in corpus in each word chain by text dividing.Statistical language model calculates the word of several maximum probabilities as the alternative set of correct word, if former word in alternative set, judges that former word is correct, otherwise user selects a word as correct word from alternative set.

The present invention is directed to efficiency and the practicality problem of misspelling check system, utilize the mode of classification corpus weighted calculation, estimate and searching algorithm in conjunction with LD, carry out capable of spelling words bug check in the mode of recommending after first debugging, can efficiently realize fast the word that bug check and context relation are stronger and recommend; Adopt viterbi algorithm, proposed a kind of statistical language model, can calculate fast the word list of word probability-weighted maximum in corpus in user version.Obtain the user who recommends word list, select correct word and feed back to system according to actual conditions, the word that system is selected user and context statistical information thereof join with user-dependent corpus in: calculate the more new data of this word in core corpus, industry corpus and user dictionary and add in tables of data by statistical model, with new data, the user version next time arriving is carried out to misspelling inspection, thereby realized the characteristic that system can provide misspelling to check for text according to practical service environment and different users.

Beneficial effect: the present invention has that corpus service efficiency is high, data the feature such as adjust based on user's actual feedback, makes the practical of system, and inspection speed is fast, data synchronism high (according to the service condition corpus data that upgrade in time); Be combined with multiple different corpus, can under the environment of multi-user, high concurrent request, effectively realize efficient misspelling inspection.

Accompanying drawing explanation

Fig. 1 is N-gram statistical model figure of the present invention.

Fig. 2 is misspelling check system structural drawing of the present invention.

Fig. 3 is specific embodiment of the invention process flow diagram.

Fig. 4 is misspelling audit function module map.

Fig. 5 is Google1T N-gram data message table.

Embodiment

Below in conjunction with accompanying drawing, the present invention is further described in more detail with concrete example.

Industry misspelling inspection method based on user feedback of the present invention, mainly solve the problem that lacks user-association and fast search Big-corpus in current misspelling inspection, relate to the correlation techniques such as natural language processing, user dictionary design and database search.The method is utilized the user dictionary of classification design, adopts N-gram method to carry out misspelling inspection to English text, and completes the recommendation of correct word by large language material database search, thereby realize the misspelling inspection being associated with user.N-gram model (Fig. 1), as the basic skills of natural language processing, checks the mistake in text by the statistical information in word or statement feature and corpus; The user dictionary of classification design is according to current user's historical information, in conjunction with the statistics of corpus select with user input text in the maximally related recommendation word of wrong word; Use viterbi algorithm to find out the word chain of database conditional probability product maximum, the service efficiency of statistical information in the counting yield of Hidden Markov Model (HMM) and database in raising Big-corpus.The structure of whole system and the functional module of each several part are divided as shown in Figure 2, Figure 4 shows, are below the description of design concept and the implementation detail of each several part.

1, the obtaining and setting up of corpus and user dictionary:

Corpus is divided into core corpus and industry corpus, as the core statistics of storage language message, morphology, the syntactic and semantic information of in store overall statistical language and industry term, in the time carrying out misspelling inspection, corpus, for spell check model provides all word, statement information, provides the global data of statistical language; Meanwhile, the dictionary building voluntarily according to user, obtains the special language material information about user, and its historical information of the text entry of inputting by counting user; In database, definition tables of data is stored each corpus and user's input information.Concrete list structure is as follows:

(1) user dictionary list structure

(2) monobasic data list structure

(3) binary data list structure

(4) industry language material list structure

(5) weights data list structure

2, the structure of misspelling inspection model:

The structure of misspelling inspection model is with N-gram model, the statistical information of corpus to be calculated, and obtains the word chain combination of weighting conditional probability maximum in each corpus.In model construction process, misspelling inspection to be accurately to judge that the correctness of each word and the practicality of recommendation list are as target, do not increase the complexity of Data Matching and sequence in computation process simultaneously.According to overall statistical model, user's request and text message, use all language material data, find out all possible word probability-weighted.Consider the probability size in word compiling distance, each corpus, according to weight calculation, produce an optimum recommendation list of current word.By spell check model, the text of user's input is carried out to bug check and word recommendation.Specifically as shown in Figure 3.This model is specifically divided into two stages:

A) the best candidate set of generation word

The specific definition of word probability of occurrence: if when N=3, be examined the first two words of word word in text and be respectively word1 and word2, in corpus, get four-tuple (word1, word2, word, COUNT), calculate the ratio of COUNT sum in COUNT and whole corpus, calculate the probability of word word; Word1, word2 represents the first two word of current word word in text; If word is second word in statement, word1=' # '; If word is first word in statement, word1=word2=' # '; COUNT represents the appearance total degree of this combinations of words in corpus.

Effectively essential satisfied following 3 conditions of corpus and user dictionary:

(1) in user dictionary, not having wrong word, must be also the correct word obtaining from the recognized standard such as Oxford, Longman dictionary, or the special string inputted voluntarily of user.Now N=1, the word probability of this part is calculated in conjunction with the editing distance of the COUNT in two tuples (word, COUNT) and word itself;

(2) core corpus is enough large, and guarantees that the probability calculation in model has statistical significance.Data volume size based in corpus, we do not calculate the N-gram data of COUNT<=200 regulation.There is not the skewed popularity such as industry, timeliness, in user's use procedure each time, all the statistical information of its text is added in core corpus;

(3) according to application demand, industry corpus is tentatively divided into several large classes, and can generates new industry or combination according to user's use, be also that the bussiness field in industry corpus constantly expands along with using.

First with dictionary, the word in text is mated to judge its correctness, if word, not in user dictionary,, according to the close word in editing distance Dictionary of Computing, obtains word candidate set; Secondly to the each word in word candidate set, the top n word in conjunction with it in text, uses industry corpus, core corpus calculating probability formula successively

p (w_{i} | w_{1} w_{2} \cdot \cdot \cdot w_{i - 1}) = \frac{C (w_{1} w_{2} \cdot \cdot \cdot w_{i - 1} w_{i})}{C (w_{1} w_{2} \cdot \cdot \cdot w_{i - 1})} .

According to the weights W of user dictionary, industry corpus and core corpus _h, W _p, W _ccalculate final word weights W _w.Wherein, p ₁, p ₂, p ₃be respectively the probability of occurrence of word W in user dictionary, industry corpus and core corpus; W _h+ W _p+ W _c=1, weight is calculated according to the situation of calling to each corpus in user's use procedure, W when initial _h=W _p=W _c, reject weights and be less than threshold value W _tword after, obtain word candidate set.

W _W＝W _H*p ₁+W _P*p ₂+W _C*p ₃

Table 1 is the algorithm false code that word weights calculate:

B) recommend correct word

If former word is present in word candidate set, again determine that it is correct word; Otherwise, according to the probability-weighted value of corpus weights and word chain joint probability calculation word candidate, according to probability-weighted value to word candidate set sort, by sequence after word form recommendation list send to user.

3, renewal and user-dependent text statistical information, dictionary and corpus

After user obtains word and recommends and selected correct word, user input text also becomes a part for corpus, system is added up the amended text of user, by the N-gram Information Statistics access customer dictionary in correct text, core corpus and corresponding industry corpus, concrete occurrence number and context data are increased in these tables of data.And choose the corpus at correct word place according to user, recalculate the weights W of this user in the time calling each corpus _h, W _p, W _c.

The present invention also can have other numerous embodiments; in the situation that not deviating from spirit of the present invention and essence thereof; those of ordinary skill in the art can make according to the present invention various corresponding changes and distortion, and these change and be out of shape the protection domain that all should belong to the appended claim of the present invention accordingly.

Claims

1. the industry misspelling inspection method based on user feedback, is characterized in that, comprises step:

The obtaining and setting up of step 1, corpus and user dictionary:

Corpus is divided into user dictionary, core corpus and industry corpus, as the core statistics of storage language message, the morphology of in store whole language, syntactic and semantic information, in the time carrying out misspelling inspection, corpus, for misspelling inspection model provides all word, statement information, provides the global data of whole language; Meanwhile, according to text and the service condition of user's input, obtain the new language material information about user, upgrade corpus and user dictionary;

In database, definition tables of data is stored overall language material and user's input information;

The structure of step 2, misspelling inspection model:

The structure of misspelling inspection model is with N-gram model, the statistical information of corpus to be calculated, and obtains the word chain combination of conditional probability maximum;

Step 3, system interaction interface are by being used bug check and word in misspelling inspection model to recommend the text of user's input to process;

Step 4, renewal and user-dependent text statistical information, dictionary and corpus: the correct word of the input to user and selection is added up, by the word information in correct text and context statistics access customer dictionary, core corpus and corresponding industry corpus.

2. the industry misspelling inspection method based on user feedback according to claim 1, is characterized in that, in described step 1, effectively the necessary condition of corpus and user dictionary comprises:

(1) in dictionary, not having wrong word, must be also the correct word obtaining from the recognized standard such as Oxford, Longman dictionary, and user-defined industry or special words;

(2) core corpus is enough large, does not have industry, timeliness skewed popularity, and must include N-gram information, is used to provide basic word chain statistical information;

(3) industry corpus carries out preliminary division according to demand, and according to user's selection Nature creating, certain user can be the user of multiple industry corpus;

(4) user dictionary is according to the dictionary of user's input demand structure, can allow user manage voluntarily.

3. the industry misspelling inspection method based on user feedback according to claim 1, is characterized in that, in described step 2, specifically comprises:

The correction judgement of step 2.1 word: the word in text is done to the coupling of normal dictionary, if word, not in normal dictionary, then uses industry corpus and user dictionary to judge successively; If all do not existed in aforementioned three kinds of tables of data, be judged as wrong word, carry out next step;

The recommendation of the correct word of step 2.2: according to editing distance and word chain joint probability, adopt the maximally related correct word of each corpus weighted calculation and wrong word, several words of Selection and Constitute maximum probability form the recommendation list of wrong word.

4. the industry misspelling inspection method based on user feedback according to claim 3, it is characterized in that, in described step 2.1, use the viterbi algorithm current word of Rapid matching probability of occurrence in each corpus in N-gram model, and obtain current word and front N-1 the joint probability that word occurs, realize the judgement to current word correctness.

5. the industry misspelling inspection method based on user feedback according to claim 3, is characterized in that, in described step 2.2, by editing distance and word probability of occurrence, recommendation word list is sorted, and so rear line provides recommendation results; The weights of word list of being used for sorting are that the probability in each corpus is weighted acquisition to word.

6. the industry misspelling inspection method based on user feedback according to claim 1, it is characterized in that, in described step 4, text to user's input carries out after bug check, calculate the text statistical information in user's input, for the N-gram data in user dictionary and corpus provide lastest imformation, after corresponding tables of data is upgraded, provide bug check service by new corpus data and dictionary.

7. the industry misspelling inspection method based on user feedback according to claim 1, it is characterized in that, use Hidden Markov Model (HMM) to check to make in corpus using the highest word of the context dependent word chain probability of occurrence of wrong word position as correct word list, each corpus has different weights, the weighted calculation of probability and corpus by word in corpus, the recommendation word list after being sorted; Misspelling inspection completes the selection of recommending word by user.

8. the industry misspelling inspection method based on user feedback according to claim 1, it is characterized in that, adopt user dictionary, core corpus, industry corpus and statistical language model, input after one section of text user, server carries out element to text, be the word chain set under the N unit syntax by text dividing, thereby calculate the conditional probability of last word in corpus in each word chain; Statistical language model calculates the word of several maximum probabilities as the alternative set of correct word, if former word in alternative set, judges that former word is correct, otherwise user selects a word as correct word from alternative set.

9. the industry misspelling inspection method based on user feedback according to claim 8, is characterized in that, statistical language model adopts viterbi algorithm, calculates the word list of word probability-weighted maximum in corpus in user version; Obtain and recommend the user of word list, according to actual conditions select correct word and context statistical information thereof join with user-dependent corpus in; Calculate the more new data of this word in user dictionary, core corpus and industry corpus and add in tables of data by statistical language model, with new data, the user version next time arriving being carried out to misspelling inspection.