CN103885938A - Industry spelling mistake checking method based on user feedback - Google Patents

Industry spelling mistake checking method based on user feedback Download PDF

Info

Publication number
CN103885938A
CN103885938A CN201410149427.8A CN201410149427A CN103885938A CN 103885938 A CN103885938 A CN 103885938A CN 201410149427 A CN201410149427 A CN 201410149427A CN 103885938 A CN103885938 A CN 103885938A
Authority
CN
China
Prior art keywords
word
corpus
user
industry
dictionary
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201410149427.8A
Other languages
Chinese (zh)
Other versions
CN103885938B (en
Inventor
杨明
罗军舟
倪俊辉
马成平
任新才
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Southeast University
Focus Technology Co Ltd
Original Assignee
Southeast University
Focus Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Southeast University, Focus Technology Co Ltd filed Critical Southeast University
Priority to CN201410149427.8A priority Critical patent/CN103885938B/en
Publication of CN103885938A publication Critical patent/CN103885938A/en
Application granted granted Critical
Publication of CN103885938B publication Critical patent/CN103885938B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Machine Translation (AREA)

Abstract

The invention discloses an industry spelling mistake checking method based on user feedback. According to the industry spelling mistake checking method based on user feedback, spelling mistake checking is carried out on English text by using an N-gram method and a user dictionary which is designed in a classified mode, recommendation of correct words is accomplished by searching for a large corpus database, and thus checking of spelling mistakes related to a user is achieved. The N-gram method serves as a basic method for natural language processing, and the mistakes in the text are checked according to the characteristics of words or statements and statistical information in a corpus; recommended words which are most related to wrong words in the text input by the user are selected through cooperation between the user dictionary designed in the classified mode and statistical data of the corpus according to historical information of the user at present; the database is searched for a word chain with the largest conditional probability product by using the Viterbi algorithm, and computational efficiency of a hidden Markov model in the large corpus and use efficiency of the statistical information in the database are improved.

Description

Industry misspelling inspection method based on user feedback
Technical field
The present invention is a kind of English spelling error check method, has utilized the correlation techniques such as the corpus, natural language statistical model and the Hidden Markov Model (HMM) that comprise a large amount of language messages, relate to natural language processing particularly English spelling check field.
Background technology
First the abbreviation of using in the present invention is defined:
NLP(Natural Language Processing): natural language processing;
BNC(British National Corpus): British National Corpus;
LDC(Linguistic Data Consortium): language data alliance;
LD(Levenshtein Distance): editing distance;
N-gram:N metagrammar.
Misspelling checks that (Spelling Checker) is important branch and the basic link of NLP, it is inerrancy and intelligible text by natural language processing, has natural supporting role for senior NLP technology such as mechanical translation, phonetic synthesis, speech recognitions.Meanwhile, this technology can effectively improve the friendly of user interface and intelligent, has important actual application value.
Early stage NLP mainly adopts the method based on syntax-semantic rules.Along with the emergence of Corpus Construction and corpus linguistics, the main target that is treated as natural language processing of extensive real text.Rule-based method, after development for many years, still can not break through the restriction of accuracy rate and efficiency two aspects, and statistical method shows gradually in the more advantage of natural language processing field.In natural language processing, use more and more the Auto-learning Method based on statistics to obtain linguistry, this is also including misspelling inspection.Method based on statistics relates generally to corpus and two aspects of statistical language model.
Multiple tissues and research institution provide corpus and various statistics thereof separately, and this free e-books more than 4200 of providing as Chinese and English news category language material, BNC, LDC, the Gutenberg project of Text Classification research, ten thousand pieces are randomly drawed paper Chinese DBLP resource, UCI evaluates sorting data etc.
The Brants of Google and Franz have carried out element by web page text by the mode of Penn Treebank, have altogether produced the data that exceed 1T, and detailed content is as shown in table 1.The 5-grams corpus based on 1T web page text data that Google announces is the current more comprehensive English corpus of ratio based on statistical method.This corpus provides the statistical information from 1~5-grams, for the natural language processing based on statistical method provides abundant analysis Data Source.
Corpus aspect, dictionary, for word error correction provides the most basic non-word bug check ability, designs and has good management interface, extendible normal dictionary, can the basic function of word detection is provided and improve system performance for user; Support that the corpus of statistical method is to realize the basis that misspelling checks, it provides the data available that scale is considerable, information is full and accurate for Natural Language Processing Models; Corpus based on semantic is the good model that professional domain is divided, but due to the poor efficiency of syntax rule, this method cannot obtain practicality.Need to adopt statistical method indirectly to realize the corpus of trade classification.
Traditional misspelling inspection pays attention to solve the non-word bug check that correct word is input as to invalid words, and conventional method is to use a reliable dictionary and definite distance measure, as LD.Owing to manually setting up, the cost of reliable dictionary is very high, and the dictionary that traditional spell check is used is smaller.Along with statistical model is introduced in misspelling, error model and N-gram language model become the key components of misspelling check system.Kukich proposes transition matrix and the application of proper vector in spelling error correction of error probability, is the basis that N-gram method realizes afterwards.Brill and Moore have proved that a good statistical model is the key that improves spell check precision, need to do a large amount of manual markings to error correction phrase but set up such error model, and this relates to high cost.The use Web texts such as Whitelaw have improved this efficiency to a certain extent.Along with the development of Web technology and application, misspelling inspection also more and more receives publicity, and more misspelling type is mentioned, as fail to write, the wrong letter that increases, exchange the order of some letters, the merging of mistake, split word, misuse word etc.The problem that these methods mainly solve is search input error, search word candidate space and set up word candidate score function.
In existing misspelling inspection model, major part is all the off-line model based on N-gram model, and this method has become the main flow of spell check research now.The main thought of model is the statistical information using in the Bayesian formula calculating natural language of expanding, and maximum feature is to have adopted statistical method, model simply efficient.The instrument that current research mainly uses is Bayesian formula and the Hidden Markov Model (HMM) of N-gram model, expansion.Be divided into and add up word probability, use Hidden Markov Model (HMM) to solve these aspects of rapid solving of Hidden Markov Model (HMM) in N-gram model parameter and Bayesian formula with Bayesian formula.The efficiency of model and practicality are this field problems in the urgent need to address.
Summary of the invention
Technical matters: in misspelling check system, corpus is as the basis of whole model, and calculating wherein and query script inevitably become the performance bottleneck of whole system.If corpus based on syntax rule or only add up the frequency that word occurs, is easy in query script to occur that the performance result of calculation low or that cause because of statistics deficiency that rules explosion causes is inaccurate.Misspelling inspection model aspect, simply according to a certain estimate mate or only adopt N-gram computation model, there is larger error in the check result that the former obtains, the latter produces larger impact to the performance of system.The technical problem to be solved in the present invention is that system lacks the dynamic adjustment capability based on user feedback, effectively the multiple corpus information of Integrated using.For the problem that can not effectively utilize multiple corpus, adopt user dictionary, industry corpus and core corpus mutually combines, the method for weighted calculation.This method has inquiry fast, and result of calculation is accurate, to context environmental adaptability high, can under different users and text environments, automatically regulate the use of corpus to different piece, effectively improves system effectiveness and guarantees result accuracy.The present invention, by using viterbi algorithm to calculate the Markov chain in N-gram model, obtains the set that most possible correct word forms.In corpus, according to N-1 word before wrong word, each possible word is carried out to the calculating of probability, estimate with word and calculate weights in the residing part of corpus according to LD, obtain according to the recommendation list of the probability of occurrence sequence of correct word.Correct word and the context chosen according to user, enter the Information Statistics in user version in the corpus of system.System obtains after new statistical information, according to the statistic algorithm in N-gram model, the word frequency and conditional probability to relative recording in corpus tables of data are revised, corpus is synchronizeed with user's actual use, record the statistics of all history texts, complete the whole updating of misspelling check system.
Technical scheme:
For solving the problems of the technologies described above, the present invention utilizes N-gram corpus data and relevant statistical method, has proposed a kind of industry misspelling inspection method based on user feedback.This misspelling inspection method is specific as follows:
An industry misspelling inspection method based on user feedback, comprises step:
1) the obtaining and setting up of corpus and user dictionary:
Corpus is divided into core corpus and industry corpus, as the core statistics of storage language message, morphology, the syntactic and semantic information of in store overall statistical language and industry term, in the time carrying out misspelling inspection, core corpus and industry corpus, for spell check model provides all word, statement information, provide the global data of whole language; Meanwhile, the dictionary building voluntarily according to user, obtains the special language material information about user;
In database, definition tables of data is stored overall language material and user's language material information;
2) structure of spell check model:
The structure of misspelling inspection model is with N-gram model, the statistical information of corpus to be calculated, and obtains the word chain combination of conditional probability maximum, and step comprises:
21) correction judgement of word: the word in text is done to the coupling of core corpus, if word, not in core corpus, then uses industry corpus and user dictionary to judge successively; If all do not existed in aforementioned three kinds of tables of data, be judged as wrong word, carry out next step;
22) recommendation of correct word: according in each corpus with wrong word close word under editing distance, calculate probability and the context joint probability thereof of these words, calculate and the maximally related correct word of wrong word by the weights of each corpus again, select several words of all corpus weighting posterior probability maximums to form the recommendation list of correct word;
3) recommend the text of user's input to process by the bug check in spell check model and word;
4) upgrade and user-dependent text statistical information, dictionary and corpus: the text to user's input and the correct word of selection are added up, by the correct word information in text and contextual information statistics access customer dictionary, core corpus and corresponding industry corpus.
Described step 1) in, effectively the necessary condition of corpus and user dictionary comprises:
(1) in user dictionary, not having wrong word, must be also the correct word obtaining from the recognized standard such as Oxford, Longman dictionary, or user-defined industry or special words;
(2) core corpus is enough large, does not have the skewed popularity such as industry, timeliness, and must include N-gram information, is used to provide basic word context statistical information;
(3) industry corpus carries out preliminary division according to demand, and according to user's selection Nature creating, unique user can be the user of multiple industry corpus.
Described step 21) in, use viterbi algorithm in N-gram model, to calculate fast the probability of current word in core corpus, industry language material, and obtain current word and front N-1 the joint probability that word occurs, realize the judgement to current word correctness.
Described step 22) in, use N-gram model to search in industry corpus and core corpus to the position at wrong word place, and mate in user dictionary by editing distance and word probability of occurrence, to obtain most possible word list; Probability for each word in different corpus, adopts probability-weighted to recommending word list to sort, and so rear line provides the recommendation results after sequence.
Described step 4) in, system is carried out after bug check the text of user's input, calculate the text statistical information in user's input, for the N-gram data in user dictionary and corpus provide lastest imformation, after corresponding tables of data is upgraded, provide bug check service by new corpus data and user dictionary.
In this method, the statistical language model using is exactly to use Hidden Markov Model (HMM) to check to make in corpus using the highest word of the context dependent word chain probability of occurrence of wrong word position as correct word list, each corpus has different weights, the weighted calculation of probability and corpus by word in corpus, the recommendation word list after being sorted.Misspelling inspection completes the selection of recommending word by user.
This method, based on N-gram Natural Language Processing Models, adopts core corpus, presses corpus, user dictionary and the statistical language model of trade classification, the function that the text of inputting for user provides bug check and correct word to recommend.Input after one section of text user, server carries out element to text, is the word chain set under the N unit syntax, thereby calculates the conditional probability of last word in corpus in each word chain by text dividing.Statistical language model calculates the word of several maximum probabilities as the alternative set of correct word, if former word in alternative set, judges that former word is correct, otherwise user selects a word as correct word from alternative set.
The present invention is directed to efficiency and the practicality problem of misspelling check system, utilize the mode of classification corpus weighted calculation, estimate and searching algorithm in conjunction with LD, carry out capable of spelling words bug check in the mode of recommending after first debugging, can efficiently realize fast the word that bug check and context relation are stronger and recommend; Adopt viterbi algorithm, proposed a kind of statistical language model, can calculate fast the word list of word probability-weighted maximum in corpus in user version.Obtain the user who recommends word list, select correct word and feed back to system according to actual conditions, the word that system is selected user and context statistical information thereof join with user-dependent corpus in: calculate the more new data of this word in core corpus, industry corpus and user dictionary and add in tables of data by statistical model, with new data, the user version next time arriving is carried out to misspelling inspection, thereby realized the characteristic that system can provide misspelling to check for text according to practical service environment and different users.
Beneficial effect: the present invention has that corpus service efficiency is high, data the feature such as adjust based on user's actual feedback, makes the practical of system, and inspection speed is fast, data synchronism high (according to the service condition corpus data that upgrade in time); Be combined with multiple different corpus, can under the environment of multi-user, high concurrent request, effectively realize efficient misspelling inspection.
Accompanying drawing explanation
Fig. 1 is N-gram statistical model figure of the present invention.
Fig. 2 is misspelling check system structural drawing of the present invention.
Fig. 3 is specific embodiment of the invention process flow diagram.
Fig. 4 is misspelling audit function module map.
Fig. 5 is Google1T N-gram data message table.
Embodiment
Below in conjunction with accompanying drawing, the present invention is further described in more detail with concrete example.
Industry misspelling inspection method based on user feedback of the present invention, mainly solve the problem that lacks user-association and fast search Big-corpus in current misspelling inspection, relate to the correlation techniques such as natural language processing, user dictionary design and database search.The method is utilized the user dictionary of classification design, adopts N-gram method to carry out misspelling inspection to English text, and completes the recommendation of correct word by large language material database search, thereby realize the misspelling inspection being associated with user.N-gram model (Fig. 1), as the basic skills of natural language processing, checks the mistake in text by the statistical information in word or statement feature and corpus; The user dictionary of classification design is according to current user's historical information, in conjunction with the statistics of corpus select with user input text in the maximally related recommendation word of wrong word; Use viterbi algorithm to find out the word chain of database conditional probability product maximum, the service efficiency of statistical information in the counting yield of Hidden Markov Model (HMM) and database in raising Big-corpus.The structure of whole system and the functional module of each several part are divided as shown in Figure 2, Figure 4 shows, are below the description of design concept and the implementation detail of each several part.
1, the obtaining and setting up of corpus and user dictionary:
Corpus is divided into core corpus and industry corpus, as the core statistics of storage language message, morphology, the syntactic and semantic information of in store overall statistical language and industry term, in the time carrying out misspelling inspection, corpus, for spell check model provides all word, statement information, provides the global data of statistical language; Meanwhile, the dictionary building voluntarily according to user, obtains the special language material information about user, and its historical information of the text entry of inputting by counting user; In database, definition tables of data is stored each corpus and user's input information.Concrete list structure is as follows:
(1) user dictionary list structure
Figure BDA0000490734800000071
(2) monobasic data list structure
Figure BDA0000490734800000072
(3) binary data list structure
Figure BDA0000490734800000073
(4) industry language material list structure
Figure BDA0000490734800000081
(5) weights data list structure
2, the structure of misspelling inspection model:
The structure of misspelling inspection model is with N-gram model, the statistical information of corpus to be calculated, and obtains the word chain combination of weighting conditional probability maximum in each corpus.In model construction process, misspelling inspection to be accurately to judge that the correctness of each word and the practicality of recommendation list are as target, do not increase the complexity of Data Matching and sequence in computation process simultaneously.According to overall statistical model, user's request and text message, use all language material data, find out all possible word probability-weighted.Consider the probability size in word compiling distance, each corpus, according to weight calculation, produce an optimum recommendation list of current word.By spell check model, the text of user's input is carried out to bug check and word recommendation.Specifically as shown in Figure 3.This model is specifically divided into two stages:
A) the best candidate set of generation word
The specific definition of word probability of occurrence: if when N=3, be examined the first two words of word word in text and be respectively word1 and word2, in corpus, get four-tuple (word1, word2, word, COUNT), calculate the ratio of COUNT sum in COUNT and whole corpus, calculate the probability of word word; Word1, word2 represents the first two word of current word word in text; If word is second word in statement, word1=' # '; If word is first word in statement, word1=word2=' # '; COUNT represents the appearance total degree of this combinations of words in corpus.
Effectively essential satisfied following 3 conditions of corpus and user dictionary:
(1) in user dictionary, not having wrong word, must be also the correct word obtaining from the recognized standard such as Oxford, Longman dictionary, or the special string inputted voluntarily of user.Now N=1, the word probability of this part is calculated in conjunction with the editing distance of the COUNT in two tuples (word, COUNT) and word itself;
(2) core corpus is enough large, and guarantees that the probability calculation in model has statistical significance.Data volume size based in corpus, we do not calculate the N-gram data of COUNT<=200 regulation.There is not the skewed popularity such as industry, timeliness, in user's use procedure each time, all the statistical information of its text is added in core corpus;
(3) according to application demand, industry corpus is tentatively divided into several large classes, and can generates new industry or combination according to user's use, be also that the bussiness field in industry corpus constantly expands along with using.
First with dictionary, the word in text is mated to judge its correctness, if word, not in user dictionary,, according to the close word in editing distance Dictionary of Computing, obtains word candidate set; Secondly to the each word in word candidate set, the top n word in conjunction with it in text, uses industry corpus, core corpus calculating probability formula successively p ( w i | w 1 w 2 &CenterDot; &CenterDot; &CenterDot; w i - 1 ) = C ( w 1 w 2 &CenterDot; &CenterDot; &CenterDot; w i - 1 w i ) C ( w 1 w 2 &CenterDot; &CenterDot; &CenterDot; w i - 1 ) .
According to the weights W of user dictionary, industry corpus and core corpus h, W p, W ccalculate final word weights W w.Wherein, p 1, p 2, p 3be respectively the probability of occurrence of word W in user dictionary, industry corpus and core corpus; W h+ W p+ W c=1, weight is calculated according to the situation of calling to each corpus in user's use procedure, W when initial h=W p=W c, reject weights and be less than threshold value W tword after, obtain word candidate set.
W W=W H*p 1+W P*p 2+W C*p 3
Table 1 is the algorithm false code that word weights calculate:
Figure BDA0000490734800000101
B) recommend correct word
If former word is present in word candidate set, again determine that it is correct word; Otherwise, according to the probability-weighted value of corpus weights and word chain joint probability calculation word candidate, according to probability-weighted value to word candidate set sort, by sequence after word form recommendation list send to user.
3, renewal and user-dependent text statistical information, dictionary and corpus
After user obtains word and recommends and selected correct word, user input text also becomes a part for corpus, system is added up the amended text of user, by the N-gram Information Statistics access customer dictionary in correct text, core corpus and corresponding industry corpus, concrete occurrence number and context data are increased in these tables of data.And choose the corpus at correct word place according to user, recalculate the weights W of this user in the time calling each corpus h, W p, W c.
The present invention also can have other numerous embodiments; in the situation that not deviating from spirit of the present invention and essence thereof; those of ordinary skill in the art can make according to the present invention various corresponding changes and distortion, and these change and be out of shape the protection domain that all should belong to the appended claim of the present invention accordingly.

Claims (9)

1. the industry misspelling inspection method based on user feedback, is characterized in that, comprises step:
The obtaining and setting up of step 1, corpus and user dictionary:
Corpus is divided into user dictionary, core corpus and industry corpus, as the core statistics of storage language message, the morphology of in store whole language, syntactic and semantic information, in the time carrying out misspelling inspection, corpus, for misspelling inspection model provides all word, statement information, provides the global data of whole language; Meanwhile, according to text and the service condition of user's input, obtain the new language material information about user, upgrade corpus and user dictionary;
In database, definition tables of data is stored overall language material and user's input information;
The structure of step 2, misspelling inspection model:
The structure of misspelling inspection model is with N-gram model, the statistical information of corpus to be calculated, and obtains the word chain combination of conditional probability maximum;
Step 3, system interaction interface are by being used bug check and word in misspelling inspection model to recommend the text of user's input to process;
Step 4, renewal and user-dependent text statistical information, dictionary and corpus: the correct word of the input to user and selection is added up, by the word information in correct text and context statistics access customer dictionary, core corpus and corresponding industry corpus.
2. the industry misspelling inspection method based on user feedback according to claim 1, is characterized in that, in described step 1, effectively the necessary condition of corpus and user dictionary comprises:
(1) in dictionary, not having wrong word, must be also the correct word obtaining from the recognized standard such as Oxford, Longman dictionary, and user-defined industry or special words;
(2) core corpus is enough large, does not have industry, timeliness skewed popularity, and must include N-gram information, is used to provide basic word chain statistical information;
(3) industry corpus carries out preliminary division according to demand, and according to user's selection Nature creating, certain user can be the user of multiple industry corpus;
(4) user dictionary is according to the dictionary of user's input demand structure, can allow user manage voluntarily.
3. the industry misspelling inspection method based on user feedback according to claim 1, is characterized in that, in described step 2, specifically comprises:
The correction judgement of step 2.1 word: the word in text is done to the coupling of normal dictionary, if word, not in normal dictionary, then uses industry corpus and user dictionary to judge successively; If all do not existed in aforementioned three kinds of tables of data, be judged as wrong word, carry out next step;
The recommendation of the correct word of step 2.2: according to editing distance and word chain joint probability, adopt the maximally related correct word of each corpus weighted calculation and wrong word, several words of Selection and Constitute maximum probability form the recommendation list of wrong word.
4. the industry misspelling inspection method based on user feedback according to claim 3, it is characterized in that, in described step 2.1, use the viterbi algorithm current word of Rapid matching probability of occurrence in each corpus in N-gram model, and obtain current word and front N-1 the joint probability that word occurs, realize the judgement to current word correctness.
5. the industry misspelling inspection method based on user feedback according to claim 3, is characterized in that, in described step 2.2, by editing distance and word probability of occurrence, recommendation word list is sorted, and so rear line provides recommendation results; The weights of word list of being used for sorting are that the probability in each corpus is weighted acquisition to word.
6. the industry misspelling inspection method based on user feedback according to claim 1, it is characterized in that, in described step 4, text to user's input carries out after bug check, calculate the text statistical information in user's input, for the N-gram data in user dictionary and corpus provide lastest imformation, after corresponding tables of data is upgraded, provide bug check service by new corpus data and dictionary.
7. the industry misspelling inspection method based on user feedback according to claim 1, it is characterized in that, use Hidden Markov Model (HMM) to check to make in corpus using the highest word of the context dependent word chain probability of occurrence of wrong word position as correct word list, each corpus has different weights, the weighted calculation of probability and corpus by word in corpus, the recommendation word list after being sorted; Misspelling inspection completes the selection of recommending word by user.
8. the industry misspelling inspection method based on user feedback according to claim 1, it is characterized in that, adopt user dictionary, core corpus, industry corpus and statistical language model, input after one section of text user, server carries out element to text, be the word chain set under the N unit syntax by text dividing, thereby calculate the conditional probability of last word in corpus in each word chain; Statistical language model calculates the word of several maximum probabilities as the alternative set of correct word, if former word in alternative set, judges that former word is correct, otherwise user selects a word as correct word from alternative set.
9. the industry misspelling inspection method based on user feedback according to claim 8, is characterized in that, statistical language model adopts viterbi algorithm, calculates the word list of word probability-weighted maximum in corpus in user version; Obtain and recommend the user of word list, according to actual conditions select correct word and context statistical information thereof join with user-dependent corpus in; Calculate the more new data of this word in user dictionary, core corpus and industry corpus and add in tables of data by statistical language model, with new data, the user version next time arriving being carried out to misspelling inspection.
CN201410149427.8A 2014-04-14 2014-04-14 Industry spelling mistake checking method based on user feedback Expired - Fee Related CN103885938B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410149427.8A CN103885938B (en) 2014-04-14 2014-04-14 Industry spelling mistake checking method based on user feedback

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410149427.8A CN103885938B (en) 2014-04-14 2014-04-14 Industry spelling mistake checking method based on user feedback

Publications (2)

Publication Number Publication Date
CN103885938A true CN103885938A (en) 2014-06-25
CN103885938B CN103885938B (en) 2015-04-22

Family

ID=50954833

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410149427.8A Expired - Fee Related CN103885938B (en) 2014-04-14 2014-04-14 Industry spelling mistake checking method based on user feedback

Country Status (1)

Country Link
CN (1) CN103885938B (en)

Cited By (30)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104112447A (en) * 2014-07-28 2014-10-22 科大讯飞股份有限公司 Method and system for improving statistical language model accuracy
CN105206267A (en) * 2015-09-09 2015-12-30 中国科学院计算技术研究所 Voice recognition error correction method with integration of uncertain feedback and system thereof
CN105654955A (en) * 2016-03-18 2016-06-08 华为技术有限公司 Voice recognition method and device
CN106294325A (en) * 2016-08-11 2017-01-04 海信集团有限公司 The optimization method and device of spatial term statement
CN106528616A (en) * 2016-09-30 2017-03-22 厦门快商通科技股份有限公司 Language error correcting method and system for use in human-computer interaction process
CN106708893A (en) * 2015-11-17 2017-05-24 华为技术有限公司 Error correction method and device for search query term
CN107122346A (en) * 2016-12-28 2017-09-01 平安科技(深圳)有限公司 The error correction method and device of a kind of read statement
CN107291775A (en) * 2016-04-11 2017-10-24 北京京东尚科信息技术有限公司 The reparation language material generation method and device of error sample
CN107291730A (en) * 2016-03-31 2017-10-24 阿里巴巴集团控股有限公司 Method, device and the probabilistic dictionaries construction method of correction suggestion are provided query word
CN107305542A (en) * 2016-04-21 2017-10-31 珠海金山办公软件有限公司 A kind of spell checking methods and device
CN107357775A (en) * 2017-06-05 2017-11-17 百度在线网络技术(北京)有限公司 The text error correction method and device of Recognition with Recurrent Neural Network based on artificial intelligence
WO2018103128A1 (en) * 2016-12-09 2018-06-14 Hong Kong Applied Science and Technology Research Institute Company Limited System and method for organizing and processing feature based data structures
CN108628827A (en) * 2018-04-11 2018-10-09 广州视源电子科技股份有限公司 Candidate word appraisal procedure, device, computer equipment and storage medium
CN109033065A (en) * 2018-06-01 2018-12-18 昆明理工大学 A kind of English- word spelling inspection method
CN109145287A (en) * 2018-07-05 2019-01-04 广东外语外贸大学 Indonesian word error-detection error-correction method and system
CN109542247A (en) * 2018-11-14 2019-03-29 腾讯科技(深圳)有限公司 Clause recommended method and device, electronic equipment, storage medium
CN110020432A (en) * 2019-03-29 2019-07-16 联想(北京)有限公司 A kind of information processing method and information processing equipment
CN110073349A (en) * 2016-12-15 2019-07-30 微软技术许可有限责任公司 Consider the word order suggestion of frequency and formatted message
CN110147546A (en) * 2019-04-03 2019-08-20 苏州驰声信息科技有限公司 A kind of syntactic correction method and device of Oral English Practice
US10402435B2 (en) 2015-06-30 2019-09-03 Microsoft Technology Licensing, Llc Utilizing semantic hierarchies to process free-form text
CN110489723A (en) * 2019-08-19 2019-11-22 绍兴数纺科技有限公司 A kind of data error detection and error correction system of dyeing information system
CN110532572A (en) * 2019-09-12 2019-12-03 四川长虹电器股份有限公司 Spell checking methods based on the tree-like naive Bayesian of TAN
CN110600011A (en) * 2018-06-12 2019-12-20 中国移动通信有限公司研究院 Voice recognition method and device and computer readable storage medium
CN111259654A (en) * 2018-11-30 2020-06-09 北京嘀嘀无限科技发展有限公司 Text error detection method and device
US10679008B2 (en) 2016-12-16 2020-06-09 Microsoft Technology Licensing, Llc Knowledge base for analysis of text
CN111523532A (en) * 2020-04-14 2020-08-11 广东小天才科技有限公司 Method for correcting OCR character recognition error and terminal equipment
CN111737980A (en) * 2020-06-22 2020-10-02 桂林电子科技大学 Method for correcting English text word use errors
CN111859920A (en) * 2020-06-19 2020-10-30 北京国音红杉树教育科技有限公司 Method and system for identifying word spelling errors and electronic equipment
CN112328737A (en) * 2019-07-17 2021-02-05 北方工业大学 Spelling data generation method
CN113095072A (en) * 2019-12-23 2021-07-09 华为技术有限公司 Text processing method and device

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060015326A1 (en) * 2004-07-14 2006-01-19 International Business Machines Corporation Word boundary probability estimating, probabilistic language model building, kana-kanji converting, and unknown word model building
CN102298577A (en) * 2011-09-21 2011-12-28 深圳市万兴软件有限公司 Method and device for detecting spelling of document edition
CN102937949A (en) * 2012-10-15 2013-02-20 福建榕基软件股份有限公司 Method and system for checking English spelling in rich text editor

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060015326A1 (en) * 2004-07-14 2006-01-19 International Business Machines Corporation Word boundary probability estimating, probabilistic language model building, kana-kanji converting, and unknown word model building
CN102298577A (en) * 2011-09-21 2011-12-28 深圳市万兴软件有限公司 Method and device for detecting spelling of document edition
CN102937949A (en) * 2012-10-15 2013-02-20 福建榕基软件股份有限公司 Method and system for checking English spelling in rich text editor

Cited By (51)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104112447B (en) * 2014-07-28 2017-08-25 安徽普济信息科技有限公司 Method and system for improving accuracy of statistical language model
CN104112447A (en) * 2014-07-28 2014-10-22 科大讯飞股份有限公司 Method and system for improving statistical language model accuracy
US10402435B2 (en) 2015-06-30 2019-09-03 Microsoft Technology Licensing, Llc Utilizing semantic hierarchies to process free-form text
CN105206267A (en) * 2015-09-09 2015-12-30 中国科学院计算技术研究所 Voice recognition error correction method with integration of uncertain feedback and system thereof
CN105206267B (en) * 2015-09-09 2019-04-02 中国科学院计算技术研究所 A kind of the speech recognition errors modification method and system of fusion uncertainty feedback
CN106708893B (en) * 2015-11-17 2018-09-28 华为技术有限公司 Search query word error correction method and device
CN106708893A (en) * 2015-11-17 2017-05-24 华为技术有限公司 Error correction method and device for search query term
WO2017084506A1 (en) * 2015-11-17 2017-05-26 华为技术有限公司 Method and device for correcting search query term
CN105654955B (en) * 2016-03-18 2019-11-12 华为技术有限公司 Audio recognition method and device
CN105654955A (en) * 2016-03-18 2016-06-08 华为技术有限公司 Voice recognition method and device
CN107291730A (en) * 2016-03-31 2017-10-24 阿里巴巴集团控股有限公司 Method, device and the probabilistic dictionaries construction method of correction suggestion are provided query word
CN107291730B (en) * 2016-03-31 2020-07-31 阿里巴巴集团控股有限公司 Method and device for providing correction suggestion for query word and probability dictionary construction method
CN107291775B (en) * 2016-04-11 2020-07-31 北京京东尚科信息技术有限公司 Method and device for generating repairing linguistic data of error sample
CN107291775A (en) * 2016-04-11 2017-10-24 北京京东尚科信息技术有限公司 The reparation language material generation method and device of error sample
CN107305542B (en) * 2016-04-21 2018-11-16 珠海金山办公软件有限公司 A kind of spell checking methods and device
CN107305542A (en) * 2016-04-21 2017-10-31 珠海金山办公软件有限公司 A kind of spell checking methods and device
CN106294325A (en) * 2016-08-11 2017-01-04 海信集团有限公司 The optimization method and device of spatial term statement
CN106294325B (en) * 2016-08-11 2019-01-04 海信集团有限公司 The optimization method and device of spatial term sentence
CN106528616A (en) * 2016-09-30 2017-03-22 厦门快商通科技股份有限公司 Language error correcting method and system for use in human-computer interaction process
CN106528616B (en) * 2016-09-30 2019-12-17 厦门快商通科技股份有限公司 Language error correction method and system in human-computer interaction process
WO2018103128A1 (en) * 2016-12-09 2018-06-14 Hong Kong Applied Science and Technology Research Institute Company Limited System and method for organizing and processing feature based data structures
US10127219B2 (en) 2016-12-09 2018-11-13 Hong Kong Applied Science and Technoloy Research Institute Company Limited System and method for organizing and processing feature based data structures
CN110073349A (en) * 2016-12-15 2019-07-30 微软技术许可有限责任公司 Consider the word order suggestion of frequency and formatted message
CN110073349B (en) * 2016-12-15 2023-10-10 微软技术许可有限责任公司 Word order suggestion considering frequency and formatting information
US10679008B2 (en) 2016-12-16 2020-06-09 Microsoft Technology Licensing, Llc Knowledge base for analysis of text
CN107122346B (en) * 2016-12-28 2018-02-27 平安科技(深圳)有限公司 The error correction method and device of a kind of read statement
CN107122346A (en) * 2016-12-28 2017-09-01 平安科技(深圳)有限公司 The error correction method and device of a kind of read statement
US11314921B2 (en) 2017-06-05 2022-04-26 Baidu Online Network Technology (Beijing) Co., Ltd. Text error correction method and apparatus based on recurrent neural network of artificial intelligence
CN107357775A (en) * 2017-06-05 2017-11-17 百度在线网络技术(北京)有限公司 The text error correction method and device of Recognition with Recurrent Neural Network based on artificial intelligence
CN108628827A (en) * 2018-04-11 2018-10-09 广州视源电子科技股份有限公司 Candidate word appraisal procedure, device, computer equipment and storage medium
CN109033065A (en) * 2018-06-01 2018-12-18 昆明理工大学 A kind of English- word spelling inspection method
CN110600011B (en) * 2018-06-12 2022-04-01 中国移动通信有限公司研究院 Voice recognition method and device and computer readable storage medium
CN110600011A (en) * 2018-06-12 2019-12-20 中国移动通信有限公司研究院 Voice recognition method and device and computer readable storage medium
CN109145287A (en) * 2018-07-05 2019-01-04 广东外语外贸大学 Indonesian word error-detection error-correction method and system
CN109542247B (en) * 2018-11-14 2023-03-24 腾讯科技(深圳)有限公司 Sentence recommendation method and device, electronic equipment and storage medium
CN109542247A (en) * 2018-11-14 2019-03-29 腾讯科技(深圳)有限公司 Clause recommended method and device, electronic equipment, storage medium
CN111259654A (en) * 2018-11-30 2020-06-09 北京嘀嘀无限科技发展有限公司 Text error detection method and device
CN111259654B (en) * 2018-11-30 2023-09-15 北京嘀嘀无限科技发展有限公司 Text error detection method and device
CN110020432A (en) * 2019-03-29 2019-07-16 联想(北京)有限公司 A kind of information processing method and information processing equipment
CN110147546B (en) * 2019-04-03 2023-05-26 苏州驰声信息科技有限公司 Grammar correction method and device for spoken English
CN110147546A (en) * 2019-04-03 2019-08-20 苏州驰声信息科技有限公司 A kind of syntactic correction method and device of Oral English Practice
CN112328737A (en) * 2019-07-17 2021-02-05 北方工业大学 Spelling data generation method
CN112328737B (en) * 2019-07-17 2023-05-05 北方工业大学 Spelling data generation method
CN110489723A (en) * 2019-08-19 2019-11-22 绍兴数纺科技有限公司 A kind of data error detection and error correction system of dyeing information system
CN110532572A (en) * 2019-09-12 2019-12-03 四川长虹电器股份有限公司 Spell checking methods based on the tree-like naive Bayesian of TAN
CN113095072A (en) * 2019-12-23 2021-07-09 华为技术有限公司 Text processing method and device
CN111523532A (en) * 2020-04-14 2020-08-11 广东小天才科技有限公司 Method for correcting OCR character recognition error and terminal equipment
CN111859920A (en) * 2020-06-19 2020-10-30 北京国音红杉树教育科技有限公司 Method and system for identifying word spelling errors and electronic equipment
CN111859920B (en) * 2020-06-19 2024-06-04 北京国音红杉树教育科技有限公司 Word misspelling recognition method, system and electronic equipment
CN111737980B (en) * 2020-06-22 2023-05-16 桂林电子科技大学 Correction method for use errors of English text words
CN111737980A (en) * 2020-06-22 2020-10-02 桂林电子科技大学 Method for correcting English text word use errors

Also Published As

Publication number Publication date
CN103885938B (en) 2015-04-22

Similar Documents

Publication Publication Date Title
CN103885938B (en) Industry spelling mistake checking method based on user feedback
US10997370B2 (en) Hybrid classifier for assigning natural language processing (NLP) inputs to domains in real-time
US9218390B2 (en) Query parser derivation computing device and method for making a query parser for parsing unstructured search queries
CN110489760A (en) Based on deep neural network text auto-collation and device
CN106844331A (en) Sentence similarity calculation method and system
CN103314369B (en) Machine translation apparatus and method
CN110348003A (en) Method and device for extracting effective text information
CN117251455A (en) Intelligent report generation method and system based on large model
Ma et al. Improving Chinese spell checking with bidirectional LSTMs and confusionset-based decision network
CN104750676A (en) Machine translation processing method and device
Ganji et al. Novel textual features for language modeling of intra-sentential code-switching data
CN110750967B (en) Pronunciation labeling method and device, computer equipment and storage medium
Rosner et al. A tagging algorithm for mixed language identification in a noisy domain.
Melero et al. Holaaa!! writin like u talk is kewl but kinda hard 4 NLP
Lee Natural Language Processing: A Textbook with Python Implementation
Sharma et al. Contextual multilingual spellchecker for user queries
CN110807096A (en) Information pair matching method and system on small sample set
Wu A computational neural network model for college English grammar correction
CN114970541A (en) Text semantic understanding method, device, equipment and storage medium
Sreeram et al. A Novel Approach for Effective Recognition of the Code-Switched Data on Monolingual Language Model.
Wang Research on cultural translation based on neural network
Sreeram et al. Language modeling for code-switched data: Challenges and approaches
Sreeram et al. Exploiting Parts-of-Speech for improved textual modeling of code-switching data
Byambadorj et al. Normalization of transliterated mongolian words using Seq2Seq model with limited data
Li Construction of English Translation Model Based on Improved Fuzzy Semantic Optimal Control of GLR Algorithm

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20150422

CF01 Termination of patent right due to non-payment of annual fee