CN103885938B - Industry spelling mistake checking method based on user feedback - Google Patents

Industry spelling mistake checking method based on user feedback Download PDF

Info

Publication number
CN103885938B
CN103885938B CN201410149427.8A CN201410149427A CN103885938B CN 103885938 B CN103885938 B CN 103885938B CN 201410149427 A CN201410149427 A CN 201410149427A CN 103885938 B CN103885938 B CN 103885938B
Authority
CN
China
Prior art keywords
word
corpus
user
industry
dictionary
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201410149427.8A
Other languages
Chinese (zh)
Other versions
CN103885938A (en
Inventor
杨明
罗军舟
倪俊辉
马成平
任新才
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Southeast University
Focus Technology Co Ltd
Original Assignee
Southeast University
Focus Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Southeast University, Focus Technology Co Ltd filed Critical Southeast University
Priority to CN201410149427.8A priority Critical patent/CN103885938B/en
Publication of CN103885938A publication Critical patent/CN103885938A/en
Application granted granted Critical
Publication of CN103885938B publication Critical patent/CN103885938B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Machine Translation (AREA)

Abstract

The invention discloses an industry spelling mistake checking method based on user feedback. According to the industry spelling mistake checking method based on user feedback, spelling mistake checking is carried out on English text by using an N-gram method and a user dictionary which is designed in a classified mode, recommendation of correct words is accomplished by searching for a large corpus database, and thus checking of spelling mistakes related to a user is achieved. The N-gram method serves as a basic method for natural language processing, and the mistakes in the text are checked according to the characteristics of words or statements and statistical information in a corpus; recommended words which are most related to wrong words in the text input by the user are selected through cooperation between the user dictionary designed in the classified mode and statistical data of the corpus according to historical information of the user at present; the database is searched for a word chain with the largest conditional probability product by using the Viterbi algorithm, and computational efficiency of a hidden Markov model in the large corpus and use efficiency of the statistical information in the database are improved.

Description

Based on the industry misspelling inspection method of user feedback
Technical field
The present invention is a kind of English spelling error check method, make use of the correlation techniques such as corpus, natural language statistical model and the Hidden Markov Model (HMM) comprising a large amount of language message, relates to natural language processing particularly English spelling inspection field.
Background technology
First the abbreviation of using in the present invention is defined:
NLP(Natural Language Processing): natural language processing;
BNC(British National Corpus): British National Corpus;
LDC(Linguistic Data Consortium): language data alliance;
LD(Levenshtein Distance): editing distance;
N-gram:N metagrammar.
Misspelling checks that (Spelling Checker) is an important branch and the basic link of NLP, natural language processing is inerrancy and intelligible text by it, has natural supporting role for senior NLP technology such as mechanical translation, phonetic synthesis, speech recognitions.Meanwhile, this technology effectively can improve the friendly of user interface and intelligent, has important actual application value.
Early stage NLP mainly adopts the method based on syntax-semantic rules.Along with the emergence of Corpus Construction and corpus linguistics, the main target being treated as natural language processing of extensive real text.Rule-based method, after development for many years, still can not break through the restriction of accuracy rate and efficiency two aspect, and statistical method shows gradually in the more advantage of natural language processing field.Use the Auto-learning Method of Corpus--based Method to obtain linguistry in natural language processing more and more, this also comprises misspelling inspection.Statistics-Based Method relates generally to corpus and statistical language model two aspects.
Multiple tissue and research institution provide respective corpus and various statistics thereof, and this free e-books more than 4200 that Chinese and English news category language material, BNC, LDC, Gutenberg project as Text Classification research provide, ten thousand sections are randomly drawed paper Chinese DBLP resource, UCI evaluates sorting data etc.
The mode of web page text Penn Treebank has been carried out element by Brants and the Franz of Google, and altogether create the data more than 1T, detailed content is as shown in table 1.The 5-grams corpus based on 1T web page text data that Google announces is the more comprehensive English corpus of ratio of current Corpus--based Method method.This corpus provides the statistical information from 1 ~ 5-grams, and the natural language processing for Corpus--based Method method provides abundant analysis Data Source.
Corpus aspect, dictionary is that word error correction provides the most basic non-word bug check ability, and design has good management interface, extendible normal dictionary, the basic function that word can be provided to detect for user and raising system performance; Support that the corpus of statistical method realizes the basis that misspelling checks, it provides the data available that scale is considerable, information is full and accurate for Natural Language Processing Models; Corpus based on semanteme is the excellent model that professional domain divides, but due to the poor efficiency of syntax rule, this method cannot obtain practicality.Need to adopt statistical method indirectly to realize the corpus of trade classification.
Traditional misspelling inspection is paid attention to solve the non-word bug check correct word being input as invalid words, the distance measure that conventional method is use reliable dictionary and determines, as LD.Because the cost manually setting up reliable dictionary is very high, the dictionary that traditional spell check uses is smaller.Along with statistical model is introduced in misspelling, error model and N-gram language model become the key components of misspelling check system.Transition matrix and the proper vector of Kukich proposition error probability are spelling the application in error correction, are the bases of N-gram method realization afterwards.It is the key improving spell check precision that Brill and Moore demonstrates a good statistical model, but sets up such error model and need to do a large amount of manual markings to error correction phrase, and this relates to high cost.Whitelaw etc. use Web text to improve this efficiency to a certain extent.Along with the development of Web technology and application, misspelling inspection also more and more receives publicity, and more misspelling type is mentioned, and as failed to write, wrongly increasing letter, exchanging the order of some letters, the merging of mistake, splits word, misuse word etc.The problem that these methods mainly solve is searched input error, search word candidate space and sets up word candidate score function.
In existing misspelling inspection model, major part is all the off-line model based on N-gram model, and this method has become the main flow of spell check research now.The main thought of model uses the statistical information in the Bayesian formula calculating natural language of expansion, and maximum feature is that to have employed statistical method, model simply efficient.The instrument that current research mainly uses is N-gram model, the Bayesian formula of expansion and Hidden Markov Model (HMM).Be divided into and add up word probability with Bayesian formula, use Hidden Markov Model (HMM) to solve these aspects of rapid solving of Hidden Markov Model (HMM) in N-gram model parameter and Bayesian formula.The efficiency of model and practicality are this field problems in the urgent need to address.
Summary of the invention
Technical matters: in misspelling check system, corpus is as the basis of whole model, and calculating wherein and query script inevitably become the performance bottleneck of whole system.If corpus is based on syntax rule or only add up the frequency that word occurs, the degraded performance being easy to occur that rules explosion causes in query script or the result of calculation caused because of statistics deficiency inaccurate.Misspelling inspection model aspect, simply according to a certain estimate carry out mating or only adopt N-gram computation model, there is larger error in the check result that the former obtains, the latter produces larger impact to the performance of system.The technical problem to be solved in the present invention is the dynamic adjustment capability that system lacks based on user feedback, effectively cannot comprehensively use multiple corpus information.For the problem that effectively can not utilize multiple corpus, employing user dictionary, industry corpus and core corpus be combined with each other, the method for weighted calculation.This method has inquiry fast, and result of calculation is accurate, to context environmental adaptability high, can automatically regulate corpus to the use of different piece under different users and text environments, effectively improves system effectiveness and ensures result accuracy.The present invention calculates the Markov chain in N-gram model by using viterbi algorithm, obtains the set that most possible correct word is formed.In corpus, according to the word of N-1 before incorrect word, each possible word is carried out to the calculating of probability, the part estimated with word residing for corpus according to LD calculates weights, obtains the recommendation list sorted according to the probability of occurrence of correct word.Information Statistics in user version are entered in the corpus of system by the correct word chosen according to user and context.After system obtains new statistical information, according to the statistic algorithm in N-gram model, the word frequency of relative recording in corpus tables of data and conditional probability are revised, make corpus synchronous with the actual use of user, record the statistics of all history texts, complete the whole updating of misspelling check system.
Technical scheme:
For solving the problems of the technologies described above, the present invention utilizes N-gram corpus data and relevant statistical method, proposes a kind of industry misspelling inspection method based on user feedback.This misspelling inspection method is specific as follows:
Based on an industry misspelling inspection method for user feedback, comprise step:
1) acquisition of corpus and user dictionary and foundation:
Corpus is divided into core corpus and industry corpus, as the core statistics storing language message, morphology, the syntactic and semantic information of in store overall statistical language and professional terms, when carrying out misspelling and checking, core corpus and industry corpus provide all word, statement information for spell check model, provide the global data of whole language; Meanwhile, according to the dictionary that user builds voluntarily, obtain the special language material information about user;
In a database, define tables of data to store the language material of entirety and user's language material information;
2) structure of spell check model:
The structure of misspelling inspection model calculates with the statistical information of N-gram model to corpus, and obtain the word chain combination that conditional probability is maximum, step comprises:
21) correction judgement of word: the coupling word in text being done to core corpus, if word is not in core corpus, then uses industry corpus and user dictionary to judge successively; If all do not existed in aforementioned three kinds of tables of data, be then judged as incorrect word, carry out next step;
22) recommendation of correct word: according to word close under editing distance with incorrect word in each corpus, calculate probability and the context joint probability thereof of these words, again by weight computing and the maximally related correct word of incorrect word of each corpus, several maximum words of all corpus weighting posterior probabilities are selected to form the recommendation list of correct word;
3) recommend to process the text of user's input by the bug check in spell check model and word;
4) upgrade and user-dependent text statistical information, dictionary and corpus: the text input user and the correct word of selection are added up, by the correct word information in text and contextual information statistics access customer dictionary, core corpus and corresponding industry corpus.
Described step 1) in, the necessary condition of effective corpus and user dictionary comprises:
(1) namely there is not incorrect word in user dictionary, also must be the correct word obtained from the recognized standard such as Oxford, Longman dictionary, or user-defined industry or special words;
(2) core corpus is enough large, there is not the skewed popularity such as industry, timeliness, and must include N-gram information, be used to provide basic word context statistical information;
(3) industry corpus carries out preliminary division according to demand, and according to the selection Nature creating of user, unique user can be the user of multiple industry corpus.
Described step 21) in, viterbi algorithm is used to calculate the probability of current word in core corpus, industry language material fast in N-gram model, and obtain the joint probability that current word and a front N-1 word occur, realize the judgement to current word correctness.
Described step 22) in, N-gram model is used to search in industry corpus and core corpus to the position at incorrect word place, and mated in user dictionary by editing distance and word probability of occurrence, to obtain most possible word list; For the probability of each word in different corpus, adopt probability-weighted to sort to recommendation word list, right rear line provides the recommendation results after sequence.
Described step 4) in, after the text that system of users inputs carries out bug check, calculate the text statistical information in user's input, for the N-gram data in user dictionary and corpus provide lastest imformation, after corresponding tables of data being upgraded, provide bug check service by new corpus data and user dictionary.
In this method, the statistical language model used is exactly use Hidden Markov Model (HMM) to check in corpus to make using the highest word of the context-sensitive word chain probability of occurrence of incorrect word position as correct word list, each corpus has different weights, by the probability of word in corpus and the weighted calculation of corpus, obtain the recommendation word list after sorting.Misspelling inspection is completed recommending the selection of word by user.
This method, based on N-gram Natural Language Processing Models, adopts core corpus, corpus, user dictionary and statistical language model by trade classification, the function that the text for user's input provides bug check and correct word to recommend.After user inputs one section of text, server carries out element to text, is the word chain set under the N unit syntax, thus calculates the conditional probability of last word in corpus in each word chain by text dividing.Statistical language model calculates the alternative set of word as correct word of several maximum probabilities, if former word is in alternative set, then judge that former word is correct, otherwise user selects from alternative set a word as correct word.
The present invention is directed to efficiency and the practicality problem of misspelling check system, utilize the mode of classification corpus weighted calculation, estimate and searching algorithm in conjunction with LD, carry out spelling words bug check in the mode of recommending after first debugging, efficiently can realize bug check fast and the stronger word of contextual relevance is recommended; Have employed viterbi algorithm, propose a kind of statistical language model, the word list that in user version, word probability-weighted in corpus is maximum can be calculated fast.Obtain the user recommending word list, select correct word according to actual conditions and feed back to system, the word that user selects by system and context statistical information thereof join with in user-dependent corpus: calculate the more new data of this word in core corpus, industry corpus and user dictionary by statistical model and add in tables of data, with new data, misspelling inspection is carried out to the user version arrived next time, thus the characteristic that the system that achieves can provide misspelling to check for text according to practical service environment and different users.
Beneficial effect: the present invention has that corpus service efficiency is high, data carry out the feature such as adjusting based on user's actual feedback, makes the practical of system, checks that speed is fast, data syn-chronization high (upgrading in time corpus data according to service condition); Be combined multiple different corpus, can effectively realize efficient misspelling inspection under the environment of multi-user, high concurrent request.
Accompanying drawing explanation
Fig. 1 is N-gram statistical model figure of the present invention.
Fig. 2 is misspelling check system structural drawing of the present invention.
Fig. 3 is specific embodiment of the invention process flow diagram.
Fig. 4 is misspelling audit function module map.
Fig. 5 is Google1T N-gram data message table.
Embodiment
Below in conjunction with accompanying drawing and concrete example, the present invention is further described in more detail.
Industry misspelling inspection method based on user feedback of the present invention, mainly solve the problem lacking user-association and fast search Big-corpus in current misspelling inspection, relate to the correlation techniques such as natural language processing, user dictionary design and database search.The method utilizes the user dictionary of classification design, adopts N-gram method to carry out misspelling inspection to English text, and completes the recommendation of correct word by large corpus data library searching, thus the misspelling inspection that realization is associated with user.N-gram model (Fig. 1), as the basic skills of natural language processing, is checked the mistake in text by the statistical information in word or statement feature and corpus; The user dictionary of classification design is according to the historical information of current user, and the statistics in conjunction with corpus selects the maximally related recommendation word with incorrect word in user input text; Use viterbi algorithm to find out the maximum word chain of database conditional probability product, improve the service efficiency of statistical information in the counting yield of Hidden Markov Model (HMM) in Big-corpus and database.The structure of whole system and the functional module of each several part divide as shown in Figure 2, Figure 4 shows, are below the design concept of each several part and the description of implementation detail.
1, the acquisition of corpus and user dictionary and foundation:
Corpus is divided into core corpus and industry corpus, as the core statistics storing language message, morphology, the syntactic and semantic information of in store overall statistical language and professional terms, when carrying out misspelling and checking, corpus provides all word, statement information for spell check model, provides the global data of statistical language; Meanwhile, according to the dictionary that user builds voluntarily, obtain the special language material information about user, and by its historical information of text entry that counting user inputs; In a database, define tables of data to store each corpus and user's input information.Concrete list structure is as follows:
(1) user dictionary list structure
(2) unitary data list structure
(3) binary data list structure
(4) industry language material list structure
(5) weight data list structure
2, the structure of misspelling inspection model:
The structure of misspelling inspection model calculates with the statistical information of N-gram model to corpus, obtains the word chain combination of weighting conditions maximum probability in each corpus.In model construction process, misspelling inspection as target, does not increase the complexity of Data Matching and sequence in computation process with the practicality of the correctness and recommendation list that accurately judge each word simultaneously.According to the statistical model of the overall situation, user's request and text message, use all corpus data, find out all possible word probability-weighted.Consider the probability size in word compiling distance, each corpus, according to weight calculation, produce an optimum recommendation list of current word.By spell check model, bug check and word recommendation are carried out to the text that user inputs.Specifically as shown in Figure 3.This model is specifically divided into two stages:
A) the best candidate set of word is generated
The concrete definition of word probability of occurrence: if during N=3, examined word word first two words is in the text respectively word1 and word2, four-tuple (word1 is got in corpus, word2, word, COUNT), calculate the ratio of COUNT sum in COUNT and whole corpus, namely calculate the probability of word word; Word1, word2 represent current word word the first two word in the text; If word is second word in statement, then word1=' # '; If word is first word in statement, then word1=word2=' # '; COUNT represents the appearance total degree of this combinations of words in corpus.
Effective corpus and user dictionary must meet following 3 conditions:
(1) namely there is not incorrect word in user dictionary, also must be the correct word obtained from the recognized standard such as Oxford, Longman dictionary, or the special string that user inputs voluntarily.Now N=1, this part word probability calculates in conjunction with the editing distance of the COUNT in two tuples (word, COUNT) and word itself;
(2) core corpus is enough large, and ensures that the probability calculation in model has statistical significance.Based on the data volume size in corpus, we do not calculate the N-gram data of COUNT<=200 regulation.There is not the skewed popularity such as industry, timeliness, in the use procedure each time of user, all the statistical information of its text is added in core corpus;
(3) according to application demand, industry corpus being preliminarily divided into several large classes, and can generating new industry or combination according to the use of user, is also that the bussiness field in industry corpus constantly expands along with using.
First with dictionary, coupling is carried out to the word in text and judge its correctness, if word is not in user dictionary, then according to the close word in editing distance Dictionary of Computing, obtain word candidate set; Secondly to each word in word candidate set, in conjunction with its top n word in the text, industry corpus, core corpus calculating probability formula is used successively p ( w i | w 1 w 2 &CenterDot; &CenterDot; &CenterDot; w i - 1 ) = C ( w 1 w 2 &CenterDot; &CenterDot; &CenterDot; w i - 1 w i ) C ( w 1 w 2 &CenterDot; &CenterDot; &CenterDot; w i - 1 ) .
According to the weights W of user dictionary, industry corpus and core corpus h, W p, W ccalculate final word weights W w.Wherein, p 1, p 2, p 3be respectively the probability of occurrence of word W in user dictionary, industry corpus and core corpus; W h+ W p+ W c=1, weight calculates the situation of calling of each corpus according in user's use procedure, W time initial h=W p=W c, reject weights and be less than threshold value W tword after, obtain word candidate set.
W W=W H*p 1+W P*p 2+W C*p 3
Table 1 is the algorithm false code of word weight computing:
B) correct word is recommended
If former word is present in word candidate set, then again determine that it is correct word; Otherwise, according to the probability-weighted value of corpus weights and word chain joint probability calculation word candidate, according to probability-weighted value, word candidate set is sorted, the word after sequence is formed recommendation list and sends to user.
3, renewal and user-dependent text statistical information, dictionary and corpus
User obtains word and to recommend and after have selected correct word, user input text also becomes a part for corpus, the amended text of system of users is added up, by the N-gram Information Statistics access customer dictionary in correct text, core corpus and corresponding industry corpus, concrete occurrence number and context data are increased in these tables of data.And the corpus at correct word place is chosen according to user, recalculate the weights W of this user when calling each corpus h, W p, W c.
The present invention also can have other numerous embodiments; when not deviating from the present invention's spirit and essence thereof; those of ordinary skill in the art can make various corresponding change and distortion according to the present invention, and these change accordingly and are out of shape the protection domain that all should belong to the claim appended by the present invention.

Claims (9)

1., based on the industry misspelling inspection method of user feedback, it is characterized in that, comprise step:
The acquisition of step one, corpus and user dictionary and foundation:
Corpus is divided into user dictionary, core corpus and industry corpus, as the core statistics storing language message, morphology, the syntactic and semantic information of in store whole language, when carrying out misspelling and checking, corpus provides all word, statement information for misspelling inspection model, provides the global data of whole language; Meanwhile, according to text and the service condition of user's input, obtain the new language material information about user, upgrade corpus and user dictionary;
In a database, define tables of data to store the language material of entirety and user's input information;
The structure of step 2, misspelling inspection model:
The structure of misspelling inspection model calculates with the statistical information of N-gram model to corpus, obtains the word chain combination that conditional probability is maximum;
Step 3, system interaction interface recommend to process the text of user's input by using the bug check in misspelling inspection model and word;
Step 4, renewal and user-dependent text statistical information, dictionary and corpus: add up the input of user and the correct word of selection, by the word information in correct text and context statistics access customer dictionary, core corpus and corresponding industry corpus.
2. the industry misspelling inspection method based on user feedback according to claim 1, it is characterized in that, in described step one, the necessary condition of effective corpus and user dictionary comprises:
(1) namely there is not incorrect word in dictionary, also must be the correct word obtained from the recognized standard such as Oxford, Longman dictionary, and user-defined industry or special words;
(2) core corpus is enough large, there is not industry, timeliness skewed popularity, and must include N-gram information, be used to provide basic word chain statistical information;
(3) industry corpus carries out preliminary division according to demand, and according to the selection Nature creating of user, certain user can be the user of multiple industry corpus;
(4) user dictionary is the dictionary constructed according to the input demand of user, and user can be allowed to manage voluntarily.
3. the industry misspelling inspection method based on user feedback according to claim 1, is characterized in that, in described step 2, specifically comprise:
The correction judgement of step 2.1 word: the coupling word in text being done to normal dictionary, if word is not in normal dictionary, then uses industry corpus and user dictionary to judge successively; If all do not existed in aforementioned three kinds of tables of data, be then judged as incorrect word, carry out next step;
The recommendation of the correct word of step 2.2: according to editing distance and word chain joint probability, adopt each corpus weighted calculation and the maximally related correct word of incorrect word, several words selecting combined probability maximum form the recommendation list of incorrect word.
4. the industry misspelling inspection method based on user feedback according to claim 3, it is characterized in that, in described step 2.1, use viterbi algorithm Rapid matching current word probability of occurrence in each corpus in N-gram model, and obtain the joint probability that current word and a front N-1 word occur, realize the judgement to current word correctness.
5. the industry misspelling inspection method based on user feedback according to claim 3, is characterized in that, in described step 2.2, sorted to recommendation word list by editing distance and word probability of occurrence, right rear line provides recommendation results; Being used for the weights of ordered word list is be weighted acquisition to the probability of word in each corpus.
6. the industry misspelling inspection method based on user feedback according to claim 1, it is characterized in that, in described step 4, after bug check is carried out to the text of user's input, calculate the text statistical information in user's input, for the N-gram data in user dictionary and corpus provide lastest imformation, after corresponding tables of data being upgraded, provide bug check service by new corpus data and dictionary.
7. the industry misspelling inspection method based on user feedback according to claim 1, it is characterized in that, use Hidden Markov Model (HMM) to check in corpus to make using the highest word of the context-sensitive word chain probability of occurrence of incorrect word position as correct word list, each corpus has different weights, by the probability of word in corpus and the weighted calculation of corpus, obtain the recommendation word list after sorting; Misspelling inspection is completed recommending the selection of word by user.
8. the industry misspelling inspection method based on user feedback according to claim 1, it is characterized in that, adopt user dictionary, core corpus, industry corpus and statistical language model, after user inputs one section of text, server carries out element to text, be the word chain set under the N unit syntax by text dividing, thus calculate the conditional probability of last word in corpus in each word chain; Statistical language model calculates the alternative set of word as correct word of several maximum probabilities, if former word is in alternative set, then judge that former word is correct, otherwise user selects from alternative set a word as correct word.
9. the industry misspelling inspection method based on user feedback according to claim 8, is characterized in that, statistical language model adopts viterbi algorithm, calculates the word list that in user version, word probability-weighted in corpus is maximum; Obtain the user recommending word list, select correct word and context statistical information thereof to join with in user-dependent corpus according to actual conditions; Calculate the more new data of this word in user dictionary, core corpus and industry corpus by statistical language model and add in tables of data, with new data, misspelling inspection being carried out to the user version arrived next time.
CN201410149427.8A 2014-04-14 2014-04-14 Industry spelling mistake checking method based on user feedback Expired - Fee Related CN103885938B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410149427.8A CN103885938B (en) 2014-04-14 2014-04-14 Industry spelling mistake checking method based on user feedback

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410149427.8A CN103885938B (en) 2014-04-14 2014-04-14 Industry spelling mistake checking method based on user feedback

Publications (2)

Publication Number Publication Date
CN103885938A CN103885938A (en) 2014-06-25
CN103885938B true CN103885938B (en) 2015-04-22

Family

ID=50954833

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410149427.8A Expired - Fee Related CN103885938B (en) 2014-04-14 2014-04-14 Industry spelling mistake checking method based on user feedback

Country Status (1)

Country Link
CN (1) CN103885938B (en)

Families Citing this family (32)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104112447B (en) * 2014-07-28 2017-08-25 安徽普济信息科技有限公司 Method and system for improving accuracy of statistical language model
US10402435B2 (en) 2015-06-30 2019-09-03 Microsoft Technology Licensing, Llc Utilizing semantic hierarchies to process free-form text
CN105206267B (en) * 2015-09-09 2019-04-02 中国科学院计算技术研究所 A kind of the speech recognition errors modification method and system of fusion uncertainty feedback
CN106708893B (en) * 2015-11-17 2018-09-28 华为技术有限公司 Search query word error correction method and device
CN105654955B (en) * 2016-03-18 2019-11-12 华为技术有限公司 Audio recognition method and device
CN107291730B (en) * 2016-03-31 2020-07-31 阿里巴巴集团控股有限公司 Method and device for providing correction suggestion for query word and probability dictionary construction method
CN107291775B (en) * 2016-04-11 2020-07-31 北京京东尚科信息技术有限公司 Method and device for generating repairing linguistic data of error sample
CN107305542B (en) * 2016-04-21 2018-11-16 珠海金山办公软件有限公司 A kind of spell checking methods and device
CN106294325B (en) * 2016-08-11 2019-01-04 海信集团有限公司 The optimization method and device of spatial term sentence
CN106528616B (en) * 2016-09-30 2019-12-17 厦门快商通科技股份有限公司 Language error correction method and system in human-computer interaction process
US10127219B2 (en) 2016-12-09 2018-11-13 Hong Kong Applied Science and Technoloy Research Institute Company Limited System and method for organizing and processing feature based data structures
US10089297B2 (en) * 2016-12-15 2018-10-02 Microsoft Technology Licensing, Llc Word order suggestion processing
US10679008B2 (en) 2016-12-16 2020-06-09 Microsoft Technology Licensing, Llc Knowledge base for analysis of text
CN107122346B (en) * 2016-12-28 2018-02-27 平安科技(深圳)有限公司 The error correction method and device of a kind of read statement
CN107357775A (en) * 2017-06-05 2017-11-17 百度在线网络技术(北京)有限公司 The text error correction method and device of Recognition with Recurrent Neural Network based on artificial intelligence
CN108628827A (en) * 2018-04-11 2018-10-09 广州视源电子科技股份有限公司 Candidate word evaluation method and device, computer equipment and storage medium
CN109033065A (en) * 2018-06-01 2018-12-18 昆明理工大学 A kind of English- word spelling inspection method
CN110600011B (en) * 2018-06-12 2022-04-01 中国移动通信有限公司研究院 Voice recognition method and device and computer readable storage medium
CN109145287B (en) * 2018-07-05 2022-11-29 广东外语外贸大学 Indonesia word error detection and correction method and system
CN109542247B (en) * 2018-11-14 2023-03-24 腾讯科技(深圳)有限公司 Sentence recommendation method and device, electronic equipment and storage medium
CN111259654B (en) * 2018-11-30 2023-09-15 北京嘀嘀无限科技发展有限公司 Text error detection method and device
CN110020432B (en) * 2019-03-29 2021-09-14 联想(北京)有限公司 Information processing method and information processing equipment
CN110147546B (en) * 2019-04-03 2023-05-26 苏州驰声信息科技有限公司 Grammar correction method and device for spoken English
CN112328737B (en) * 2019-07-17 2023-05-05 北方工业大学 Spelling data generation method
CN110489723A (en) * 2019-08-19 2019-11-22 绍兴数纺科技有限公司 A kind of data error detection and error correction system of dyeing information system
CN110532572A (en) * 2019-09-12 2019-12-03 四川长虹电器股份有限公司 Spell checking methods based on the tree-like naive Bayesian of TAN
CN113095072B (en) * 2019-12-23 2024-06-28 华为技术有限公司 Text processing method and device
CN111523532A (en) * 2020-04-14 2020-08-11 广东小天才科技有限公司 Method for correcting OCR character recognition error and terminal equipment
CN113743092A (en) * 2020-05-27 2021-12-03 阿里巴巴集团控股有限公司 Text processing method and device, electronic equipment and computer readable storage medium
CN111859920B (en) * 2020-06-19 2024-06-04 北京国音红杉树教育科技有限公司 Word misspelling recognition method, system and electronic equipment
CN111737980B (en) * 2020-06-22 2023-05-16 桂林电子科技大学 Correction method for use errors of English text words
CN118152428A (en) * 2024-05-09 2024-06-07 烟台海颐软件股份有限公司 Prediction and enhancement method and device for query instruction of electric power customer service system

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102298577A (en) * 2011-09-21 2011-12-28 深圳市万兴软件有限公司 Method and device for detecting spelling of document edition
CN102937949A (en) * 2012-10-15 2013-02-20 福建榕基软件股份有限公司 Method and system for checking English spelling in rich text editor

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP4652737B2 (en) * 2004-07-14 2011-03-16 インターナショナル・ビジネス・マシーンズ・コーポレーション Word boundary probability estimation device and method, probabilistic language model construction device and method, kana-kanji conversion device and method, and unknown word model construction method,

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102298577A (en) * 2011-09-21 2011-12-28 深圳市万兴软件有限公司 Method and device for detecting spelling of document edition
CN102937949A (en) * 2012-10-15 2013-02-20 福建榕基软件股份有限公司 Method and system for checking English spelling in rich text editor

Also Published As

Publication number Publication date
CN103885938A (en) 2014-06-25

Similar Documents

Publication Publication Date Title
CN103885938B (en) Industry spelling mistake checking method based on user feedback
US8521516B2 (en) Linguistic key normalization
US9575955B2 (en) Method of detecting grammatical error, error detecting apparatus for the method, and computer-readable recording medium storing the method
Dandapat et al. A Hybrid Model for Part-of-Speech Tagging and its Application to Bengali.
Li et al. Improving text normalization using character-blocks based models and system combination
CN113779062A (en) SQL statement generation method and device, storage medium and electronic equipment
KR20090061158A (en) Method and apparatus for correcting of translation error by using error-correction pattern in a translation system
CN117251455A (en) Intelligent report generation method and system based on large model
Comas et al. Sibyl, a factoid question-answering system for spoken documents
Ganji et al. Novel textual features for language modeling of intra-sentential code-switching data
Ma et al. Improving Chinese spell checking with bidirectional LSTMs and confusionset-based decision network
Mudge The design of a proofreading software service
Chaudhary et al. The ariel-cmu systems for lorehlt18
Melero et al. Holaaa!! writin like u talk is kewl but kinda hard 4 NLP
Sharma et al. Contextual multilingual spellchecker for user queries
Rosner et al. A tagging algorithm for mixed language identification in a noisy domain.
Kapočiūtė-Dzikienė et al. Character-based machine learning vs. language modeling for diacritics restoration
US20220229986A1 (en) System and method for compiling and using taxonomy lookup sources in a natural language understanding (nlu) framework
Faisal et al. A rule-based bengali grammar checker
CN110807096A (en) Information pair matching method and system on small sample set
CN114970541A (en) Text semantic understanding method, device, equipment and storage medium
Wu A computational neural network model for college English grammar correction
Aytan et al. Deep learning-based Turkish spelling error detection with a multi-class false positive reduction model
Xie et al. ABC-Fusion: Adapter-based BERT-level confusion set fusion approach for Chinese spelling correction
Wang Research on cultural translation based on neural network

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20150422

CF01 Termination of patent right due to non-payment of annual fee