CN108563632A - Modification method, system, computer equipment and the storage medium of word misspelling - Google Patents

Modification method, system, computer equipment and the storage medium of word misspelling Download PDF

Info

Publication number
CN108563632A
CN108563632A CN201810271932.8A CN201810271932A CN108563632A CN 108563632 A CN108563632 A CN 108563632A CN 201810271932 A CN201810271932 A CN 201810271932A CN 108563632 A CN108563632 A CN 108563632A
Authority
CN
China
Prior art keywords
word
misspelling
sentence
probability
candidate
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810271932.8A
Other languages
Chinese (zh)
Inventor
王鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Shiyuan Electronics Thecnology Co Ltd
Original Assignee
Guangzhou Shiyuan Electronics Thecnology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Shiyuan Electronics Thecnology Co Ltd filed Critical Guangzhou Shiyuan Electronics Thecnology Co Ltd
Priority to CN201810271932.8A priority Critical patent/CN108563632A/en
Publication of CN108563632A publication Critical patent/CN108563632A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/232Orthographic correction, e.g. spell checking or vowelisation

Abstract

The present invention relates to modification methods described in a kind of modification method of word misspelling, system, computer equipment and storage medium to include:The word for obtaining misspelling on each position of sentence to be modified concentrates selection to obscure word, forms the candidate word collection of corresponding position from the word of obscuring of the word of the misspelling;Wherein, described to obscure the set that word collection is the similar multiple words of word spelling;Cartesian product is carried out to the candidate word collection on each position respectively, obtains multigroup candidate sentences;The candidate sentences are inputted into the probabilistic operation value that misspelling correction model detection trained in advance was detected and calculated the candidate sentences respectively;Candidate sentences are selected to correct the sentence to be modified according to the probabilistic operation value.Technical scheme of the present invention, which realizes, accurately and efficiently corrects the misspelling in text input.

Description

Modification method, system, computer equipment and the storage medium of word misspelling
Technical field
The present invention relates to computer software technical fields, more particularly to a kind of modification method of word misspelling, are System, computer equipment and storage medium.
Background technology
With the continuous development of computer software technology, for the technologies such as the retrieval, extraction, translation of text message gradually at It is ripe, however there are no the methods of precise and high efficiency for the check and correction of text.
Amendment for wrong word in text is the core link of text proofreading, and the wrongly written character in text has seriously affected text Quality, for example, requirement of the Press release to wrong word is very stringent, if do not carried out timely to the wrong word in contribution It corrects, error message may be transmitted to reader, so being of great significance for the amendment of wrongly written character in text.
During the modification method of traditional input error mainly uses Statistics-Based Method, the method to need based on context The feature of word, word etc. establishes statistical language model, and the method relies on statistical language model, in the mistake for establishing statistical language model Cheng Zhong, statistical data Sparse Problems can seriously affect its modified efficiency and precision, it is difficult to the misspelling in text input Accurately and efficiently corrected.
Invention content
Based on this, it is necessary to for it is above-mentioned be difficult to in text input misspelling carry out it is accurately and efficiently modified Problem provides a kind of modification method, system, computer equipment and the storage medium of word misspelling.
A kind of modification method of word misspelling, includes the following steps:
The word for obtaining misspelling on each position of sentence to be modified obscures word from the word of the misspelling It concentrates selection to obscure word, forms the candidate word collection of corresponding position;It is wherein, described that obscure word collection be that word spelling is similar more The set of a word;
Cartesian product is carried out to the candidate word collection on each position respectively, obtains multigroup candidate sentences;
The candidate sentences are inputted to the misspelling correction model detection trained in advance respectively to be detected and calculate institute State the probabilistic operation value of candidate sentences;
Candidate sentences are selected to correct the sentence to be modified according to the probabilistic operation value.
The modification method of above-mentioned word misspelling passes through misspelling on each position of the sentence to be modified of acquisition Word concentrates selection to obscure word from word is obscured, and forms the candidate word collection of corresponding position;Then to the candidate word collection on each position Cartesian product is carried out, multigroup candidate sentences input misspelling correction model detection trained in advance is obtained and is detected and calculates Probabilistic operation value;Sentence to be modified is corrected further according to probabilistic operation value selection candidate sentences.The technical solution is realized to text Misspelling in input is accurately and efficiently corrected.
In one embodiment, it concentrates selection to obscure word from the word of obscuring of the word of the misspelling, forms corresponding position The step of candidate word collection set includes:
It obtains the word of the misspelling and concentrates probability of occurrence maximum K to obscure word obscuring word, form corresponding position The candidate word collection set;Wherein, K >=2, the probability of occurrence are that the word of obscuring corresponding to the word of misspelling concentrates each candidate Probability of occurrence of the word on current location;
Include according to the step of probabilistic operation value selection candidate sentences amendment sentence to be modified:By the probability The maximum candidate sentences of operation values replace the sentence to be modified.
In one embodiment, the modification method of the word misspelling further includes:
Each word is detected in sentence to be modified using the misspelling correction model and its corresponding obscured word and is concentrated Probability of occurrence of each candidate word on current location;The text of misspelling in sentence to be modified is identified according to the probability of occurrence Word.
In one embodiment, described using each in misspelling correction model detection sentence to be modified trained in advance Word and its corresponding obscure word and concentrate each candidate word to include in the step of probability of occurrence on current location:
Word in sentence to be modified is inputted the misspelling correction model to be detected, obtains the next of the word The probability vector of each word on a position, obtains the probability of occurrence of next word from the probability vector of each word;
Obtain the word obscures word collection, obscures word collection using what the misspelling correction model detected the word In probability of occurrence of each candidate word on current location.
In one embodiment, the step that the word of misspelling in sentence to be modified is identified according to the probability of occurrence Suddenly include:
If the probability of occurrence of current character is more than the first probability threshold value, judge that the word does not have misspelling;
If the probability of occurrence of current character is less than the first probability threshold value and is more than the second probability threshold value, if the word exists It obscures the probability of occurrence maximum of concentration accordingly, judges that the word does not have misspelling, otherwise judges the word misspelling.
In one embodiment, described that the candidate sentences are inputted to misspelling correction model inspection trained in advance respectively Survey is detected and includes the step of calculating the probabilistic operation value of the candidate sentences:
The candidate sentences are inputted into the word that in advance trained misspelling correction model detects each position respectively Probability of occurrence;
The probability of occurrence of the word of each position is added or is multiplied respectively, obtains the probability fortune of the candidate sentences Calculation value.
In one embodiment, the modification method of the word misspelling further includes:
Using natural language corpus data and establish misspelling detection training pattern;
The corpus data is pre-processed to obtain training corpus sentence;
The training pattern is trained using the training corpus sentence, obtains the misspelling detection model.
In one embodiment, described the step of obtaining training corpus sentence pre-processed to the corpus data to wrap It includes:
Redundant content in corpus data in the training pattern is deleted, and by non-legible data word Mother is replaced;
The sentence in corpus data is split as unit of word and the letter, and is added in sentence-initial and ending Add sentence-initial label and sentence closing tag, generates training corpus sentence.
In one embodiment, the training pattern of unidirectional misspelling detection is established based on Recognition with Recurrent Neural Network technology; The training pattern is trained to the training corpus sentence of input by preceding, obtains unidirectional misspelling detection model.
In one embodiment, two-way spelling is established based on shot and long term Memory Neural Networks and natural language corpus data Wrongly write the training pattern of error detection;The training pattern is instructed to input and the training corpus sentence inputted backward by preceding Practice, obtains two-way misspelling detection model.
In one embodiment, described to obscure word collection and stored hereof in such a way that key-value is corresponding;Wherein, key is the Chinese The phonetic of word is worth to send out the word set of this phonetic.
A kind of update the system of word misspelling, including:
Selecting module, the word of misspelling on each position for obtaining sentence to be modified, from the misspelling Word obscure word concentrate selection obscure word, form the candidate word collection of corresponding position;Wherein, described to obscure word collection for the text The set of the similar multiple words of word spelling;
Make volume module, for carrying out cartesian product to the candidate word collection on each position respectively, obtains multigroup candidate sentences;
Computing module detects progress for the candidate sentences to be inputted misspelling correction model trained in advance respectively Detect and calculate the probabilistic operation value of the candidate sentences;
Correcting module, for selecting candidate sentences to correct the sentence to be modified according to the probabilistic operation value.
The update the system of above-mentioned word misspelling passes through misspelling on each position of the sentence to be modified of acquisition Word concentrates selection to obscure word from word is obscured, and forms the candidate word collection of corresponding position;Then to the candidate word collection on each position Cartesian product is carried out, multigroup candidate sentences input misspelling correction model detection trained in advance is obtained and is detected and calculates Probabilistic operation value;Sentence to be modified is corrected further according to probabilistic operation value selection candidate sentences.The technical solution is realized to text Misspelling in input is accurately and efficiently corrected.
A kind of computer equipment, including memory, processor and be stored on the memory and can be in the processing The computer program run on device, the processor are realized when executing the computer program such as above-mentioned word misspelling Modification method.
Above computer equipment is realized by the computer program run on the processor in text input Misspelling is accurately and efficiently corrected.
A kind of computer storage media, is stored thereon with computer program, is realized when which is executed by processor as above The modification method for the word misspelling stated.
Above computer storage medium is realized by the computer program of its storage to the misspelling in text input Mistake is accurately and efficiently corrected.
Description of the drawings
Fig. 1 is the modification method flow chart of the word misspelling of one embodiment;
Fig. 2 is the modification method flow chart of the word misspelling of another embodiment;
Fig. 3 is the flow chart of the training misspelling detection model of one embodiment;
Fig. 4 is unidirectional training pattern schematic diagram;
Fig. 5 is the schematic diagram of the prediction result of unidirectional training pattern;
Fig. 6 is two-way training pattern schematic diagram;
Fig. 7 is the schematic diagram of the prediction result of two-way training pattern;
Fig. 8 is to calculate probabilistic operation value flow chart;
Fig. 9 is the update the system structural schematic diagram of the word misspelling of one embodiment;
Figure 10 is the update the system structural schematic diagram of the word misspelling of another embodiment;
Figure 11 is the internal structure schematic diagram of one embodiment Computer equipment.
Specific implementation mode
To facilitate the understanding of the present invention, below with reference to relevant drawings to invention is more fully described.In attached drawing Give the preferred embodiment of the present invention.But the present invention can realize in many different forms, however it is not limited to this paper institutes The embodiment of description.On the contrary, purpose of providing these embodiments is make it is more thorough and comprehensive to the disclosure.
Unless otherwise defined, all of technologies and scientific terms used here by the article and belong to the technical field of the present invention The normally understood meaning of technical staff is identical.Used term is intended merely to description tool in the description of the invention herein The purpose of the embodiment of body, it is not intended that in the limitation present invention.
The technical solution that the embodiment of the present invention is provided can be applied to include PC, smart mobile phone, tablet electricity On the terminal devices such as brain, personal digital assistant.Text input program can be run on the terminal device, input content of text, and In word misspelling, the amendment scheme of the word misspelling provided through the embodiment of the present invention carries out content of text It corrects.
Refering to what is shown in Fig. 1, Fig. 1 is the modification method flow chart of the word misspelling of one embodiment, including following step Suddenly:
S20 obtains the word of misspelling on each position of sentence to be modified, from the mixed of the word of the misspelling The word that confuses concentrates selection to obscure word, forms the candidate word collection of corresponding position;Wherein, it is described obscure word collection be the word spell it is close Multiple words set.
S30 carries out cartesian product to the candidate word collection on each position respectively, obtains multigroup candidate sentences.
The candidate sentences are inputted the misspelling correction model detection trained in advance and are detected and count by S40 respectively Calculate the probabilistic operation value of the candidate sentences.
Wherein, misspelling correction model can be it is pre- first pass through a large amount of word sample trainings and obtain, obscure word concentration and deposit It contains each word and is susceptible to the candidate word that spelling is obscured, can be detected by error correction model each in sentence to be modified Probability of occurrence of a word on current location, and detect that the word of obscuring of each word concentrates candidate word current simultaneously Probability of occurrence on position.
S50 selects candidate sentences to correct the sentence to be modified according to the probabilistic operation value.
The technical solution of above-described embodiment, by the word of misspelling on each position of the sentence to be modified of acquisition from Obscuring word concentrates selection to obscure word, forms the candidate word collection of corresponding position;Then flute is carried out to the candidate word collection on each position Karr is accumulated, and is obtained multigroup candidate sentences input misspelling correction model detection trained in advance and is detected and calculates probability fortune Calculation value;Sentence to be modified is corrected further according to probabilistic operation value selection candidate sentences.The technical solution is realized in text input Misspelling accurately and efficiently correct.
In one embodiment, the word of obscuring of the word from the misspelling of step S20 concentrates selection to obscure word, group At corresponding position candidate word collection the step of include:
It obtains the word of the misspelling and concentrates probability of occurrence maximum K to obscure word obscuring word, form corresponding position The candidate word collection set;Wherein, K >=2, the probability of occurrence are that the word of obscuring corresponding to the word of misspelling concentrates each candidate Probability of occurrence of the word on current location.
It is corresponding, the side that the sentence to be modified is corrected according to probabilistic operation value selection candidate sentences of step S50 Method may include:The maximum candidate sentences of the probabilistic operation value are replaced into the sentence to be modified.
In one embodiment, the modification method of the word misspelling, in the acquisition of step S20 sentence to be modified Each position on the word process of misspelling may include:
Each word is detected in sentence to be modified using the misspelling correction model and its corresponding obscured word and is concentrated Probability of occurrence of each candidate word on current location;The text of misspelling in sentence to be modified is identified according to the probability of occurrence Word.
In the above-described embodiments, the misspelling correction model can pre- be first passed through a large amount of word sample trainings and obtain It arrives, i.e. signified misspelling correction model in step S40.The correction model that corrects the spelling mistakes can be utilized to detect each word With probability of occurrence of the candidate word on current location.
In one embodiment, refering to what is shown in Fig. 2, Fig. 2 is the modification method of the word misspelling of another embodiment Flow chart;The modification method of the word misspelling of the embodiment of the present invention can also include:
S10, training misspelling detection model;
Refering to what is shown in Fig. 3, Fig. 3 is the flow chart of the training misspelling detection model of one embodiment, step S10 master Including:
S101, using natural language corpus data and establish misspelling detection training pattern;
S102 pre-processes the corpus data to obtain training corpus sentence;
Further, may include to the pretreated mode of corpus data:
Redundant content in corpus data in the training pattern is deleted, by non-legible data letter into Row is replaced, and is split to the sentence in corpus data as unit of word and the letter, is added in sentence-initial and ending Sentence-initial label and sentence closing tag etc..
S103 is trained the training pattern using the training corpus sentence, obtains the misspelling detection Model;
In above-described embodiment, by the way that the training corpus sentence suitable for model training can be generated after pretreatment, pass through number According to cleaning, the useless symbol in corpus data, Chinese character in the sentence comprising non-Chinese characters in common use or repeat statement or a word are deleted Sentence etc. of the number less than 2.
For example, the unification such as continuous a string of Arabic numerals, English word or english abbreviation is replaced with letter, example Such as, it can select to replace continuous string number with capital N, continuous a string of English alphabets, tool are replaced with capital C Body is replaced with which kind of letter and can be modified and be arranged as needed, for example, before replacing such as with the replaced table of comparisons Under:
Before replacement After replacement
On April 5th, 2017 The N N months No. N
ABC secondary industry garden C secondary industry garden
After the replacement, sentence-initial label and sentence closing tag can also be added for sentence-initial and ending, for example, Can be marked in the beginning of sentence addition "<s>", sentence ending be added "</s>", and be single with word and the letter of replacement Position is split the sentence in corpus data, generates the corpus data packet that may be used as model training, the corpus data of generation Partial data in packet is as follows:
Pre-process to obtain training corpus sentence by natural language corpus data, can targetedly to training pattern into Row training, can improve the efficiency of model training, to improve the accuracy of misspelling detection model probability output.
It is directed to the model training method of step S103, an embodiment of the present invention provides multilingual models, put up with below It is illustrated for unidirectional language model and bi-directional language model.
In conjunction with preceding embodiment, the number of plies and nerve of neural network can be configured according to accuracy of detection and actual demand The model parameters such as the number of member;For example, RNN bilayer neural networks can be established, dropout regularizations are added between layers, Input layer uses 4000 neurons, hidden layer that 400 neurons, corresponding 4000 Chinese characters in common use, output layer is used to use Softmax classification functions, output valve are the probability of occurrence of each word of prediction.
During being trained to training pattern using the training sentence in corpus data packet, training pattern obtains respectively Take each trained sentence, and the sign-on since training sentence, obtain the single word in training sentence successively, according to The information of each word obtained on the front position of current location, prediction current location most probable occur word, to model into After row training and debugging so that training pattern can obtain desired output result.
As an implementation, it is unidirectional training pattern schematic diagram with reference to figure 4 and Fig. 5, Fig. 4;Fig. 5 is unidirectional The schematic diagram of the prediction result of training pattern.It can be based on Recognition with Recurrent Neural Network technology (RNN, Recurrent Neural Networks the training pattern of unidirectional misspelling detection) is established;By the preceding training corpus sentence to input to the instruction Practice model to be trained, obtains unidirectional misspelling detection model.
For example, the input of training pattern be "<s>The People's Republic of China (PRC) ", and desired output is the " People's Republic of China (PRC) </s>", i.e., for training sentence " People's Republic of China (PRC) ", corresponding prediction result should be as shown in Fig. 3;Instruction will be passed through The experienced training pattern that can obtain anticipated output result as misspelling detection model, to the word in sentence to be detected into Row detection, exports probability of occurrence of each word on current location in sentence to be detected.
Wherein, the information above refer to the current location in sentence to be detected word before position on each text The information of word can be increased by combining the information above in sentence to be detected on the front position of current location word to deserving The accuracy of the probability of occurrence detection of word on front position.
It is two-way training pattern schematic diagram with reference to figure 6 and Fig. 7, Fig. 6 as another embodiment;Fig. 7 is two-way Training pattern prediction result schematic diagram.It can be based on shot and long term Memory Neural Networks (Bi-LSTM) and natural language Corpus data establishes the training pattern of two-way misspelling detection;By preceding to input and the training corpus sentence inputted backward The training pattern is trained, two-way misspelling detection model is obtained.
The input of training pattern is divided into two kinds, respectively preceding to input and backward input, for " People's Republic of China (PRC) " The input of the words, training pattern is divided into two kinds, respectively preceding to input and backward input, in order to ensure what both direction was predicted Consistency, i.e. " People's Republic of China (PRC) ".So forward direction input for "<s>Chinese people's republicanism ", and inputted backward as " the Chinese people Republic</s>”.And desired output is all " People's Republic of China (PRC) ".That is, for sentence " People's Republic of China (PRC) ", correspond to Prediction result should be as shown in Figure 5;Such as " in " prediction of word, by "<s>" and " magnificent people's republic</s>" common It determines, takes full advantage of contextual information, improve efficiency.
After model training is good, so that it may to use model to do spell check to the sentence newly inputted, such as in RNN models In, each step can export the probability vector of all words on next position by softmax, from the probability of all words to The probability of occurrence of next word is obtained in amount.
In the aforementioned embodiment, can utilize the misspelling correction model detect in sentence to be modified each word and Its is corresponding to obscure word and concentrates probability of occurrence of each candidate word on current location, may include:
(1) word in sentence to be modified is inputted the misspelling correction model to be detected, obtains the word The probability vector of each word on next position, the appearance that next word is obtained from the probability vector of each word are general Rate.
For example, using above-mentioned unidirectional misspelling detection model, sentence " Zhong Hua people's republics " is examined It surveys, available probability of occurrence is:
In 0.0267950482666,5
Change 5.48984644411e-07,
People 0.214276000857
The people 0.0538657493889
Altogether 0.0275610154495
With 0.038463984794
State 0.042061101339
In the sentence, " China " has mistakenly been write as " change ", and probability of " change " word in current location is 5.48984644411e-07 spells the probability of occurrence of correct word much smaller than other, can be used for detecting word misspelling.
For another example, using above-mentioned two-way misspelling detection model, " Zhong Hua people's republics " is detected, is obtained Probability it is as follows:
In 0.0108770169318
Change 1.73152820935e-05
People 0.919607996941
The people 0.365396946669
Altogether 0.999733150005
With 0.854933917522
State 0.988406062126
The handle " China " of mistake has been write as " change ", and the probability of " change " word is 1.73152820935e-05, much smaller than other spellings The probability of correct word is write, therefore can be used to detect word misspelling.
(2) obtain the word obscures word collection, and obscuring for the word is detected using the misspelling correction model Word concentrates probability of occurrence of each candidate word on current location.
In addition, the embodiment of the present invention also provides a kind of processing scheme for obscuring word collection;It is corresponded to for example, key-value may be used Mode store hereof, it is corresponding more with a key to send out the Chinese character of this pronunciation as value using the pronunciation of Chinese character as key The identical Chinese character of a pronunciation forms one and easily obscures word subset;For example, obscure word collection is stored in file in the form of key-value In;Wherein, key is the phonetic of Chinese character, and vaule is the word set for sending out this phonetic.
Further, it is also contemplated that polyphone is deposited in easily obscuring in word subset for multiple pronunciations by polyphone simultaneously, For example, " meeting " word, will consider " hui " and " kuai " two pronunciations, but for " for " word, only can consider " wei ", suddenly simultaneously The slightly difference of two tones and four tones.
The scheme of above-described embodiment, it may be determined that the probability of occurrence of the word in sentence to be modified, and determine these words The set of multiple words similar in middle spelling constitutes probability of occurrence of each candidate word for obscuring word collection on current location.
Further, for identifying misspelling in sentence to be modified according to the probability of occurrence in previous embodiment The method of word may include as follows:
If the probability of occurrence of current character is more than the first probability threshold value, judge that the word does not have misspelling;If current The probability of occurrence of word is less than the first probability threshold value and is more than the second probability threshold value, then if the word obscures collection accordingly at it In probability of occurrence it is maximum, judge that the word does not have misspelling, otherwise judge the word misspelling.
In this method, after misspelling correction model trains, using the model to the sentence to be modified that newly inputs Spell check, such as " middleization people Gong He states " are done, obtained probability is as follows:
In 0.00217751576565
Change 8.42674562591e-05
People 0.701624631882
The people 0.118908688426
Altogether 0.000807654316
It closes 3.34586762545e-05
State 0.0664190202951
Assuming that the first probability threshold value is set as 0.1, the second probability threshold value is set as 0.0003, then when misspelling corrects mould When type identification probability is more than 0.1, then it is assumed that the word does not have misspelling.If the word is less than 0.1 more than 0.0003, then Judge that the word obscures whether the probability of occurrence of concentration is maximum accordingly at it, if it is, also judging that the word is not spelled Write error.Otherwise, it is determined that the word misspelling.
In above table, wherein " people ", " people " two word probability are all higher than the first probability threshold value, then it is assumed that the two words There is no misspelling." in " probability of occurrence of word is less than the first probability threshold value, it is more than the second probability threshold value, but it is respectively being mixed The probability of occurrence that the word that confuses is concentrated is maximum, then this word does not have misspelling.And the probability of " change " and " conjunction " two words is respectively less than Second probability threshold value, then the two words have misspelling, the probability of " total " word is between 0.0003 and 0.1, but it is mixed The probability of occurrence that the word that confuses is concentrated is not maximum, so being also considered as " being total to " word misspelling.
The scheme of above-described embodiment by judging the probability of occurrence of current character, and combines the word corresponding mixed at it The probability of occurrence for concentration of confusing judges, can more accurately identify the word whether misspelling.
In one embodiment, the candidate sentences are inputted in advance trained misspelling amendment by step S40 respectively Model inspection is detected and calculates the probabilistic operation value of the candidate sentences, and following method may be used:
The appearance that the candidate sentences are inputted to the word that the misspelling correction model detects each position respectively is general Rate;The probability of occurrence of the word of each position is added or is multiplied respectively, obtains the probabilistic operation value of the candidate sentences.
With reference to figure 8, Fig. 8 is to calculate probabilistic operation value flow chart, for example, " change " word sends out " hua " phonetic, is concentrated obscuring word Maximum probability of occurrence is " change " and " China " two words, and similarly, " total " word candidate word is " total " and " work ", and " conjunction " word candidate word is " conjunction " and " and " word.Cartesian product is done to the candidate word of these three positions, following candidate sentences can be obtained:
Cartesian product Probabilistic operation value
Middleization people Gong He states 0.537909173025
Middleization people work and state 1.02907576627
Middleization people Gong He states 0.891207945057
Zhong Hua people's republics 4.13897197429
Chinese people Gong He states 2.7150827029
Chinese people's work and state 3.08748058451
Chinese people Gong He states 3.30365468572
The People's Republic of China (PRC) 6.82562798262
Wherein, first it is classified as after cartesian product as a result, second is classified as a cartesian product result and re-enters misspelling After correction model, the probabilistic operation value of each candidate sentences is calculated, as described above, computational methods can be each candidate sentences In each word probability be added, each probability multiplication in sentence can also be used.The maximum candidate sentences of select probability operation values are to replace Sentence to be modified, such as upper table are stated, according to the maximum sentence of probabilistic operation value selective value.Compare that can to obtain " People's Republic of China (PRC) " right The probabilistic operation value answered is maximum, so replacing with correct sentence.
The scheme of above-described embodiment, it is general from selection appearance in collection word is obscured respectively after identifying the word of misspelling The maximum k word of rate is as candidate word.By candidate word collection carry out cartesian product, can from multigroup candidate sentences select probability The maximum candidate sentences of operation values replace sentence to be modified, can accurately be corrected to the misspelling in text input, and Improve modified efficiency.
With reference to figure 9, Fig. 9 is the update the system structural schematic diagram of the word misspelling of one embodiment, including:
Selecting module 20, the word of misspelling on each position for obtaining sentence to be modified, from the misspelling The word of obscuring of word accidentally concentrates selection to obscure word, forms the candidate word collection of corresponding position;It is wherein, described that obscure word collection be described The set of the similar multiple words of word spelling;
Make volume module 30, for carrying out cartesian product to the candidate word collection on each position respectively, obtains multigroup candidate sentence Son;
Computing module 40, for by the candidate sentences input respectively in advance trained misspelling correction model detect into Row detects and calculates the probabilistic operation value of the candidate sentences;
Correcting module 50, for selecting candidate sentences to correct the sentence to be modified according to the probabilistic operation value.
The update the system of above-mentioned word misspelling passes through misspelling on each position of the sentence to be modified of acquisition Word concentrates selection to obscure word from word is obscured, and forms the candidate word collection of corresponding position;Then to the candidate word collection on each position Cartesian product is carried out, multigroup candidate sentences input misspelling correction model detection trained in advance is obtained and is detected and calculates Probabilistic operation value;Sentence to be modified is corrected further according to probabilistic operation value selection candidate sentences.The technical solution is realized to text Misspelling in input is accurately and efficiently corrected.
Further, with reference to figure 10, Figure 10 is the update the system structural representation of the word misspelling of another embodiment Figure, further includes training module 10, for training misspelling detection model;Include mainly:Utilize the corpus data of natural language And establish the training pattern of misspelling detection;The corpus data is pre-processed to obtain training corpus sentence;Using institute It states training corpus sentence to be trained the training pattern, obtains the misspelling detection model.
In addition, the embodiment of the present invention also provides a kind of computer equipment, including memory, processor and it is stored in described On memory and the computer program that can run on the processor, the processor are realized when executing the computer program Such as the modification method of above-mentioned word misspelling.
Above computer equipment is realized by the computer program run on the processor in text input Misspelling is accurately and efficiently corrected.
Furthermore the embodiment of the present invention also provides a kind of computer storage media, is stored thereon with computer program, the program The modification method such as above-mentioned word misspelling is realized when being executed by processor.
Above computer storage medium is realized by the computer program of its storage to the misspelling in text input Mistake is accurately and efficiently corrected.
With reference to figure 11, Figure 11 is the internal structure schematic diagram of one embodiment Computer equipment.The computer equipment packet Include processor, non-volatile memory medium, built-in storage, display and the network interface connected by system bus.Wherein, should The non-volatile memory medium of computer equipment can storage program area and the computer program for realizing voice communication assembly, the meter Calculation machine program is performed, and processor may make to execute a kind of voice communication method.The processor of the computer equipment is for carrying For calculating and control ability, the operation of entire computer equipment is supported.Computer program can be stored in the built-in storage, the meter When calculation machine program is executed by processor, processor may make to execute the modification method of word misspelling.The net of computer equipment Network interface is for carrying out network communication.Display screen is for showing application interface etc., for example, display instant messaging chat interface or text The operation interface etc. that word is corrected.The display screen of computer equipment can be liquid crystal display or electric ink display screen, calculate The input unit of machine equipment can be the touch screen covered on display screen, can also be on computer equipment shell equipment by Key, trace ball or Trackpad can also be external keyboard, Trackpad or mouse etc..Touch layer constitutes touch screen with display screen.
It will be understood by those skilled in the art that structure shown in Figure 11, only with the relevant part of the present invention program The block diagram of structure, does not constitute the restriction for the terminal being applied thereon to the present invention program, and specific terminal may include ratio More or fewer components as shown in the figure either combine certain components or are arranged with different components.
Technical solution provided in an embodiment of the present invention in conjunction with RNN and Bi-LSTM neural networks language model and obscures word Collection, is automatically corrected the middle wrong word of sentence, makes full use of the contextual information of sentence, improves the property of spelling detection Energy;And further cartesian product done to candidate word, the maximum sentence of select probability operation values is modified, can independently into Row deep learning is simultaneously automatically corrected misspelling.
Above-mentioned technical proposal can be applied to the detection of misspelling in various texts, for example, theme and news Wrong word inspection in original text.For theme, the wrong word in composition affects the quality of composition, it is indicated that the mistake in composition Malapropism has directive significance to student, and whether there is or not the evaluative dimensions that wrong word can also be used as theme score.In news release It is very stringent to wrong word requirement, if user has input wrong word, sound a warning to author, and provide correctly spelling word, The efficiency of author's writing can be improved.
Each technical characteristic of embodiment described above can be combined arbitrarily, to keep description succinct, not to above-mentioned reality It applies all possible combination of each technical characteristic in example to be all described, as long as however, the combination of these technical characteristics is not deposited In contradiction, it is all considered to be the range of this specification record.One of ordinary skill in the art will appreciate that realizing above-mentioned implementation All or part of step in example method is relevant hardware can be instructed to complete by program, and the program can deposit Be stored in a computer read/write memory medium, the program when being executed, including the step described in above method, the storage Medium, such as:ROM/RAM, magnetic disc, CD etc..
The several embodiments of the present invention/invention above described embodiment only expresses, the description thereof is more specific and detailed, But therefore it can not be interpreted as the limitation to invention/patent of invention range.It should be pointed out that for the common skill of this field For art personnel, under the premise of not departing from the present invention/inventive concept, various modifications and improvements can be made, these all belong to In the protection domain of the present invention/invention.Therefore, the protection domain of the present invention/patent of invention should be determined by the appended claims.

Claims (14)

1. a kind of modification method of word misspelling, which is characterized in that include the following steps:
The word for obtaining misspelling on each position of sentence to be modified is concentrated from the word of obscuring of the word of the misspelling Word is obscured in selection, forms the candidate word collection of corresponding position;Wherein, described to obscure word collection for the similar multiple texts of word spelling The set of word;
Cartesian product is carried out to the candidate word collection on each position respectively, obtains multigroup candidate sentences;
The candidate sentences are inputted to the misspelling correction model detection trained in advance respectively to be detected and calculate the time Select the probabilistic operation value of sentence;
Candidate sentences are selected to correct the sentence to be modified according to the probabilistic operation value.
2. the modification method of word misspelling according to claim 1, which is characterized in that from the text of the misspelling The word of obscuring of word concentrates selection to obscure word, and the step of candidate word collection for forming corresponding position includes:
It obtains the word of the misspelling and concentrates probability of occurrence maximum K to obscure word obscuring word, form corresponding position Candidate word collection;Wherein, K >=2, the probability of occurrence are that the word of obscuring corresponding to the word of misspelling concentrates each candidate word to exist Probability of occurrence on current location;
Include according to the step of probabilistic operation value selection candidate sentences amendment sentence to be modified:
The maximum candidate sentences of the probabilistic operation value are replaced into the sentence to be modified.
3. the modification method of word misspelling according to claim 2, which is characterized in that further include:
Using the misspelling correction model detect in sentence to be modified each word and its it is corresponding obscure word concentrate it is each Probability of occurrence of the candidate word on current location;The word of misspelling in sentence to be modified is identified according to the probability of occurrence.
4. the modification method of word misspelling according to claim 3, which is characterized in that utilization training in advance Misspelling correction model detects in sentence to be modified each word and its corresponding obscure word and concentrate each candidate word current The step of probability of occurrence on position includes:
Word in sentence to be modified is inputted the misspelling correction model to be detected, obtains next position of the word The probability vector for setting each word obtains the probability of occurrence of next word from the probability vector of each word;
Obtain the word obscures word collection, and the word of obscuring that the word is detected using the misspelling correction model is concentrated respectively Probability of occurrence of a candidate word on current location.
5. the modification method of word misspelling according to claim 3, which is characterized in that described to be occurred generally according to described Rate identifies that the step of word of misspelling in sentence to be modified includes:
If the probability of occurrence of current character is more than the first probability threshold value, judge that the word does not have misspelling;
If the probability of occurrence of current character is less than the first probability threshold value and is more than the second probability threshold value, if the word is in its phase The probability of occurrence for obscuring concentration answered is maximum, judges that the word does not have misspelling, otherwise judges the word misspelling.
6. the modification method of word misspelling according to claim 1, which is characterized in that described by the candidate sentences The probabilistic operation value that misspelling correction model detection trained in advance was detected and calculated the candidate sentences is inputted respectively The step of include:
The candidate sentences are inputted into the appearance that misspelling correction model trained in advance detects the word of each position respectively Probability;
The probability of occurrence of the word of each position is added or is multiplied respectively, obtains the probabilistic operation of the candidate sentences Value.
7. the modification method of word misspelling according to claim 1, which is characterized in that further include:
Using natural language corpus data and establish misspelling detection training pattern;
The corpus data is pre-processed to obtain training corpus sentence;
The training pattern is trained using the training corpus sentence, obtains the misspelling detection model.
8. the modification method of word misspelling according to claim 7, which is characterized in that described to the corpus data Being pre-processed the step of obtaining training corpus sentence includes:
Redundant content in corpus data in the training pattern is deleted, and by non-legible data letter into Row is replaced;
The sentence in corpus data is split as unit of word and the letter, and in sentence-initial and ending addition sentence Son beginning label and sentence closing tag, generate training corpus sentence.
9. the modification method of word misspelling according to claim 8, which is characterized in that be based on Recognition with Recurrent Neural Network skill Art establishes the training pattern of unidirectional misspelling detection;By the preceding training corpus sentence to input to the training pattern into Row training, obtains unidirectional misspelling detection model.
10. the modification method of word misspelling according to claim 7, which is characterized in that based on shot and long term memory god The training pattern of two-way misspelling detection is established through network and natural language corpus data;By preceding to input and backward The training corpus sentence of input is trained the training pattern, obtains two-way misspelling detection model.
11. the modification method of word misspelling according to claim 1, which is characterized in that described to obscure word collection with key- It is worth corresponding mode to store hereof;Wherein, key is the phonetic of Chinese character, is worth to send out the word set of this phonetic.
12. a kind of update the system of word misspelling, which is characterized in that including:
Selecting module, the word of misspelling on each position for obtaining sentence to be modified, from the text of the misspelling The word of obscuring of word concentrates selection to obscure word, forms the candidate word collection of corresponding position;Wherein, described to obscure word collection for word spelling Write the set of similar multiple words;
Make volume module, for carrying out cartesian product to the candidate word collection on each position respectively, obtains multigroup candidate sentences;
Computing module is detected for the candidate sentences to be inputted misspelling correction model detection trained in advance respectively And calculate the probabilistic operation value of the candidate sentences;
Correcting module, for selecting candidate sentences to correct the sentence to be modified according to the probabilistic operation value.
13. a kind of computer equipment, including memory, processor and it is stored on the memory and can be in the processor The computer program of upper operation, which is characterized in that the processor realized when executing the computer program as claim 1 to The modification method of word misspelling described in 11 any one.
14. a kind of computer storage media, is stored thereon with computer program, which is characterized in that the program is executed by processor The modification method of word misspellings of the Shi Shixian as described in claim 1 to 11 any one.
CN201810271932.8A 2018-03-29 2018-03-29 Modification method, system, computer equipment and the storage medium of word misspelling Pending CN108563632A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810271932.8A CN108563632A (en) 2018-03-29 2018-03-29 Modification method, system, computer equipment and the storage medium of word misspelling

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810271932.8A CN108563632A (en) 2018-03-29 2018-03-29 Modification method, system, computer equipment and the storage medium of word misspelling

Publications (1)

Publication Number Publication Date
CN108563632A true CN108563632A (en) 2018-09-21

Family

ID=63533433

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810271932.8A Pending CN108563632A (en) 2018-03-29 2018-03-29 Modification method, system, computer equipment and the storage medium of word misspelling

Country Status (1)

Country Link
CN (1) CN108563632A (en)

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109543022A (en) * 2018-12-17 2019-03-29 北京百度网讯科技有限公司 Text error correction method and device
CN109766538A (en) * 2018-11-21 2019-05-17 北京捷通华声科技股份有限公司 A kind of text error correction method, device, electronic equipment and storage medium
CN110457688A (en) * 2019-07-23 2019-11-15 广州视源电子科技股份有限公司 Correction processing method and device, storage medium and processor
CN110472243A (en) * 2019-08-08 2019-11-19 河南大学 A kind of Chinese spell checking methods
CN110807319A (en) * 2019-10-31 2020-02-18 北京奇艺世纪科技有限公司 Text content detection method and device, electronic equipment and storage medium
CN110852074A (en) * 2019-11-07 2020-02-28 三角兽(北京)科技有限公司 Method and device for generating correction statement, storage medium and electronic equipment
CN111597908A (en) * 2020-04-22 2020-08-28 深圳中兴网信科技有限公司 Test paper correcting method and test paper correcting device
CN111626049A (en) * 2020-05-27 2020-09-04 腾讯科技(深圳)有限公司 Title correction method and device for multimedia information, electronic equipment and storage medium
CN112256953A (en) * 2019-07-22 2021-01-22 腾讯科技(深圳)有限公司 Query rewriting method and device, computer equipment and storage medium
WO2022105180A1 (en) * 2020-11-19 2022-05-27 平安科技(深圳)有限公司 Chinese spelling error correction method and apparatus, computer device and storage medium
CN112329446B (en) * 2019-07-17 2023-05-23 北方工业大学 Chinese spelling checking method

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105045778A (en) * 2015-06-24 2015-11-11 江苏科技大学 Chinese homonym error auto-proofreading method
US20160275070A1 (en) * 2015-03-19 2016-09-22 Nuance Communications, Inc. Correction of previous words and other user text input errors
CN106202153A (en) * 2016-06-21 2016-12-07 广州智索信息科技有限公司 The spelling error correction method of a kind of ES search engine and system
CN106528532A (en) * 2016-11-07 2017-03-22 上海智臻智能网络科技股份有限公司 Text error correction method and device and terminal
CN106598939A (en) * 2016-10-21 2017-04-26 北京三快在线科技有限公司 Method and device for text error correction, server and storage medium
CN106610930A (en) * 2015-10-22 2017-05-03 科大讯飞股份有限公司 Foreign language writing automatic error correction method and system
CN106959977A (en) * 2016-01-12 2017-07-18 广州市动景计算机科技有限公司 Candidate collection computational methods and device, word error correction method and device in word input
CN107122346A (en) * 2016-12-28 2017-09-01 平安科技(深圳)有限公司 The error correction method and device of a kind of read statement
CN107357775A (en) * 2017-06-05 2017-11-17 百度在线网络技术(北京)有限公司 The text error correction method and device of Recognition with Recurrent Neural Network based on artificial intelligence
CN107608963A (en) * 2017-09-12 2018-01-19 马上消费金融股份有限公司 A kind of Chinese error correction based on mutual information, device, equipment and storage medium

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160275070A1 (en) * 2015-03-19 2016-09-22 Nuance Communications, Inc. Correction of previous words and other user text input errors
CN105045778A (en) * 2015-06-24 2015-11-11 江苏科技大学 Chinese homonym error auto-proofreading method
CN106610930A (en) * 2015-10-22 2017-05-03 科大讯飞股份有限公司 Foreign language writing automatic error correction method and system
CN106959977A (en) * 2016-01-12 2017-07-18 广州市动景计算机科技有限公司 Candidate collection computational methods and device, word error correction method and device in word input
CN106202153A (en) * 2016-06-21 2016-12-07 广州智索信息科技有限公司 The spelling error correction method of a kind of ES search engine and system
CN106598939A (en) * 2016-10-21 2017-04-26 北京三快在线科技有限公司 Method and device for text error correction, server and storage medium
CN106528532A (en) * 2016-11-07 2017-03-22 上海智臻智能网络科技股份有限公司 Text error correction method and device and terminal
CN107122346A (en) * 2016-12-28 2017-09-01 平安科技(深圳)有限公司 The error correction method and device of a kind of read statement
CN107357775A (en) * 2017-06-05 2017-11-17 百度在线网络技术(北京)有限公司 The text error correction method and device of Recognition with Recurrent Neural Network based on artificial intelligence
CN107608963A (en) * 2017-09-12 2018-01-19 马上消费金融股份有限公司 A kind of Chinese error correction based on mutual information, device, equipment and storage medium

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109766538A (en) * 2018-11-21 2019-05-17 北京捷通华声科技股份有限公司 A kind of text error correction method, device, electronic equipment and storage medium
CN109766538B (en) * 2018-11-21 2023-12-15 北京捷通华声科技股份有限公司 Text error correction method and device, electronic equipment and storage medium
CN109543022A (en) * 2018-12-17 2019-03-29 北京百度网讯科技有限公司 Text error correction method and device
US11080492B2 (en) 2018-12-17 2021-08-03 Beijing Baidu Netcom Science And Technology Co., Ltd. Method and device for correcting error in text
CN112329446B (en) * 2019-07-17 2023-05-23 北方工业大学 Chinese spelling checking method
CN112256953A (en) * 2019-07-22 2021-01-22 腾讯科技(深圳)有限公司 Query rewriting method and device, computer equipment and storage medium
CN112256953B (en) * 2019-07-22 2023-11-14 腾讯科技(深圳)有限公司 Query rewrite method, query rewrite apparatus, computer device, and storage medium
CN110457688B (en) * 2019-07-23 2023-11-24 广州视源电子科技股份有限公司 Error correction processing method and device, storage medium and processor
CN110457688A (en) * 2019-07-23 2019-11-15 广州视源电子科技股份有限公司 Correction processing method and device, storage medium and processor
CN110472243A (en) * 2019-08-08 2019-11-19 河南大学 A kind of Chinese spell checking methods
CN110807319A (en) * 2019-10-31 2020-02-18 北京奇艺世纪科技有限公司 Text content detection method and device, electronic equipment and storage medium
CN110807319B (en) * 2019-10-31 2023-07-25 北京奇艺世纪科技有限公司 Text content detection method, detection device, electronic equipment and storage medium
CN110852074A (en) * 2019-11-07 2020-02-28 三角兽(北京)科技有限公司 Method and device for generating correction statement, storage medium and electronic equipment
CN111597908A (en) * 2020-04-22 2020-08-28 深圳中兴网信科技有限公司 Test paper correcting method and test paper correcting device
CN111626049A (en) * 2020-05-27 2020-09-04 腾讯科技(深圳)有限公司 Title correction method and device for multimedia information, electronic equipment and storage medium
CN111626049B (en) * 2020-05-27 2022-12-16 深圳市雅阅科技有限公司 Title correction method and device for multimedia information, electronic equipment and storage medium
WO2022105180A1 (en) * 2020-11-19 2022-05-27 平安科技(深圳)有限公司 Chinese spelling error correction method and apparatus, computer device and storage medium

Similar Documents

Publication Publication Date Title
CN108491392A (en) Modification method, system, computer equipment and the storage medium of word misspelling
CN108563632A (en) Modification method, system, computer equipment and the storage medium of word misspelling
CN110489760B (en) Text automatic correction method and device based on deep neural network
CN107729309A (en) A kind of method and device of the Chinese semantic analysis based on deep learning
CN110276069B (en) Method, system and storage medium for automatically detecting Chinese braille error
CN106202153A (en) The spelling error correction method of a kind of ES search engine and system
CN108519973A (en) Detection method, system, computer equipment and the storage medium of word spelling
CN103823794A (en) Automatic question setting method about query type short answer question of English reading comprehension test
CN108563634A (en) Recognition methods, system, computer equipment and the storage medium of word misspelling
CN106991085A (en) The abbreviation generation method and device of a kind of entity
Quattrini Li et al. Polispell: an adaptive spellchecker and predictor for people with dyslexia
Lee et al. Linguistic rules based Chinese error detection for second language learning
CN104239289A (en) Syllabication method and syllabication device
Madi et al. A proposed Arabic grammatical error detection tool based on deep learning
CN114925170B (en) Text proofreading model training method and device and computing equipment
CN109086274A (en) English social media short text time expression recognition method based on restricted model
Tan et al. Spelling error correction with BERT based on character-phonetic
CN110147546A (en) A kind of syntactic correction method and device of Oral English Practice
Čibej et al. Normalisation, tokenisation and sentence segmentation of Slovene tweets
Madanagopal et al. Reinforced sequence training based subjective bias correction
CN115310433A (en) Data enhancement method for Chinese text proofreading
Sodhar et al. Exploration of Sindhi Corpus Through Statistical Analysis on the Basis of Reality
Zheng et al. Why press backspace? Understanding user input behaviors in Chinese Pinyin input method
Li et al. Data augmentation of incorporating real error patterns and linguistic knowledge for grammatical error correction
Sun et al. Mining sequential patterns and tree patterns to detect erroneous sentences

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20180921