CN108563632A - Modification method, system, computer equipment and the storage medium of word misspelling - Google Patents
Modification method, system, computer equipment and the storage medium of word misspelling Download PDFInfo
- Publication number
- CN108563632A CN108563632A CN201810271932.8A CN201810271932A CN108563632A CN 108563632 A CN108563632 A CN 108563632A CN 201810271932 A CN201810271932 A CN 201810271932A CN 108563632 A CN108563632 A CN 108563632A
- Authority
- CN
- China
- Prior art keywords
- word
- misspelling
- sentence
- probability
- candidate
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/232—Orthographic correction, e.g. spell checking or vowelisation
Abstract
The present invention relates to modification methods described in a kind of modification method of word misspelling, system, computer equipment and storage medium to include:The word for obtaining misspelling on each position of sentence to be modified concentrates selection to obscure word, forms the candidate word collection of corresponding position from the word of obscuring of the word of the misspelling;Wherein, described to obscure the set that word collection is the similar multiple words of word spelling;Cartesian product is carried out to the candidate word collection on each position respectively, obtains multigroup candidate sentences;The candidate sentences are inputted into the probabilistic operation value that misspelling correction model detection trained in advance was detected and calculated the candidate sentences respectively;Candidate sentences are selected to correct the sentence to be modified according to the probabilistic operation value.Technical scheme of the present invention, which realizes, accurately and efficiently corrects the misspelling in text input.
Description
Technical field
The present invention relates to computer software technical fields, more particularly to a kind of modification method of word misspelling, are
System, computer equipment and storage medium.
Background technology
With the continuous development of computer software technology, for the technologies such as the retrieval, extraction, translation of text message gradually at
It is ripe, however there are no the methods of precise and high efficiency for the check and correction of text.
Amendment for wrong word in text is the core link of text proofreading, and the wrongly written character in text has seriously affected text
Quality, for example, requirement of the Press release to wrong word is very stringent, if do not carried out timely to the wrong word in contribution
It corrects, error message may be transmitted to reader, so being of great significance for the amendment of wrongly written character in text.
During the modification method of traditional input error mainly uses Statistics-Based Method, the method to need based on context
The feature of word, word etc. establishes statistical language model, and the method relies on statistical language model, in the mistake for establishing statistical language model
Cheng Zhong, statistical data Sparse Problems can seriously affect its modified efficiency and precision, it is difficult to the misspelling in text input
Accurately and efficiently corrected.
Invention content
Based on this, it is necessary to for it is above-mentioned be difficult to in text input misspelling carry out it is accurately and efficiently modified
Problem provides a kind of modification method, system, computer equipment and the storage medium of word misspelling.
A kind of modification method of word misspelling, includes the following steps:
The word for obtaining misspelling on each position of sentence to be modified obscures word from the word of the misspelling
It concentrates selection to obscure word, forms the candidate word collection of corresponding position;It is wherein, described that obscure word collection be that word spelling is similar more
The set of a word;
Cartesian product is carried out to the candidate word collection on each position respectively, obtains multigroup candidate sentences;
The candidate sentences are inputted to the misspelling correction model detection trained in advance respectively to be detected and calculate institute
State the probabilistic operation value of candidate sentences;
Candidate sentences are selected to correct the sentence to be modified according to the probabilistic operation value.
The modification method of above-mentioned word misspelling passes through misspelling on each position of the sentence to be modified of acquisition
Word concentrates selection to obscure word from word is obscured, and forms the candidate word collection of corresponding position;Then to the candidate word collection on each position
Cartesian product is carried out, multigroup candidate sentences input misspelling correction model detection trained in advance is obtained and is detected and calculates
Probabilistic operation value;Sentence to be modified is corrected further according to probabilistic operation value selection candidate sentences.The technical solution is realized to text
Misspelling in input is accurately and efficiently corrected.
In one embodiment, it concentrates selection to obscure word from the word of obscuring of the word of the misspelling, forms corresponding position
The step of candidate word collection set includes:
It obtains the word of the misspelling and concentrates probability of occurrence maximum K to obscure word obscuring word, form corresponding position
The candidate word collection set;Wherein, K >=2, the probability of occurrence are that the word of obscuring corresponding to the word of misspelling concentrates each candidate
Probability of occurrence of the word on current location;
Include according to the step of probabilistic operation value selection candidate sentences amendment sentence to be modified:By the probability
The maximum candidate sentences of operation values replace the sentence to be modified.
In one embodiment, the modification method of the word misspelling further includes:
Each word is detected in sentence to be modified using the misspelling correction model and its corresponding obscured word and is concentrated
Probability of occurrence of each candidate word on current location;The text of misspelling in sentence to be modified is identified according to the probability of occurrence
Word.
In one embodiment, described using each in misspelling correction model detection sentence to be modified trained in advance
Word and its corresponding obscure word and concentrate each candidate word to include in the step of probability of occurrence on current location:
Word in sentence to be modified is inputted the misspelling correction model to be detected, obtains the next of the word
The probability vector of each word on a position, obtains the probability of occurrence of next word from the probability vector of each word;
Obtain the word obscures word collection, obscures word collection using what the misspelling correction model detected the word
In probability of occurrence of each candidate word on current location.
In one embodiment, the step that the word of misspelling in sentence to be modified is identified according to the probability of occurrence
Suddenly include:
If the probability of occurrence of current character is more than the first probability threshold value, judge that the word does not have misspelling;
If the probability of occurrence of current character is less than the first probability threshold value and is more than the second probability threshold value, if the word exists
It obscures the probability of occurrence maximum of concentration accordingly, judges that the word does not have misspelling, otherwise judges the word misspelling.
In one embodiment, described that the candidate sentences are inputted to misspelling correction model inspection trained in advance respectively
Survey is detected and includes the step of calculating the probabilistic operation value of the candidate sentences:
The candidate sentences are inputted into the word that in advance trained misspelling correction model detects each position respectively
Probability of occurrence;
The probability of occurrence of the word of each position is added or is multiplied respectively, obtains the probability fortune of the candidate sentences
Calculation value.
In one embodiment, the modification method of the word misspelling further includes:
Using natural language corpus data and establish misspelling detection training pattern;
The corpus data is pre-processed to obtain training corpus sentence;
The training pattern is trained using the training corpus sentence, obtains the misspelling detection model.
In one embodiment, described the step of obtaining training corpus sentence pre-processed to the corpus data to wrap
It includes:
Redundant content in corpus data in the training pattern is deleted, and by non-legible data word
Mother is replaced;
The sentence in corpus data is split as unit of word and the letter, and is added in sentence-initial and ending
Add sentence-initial label and sentence closing tag, generates training corpus sentence.
In one embodiment, the training pattern of unidirectional misspelling detection is established based on Recognition with Recurrent Neural Network technology;
The training pattern is trained to the training corpus sentence of input by preceding, obtains unidirectional misspelling detection model.
In one embodiment, two-way spelling is established based on shot and long term Memory Neural Networks and natural language corpus data
Wrongly write the training pattern of error detection;The training pattern is instructed to input and the training corpus sentence inputted backward by preceding
Practice, obtains two-way misspelling detection model.
In one embodiment, described to obscure word collection and stored hereof in such a way that key-value is corresponding;Wherein, key is the Chinese
The phonetic of word is worth to send out the word set of this phonetic.
A kind of update the system of word misspelling, including:
Selecting module, the word of misspelling on each position for obtaining sentence to be modified, from the misspelling
Word obscure word concentrate selection obscure word, form the candidate word collection of corresponding position;Wherein, described to obscure word collection for the text
The set of the similar multiple words of word spelling;
Make volume module, for carrying out cartesian product to the candidate word collection on each position respectively, obtains multigroup candidate sentences;
Computing module detects progress for the candidate sentences to be inputted misspelling correction model trained in advance respectively
Detect and calculate the probabilistic operation value of the candidate sentences;
Correcting module, for selecting candidate sentences to correct the sentence to be modified according to the probabilistic operation value.
The update the system of above-mentioned word misspelling passes through misspelling on each position of the sentence to be modified of acquisition
Word concentrates selection to obscure word from word is obscured, and forms the candidate word collection of corresponding position;Then to the candidate word collection on each position
Cartesian product is carried out, multigroup candidate sentences input misspelling correction model detection trained in advance is obtained and is detected and calculates
Probabilistic operation value;Sentence to be modified is corrected further according to probabilistic operation value selection candidate sentences.The technical solution is realized to text
Misspelling in input is accurately and efficiently corrected.
A kind of computer equipment, including memory, processor and be stored on the memory and can be in the processing
The computer program run on device, the processor are realized when executing the computer program such as above-mentioned word misspelling
Modification method.
Above computer equipment is realized by the computer program run on the processor in text input
Misspelling is accurately and efficiently corrected.
A kind of computer storage media, is stored thereon with computer program, is realized when which is executed by processor as above
The modification method for the word misspelling stated.
Above computer storage medium is realized by the computer program of its storage to the misspelling in text input
Mistake is accurately and efficiently corrected.
Description of the drawings
Fig. 1 is the modification method flow chart of the word misspelling of one embodiment;
Fig. 2 is the modification method flow chart of the word misspelling of another embodiment;
Fig. 3 is the flow chart of the training misspelling detection model of one embodiment;
Fig. 4 is unidirectional training pattern schematic diagram;
Fig. 5 is the schematic diagram of the prediction result of unidirectional training pattern;
Fig. 6 is two-way training pattern schematic diagram;
Fig. 7 is the schematic diagram of the prediction result of two-way training pattern;
Fig. 8 is to calculate probabilistic operation value flow chart;
Fig. 9 is the update the system structural schematic diagram of the word misspelling of one embodiment;
Figure 10 is the update the system structural schematic diagram of the word misspelling of another embodiment;
Figure 11 is the internal structure schematic diagram of one embodiment Computer equipment.
Specific implementation mode
To facilitate the understanding of the present invention, below with reference to relevant drawings to invention is more fully described.In attached drawing
Give the preferred embodiment of the present invention.But the present invention can realize in many different forms, however it is not limited to this paper institutes
The embodiment of description.On the contrary, purpose of providing these embodiments is make it is more thorough and comprehensive to the disclosure.
Unless otherwise defined, all of technologies and scientific terms used here by the article and belong to the technical field of the present invention
The normally understood meaning of technical staff is identical.Used term is intended merely to description tool in the description of the invention herein
The purpose of the embodiment of body, it is not intended that in the limitation present invention.
The technical solution that the embodiment of the present invention is provided can be applied to include PC, smart mobile phone, tablet electricity
On the terminal devices such as brain, personal digital assistant.Text input program can be run on the terminal device, input content of text, and
In word misspelling, the amendment scheme of the word misspelling provided through the embodiment of the present invention carries out content of text
It corrects.
Refering to what is shown in Fig. 1, Fig. 1 is the modification method flow chart of the word misspelling of one embodiment, including following step
Suddenly:
S20 obtains the word of misspelling on each position of sentence to be modified, from the mixed of the word of the misspelling
The word that confuses concentrates selection to obscure word, forms the candidate word collection of corresponding position;Wherein, it is described obscure word collection be the word spell it is close
Multiple words set.
S30 carries out cartesian product to the candidate word collection on each position respectively, obtains multigroup candidate sentences.
The candidate sentences are inputted the misspelling correction model detection trained in advance and are detected and count by S40 respectively
Calculate the probabilistic operation value of the candidate sentences.
Wherein, misspelling correction model can be it is pre- first pass through a large amount of word sample trainings and obtain, obscure word concentration and deposit
It contains each word and is susceptible to the candidate word that spelling is obscured, can be detected by error correction model each in sentence to be modified
Probability of occurrence of a word on current location, and detect that the word of obscuring of each word concentrates candidate word current simultaneously
Probability of occurrence on position.
S50 selects candidate sentences to correct the sentence to be modified according to the probabilistic operation value.
The technical solution of above-described embodiment, by the word of misspelling on each position of the sentence to be modified of acquisition from
Obscuring word concentrates selection to obscure word, forms the candidate word collection of corresponding position;Then flute is carried out to the candidate word collection on each position
Karr is accumulated, and is obtained multigroup candidate sentences input misspelling correction model detection trained in advance and is detected and calculates probability fortune
Calculation value;Sentence to be modified is corrected further according to probabilistic operation value selection candidate sentences.The technical solution is realized in text input
Misspelling accurately and efficiently correct.
In one embodiment, the word of obscuring of the word from the misspelling of step S20 concentrates selection to obscure word, group
At corresponding position candidate word collection the step of include:
It obtains the word of the misspelling and concentrates probability of occurrence maximum K to obscure word obscuring word, form corresponding position
The candidate word collection set;Wherein, K >=2, the probability of occurrence are that the word of obscuring corresponding to the word of misspelling concentrates each candidate
Probability of occurrence of the word on current location.
It is corresponding, the side that the sentence to be modified is corrected according to probabilistic operation value selection candidate sentences of step S50
Method may include:The maximum candidate sentences of the probabilistic operation value are replaced into the sentence to be modified.
In one embodiment, the modification method of the word misspelling, in the acquisition of step S20 sentence to be modified
Each position on the word process of misspelling may include:
Each word is detected in sentence to be modified using the misspelling correction model and its corresponding obscured word and is concentrated
Probability of occurrence of each candidate word on current location;The text of misspelling in sentence to be modified is identified according to the probability of occurrence
Word.
In the above-described embodiments, the misspelling correction model can pre- be first passed through a large amount of word sample trainings and obtain
It arrives, i.e. signified misspelling correction model in step S40.The correction model that corrects the spelling mistakes can be utilized to detect each word
With probability of occurrence of the candidate word on current location.
In one embodiment, refering to what is shown in Fig. 2, Fig. 2 is the modification method of the word misspelling of another embodiment
Flow chart;The modification method of the word misspelling of the embodiment of the present invention can also include:
S10, training misspelling detection model;
Refering to what is shown in Fig. 3, Fig. 3 is the flow chart of the training misspelling detection model of one embodiment, step S10 master
Including:
S101, using natural language corpus data and establish misspelling detection training pattern;
S102 pre-processes the corpus data to obtain training corpus sentence;
Further, may include to the pretreated mode of corpus data:
Redundant content in corpus data in the training pattern is deleted, by non-legible data letter into
Row is replaced, and is split to the sentence in corpus data as unit of word and the letter, is added in sentence-initial and ending
Sentence-initial label and sentence closing tag etc..
S103 is trained the training pattern using the training corpus sentence, obtains the misspelling detection
Model;
In above-described embodiment, by the way that the training corpus sentence suitable for model training can be generated after pretreatment, pass through number
According to cleaning, the useless symbol in corpus data, Chinese character in the sentence comprising non-Chinese characters in common use or repeat statement or a word are deleted
Sentence etc. of the number less than 2.
For example, the unification such as continuous a string of Arabic numerals, English word or english abbreviation is replaced with letter, example
Such as, it can select to replace continuous string number with capital N, continuous a string of English alphabets, tool are replaced with capital C
Body is replaced with which kind of letter and can be modified and be arranged as needed, for example, before replacing such as with the replaced table of comparisons
Under:
Before replacement | After replacement |
On April 5th, 2017 | The N N months No. N |
ABC secondary industry garden | C secondary industry garden |
After the replacement, sentence-initial label and sentence closing tag can also be added for sentence-initial and ending, for example,
Can be marked in the beginning of sentence addition "<s>", sentence ending be added "</s>", and be single with word and the letter of replacement
Position is split the sentence in corpus data, generates the corpus data packet that may be used as model training, the corpus data of generation
Partial data in packet is as follows:
Pre-process to obtain training corpus sentence by natural language corpus data, can targetedly to training pattern into
Row training, can improve the efficiency of model training, to improve the accuracy of misspelling detection model probability output.
It is directed to the model training method of step S103, an embodiment of the present invention provides multilingual models, put up with below
It is illustrated for unidirectional language model and bi-directional language model.
In conjunction with preceding embodiment, the number of plies and nerve of neural network can be configured according to accuracy of detection and actual demand
The model parameters such as the number of member;For example, RNN bilayer neural networks can be established, dropout regularizations are added between layers,
Input layer uses 4000 neurons, hidden layer that 400 neurons, corresponding 4000 Chinese characters in common use, output layer is used to use
Softmax classification functions, output valve are the probability of occurrence of each word of prediction.
During being trained to training pattern using the training sentence in corpus data packet, training pattern obtains respectively
Take each trained sentence, and the sign-on since training sentence, obtain the single word in training sentence successively, according to
The information of each word obtained on the front position of current location, prediction current location most probable occur word, to model into
After row training and debugging so that training pattern can obtain desired output result.
As an implementation, it is unidirectional training pattern schematic diagram with reference to figure 4 and Fig. 5, Fig. 4;Fig. 5 is unidirectional
The schematic diagram of the prediction result of training pattern.It can be based on Recognition with Recurrent Neural Network technology (RNN, Recurrent Neural
Networks the training pattern of unidirectional misspelling detection) is established;By the preceding training corpus sentence to input to the instruction
Practice model to be trained, obtains unidirectional misspelling detection model.
For example, the input of training pattern be "<s>The People's Republic of China (PRC) ", and desired output is the " People's Republic of China (PRC)
</s>", i.e., for training sentence " People's Republic of China (PRC) ", corresponding prediction result should be as shown in Fig. 3;Instruction will be passed through
The experienced training pattern that can obtain anticipated output result as misspelling detection model, to the word in sentence to be detected into
Row detection, exports probability of occurrence of each word on current location in sentence to be detected.
Wherein, the information above refer to the current location in sentence to be detected word before position on each text
The information of word can be increased by combining the information above in sentence to be detected on the front position of current location word to deserving
The accuracy of the probability of occurrence detection of word on front position.
It is two-way training pattern schematic diagram with reference to figure 6 and Fig. 7, Fig. 6 as another embodiment;Fig. 7 is two-way
Training pattern prediction result schematic diagram.It can be based on shot and long term Memory Neural Networks (Bi-LSTM) and natural language
Corpus data establishes the training pattern of two-way misspelling detection;By preceding to input and the training corpus sentence inputted backward
The training pattern is trained, two-way misspelling detection model is obtained.
The input of training pattern is divided into two kinds, respectively preceding to input and backward input, for " People's Republic of China (PRC) "
The input of the words, training pattern is divided into two kinds, respectively preceding to input and backward input, in order to ensure what both direction was predicted
Consistency, i.e. " People's Republic of China (PRC) ".So forward direction input for "<s>Chinese people's republicanism ", and inputted backward as " the Chinese people
Republic</s>”.And desired output is all " People's Republic of China (PRC) ".That is, for sentence " People's Republic of China (PRC) ", correspond to
Prediction result should be as shown in Figure 5;Such as " in " prediction of word, by "<s>" and " magnificent people's republic</s>" common
It determines, takes full advantage of contextual information, improve efficiency.
After model training is good, so that it may to use model to do spell check to the sentence newly inputted, such as in RNN models
In, each step can export the probability vector of all words on next position by softmax, from the probability of all words to
The probability of occurrence of next word is obtained in amount.
In the aforementioned embodiment, can utilize the misspelling correction model detect in sentence to be modified each word and
Its is corresponding to obscure word and concentrates probability of occurrence of each candidate word on current location, may include:
(1) word in sentence to be modified is inputted the misspelling correction model to be detected, obtains the word
The probability vector of each word on next position, the appearance that next word is obtained from the probability vector of each word are general
Rate.
For example, using above-mentioned unidirectional misspelling detection model, sentence " Zhong Hua people's republics " is examined
It surveys, available probability of occurrence is:
In | 0.0267950482666,5 |
Change | 5.48984644411e-07, |
People | 0.214276000857 |
The people | 0.0538657493889 |
Altogether | 0.0275610154495 |
With | 0.038463984794 |
State | 0.042061101339 |
In the sentence, " China " has mistakenly been write as " change ", and probability of " change " word in current location is
5.48984644411e-07 spells the probability of occurrence of correct word much smaller than other, can be used for detecting word misspelling.
For another example, using above-mentioned two-way misspelling detection model, " Zhong Hua people's republics " is detected, is obtained
Probability it is as follows:
In | 0.0108770169318 |
Change | 1.73152820935e-05 |
People | 0.919607996941 |
The people | 0.365396946669 |
Altogether | 0.999733150005 |
With | 0.854933917522 |
State | 0.988406062126 |
The handle " China " of mistake has been write as " change ", and the probability of " change " word is 1.73152820935e-05, much smaller than other spellings
The probability of correct word is write, therefore can be used to detect word misspelling.
(2) obtain the word obscures word collection, and obscuring for the word is detected using the misspelling correction model
Word concentrates probability of occurrence of each candidate word on current location.
In addition, the embodiment of the present invention also provides a kind of processing scheme for obscuring word collection;It is corresponded to for example, key-value may be used
Mode store hereof, it is corresponding more with a key to send out the Chinese character of this pronunciation as value using the pronunciation of Chinese character as key
The identical Chinese character of a pronunciation forms one and easily obscures word subset;For example, obscure word collection is stored in file in the form of key-value
In;Wherein, key is the phonetic of Chinese character, and vaule is the word set for sending out this phonetic.
Further, it is also contemplated that polyphone is deposited in easily obscuring in word subset for multiple pronunciations by polyphone simultaneously,
For example, " meeting " word, will consider " hui " and " kuai " two pronunciations, but for " for " word, only can consider " wei ", suddenly simultaneously
The slightly difference of two tones and four tones.
The scheme of above-described embodiment, it may be determined that the probability of occurrence of the word in sentence to be modified, and determine these words
The set of multiple words similar in middle spelling constitutes probability of occurrence of each candidate word for obscuring word collection on current location.
Further, for identifying misspelling in sentence to be modified according to the probability of occurrence in previous embodiment
The method of word may include as follows:
If the probability of occurrence of current character is more than the first probability threshold value, judge that the word does not have misspelling;If current
The probability of occurrence of word is less than the first probability threshold value and is more than the second probability threshold value, then if the word obscures collection accordingly at it
In probability of occurrence it is maximum, judge that the word does not have misspelling, otherwise judge the word misspelling.
In this method, after misspelling correction model trains, using the model to the sentence to be modified that newly inputs
Spell check, such as " middleization people Gong He states " are done, obtained probability is as follows:
In | 0.00217751576565 |
Change | 8.42674562591e-05 |
People | 0.701624631882 |
The people | 0.118908688426 |
Altogether | 0.000807654316 |
It closes | 3.34586762545e-05 |
State | 0.0664190202951 |
Assuming that the first probability threshold value is set as 0.1, the second probability threshold value is set as 0.0003, then when misspelling corrects mould
When type identification probability is more than 0.1, then it is assumed that the word does not have misspelling.If the word is less than 0.1 more than 0.0003, then
Judge that the word obscures whether the probability of occurrence of concentration is maximum accordingly at it, if it is, also judging that the word is not spelled
Write error.Otherwise, it is determined that the word misspelling.
In above table, wherein " people ", " people " two word probability are all higher than the first probability threshold value, then it is assumed that the two words
There is no misspelling." in " probability of occurrence of word is less than the first probability threshold value, it is more than the second probability threshold value, but it is respectively being mixed
The probability of occurrence that the word that confuses is concentrated is maximum, then this word does not have misspelling.And the probability of " change " and " conjunction " two words is respectively less than
Second probability threshold value, then the two words have misspelling, the probability of " total " word is between 0.0003 and 0.1, but it is mixed
The probability of occurrence that the word that confuses is concentrated is not maximum, so being also considered as " being total to " word misspelling.
The scheme of above-described embodiment by judging the probability of occurrence of current character, and combines the word corresponding mixed at it
The probability of occurrence for concentration of confusing judges, can more accurately identify the word whether misspelling.
In one embodiment, the candidate sentences are inputted in advance trained misspelling amendment by step S40 respectively
Model inspection is detected and calculates the probabilistic operation value of the candidate sentences, and following method may be used:
The appearance that the candidate sentences are inputted to the word that the misspelling correction model detects each position respectively is general
Rate;The probability of occurrence of the word of each position is added or is multiplied respectively, obtains the probabilistic operation value of the candidate sentences.
With reference to figure 8, Fig. 8 is to calculate probabilistic operation value flow chart, for example, " change " word sends out " hua " phonetic, is concentrated obscuring word
Maximum probability of occurrence is " change " and " China " two words, and similarly, " total " word candidate word is " total " and " work ", and " conjunction " word candidate word is
" conjunction " and " and " word.Cartesian product is done to the candidate word of these three positions, following candidate sentences can be obtained:
Cartesian product | Probabilistic operation value |
Middleization people Gong He states | 0.537909173025 |
Middleization people work and state | 1.02907576627 |
Middleization people Gong He states | 0.891207945057 |
Zhong Hua people's republics | 4.13897197429 |
Chinese people Gong He states | 2.7150827029 |
Chinese people's work and state | 3.08748058451 |
Chinese people Gong He states | 3.30365468572 |
The People's Republic of China (PRC) | 6.82562798262 |
Wherein, first it is classified as after cartesian product as a result, second is classified as a cartesian product result and re-enters misspelling
After correction model, the probabilistic operation value of each candidate sentences is calculated, as described above, computational methods can be each candidate sentences
In each word probability be added, each probability multiplication in sentence can also be used.The maximum candidate sentences of select probability operation values are to replace
Sentence to be modified, such as upper table are stated, according to the maximum sentence of probabilistic operation value selective value.Compare that can to obtain " People's Republic of China (PRC) " right
The probabilistic operation value answered is maximum, so replacing with correct sentence.
The scheme of above-described embodiment, it is general from selection appearance in collection word is obscured respectively after identifying the word of misspelling
The maximum k word of rate is as candidate word.By candidate word collection carry out cartesian product, can from multigroup candidate sentences select probability
The maximum candidate sentences of operation values replace sentence to be modified, can accurately be corrected to the misspelling in text input, and
Improve modified efficiency.
With reference to figure 9, Fig. 9 is the update the system structural schematic diagram of the word misspelling of one embodiment, including:
Selecting module 20, the word of misspelling on each position for obtaining sentence to be modified, from the misspelling
The word of obscuring of word accidentally concentrates selection to obscure word, forms the candidate word collection of corresponding position;It is wherein, described that obscure word collection be described
The set of the similar multiple words of word spelling;
Make volume module 30, for carrying out cartesian product to the candidate word collection on each position respectively, obtains multigroup candidate sentence
Son;
Computing module 40, for by the candidate sentences input respectively in advance trained misspelling correction model detect into
Row detects and calculates the probabilistic operation value of the candidate sentences;
Correcting module 50, for selecting candidate sentences to correct the sentence to be modified according to the probabilistic operation value.
The update the system of above-mentioned word misspelling passes through misspelling on each position of the sentence to be modified of acquisition
Word concentrates selection to obscure word from word is obscured, and forms the candidate word collection of corresponding position;Then to the candidate word collection on each position
Cartesian product is carried out, multigroup candidate sentences input misspelling correction model detection trained in advance is obtained and is detected and calculates
Probabilistic operation value;Sentence to be modified is corrected further according to probabilistic operation value selection candidate sentences.The technical solution is realized to text
Misspelling in input is accurately and efficiently corrected.
Further, with reference to figure 10, Figure 10 is the update the system structural representation of the word misspelling of another embodiment
Figure, further includes training module 10, for training misspelling detection model;Include mainly:Utilize the corpus data of natural language
And establish the training pattern of misspelling detection;The corpus data is pre-processed to obtain training corpus sentence;Using institute
It states training corpus sentence to be trained the training pattern, obtains the misspelling detection model.
In addition, the embodiment of the present invention also provides a kind of computer equipment, including memory, processor and it is stored in described
On memory and the computer program that can run on the processor, the processor are realized when executing the computer program
Such as the modification method of above-mentioned word misspelling.
Above computer equipment is realized by the computer program run on the processor in text input
Misspelling is accurately and efficiently corrected.
Furthermore the embodiment of the present invention also provides a kind of computer storage media, is stored thereon with computer program, the program
The modification method such as above-mentioned word misspelling is realized when being executed by processor.
Above computer storage medium is realized by the computer program of its storage to the misspelling in text input
Mistake is accurately and efficiently corrected.
With reference to figure 11, Figure 11 is the internal structure schematic diagram of one embodiment Computer equipment.The computer equipment packet
Include processor, non-volatile memory medium, built-in storage, display and the network interface connected by system bus.Wherein, should
The non-volatile memory medium of computer equipment can storage program area and the computer program for realizing voice communication assembly, the meter
Calculation machine program is performed, and processor may make to execute a kind of voice communication method.The processor of the computer equipment is for carrying
For calculating and control ability, the operation of entire computer equipment is supported.Computer program can be stored in the built-in storage, the meter
When calculation machine program is executed by processor, processor may make to execute the modification method of word misspelling.The net of computer equipment
Network interface is for carrying out network communication.Display screen is for showing application interface etc., for example, display instant messaging chat interface or text
The operation interface etc. that word is corrected.The display screen of computer equipment can be liquid crystal display or electric ink display screen, calculate
The input unit of machine equipment can be the touch screen covered on display screen, can also be on computer equipment shell equipment by
Key, trace ball or Trackpad can also be external keyboard, Trackpad or mouse etc..Touch layer constitutes touch screen with display screen.
It will be understood by those skilled in the art that structure shown in Figure 11, only with the relevant part of the present invention program
The block diagram of structure, does not constitute the restriction for the terminal being applied thereon to the present invention program, and specific terminal may include ratio
More or fewer components as shown in the figure either combine certain components or are arranged with different components.
Technical solution provided in an embodiment of the present invention in conjunction with RNN and Bi-LSTM neural networks language model and obscures word
Collection, is automatically corrected the middle wrong word of sentence, makes full use of the contextual information of sentence, improves the property of spelling detection
Energy;And further cartesian product done to candidate word, the maximum sentence of select probability operation values is modified, can independently into
Row deep learning is simultaneously automatically corrected misspelling.
Above-mentioned technical proposal can be applied to the detection of misspelling in various texts, for example, theme and news
Wrong word inspection in original text.For theme, the wrong word in composition affects the quality of composition, it is indicated that the mistake in composition
Malapropism has directive significance to student, and whether there is or not the evaluative dimensions that wrong word can also be used as theme score.In news release
It is very stringent to wrong word requirement, if user has input wrong word, sound a warning to author, and provide correctly spelling word,
The efficiency of author's writing can be improved.
Each technical characteristic of embodiment described above can be combined arbitrarily, to keep description succinct, not to above-mentioned reality
It applies all possible combination of each technical characteristic in example to be all described, as long as however, the combination of these technical characteristics is not deposited
In contradiction, it is all considered to be the range of this specification record.One of ordinary skill in the art will appreciate that realizing above-mentioned implementation
All or part of step in example method is relevant hardware can be instructed to complete by program, and the program can deposit
Be stored in a computer read/write memory medium, the program when being executed, including the step described in above method, the storage
Medium, such as:ROM/RAM, magnetic disc, CD etc..
The several embodiments of the present invention/invention above described embodiment only expresses, the description thereof is more specific and detailed,
But therefore it can not be interpreted as the limitation to invention/patent of invention range.It should be pointed out that for the common skill of this field
For art personnel, under the premise of not departing from the present invention/inventive concept, various modifications and improvements can be made, these all belong to
In the protection domain of the present invention/invention.Therefore, the protection domain of the present invention/patent of invention should be determined by the appended claims.
Claims (14)
1. a kind of modification method of word misspelling, which is characterized in that include the following steps:
The word for obtaining misspelling on each position of sentence to be modified is concentrated from the word of obscuring of the word of the misspelling
Word is obscured in selection, forms the candidate word collection of corresponding position;Wherein, described to obscure word collection for the similar multiple texts of word spelling
The set of word;
Cartesian product is carried out to the candidate word collection on each position respectively, obtains multigroup candidate sentences;
The candidate sentences are inputted to the misspelling correction model detection trained in advance respectively to be detected and calculate the time
Select the probabilistic operation value of sentence;
Candidate sentences are selected to correct the sentence to be modified according to the probabilistic operation value.
2. the modification method of word misspelling according to claim 1, which is characterized in that from the text of the misspelling
The word of obscuring of word concentrates selection to obscure word, and the step of candidate word collection for forming corresponding position includes:
It obtains the word of the misspelling and concentrates probability of occurrence maximum K to obscure word obscuring word, form corresponding position
Candidate word collection;Wherein, K >=2, the probability of occurrence are that the word of obscuring corresponding to the word of misspelling concentrates each candidate word to exist
Probability of occurrence on current location;
Include according to the step of probabilistic operation value selection candidate sentences amendment sentence to be modified:
The maximum candidate sentences of the probabilistic operation value are replaced into the sentence to be modified.
3. the modification method of word misspelling according to claim 2, which is characterized in that further include:
Using the misspelling correction model detect in sentence to be modified each word and its it is corresponding obscure word concentrate it is each
Probability of occurrence of the candidate word on current location;The word of misspelling in sentence to be modified is identified according to the probability of occurrence.
4. the modification method of word misspelling according to claim 3, which is characterized in that utilization training in advance
Misspelling correction model detects in sentence to be modified each word and its corresponding obscure word and concentrate each candidate word current
The step of probability of occurrence on position includes:
Word in sentence to be modified is inputted the misspelling correction model to be detected, obtains next position of the word
The probability vector for setting each word obtains the probability of occurrence of next word from the probability vector of each word;
Obtain the word obscures word collection, and the word of obscuring that the word is detected using the misspelling correction model is concentrated respectively
Probability of occurrence of a candidate word on current location.
5. the modification method of word misspelling according to claim 3, which is characterized in that described to be occurred generally according to described
Rate identifies that the step of word of misspelling in sentence to be modified includes:
If the probability of occurrence of current character is more than the first probability threshold value, judge that the word does not have misspelling;
If the probability of occurrence of current character is less than the first probability threshold value and is more than the second probability threshold value, if the word is in its phase
The probability of occurrence for obscuring concentration answered is maximum, judges that the word does not have misspelling, otherwise judges the word misspelling.
6. the modification method of word misspelling according to claim 1, which is characterized in that described by the candidate sentences
The probabilistic operation value that misspelling correction model detection trained in advance was detected and calculated the candidate sentences is inputted respectively
The step of include:
The candidate sentences are inputted into the appearance that misspelling correction model trained in advance detects the word of each position respectively
Probability;
The probability of occurrence of the word of each position is added or is multiplied respectively, obtains the probabilistic operation of the candidate sentences
Value.
7. the modification method of word misspelling according to claim 1, which is characterized in that further include:
Using natural language corpus data and establish misspelling detection training pattern;
The corpus data is pre-processed to obtain training corpus sentence;
The training pattern is trained using the training corpus sentence, obtains the misspelling detection model.
8. the modification method of word misspelling according to claim 7, which is characterized in that described to the corpus data
Being pre-processed the step of obtaining training corpus sentence includes:
Redundant content in corpus data in the training pattern is deleted, and by non-legible data letter into
Row is replaced;
The sentence in corpus data is split as unit of word and the letter, and in sentence-initial and ending addition sentence
Son beginning label and sentence closing tag, generate training corpus sentence.
9. the modification method of word misspelling according to claim 8, which is characterized in that be based on Recognition with Recurrent Neural Network skill
Art establishes the training pattern of unidirectional misspelling detection;By the preceding training corpus sentence to input to the training pattern into
Row training, obtains unidirectional misspelling detection model.
10. the modification method of word misspelling according to claim 7, which is characterized in that based on shot and long term memory god
The training pattern of two-way misspelling detection is established through network and natural language corpus data;By preceding to input and backward
The training corpus sentence of input is trained the training pattern, obtains two-way misspelling detection model.
11. the modification method of word misspelling according to claim 1, which is characterized in that described to obscure word collection with key-
It is worth corresponding mode to store hereof;Wherein, key is the phonetic of Chinese character, is worth to send out the word set of this phonetic.
12. a kind of update the system of word misspelling, which is characterized in that including:
Selecting module, the word of misspelling on each position for obtaining sentence to be modified, from the text of the misspelling
The word of obscuring of word concentrates selection to obscure word, forms the candidate word collection of corresponding position;Wherein, described to obscure word collection for word spelling
Write the set of similar multiple words;
Make volume module, for carrying out cartesian product to the candidate word collection on each position respectively, obtains multigroup candidate sentences;
Computing module is detected for the candidate sentences to be inputted misspelling correction model detection trained in advance respectively
And calculate the probabilistic operation value of the candidate sentences;
Correcting module, for selecting candidate sentences to correct the sentence to be modified according to the probabilistic operation value.
13. a kind of computer equipment, including memory, processor and it is stored on the memory and can be in the processor
The computer program of upper operation, which is characterized in that the processor realized when executing the computer program as claim 1 to
The modification method of word misspelling described in 11 any one.
14. a kind of computer storage media, is stored thereon with computer program, which is characterized in that the program is executed by processor
The modification method of word misspellings of the Shi Shixian as described in claim 1 to 11 any one.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810271932.8A CN108563632A (en) | 2018-03-29 | 2018-03-29 | Modification method, system, computer equipment and the storage medium of word misspelling |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810271932.8A CN108563632A (en) | 2018-03-29 | 2018-03-29 | Modification method, system, computer equipment and the storage medium of word misspelling |
Publications (1)
Publication Number | Publication Date |
---|---|
CN108563632A true CN108563632A (en) | 2018-09-21 |
Family
ID=63533433
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810271932.8A Pending CN108563632A (en) | 2018-03-29 | 2018-03-29 | Modification method, system, computer equipment and the storage medium of word misspelling |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108563632A (en) |
Cited By (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109543022A (en) * | 2018-12-17 | 2019-03-29 | 北京百度网讯科技有限公司 | Text error correction method and device |
CN109766538A (en) * | 2018-11-21 | 2019-05-17 | 北京捷通华声科技股份有限公司 | A kind of text error correction method, device, electronic equipment and storage medium |
CN110457688A (en) * | 2019-07-23 | 2019-11-15 | 广州视源电子科技股份有限公司 | Correction processing method and device, storage medium and processor |
CN110472243A (en) * | 2019-08-08 | 2019-11-19 | 河南大学 | A kind of Chinese spell checking methods |
CN110807319A (en) * | 2019-10-31 | 2020-02-18 | 北京奇艺世纪科技有限公司 | Text content detection method and device, electronic equipment and storage medium |
CN110852074A (en) * | 2019-11-07 | 2020-02-28 | 三角兽(北京)科技有限公司 | Method and device for generating correction statement, storage medium and electronic equipment |
CN111597908A (en) * | 2020-04-22 | 2020-08-28 | 深圳中兴网信科技有限公司 | Test paper correcting method and test paper correcting device |
CN111626049A (en) * | 2020-05-27 | 2020-09-04 | 腾讯科技(深圳)有限公司 | Title correction method and device for multimedia information, electronic equipment and storage medium |
CN112256953A (en) * | 2019-07-22 | 2021-01-22 | 腾讯科技(深圳)有限公司 | Query rewriting method and device, computer equipment and storage medium |
WO2022105180A1 (en) * | 2020-11-19 | 2022-05-27 | 平安科技(深圳)有限公司 | Chinese spelling error correction method and apparatus, computer device and storage medium |
CN112329446B (en) * | 2019-07-17 | 2023-05-23 | 北方工业大学 | Chinese spelling checking method |
Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105045778A (en) * | 2015-06-24 | 2015-11-11 | 江苏科技大学 | Chinese homonym error auto-proofreading method |
US20160275070A1 (en) * | 2015-03-19 | 2016-09-22 | Nuance Communications, Inc. | Correction of previous words and other user text input errors |
CN106202153A (en) * | 2016-06-21 | 2016-12-07 | 广州智索信息科技有限公司 | The spelling error correction method of a kind of ES search engine and system |
CN106528532A (en) * | 2016-11-07 | 2017-03-22 | 上海智臻智能网络科技股份有限公司 | Text error correction method and device and terminal |
CN106598939A (en) * | 2016-10-21 | 2017-04-26 | 北京三快在线科技有限公司 | Method and device for text error correction, server and storage medium |
CN106610930A (en) * | 2015-10-22 | 2017-05-03 | 科大讯飞股份有限公司 | Foreign language writing automatic error correction method and system |
CN106959977A (en) * | 2016-01-12 | 2017-07-18 | 广州市动景计算机科技有限公司 | Candidate collection computational methods and device, word error correction method and device in word input |
CN107122346A (en) * | 2016-12-28 | 2017-09-01 | 平安科技(深圳)有限公司 | The error correction method and device of a kind of read statement |
CN107357775A (en) * | 2017-06-05 | 2017-11-17 | 百度在线网络技术(北京)有限公司 | The text error correction method and device of Recognition with Recurrent Neural Network based on artificial intelligence |
CN107608963A (en) * | 2017-09-12 | 2018-01-19 | 马上消费金融股份有限公司 | A kind of Chinese error correction based on mutual information, device, equipment and storage medium |
-
2018
- 2018-03-29 CN CN201810271932.8A patent/CN108563632A/en active Pending
Patent Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20160275070A1 (en) * | 2015-03-19 | 2016-09-22 | Nuance Communications, Inc. | Correction of previous words and other user text input errors |
CN105045778A (en) * | 2015-06-24 | 2015-11-11 | 江苏科技大学 | Chinese homonym error auto-proofreading method |
CN106610930A (en) * | 2015-10-22 | 2017-05-03 | 科大讯飞股份有限公司 | Foreign language writing automatic error correction method and system |
CN106959977A (en) * | 2016-01-12 | 2017-07-18 | 广州市动景计算机科技有限公司 | Candidate collection computational methods and device, word error correction method and device in word input |
CN106202153A (en) * | 2016-06-21 | 2016-12-07 | 广州智索信息科技有限公司 | The spelling error correction method of a kind of ES search engine and system |
CN106598939A (en) * | 2016-10-21 | 2017-04-26 | 北京三快在线科技有限公司 | Method and device for text error correction, server and storage medium |
CN106528532A (en) * | 2016-11-07 | 2017-03-22 | 上海智臻智能网络科技股份有限公司 | Text error correction method and device and terminal |
CN107122346A (en) * | 2016-12-28 | 2017-09-01 | 平安科技(深圳)有限公司 | The error correction method and device of a kind of read statement |
CN107357775A (en) * | 2017-06-05 | 2017-11-17 | 百度在线网络技术(北京)有限公司 | The text error correction method and device of Recognition with Recurrent Neural Network based on artificial intelligence |
CN107608963A (en) * | 2017-09-12 | 2018-01-19 | 马上消费金融股份有限公司 | A kind of Chinese error correction based on mutual information, device, equipment and storage medium |
Cited By (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109766538A (en) * | 2018-11-21 | 2019-05-17 | 北京捷通华声科技股份有限公司 | A kind of text error correction method, device, electronic equipment and storage medium |
CN109766538B (en) * | 2018-11-21 | 2023-12-15 | 北京捷通华声科技股份有限公司 | Text error correction method and device, electronic equipment and storage medium |
CN109543022A (en) * | 2018-12-17 | 2019-03-29 | 北京百度网讯科技有限公司 | Text error correction method and device |
US11080492B2 (en) | 2018-12-17 | 2021-08-03 | Beijing Baidu Netcom Science And Technology Co., Ltd. | Method and device for correcting error in text |
CN112329446B (en) * | 2019-07-17 | 2023-05-23 | 北方工业大学 | Chinese spelling checking method |
CN112256953A (en) * | 2019-07-22 | 2021-01-22 | 腾讯科技(深圳)有限公司 | Query rewriting method and device, computer equipment and storage medium |
CN112256953B (en) * | 2019-07-22 | 2023-11-14 | 腾讯科技(深圳)有限公司 | Query rewrite method, query rewrite apparatus, computer device, and storage medium |
CN110457688B (en) * | 2019-07-23 | 2023-11-24 | 广州视源电子科技股份有限公司 | Error correction processing method and device, storage medium and processor |
CN110457688A (en) * | 2019-07-23 | 2019-11-15 | 广州视源电子科技股份有限公司 | Correction processing method and device, storage medium and processor |
CN110472243A (en) * | 2019-08-08 | 2019-11-19 | 河南大学 | A kind of Chinese spell checking methods |
CN110807319A (en) * | 2019-10-31 | 2020-02-18 | 北京奇艺世纪科技有限公司 | Text content detection method and device, electronic equipment and storage medium |
CN110807319B (en) * | 2019-10-31 | 2023-07-25 | 北京奇艺世纪科技有限公司 | Text content detection method, detection device, electronic equipment and storage medium |
CN110852074A (en) * | 2019-11-07 | 2020-02-28 | 三角兽(北京)科技有限公司 | Method and device for generating correction statement, storage medium and electronic equipment |
CN111597908A (en) * | 2020-04-22 | 2020-08-28 | 深圳中兴网信科技有限公司 | Test paper correcting method and test paper correcting device |
CN111626049A (en) * | 2020-05-27 | 2020-09-04 | 腾讯科技(深圳)有限公司 | Title correction method and device for multimedia information, electronic equipment and storage medium |
CN111626049B (en) * | 2020-05-27 | 2022-12-16 | 深圳市雅阅科技有限公司 | Title correction method and device for multimedia information, electronic equipment and storage medium |
WO2022105180A1 (en) * | 2020-11-19 | 2022-05-27 | 平安科技(深圳)有限公司 | Chinese spelling error correction method and apparatus, computer device and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108491392A (en) | Modification method, system, computer equipment and the storage medium of word misspelling | |
CN108563632A (en) | Modification method, system, computer equipment and the storage medium of word misspelling | |
CN110489760B (en) | Text automatic correction method and device based on deep neural network | |
CN107729309A (en) | A kind of method and device of the Chinese semantic analysis based on deep learning | |
CN110276069B (en) | Method, system and storage medium for automatically detecting Chinese braille error | |
CN106202153A (en) | The spelling error correction method of a kind of ES search engine and system | |
CN108519973A (en) | Detection method, system, computer equipment and the storage medium of word spelling | |
CN103823794A (en) | Automatic question setting method about query type short answer question of English reading comprehension test | |
CN108563634A (en) | Recognition methods, system, computer equipment and the storage medium of word misspelling | |
CN106991085A (en) | The abbreviation generation method and device of a kind of entity | |
Quattrini Li et al. | Polispell: an adaptive spellchecker and predictor for people with dyslexia | |
Lee et al. | Linguistic rules based Chinese error detection for second language learning | |
CN104239289A (en) | Syllabication method and syllabication device | |
Madi et al. | A proposed Arabic grammatical error detection tool based on deep learning | |
CN114925170B (en) | Text proofreading model training method and device and computing equipment | |
CN109086274A (en) | English social media short text time expression recognition method based on restricted model | |
Tan et al. | Spelling error correction with BERT based on character-phonetic | |
CN110147546A (en) | A kind of syntactic correction method and device of Oral English Practice | |
Čibej et al. | Normalisation, tokenisation and sentence segmentation of Slovene tweets | |
Madanagopal et al. | Reinforced sequence training based subjective bias correction | |
CN115310433A (en) | Data enhancement method for Chinese text proofreading | |
Sodhar et al. | Exploration of Sindhi Corpus Through Statistical Analysis on the Basis of Reality | |
Zheng et al. | Why press backspace? Understanding user input behaviors in Chinese Pinyin input method | |
Li et al. | Data augmentation of incorporating real error patterns and linguistic knowledge for grammatical error correction | |
Sun et al. | Mining sequential patterns and tree patterns to detect erroneous sentences |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20180921 |