CN108563634A - Recognition methods, system, computer equipment and the storage medium of word misspelling - Google Patents
Recognition methods, system, computer equipment and the storage medium of word misspelling Download PDFInfo
- Publication number
- CN108563634A CN108563634A CN201810273968.XA CN201810273968A CN108563634A CN 108563634 A CN108563634 A CN 108563634A CN 201810273968 A CN201810273968 A CN 201810273968A CN 108563634 A CN108563634 A CN 108563634A
- Authority
- CN
- China
- Prior art keywords
- word
- misspelling
- probability
- sentence
- occurrence
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/232—Orthographic correction, e.g. spell checking or vowelisation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2415—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
Abstract
The present invention relates to a kind of recognition methods, system, computer equipment and the storage medium of word misspelling, the method includes:The sentence to be detected of input is obtained, and extracts the word of the sentence to be detected;The word in the sentence to be detected is detected using misspelling detection model trained in advance, obtains first probability of occurrence of each word in the sentence to be detected on current location;The first probability of occurrence for judging each word respectively using preset first probability threshold value, the word that the first probability of occurrence is less than to the first probability threshold value are determined as misspelling.Technical scheme of the present invention, probability of occurrence and preset threshold value by using misspelling detection model trained in advance and according to each word in sentence to be measured on current location are detected the word of the misspelling in sentence to be detected, realize the accurately and efficiently identification to misspelling.
Description
Technical field
The present invention relates to text-processing technologies, more particularly to a kind of recognition methods, system, the calculating of word misspelling
Machine equipment and storage medium.
Background technology
With the continuous development of text-processing technology, for the technologies such as the retrieval, extraction, translation of text message gradually at
It is ripe, however there are no the methods of precise and high efficiency for the check and correction of text.
Detection for misspelling in text is an important link during text proofreading, and the wrongly written character in text is tight
The quality of text is affected again, for example, the word quantity in Press release is very big, while the requirement to misspelling is very tight
Lattice only timely and accurately find the misspelling in contribution, are just unlikely to cause the error propagation of information.
The detection method of traditional input error mainly uses Statistics-Based Method, but the present inventor is somebody's turn to do in realization
It is found when scheme, the method needs the feature according to word, word in text etc., establishes statistical language model, and the method relies on statistics
Language model, during establishing statistical language model, statistical data Sparse Problems can seriously affect its modified efficiency and
Precision, in addition, detection when can not content from the context, also influence its detection accuracy, it is difficult to the misspelling in text
Mistake is accurately and efficiently identified.
Invention content
Based on this, it is necessary to the technical issues of for being difficult to accurately and efficiently detect the misspelling in text,
A kind of recognition methods, system, computer equipment and the storage medium of word misspelling are provided.
A kind of recognition methods of word misspelling, includes the following steps:
The sentence to be detected of input is obtained, and extracts the word of the sentence to be detected;
The word in the sentence to be detected is detected using misspelling detection model trained in advance, obtains institute
State first probability of occurrence of each word in sentence to be detected on current location;
The first probability of occurrence for being judged each word respectively using preset first probability threshold value is occurred first general
The word that rate is less than the first probability threshold value is determined as misspelling.
In one embodiment, if the probability of occurrence of the alternative word is less than or equal to the second probability threshold value of setting,
Judge that the alternative word is misspelling.
In one embodiment, further include:Obtain the alternative text that the first probability of occurrence is less than or equal to the first probability threshold value
Word;
If the probability of occurrence of the alternative word is more than the second probability threshold value of setting, obtain close with the alternative word
Close word obscure word subset;
Using the misspelling detection model detect respectively described in obscure close word in word subset described current
The second probability of occurrence on position;
If the first probability of occurrence of the alternative word is less than the second probability of occurrence of at least one close word, institute is judged
It is misspelling to state alternative word.
In one embodiment, further include:Using natural language corpus data and built based on Recognition with Recurrent Neural Network technology
The training pattern of vertical misspelling detection;
Corpus data in the training pattern is pre-processed to obtain corpus data packet, utilizes the corpus data packet
The training pattern is trained, the misspelling detection model is obtained.
In one embodiment, the corpus data in training pattern is pre-processed to obtain the step of corpus data packet
Suddenly include:
Redundant content in corpus data in the training pattern is deleted, and by non-legible data word
Mother is replaced;
The sentence in corpus data is split as unit of word and the letter, and is added in sentence-initial and ending
Add sentence-initial label and sentence closing tag, generates corpus data packet.
In one embodiment, each word obtained in the sentence to be detected first going out on current location
Now the step of probability includes:
The word of the sentence to be detected of input is obtained one by one;
According to the current location of the word, the letter above of the sentence to be detected before obtaining the current location
Breath;
According to first probability of occurrence of the described acquisition of information above word on the current location.
In one embodiment, further include:Position where the misspelling of the identification is exported, and to described
Position where the misspelling of identification is prompted.
A kind of identifying system of word misspelling, including:
Text acquisition module, the sentence to be detected for obtaining input, and extract the word of the sentence to be detected;
Probability Detection module, for utilizing misspelling detection model trained in advance to the text in the sentence to be detected
Word is detected, and obtains first probability of occurrence of each word in the sentence to be detected on current location;
Error detection module, for judging that the first of each word occurs respectively using preset first probability threshold value
Probability, the word that the first probability of occurrence is less than to the first probability threshold value are determined as misspelling.
The recognition methods of above-mentioned word misspelling and system utilize trained in advance misspelling detection model and basis
Probability of occurrence and preset threshold value of each word on current location are to the misspelling in sentence to be detected in sentence to be measured
Word be detected, realize the accurately and efficiently identification to misspelling.
A kind of computer equipment, including memory, processor and be stored on the memory and can be in the processing
The computer program run on device, the processor realize the knowledge such as above-mentioned word misspelling when executing the computer program
Other method.
Above computer equipment, by the computer program run on the processor, according to each text in sentence to be measured
Probability of occurrence and preset threshold value of the word on current location are detected the word of the misspelling in sentence to be detected, real
The accurately and efficiently identification to misspelling is showed.
A kind of computer storage media, is stored thereon with computer program, is realized when which is executed by processor as above
State the recognition methods of word misspelling.
Above computer storage medium is being worked as by the computer program of its storage according to each word in sentence to be measured
Probability of occurrence and preset threshold value on front position are detected the word of the misspelling in sentence to be detected, realize pair
The accurately and efficiently identification of misspelling.
Description of the drawings
Fig. 1 is the flow chart of the recognition methods of the word misspelling of one embodiment;
Fig. 2 is unidirectional training pattern schematic diagram;
Fig. 3 is the schematic diagram of the prediction result of unidirectional training pattern;
Fig. 4 is two-way training pattern schematic diagram;
Fig. 5 is the schematic diagram of the prediction result of two-way training pattern;
Fig. 6 is the flow chart of the recognition methods of the word misspelling of another embodiment;
Fig. 7 is the structural schematic diagram of the identifying system of the word misspelling of one embodiment;
Fig. 8 is the flow chart of the recognition methods of the word misspelling of an application example;
Fig. 9 is the flow chart of the recognition methods of the word misspelling of another application example;
Figure 10 is the internal structure schematic diagram of one embodiment Computer equipment.
Specific implementation mode
To facilitate the understanding of the present invention, below with reference to relevant drawings to invention is more fully described.In attached drawing
Give the preferred embodiment of the present invention.But the present invention can realize in many different forms, however it is not limited to this paper institutes
The embodiment of description.On the contrary, purpose of providing these embodiments is make it is more thorough and comprehensive to the disclosure.
With reference to figure 1, Fig. 1 shows the flow chart of the recognition methods of the word misspelling of one embodiment, includes mainly
Following steps:
Step S10:The sentence to be detected of input is obtained, and extracts the word of the sentence to be detected.
In this step, the sentence to be detected being input in misspelling detection model is obtained, and is extracted described to be detected
Each text information in sentence.
Step S20:The word in the sentence to be detected is examined using misspelling detection model trained in advance
It surveys, obtains first probability of occurrence of each word in the sentence to be detected on current location.
In this step, using misspelling detection model trained in advance to each word in the sentence to be detected
It is detected, exports the probability that each word occurs on residing current location respectively, i.e., each word in sentence to be detected
The first probability of occurrence in sentence to be detected on residing current location.
Step S30:The first probability of occurrence for judging each word respectively using preset first probability threshold value, by
The word that one probability of occurrence is less than the first probability threshold value is determined as misspelling.
In this step, by the first probability of occurrence of each word in sentence to be detected respectively with preset first probability
Threshold value is compared, if the first probability of occurrence of the word is less than preset first probability threshold value, which is judged to spelling
Write error;Conversely, judging that word spelling is correct.
The technical solution that above-described embodiment proposes, using misspelling detection model trained in advance and according to sentence to be measured
In probability of occurrence of each word on current location and preset threshold value to the word of the misspelling in sentence to be detected into
Row detection, realizes the accurately and efficiently identification to misspelling.
In one embodiment, step S20 with misspelling detection model trained in advance to the language to be detected
Before word in sentence is detected, it can also include the following steps:
(1) training of misspelling detection is established using the corpus data of natural language and based on Recognition with Recurrent Neural Network technology
Model;
(2) corpus data in the training pattern is pre-processed to obtain corpus data packet, utilizes the language material number
The training pattern is trained according to packet, obtains the misspelling detection model.
The scheme of above-described embodiment establishes training pattern and the utilization of misspelling detection based on Recognition with Recurrent Neural Network technology
Corpus data packet is trained, and can obtain accurately and efficiently misspelling detection model.
Further, the step of being pre-processed to obtain corpus data packet to the corpus data in training pattern can wrap
It includes:
Redundant content in corpus data in the training pattern is deleted, and by non-legible data word
Mother is replaced;The sentence in corpus data is split as unit of word and the letter, and in sentence-initial and knot
Tail adds sentence-initial label and sentence closing tag, generates corpus data packet.
Wherein it is possible to by data cleansing, the sentence deleting the useless symbol in corpus data, include non-Chinese characters in common use
Or Chinese character number is less than 2 sentence etc. in repeat statement or a word;For example, continuous a string of Arabic numerals, English word
Or the unification such as english abbreviation is replaced with letter, for example, can select to replace continuous string number with capital N, is used
Capital C replaces continuous a string of English alphabets, be specifically replaced with which kind of letter can be modified as needed and
Setting, for example, as follows with the replaced table of comparisons before replacing:
Before replacement | After replacement |
On April 5th, 2017 | The N N months No. N |
XXXX secondary industry garden | C secondary industry garden |
After the replacement, sentence-initial label and sentence closing tag can also be added for sentence-initial and ending, for example,
Can be marked in the beginning of sentence addition "<s>", sentence ending be added "</s>", and be single with word and the letter of replacement
Position is split the sentence in corpus data, generates the corpus data packet that may be used as model training, the expectation data of generation
Partial data in packet is as follows:
Expect data by arranging natural language, generate and expect data packet, targetedly training pattern can be carried out
Training, can improve the efficiency of model training, to improve the accuracy of misspelling detection model probability output.
In one embodiment, in above-mentioned steps (1) using natural language corpus data and based on cycle nerve net
Network technology establishes the training pattern of misspelling detection, and the model parameters such as number of the number of plies and neuron for neural network can
To be configured according to accuracy of detection and actual demand;For example, the double-deck neural network can be established, it is added between layers
Dropout regularizations, input layer use 4000 neurons, hidden layer to use 400 neurons, correspond to 4000 Chinese characters in common use,
It is the probability of occurrence of each word of prediction that output layer, which uses softmax classification functions, output valve,.
As an implementation, referring to figs. 2 and 3, Fig. 2 is unidirectional training pattern schematic diagram;Fig. 3 is unidirectional
The schematic diagram of the prediction result of training pattern.It can be based on Recognition with Recurrent Neural Network technology (RNN, Recurrent Neural
Networks the training pattern of unidirectional misspelling detection) is established;By the preceding training corpus sentence to input to the instruction
Practice model to be trained, obtains unidirectional misspelling detection model.
For example, the input of training pattern be "<s>Everybody can do something ", and desired output is that " everybody can do something
</s>", i.e., for training sentence " everybody can do something ", corresponding prediction result should be as shown in Fig. 3;Instruction will be passed through
The experienced training pattern that can obtain anticipated output result as misspelling detection model, to the word in sentence to be detected into
Row detection, exports probability of occurrence of each word on current location in sentence to be detected.
Wherein, the information above refer to the current location in sentence to be detected word before position on each text
The information of word can be increased by combining the information above in sentence to be detected on the front position of current location word to deserving
The accuracy of the probability of occurrence detection of word on front position.
It is two-way training pattern schematic diagram with reference to figure 4 and Fig. 5, Fig. 4 as another embodiment;Fig. 5 is two-way
Training pattern prediction result schematic diagram.It can be based on shot and long term Memory Neural Networks (Bi-LSTM) and natural language
Corpus data establishes the training pattern of two-way misspelling detection;By preceding to input and the training corpus sentence inputted backward
The training pattern is trained, two-way misspelling detection model is obtained.
The input of training pattern is divided into two kinds, respectively preceding to input and backward input, for " everybody can do something "
The input of the words, training pattern is divided into two kinds, respectively preceding to input and backward input, in order to ensure what both direction was predicted
Consistency, i.e. " everybody can do something ".So forward direction input for "<s>Innately my material must have ", and inputted backward " to give birth to my material
It must be useful</s>”.And desired output is all " everybody can do something ".That is, for sentence " everybody can do something ", correspond to
Prediction result should be as shown in Figure 5;Such as the prediction for " day " word, by "<s>" and " my raw material must be useful</s>" common
It determines, takes full advantage of contextual information, improve efficiency.
After model training is good, so that it may to use model to do spell check to the sentence newly inputted, such as in RNN models
In, each step can export the probability vector of all words on next position by softmax, from the probability of all words to
The probability of occurrence of next word is obtained in amount.
For example, using above-mentioned unidirectional misspelling detection model, sentence " day rises my material must be useful " is examined
It surveys, available probability of occurrence is:
It | 0.0267950482666,5 |
It rises | 5.48984644411e-07, |
I | 0.214276000857 |
Material | 0.0538657493889 |
It must | 0.0275610154495 |
Have | 0.038463984794 |
With | 0.042061101339 |
In the sentence, " life " has mistakenly been write as " liter ", and probability of " liter " word in current location is
5.48984644411e-07 spells the probability of occurrence of correct word much smaller than other, can be used for detecting word misspelling.
For another example, using above-mentioned two-way misspelling detection model, " day rises my material must be useful " is detected, is obtained
Probability it is as follows:
It | 0.0108770169318 |
It rises | 1.73152820935e-05 |
I | 0.919607996941 |
Material | 0.365396946669 |
It must | 0.999733150005 |
Have | 0.854933917522 |
With | 0.988406062126 |
The handle " life " of mistake has been write as " liter ", and the probability of " liter " word is 1.73152820935e-05, much smaller than other spellings
The probability of correct word is write, therefore can be used to detect word misspelling.
The recognition methods of the word misspelling of above-described embodiment utilizes trained in advance misspelling detection model and root
According to probability of occurrence of each word in sentence to be measured on current location and preset threshold value to the misspelling in sentence to be detected
Word accidentally is detected, and realizes the accurately and efficiently identification to misspelling.
In one embodiment, the recognition methods of the word misspelling further includes:
Step S40, if the probability of occurrence of the alternative word is less than or equal to the second probability threshold value of setting, described in judgement
Alternative word is misspelling.In the present embodiment, second probability threshold value is less than the first probability threshold value, passes through setting second
Probability threshold value further screens the misspelling less than first threshold, further avoids judging by accident.Improve the precision of detection.
With reference to figure 6, Fig. 6 shows the flow chart of the recognition methods of the word misspelling of another embodiment;In this reality
It applies in example, it is after obtaining the first probability of occurrence and being less than the word of the first probability threshold value, the word is alternatively literary as misspelling
Word further detects the word;To the word carry out further detect the step of include mainly:
Step S401:Obtain the alternative word that the first probability of occurrence is less than or equal to the first probability threshold value;
The word that probability of occurrence is less than the first probability threshold value is obtained, using the word as the alternative word of misspelling, to this
Word is further detected.
Step S402:If the probability of occurrence of the alternative word be more than setting the second probability threshold value, obtain with it is described standby
Close word similar in selection word obscures word subset;It is wherein, described that obscure word subset be to have the alternative word and read with it
Word of obscuring that sound is close, familiar in shape or common confusing word is constituted closes.
It is stored hereof for example, the corresponding mode of key-value may be used, using the pronunciation of Chinese character as key, to send out this reading
The Chinese character of sound forms one as value, with the identical Chinese character of the corresponding multiple pronunciations of a key and easily obscures word subset;Further,
It is also conceivable to polyphone, polyphone is deposited in easily obscuring in word subset for multiple pronunciations simultaneously, for example, " meeting " word, it will be same
When consider " hui " and " kuai " two pronunciations, but for " for " word, can only consideration " wei ", ignore two tones and four tones
Difference.
Step S403:Using the misspelling detection model detect respectively described in obscure the close word in word subset and exist
The second probability of occurrence on the current location;
Have invoked the alternative word obscure word subset after, detected respectively using the misspelling detection model described in
Obscure the probability that the close word in word subset occurs on the current location residing for the alternative word, i.e. the second probability of occurrence;
The quantity for the close word being detected can be configured as needed, it can be to obscuring all close texts in word subset
Word is detected or close word only high to part similarity is detected.
Step S404:If the second appearance that the first probability of occurrence of the alternative word is less than at least one close word is general
Rate judges that the alternative word is misspelling;
By the first probability of occurrence of the alternative word, compared respectively with the second probability of occurrence of each close word
Compared with if the first probability of occurrence of the alternative word is less than the second probability of occurrence of at least one close word, judgement is described standby
Selection word is misspelling;If it is second general to be greater than or equal to any one close word for the first probability of the alternative word
Rate then judges that the alternative word spelling is correct.
First probability of occurrence is less than the text of the first probability threshold value by the recognition methods of the word misspelling of above-described embodiment
Word easily obscures word subset as the alternative word of misspelling, by calling, general to the appearance of alternative word word close with its
Rate compares, and further eliminates misjudged word, improves the accuracy of misspelling identification.
In one embodiment, the recognition methods of the word misspelling further includes:By the misspelling of the identification
The position at place is exported;And/or the position where the misspelling of the identification is prompted.
Specifically, after identifying misspelling, the position of misspelling can also be exported, it can also be further
Misspelling is labeled in a manner of crossing, highlighting etc., to achieve the purpose that prompt misspelling position, by right
The output and/or prompt of misspelling position can improve detection efficiency, convenient to be modified to wrongly written character.
The specific implementation mode of the identifying system of the word misspelling of the present invention is described in detail below in conjunction with the accompanying drawings.
With reference to figure 7, Fig. 7 shows the structural schematic diagram of the identifying system of the word misspelling of one embodiment, includes mainly:Word
Acquisition module 10, Probability Detection module 20 and error detection module 30;
Text acquisition module 10, the sentence to be detected for obtaining input, and extract the word of the sentence to be detected;
Probability Detection module 20, for utilizing misspelling detection model trained in advance in the sentence to be detected
Word is detected, and obtains first probability of occurrence of each word in the sentence to be detected on current location;
Error detection module 30, for judging that the first of each word goes out respectively using preset first probability threshold value
Existing probability, the word that the first probability of occurrence is less than to the first probability threshold value are determined as misspelling.
For Probability Detection module 20, can be further used for obtaining the word of the sentence to be detected of input one by one;According to
The current location of the word, the information above of the sentence to be detected before obtaining the current location;On described
First probability of occurrence of the literary acquisition of information word on the current location.
For error detection module 30, it can be also used for obtaining the first probability of occurrence less than or equal to the first probability threshold value
Alternative word;If the probability of occurrence of the alternative word is more than the second probability threshold value of setting, obtain and the alternative word phase
Close close word obscures word subset;Using the misspelling detection model detect respectively described in obscure phase in word subset
Nearly second probability of occurrence of the word on the current location;If the first probability of occurrence of the alternative word is less than at least one
Second probability of occurrence of close word judges that the alternative word is misspelling.
For error detection module 30, if the probability of occurrence that can be also used for the alternative word is less than or equal to setting
Second probability threshold value judges that the alternative word is misspelling.
For error detection module 30, can be also used for exporting the position where the misspelling of the identification;
And/or the position where the misspelling of the identification is prompted.
In addition, it can include model foundation and training module 40, for the corpus data using natural language and are based on
Recognition with Recurrent Neural Network technology establishes the training pattern of misspelling detection;Corpus data in the training pattern is located in advance
Reason obtains corpus data packet, is trained to the training pattern using the corpus data packet, obtains the misspelling flase drop
Survey model.
For model foundation and training module 40, can be further used for will be in the corpus data in the training pattern
Redundant content is deleted, and non-legible data are replaced with letter;To language as unit of word and the letter
Sentence in material data is split, and in sentence-initial and ending addition sentence-initial label and sentence closing tag, is generated
Corpus data packet.
The recognition methods of above-mentioned word misspelling and system utilize trained in advance misspelling detection model and basis
Probability of occurrence and preset threshold value of each word on current location are to the misspelling in sentence to be detected in sentence to be measured
Word be detected, realize the accurately and efficiently identification to misspelling.
Specific reality of the application example to the present invention of word misspelling detection is carried out with reference to an application present invention
Mode is applied to be illustrated in more detail.
With reference to figure 8, Fig. 8 shows the flow chart of the recognition methods of the word misspelling of an application example, main to wrap
Include following steps:
Step a1:The training pattern for establishing misspelling detection pre-processes the corpus data of training pattern;
Step a2:The training pattern is trained;
Step a3:The misspelling detection model for obtaining sentence to be detected and being input to after training;
Step a4:The word of the sentence to be detected of the input is extracted using the misspelling detection model after training;
Step a5:Obtain first probability of occurrence of each word in the sentence to be detected on current location;
Step a6:Judge whether the first probability of occurrence of each word is less than or equal to preset first probability threshold
Value;If so, executing step a7;Otherwise, judge that word spelling is correct;
Step a7:Identify that the word is misspelling and exports the position of the misspelling.
With reference to figure 9, Fig. 9 shows the flow chart of the recognition methods of the word misspelling of another application example, mainly
Include the following steps:
Step b1:The training pattern for establishing misspelling detection pre-processes the corpus data of training pattern;
Step b2:The training pattern is trained;
Step b3:The misspelling detection model for obtaining sentence to be detected and being input to after training;
Step b4:The word of the sentence to be detected of the input is extracted using the misspelling detection model after training;
Step b5:Obtain first probability of occurrence of each word in the sentence to be detected on current location;
Step b6:Judge whether the first probability of occurrence of each word is less than or equal to preset first probability threshold
Value;If so, executing step b7;Otherwise, judge that word spelling is correct.
Step b7:Judge whether the probability of occurrence of the word is more than the second probability threshold value of setting;If so, executing step
b8;Otherwise, step b10 is executed;
Step b8:Using the misspelling detection model detect respectively described in obscure close word in word subset in institute
State the second probability of occurrence on current location;
Step b9:Judge whether the first probability of occurrence of the word is less than the second of at least one close word and occurs generally
Rate;If so, executing step b10;Conversely, judging that word spelling is correct;
Step b10:Identify that the word is misspelling and exports the position of the misspelling.
By the recognition methods of above-mentioned word misspelling, waited for using misspelling detection model trained in advance and basis
Probability of occurrence and preset threshold value of each word on current location are to the misspelling in sentence to be detected in survey sentence
Word is detected, and realizes the accurately and efficiently identification to misspelling.
In one embodiment, the present invention also provides a kind of computer equipment, which includes memory, processing
Device and storage are on a memory and the computer program that can run on a processor, wherein reality when processor executes described program
Now such as the recognition methods of any one word misspelling in the various embodiments described above.
The computer equipment, when processor executes program, by realizing such as any one text in the various embodiments described above
The recognition methods of word misspelling, to according to probability of occurrence of each word in sentence to be measured on current location and preset
Threshold value is detected the word of the misspelling in sentence to be detected, realizes the accurately and efficiently identification to misspelling.
In addition, one of ordinary skill in the art will appreciate that realize above-described embodiment method in all or part of flow,
It is that relevant hardware can be instructed to complete by computer program, the program can be stored in a non-volatile calculating
In machine read/write memory medium, in the embodiment of the present invention, which can be stored in the storage medium of computer system, and by
At least one of computer system processor executes, and includes such as the recognition methods of above-mentioned each word misspelling to realize
The flow of embodiment.
In one embodiment, a kind of storage medium is also provided, computer program is stored thereon with, wherein the program quilt
The recognition methods such as any one word misspelling in the various embodiments described above is realized when processor executes.Wherein, described
Storage medium can be magnetic disc, CD, read-only memory (Read-Only Memory, ROM) or random access memory
(Random Access Memory, RAM) etc..
The computer storage media, the computer program of storage include such as above-mentioned each word misspelling by realizing
Recognition methods embodiment flow, so as to according to probability of occurrence of each word on current location in sentence to be measured
The word of the misspelling in sentence to be detected is detected with preset threshold value, is realized to the accurate, high of misspelling
The identification of effect.
Figure 10 is the internal structure schematic diagram of one embodiment Computer equipment.Referring to Fig.1 0, the computer equipment packet
Include processor, non-volatile memory medium, built-in storage, display and the network interface connected by system bus.Wherein, should
The non-volatile memory medium of computer equipment can storage program area and the computer program for realizing voice communication assembly, the meter
Calculation machine program is performed, and processor may make to execute a kind of voice communication method.The processor of the computer equipment is for carrying
For calculating and control ability, the operation of entire computer equipment is supported.Computer program can be stored in the built-in storage, the meter
When calculation machine program is executed by processor, processor may make to execute the recognition methods of word misspelling.The net of computer equipment
Network interface is for carrying out network communication.Display screen is for showing application interface etc., for example, display instant messaging chat interface or text
The operation interface etc. that word is corrected.The display screen of computer equipment can be liquid crystal display or electric ink display screen, calculate
The input unit of machine equipment can be the touch screen covered on display screen, can also be on computer equipment shell equipment by
Key, trace ball or Trackpad can also be external keyboard, Trackpad or mouse etc..Touch layer constitutes touch screen with display screen.
It will be understood by those skilled in the art that structure shown in Figure 10, only with the relevant part of the present invention program
The block diagram of structure, does not constitute the restriction for the terminal being applied thereon to the present invention program, and specific terminal may include ratio
More or fewer components as shown in the figure either combine certain components or are arranged with different components.
Each technical characteristic of embodiment described above can be combined arbitrarily, to keep description succinct, not to above-mentioned reality
It applies all possible combination of each technical characteristic in example to be all described, as long as however, the combination of these technical characteristics is not deposited
In contradiction, it is all considered to be the range of this specification record.Several implementations of the invention above described embodiment only expresses
Mode, the description thereof is more specific and detailed, but can not therefore be construed as limiting the scope of the patent.It should be understood that
It is that for those of ordinary skill in the art, without departing from the inventive concept of the premise, several deformations can also be made
And improvement, these are all within the scope of protection of the present invention.Therefore, the protection domain of patent of the present invention should be with appended claims
It is accurate.
Claims (10)
1. a kind of recognition methods of word misspelling, which is characterized in that include the following steps:
The sentence to be detected of input is obtained, and extracts the word of the sentence to be detected;
The word in the sentence to be detected is detected using misspelling detection model trained in advance, is waited for described in acquisition
Detect first probability of occurrence of each word in sentence on current location;
Judge the first probability of occurrence of each word respectively using preset first probability threshold value, the first probability of occurrence is small
It is determined as misspelling in the word of the first probability threshold value.
2. the recognition methods of word misspelling according to claim 1, which is characterized in that further include:
If the probability of occurrence of the alternative word is less than or equal to the second probability threshold value of setting, the alternative word is judged to spell
Write error.
3. the recognition methods of word misspelling according to claim 1, which is characterized in that further include:
Obtain the alternative word that the first probability of occurrence is less than or equal to the first probability threshold value;
If the probability of occurrence of the alternative word is more than the second probability threshold value of setting, obtain and phase similar in the alternative word
Nearly word obscures word subset;
Using the misspelling detection model detect respectively described in obscure close word in word subset in the current location
On the second probability of occurrence;
If the first probability of occurrence of the alternative word is less than the second probability of occurrence of at least one close word, judgement is described standby
Selection word is misspelling.
4. the recognition methods of word misspelling according to claim 1, which is characterized in that further include:
Using natural language corpus data and based on Recognition with Recurrent Neural Network technology establish misspelling detection training pattern;
Corpus data in the training pattern is pre-processed to obtain corpus data packet, using the corpus data packet to institute
It states training pattern to be trained, obtains the misspelling detection model.
5. the recognition methods of word misspelling according to claim 4, which is characterized in that described in training pattern
Corpus data is pre-processed the step of obtaining corpus data packet and includes:
Redundant content in corpus data in the training pattern is deleted, and by non-legible data letter into
Row is replaced;
The sentence in corpus data is split as unit of word and the letter, and in sentence-initial and ending addition sentence
Son beginning label and sentence closing tag, generate corpus data packet.
6. the recognition methods of word misspelling according to claim 1, which is characterized in that the acquisition is described to be detected
Each word in sentence includes in the step of the first probability of occurrence on current location:
The word of the sentence to be detected of input is obtained one by one;
According to the current location of the word, the information above of the sentence to be detected before obtaining the current location;
According to first probability of occurrence of the described acquisition of information above word on the current location.
7. the recognition methods of word misspelling according to claim 1, which is characterized in that further include:
Position where the misspelling of the identification is exported, and to the position where the misspelling of the identification into
Row prompt.
8. a kind of identifying system of word misspelling, which is characterized in that including:
Text acquisition module, the sentence to be detected for obtaining input, and extract the word of the sentence to be detected;
Probability Detection module, for using misspelling detection model trained in advance to the word in the sentence to be detected into
Row detection obtains first probability of occurrence of each word in the sentence to be detected on current location;
Error detection module, for judging that the first of each word occurs generally respectively using preset first probability threshold value
Rate, the word that the first probability of occurrence is less than to the first probability threshold value are determined as misspelling.
9. a kind of computer equipment, including memory, processor and it is stored on the memory and can be in the processor
The computer program of upper operation, which is characterized in that the processor realized when executing the computer program as claim 1 to
The recognition methods of word misspelling described in 7 any one.
10. a kind of computer storage media, is stored thereon with computer program, which is characterized in that the program is executed by processor
The recognition methods of word misspellings of the Shi Shixian as described in claim 1 to 7 any one.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810273968.XA CN108563634A (en) | 2018-03-29 | 2018-03-29 | Recognition methods, system, computer equipment and the storage medium of word misspelling |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810273968.XA CN108563634A (en) | 2018-03-29 | 2018-03-29 | Recognition methods, system, computer equipment and the storage medium of word misspelling |
Publications (1)
Publication Number | Publication Date |
---|---|
CN108563634A true CN108563634A (en) | 2018-09-21 |
Family
ID=63533452
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810273968.XA Pending CN108563634A (en) | 2018-03-29 | 2018-03-29 | Recognition methods, system, computer equipment and the storage medium of word misspelling |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108563634A (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109408813A (en) * | 2018-09-30 | 2019-03-01 | 北京金山安全软件有限公司 | Text correction method and device |
CN109558600A (en) * | 2018-11-14 | 2019-04-02 | 北京字节跳动网络技术有限公司 | Translation processing method and device |
CN109766538A (en) * | 2018-11-21 | 2019-05-17 | 北京捷通华声科技股份有限公司 | A kind of text error correction method, device, electronic equipment and storage medium |
CN110232191A (en) * | 2019-06-17 | 2019-09-13 | 无码科技(杭州)有限公司 | Autotext error-checking method |
CN110866390A (en) * | 2019-10-15 | 2020-03-06 | 平安科技(深圳)有限公司 | Method and device for recognizing Chinese grammar error, computer equipment and storage medium |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105045778A (en) * | 2015-06-24 | 2015-11-11 | 江苏科技大学 | Chinese homonym error auto-proofreading method |
CN105159871A (en) * | 2015-08-21 | 2015-12-16 | 小米科技有限责任公司 | Text information detection method and apparatus |
CN106776501A (en) * | 2016-12-13 | 2017-05-31 | 深圳爱拼信息科技有限公司 | A kind of automatic method for correcting of text wrong word and server |
CN107122346A (en) * | 2016-12-28 | 2017-09-01 | 平安科技(深圳)有限公司 | The error correction method and device of a kind of read statement |
CN107357775A (en) * | 2017-06-05 | 2017-11-17 | 百度在线网络技术(北京)有限公司 | The text error correction method and device of Recognition with Recurrent Neural Network based on artificial intelligence |
-
2018
- 2018-03-29 CN CN201810273968.XA patent/CN108563634A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105045778A (en) * | 2015-06-24 | 2015-11-11 | 江苏科技大学 | Chinese homonym error auto-proofreading method |
CN105159871A (en) * | 2015-08-21 | 2015-12-16 | 小米科技有限责任公司 | Text information detection method and apparatus |
CN106776501A (en) * | 2016-12-13 | 2017-05-31 | 深圳爱拼信息科技有限公司 | A kind of automatic method for correcting of text wrong word and server |
CN107122346A (en) * | 2016-12-28 | 2017-09-01 | 平安科技(深圳)有限公司 | The error correction method and device of a kind of read statement |
CN107357775A (en) * | 2017-06-05 | 2017-11-17 | 百度在线网络技术(北京)有限公司 | The text error correction method and device of Recognition with Recurrent Neural Network based on artificial intelligence |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109408813A (en) * | 2018-09-30 | 2019-03-01 | 北京金山安全软件有限公司 | Text correction method and device |
CN109408813B (en) * | 2018-09-30 | 2023-07-21 | 北京金山安全软件有限公司 | Text correction method and device |
CN109558600A (en) * | 2018-11-14 | 2019-04-02 | 北京字节跳动网络技术有限公司 | Translation processing method and device |
CN109766538A (en) * | 2018-11-21 | 2019-05-17 | 北京捷通华声科技股份有限公司 | A kind of text error correction method, device, electronic equipment and storage medium |
CN109766538B (en) * | 2018-11-21 | 2023-12-15 | 北京捷通华声科技股份有限公司 | Text error correction method and device, electronic equipment and storage medium |
CN110232191A (en) * | 2019-06-17 | 2019-09-13 | 无码科技(杭州)有限公司 | Autotext error-checking method |
CN110866390A (en) * | 2019-10-15 | 2020-03-06 | 平安科技(深圳)有限公司 | Method and device for recognizing Chinese grammar error, computer equipment and storage medium |
CN110866390B (en) * | 2019-10-15 | 2022-02-11 | 平安科技(深圳)有限公司 | Method and device for recognizing Chinese grammar error, computer equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110489760B (en) | Text automatic correction method and device based on deep neural network | |
CN108491392A (en) | Modification method, system, computer equipment and the storage medium of word misspelling | |
Stoica et al. | Re-tacred: Addressing shortcomings of the tacred dataset | |
CN108563634A (en) | Recognition methods, system, computer equipment and the storage medium of word misspelling | |
CN108563632A (en) | Modification method, system, computer equipment and the storage medium of word misspelling | |
CN114610515B (en) | Multi-feature log anomaly detection method and system based on log full semantics | |
CN107729309A (en) | A kind of method and device of the Chinese semantic analysis based on deep learning | |
CN107229610A (en) | The analysis method and device of a kind of affection data | |
CN106991085A (en) | The abbreviation generation method and device of a kind of entity | |
CN111310447A (en) | Grammar error correction method, grammar error correction device, electronic equipment and storage medium | |
CN106325596B (en) | A kind of written handwriting automatic error correction method and system | |
CN108519973A (en) | Detection method, system, computer equipment and the storage medium of word spelling | |
CN110888798A (en) | Software defect prediction method based on graph convolution neural network | |
CN111966944A (en) | Model construction method for multi-level user comment security audit | |
CN114970506A (en) | Grammar error correction method and system based on multi-granularity grammar error template learning fine tuning | |
Tashu et al. | Deep Learning Architecture for Automatic Essay Scoring | |
CN115358219A (en) | Chinese spelling error correction method integrating unsupervised learning and self-supervised learning | |
CN114580391A (en) | Chinese error detection model training method, device, equipment and storage medium | |
CN113515588A (en) | Form data detection method, computer device and storage medium | |
CN108664467A (en) | Candidate word appraisal procedure, device, computer equipment and storage medium | |
CN108681534A (en) | Candidate word appraisal procedure, device, computer equipment and storage medium | |
CN108628827A (en) | Candidate word appraisal procedure, device, computer equipment and storage medium | |
CN108595419A (en) | Candidate word appraisal procedure, candidate word sort method and device | |
CN108647202A (en) | Candidate word appraisal procedure, device, computer equipment and storage medium | |
CN108681533A (en) | Candidate word appraisal procedure, device, computer equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20180921 |