CN108563634A - Recognition methods, system, computer equipment and the storage medium of word misspelling - Google Patents

Recognition methods, system, computer equipment and the storage medium of word misspelling Download PDF

Info

Publication number
CN108563634A
CN108563634A CN201810273968.XA CN201810273968A CN108563634A CN 108563634 A CN108563634 A CN 108563634A CN 201810273968 A CN201810273968 A CN 201810273968A CN 108563634 A CN108563634 A CN 108563634A
Authority
CN
China
Prior art keywords
word
misspelling
probability
sentence
occurrence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810273968.XA
Other languages
Chinese (zh)
Inventor
王鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Shiyuan Electronics Thecnology Co Ltd
Original Assignee
Guangzhou Shiyuan Electronics Thecnology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Shiyuan Electronics Thecnology Co Ltd filed Critical Guangzhou Shiyuan Electronics Thecnology Co Ltd
Priority to CN201810273968.XA priority Critical patent/CN108563634A/en
Publication of CN108563634A publication Critical patent/CN108563634A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/232Orthographic correction, e.g. spell checking or vowelisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate

Abstract

The present invention relates to a kind of recognition methods, system, computer equipment and the storage medium of word misspelling, the method includes:The sentence to be detected of input is obtained, and extracts the word of the sentence to be detected;The word in the sentence to be detected is detected using misspelling detection model trained in advance, obtains first probability of occurrence of each word in the sentence to be detected on current location;The first probability of occurrence for judging each word respectively using preset first probability threshold value, the word that the first probability of occurrence is less than to the first probability threshold value are determined as misspelling.Technical scheme of the present invention, probability of occurrence and preset threshold value by using misspelling detection model trained in advance and according to each word in sentence to be measured on current location are detected the word of the misspelling in sentence to be detected, realize the accurately and efficiently identification to misspelling.

Description

Recognition methods, system, computer equipment and the storage medium of word misspelling
Technical field
The present invention relates to text-processing technologies, more particularly to a kind of recognition methods, system, the calculating of word misspelling Machine equipment and storage medium.
Background technology
With the continuous development of text-processing technology, for the technologies such as the retrieval, extraction, translation of text message gradually at It is ripe, however there are no the methods of precise and high efficiency for the check and correction of text.
Detection for misspelling in text is an important link during text proofreading, and the wrongly written character in text is tight The quality of text is affected again, for example, the word quantity in Press release is very big, while the requirement to misspelling is very tight Lattice only timely and accurately find the misspelling in contribution, are just unlikely to cause the error propagation of information.
The detection method of traditional input error mainly uses Statistics-Based Method, but the present inventor is somebody's turn to do in realization It is found when scheme, the method needs the feature according to word, word in text etc., establishes statistical language model, and the method relies on statistics Language model, during establishing statistical language model, statistical data Sparse Problems can seriously affect its modified efficiency and Precision, in addition, detection when can not content from the context, also influence its detection accuracy, it is difficult to the misspelling in text Mistake is accurately and efficiently identified.
Invention content
Based on this, it is necessary to the technical issues of for being difficult to accurately and efficiently detect the misspelling in text, A kind of recognition methods, system, computer equipment and the storage medium of word misspelling are provided.
A kind of recognition methods of word misspelling, includes the following steps:
The sentence to be detected of input is obtained, and extracts the word of the sentence to be detected;
The word in the sentence to be detected is detected using misspelling detection model trained in advance, obtains institute State first probability of occurrence of each word in sentence to be detected on current location;
The first probability of occurrence for being judged each word respectively using preset first probability threshold value is occurred first general The word that rate is less than the first probability threshold value is determined as misspelling.
In one embodiment, if the probability of occurrence of the alternative word is less than or equal to the second probability threshold value of setting, Judge that the alternative word is misspelling.
In one embodiment, further include:Obtain the alternative text that the first probability of occurrence is less than or equal to the first probability threshold value Word;
If the probability of occurrence of the alternative word is more than the second probability threshold value of setting, obtain close with the alternative word Close word obscure word subset;
Using the misspelling detection model detect respectively described in obscure close word in word subset described current The second probability of occurrence on position;
If the first probability of occurrence of the alternative word is less than the second probability of occurrence of at least one close word, institute is judged It is misspelling to state alternative word.
In one embodiment, further include:Using natural language corpus data and built based on Recognition with Recurrent Neural Network technology The training pattern of vertical misspelling detection;
Corpus data in the training pattern is pre-processed to obtain corpus data packet, utilizes the corpus data packet The training pattern is trained, the misspelling detection model is obtained.
In one embodiment, the corpus data in training pattern is pre-processed to obtain the step of corpus data packet Suddenly include:
Redundant content in corpus data in the training pattern is deleted, and by non-legible data word Mother is replaced;
The sentence in corpus data is split as unit of word and the letter, and is added in sentence-initial and ending Add sentence-initial label and sentence closing tag, generates corpus data packet.
In one embodiment, each word obtained in the sentence to be detected first going out on current location Now the step of probability includes:
The word of the sentence to be detected of input is obtained one by one;
According to the current location of the word, the letter above of the sentence to be detected before obtaining the current location Breath;
According to first probability of occurrence of the described acquisition of information above word on the current location.
In one embodiment, further include:Position where the misspelling of the identification is exported, and to described Position where the misspelling of identification is prompted.
A kind of identifying system of word misspelling, including:
Text acquisition module, the sentence to be detected for obtaining input, and extract the word of the sentence to be detected;
Probability Detection module, for utilizing misspelling detection model trained in advance to the text in the sentence to be detected Word is detected, and obtains first probability of occurrence of each word in the sentence to be detected on current location;
Error detection module, for judging that the first of each word occurs respectively using preset first probability threshold value Probability, the word that the first probability of occurrence is less than to the first probability threshold value are determined as misspelling.
The recognition methods of above-mentioned word misspelling and system utilize trained in advance misspelling detection model and basis Probability of occurrence and preset threshold value of each word on current location are to the misspelling in sentence to be detected in sentence to be measured Word be detected, realize the accurately and efficiently identification to misspelling.
A kind of computer equipment, including memory, processor and be stored on the memory and can be in the processing The computer program run on device, the processor realize the knowledge such as above-mentioned word misspelling when executing the computer program Other method.
Above computer equipment, by the computer program run on the processor, according to each text in sentence to be measured Probability of occurrence and preset threshold value of the word on current location are detected the word of the misspelling in sentence to be detected, real The accurately and efficiently identification to misspelling is showed.
A kind of computer storage media, is stored thereon with computer program, is realized when which is executed by processor as above State the recognition methods of word misspelling.
Above computer storage medium is being worked as by the computer program of its storage according to each word in sentence to be measured Probability of occurrence and preset threshold value on front position are detected the word of the misspelling in sentence to be detected, realize pair The accurately and efficiently identification of misspelling.
Description of the drawings
Fig. 1 is the flow chart of the recognition methods of the word misspelling of one embodiment;
Fig. 2 is unidirectional training pattern schematic diagram;
Fig. 3 is the schematic diagram of the prediction result of unidirectional training pattern;
Fig. 4 is two-way training pattern schematic diagram;
Fig. 5 is the schematic diagram of the prediction result of two-way training pattern;
Fig. 6 is the flow chart of the recognition methods of the word misspelling of another embodiment;
Fig. 7 is the structural schematic diagram of the identifying system of the word misspelling of one embodiment;
Fig. 8 is the flow chart of the recognition methods of the word misspelling of an application example;
Fig. 9 is the flow chart of the recognition methods of the word misspelling of another application example;
Figure 10 is the internal structure schematic diagram of one embodiment Computer equipment.
Specific implementation mode
To facilitate the understanding of the present invention, below with reference to relevant drawings to invention is more fully described.In attached drawing Give the preferred embodiment of the present invention.But the present invention can realize in many different forms, however it is not limited to this paper institutes The embodiment of description.On the contrary, purpose of providing these embodiments is make it is more thorough and comprehensive to the disclosure.
With reference to figure 1, Fig. 1 shows the flow chart of the recognition methods of the word misspelling of one embodiment, includes mainly Following steps:
Step S10:The sentence to be detected of input is obtained, and extracts the word of the sentence to be detected.
In this step, the sentence to be detected being input in misspelling detection model is obtained, and is extracted described to be detected Each text information in sentence.
Step S20:The word in the sentence to be detected is examined using misspelling detection model trained in advance It surveys, obtains first probability of occurrence of each word in the sentence to be detected on current location.
In this step, using misspelling detection model trained in advance to each word in the sentence to be detected It is detected, exports the probability that each word occurs on residing current location respectively, i.e., each word in sentence to be detected The first probability of occurrence in sentence to be detected on residing current location.
Step S30:The first probability of occurrence for judging each word respectively using preset first probability threshold value, by The word that one probability of occurrence is less than the first probability threshold value is determined as misspelling.
In this step, by the first probability of occurrence of each word in sentence to be detected respectively with preset first probability Threshold value is compared, if the first probability of occurrence of the word is less than preset first probability threshold value, which is judged to spelling Write error;Conversely, judging that word spelling is correct.
The technical solution that above-described embodiment proposes, using misspelling detection model trained in advance and according to sentence to be measured In probability of occurrence of each word on current location and preset threshold value to the word of the misspelling in sentence to be detected into Row detection, realizes the accurately and efficiently identification to misspelling.
In one embodiment, step S20 with misspelling detection model trained in advance to the language to be detected Before word in sentence is detected, it can also include the following steps:
(1) training of misspelling detection is established using the corpus data of natural language and based on Recognition with Recurrent Neural Network technology Model;
(2) corpus data in the training pattern is pre-processed to obtain corpus data packet, utilizes the language material number The training pattern is trained according to packet, obtains the misspelling detection model.
The scheme of above-described embodiment establishes training pattern and the utilization of misspelling detection based on Recognition with Recurrent Neural Network technology Corpus data packet is trained, and can obtain accurately and efficiently misspelling detection model.
Further, the step of being pre-processed to obtain corpus data packet to the corpus data in training pattern can wrap It includes:
Redundant content in corpus data in the training pattern is deleted, and by non-legible data word Mother is replaced;The sentence in corpus data is split as unit of word and the letter, and in sentence-initial and knot Tail adds sentence-initial label and sentence closing tag, generates corpus data packet.
Wherein it is possible to by data cleansing, the sentence deleting the useless symbol in corpus data, include non-Chinese characters in common use Or Chinese character number is less than 2 sentence etc. in repeat statement or a word;For example, continuous a string of Arabic numerals, English word Or the unification such as english abbreviation is replaced with letter, for example, can select to replace continuous string number with capital N, is used Capital C replaces continuous a string of English alphabets, be specifically replaced with which kind of letter can be modified as needed and Setting, for example, as follows with the replaced table of comparisons before replacing:
Before replacement After replacement
On April 5th, 2017 The N N months No. N
XXXX secondary industry garden C secondary industry garden
After the replacement, sentence-initial label and sentence closing tag can also be added for sentence-initial and ending, for example, Can be marked in the beginning of sentence addition "<s>", sentence ending be added "</s>", and be single with word and the letter of replacement Position is split the sentence in corpus data, generates the corpus data packet that may be used as model training, the expectation data of generation Partial data in packet is as follows:
Expect data by arranging natural language, generate and expect data packet, targetedly training pattern can be carried out Training, can improve the efficiency of model training, to improve the accuracy of misspelling detection model probability output.
In one embodiment, in above-mentioned steps (1) using natural language corpus data and based on cycle nerve net Network technology establishes the training pattern of misspelling detection, and the model parameters such as number of the number of plies and neuron for neural network can To be configured according to accuracy of detection and actual demand;For example, the double-deck neural network can be established, it is added between layers Dropout regularizations, input layer use 4000 neurons, hidden layer to use 400 neurons, correspond to 4000 Chinese characters in common use, It is the probability of occurrence of each word of prediction that output layer, which uses softmax classification functions, output valve,.
As an implementation, referring to figs. 2 and 3, Fig. 2 is unidirectional training pattern schematic diagram;Fig. 3 is unidirectional The schematic diagram of the prediction result of training pattern.It can be based on Recognition with Recurrent Neural Network technology (RNN, Recurrent Neural Networks the training pattern of unidirectional misspelling detection) is established;By the preceding training corpus sentence to input to the instruction Practice model to be trained, obtains unidirectional misspelling detection model.
For example, the input of training pattern be "<s>Everybody can do something ", and desired output is that " everybody can do something </s>", i.e., for training sentence " everybody can do something ", corresponding prediction result should be as shown in Fig. 3;Instruction will be passed through The experienced training pattern that can obtain anticipated output result as misspelling detection model, to the word in sentence to be detected into Row detection, exports probability of occurrence of each word on current location in sentence to be detected.
Wherein, the information above refer to the current location in sentence to be detected word before position on each text The information of word can be increased by combining the information above in sentence to be detected on the front position of current location word to deserving The accuracy of the probability of occurrence detection of word on front position.
It is two-way training pattern schematic diagram with reference to figure 4 and Fig. 5, Fig. 4 as another embodiment;Fig. 5 is two-way Training pattern prediction result schematic diagram.It can be based on shot and long term Memory Neural Networks (Bi-LSTM) and natural language Corpus data establishes the training pattern of two-way misspelling detection;By preceding to input and the training corpus sentence inputted backward The training pattern is trained, two-way misspelling detection model is obtained.
The input of training pattern is divided into two kinds, respectively preceding to input and backward input, for " everybody can do something " The input of the words, training pattern is divided into two kinds, respectively preceding to input and backward input, in order to ensure what both direction was predicted Consistency, i.e. " everybody can do something ".So forward direction input for "<s>Innately my material must have ", and inputted backward " to give birth to my material It must be useful</s>”.And desired output is all " everybody can do something ".That is, for sentence " everybody can do something ", correspond to Prediction result should be as shown in Figure 5;Such as the prediction for " day " word, by "<s>" and " my raw material must be useful</s>" common It determines, takes full advantage of contextual information, improve efficiency.
After model training is good, so that it may to use model to do spell check to the sentence newly inputted, such as in RNN models In, each step can export the probability vector of all words on next position by softmax, from the probability of all words to The probability of occurrence of next word is obtained in amount.
For example, using above-mentioned unidirectional misspelling detection model, sentence " day rises my material must be useful " is examined It surveys, available probability of occurrence is:
It 0.0267950482666,5
It rises 5.48984644411e-07,
I 0.214276000857
Material 0.0538657493889
It must 0.0275610154495
Have 0.038463984794
With 0.042061101339
In the sentence, " life " has mistakenly been write as " liter ", and probability of " liter " word in current location is 5.48984644411e-07 spells the probability of occurrence of correct word much smaller than other, can be used for detecting word misspelling.
For another example, using above-mentioned two-way misspelling detection model, " day rises my material must be useful " is detected, is obtained Probability it is as follows:
It 0.0108770169318
It rises 1.73152820935e-05
I 0.919607996941
Material 0.365396946669
It must 0.999733150005
Have 0.854933917522
With 0.988406062126
The handle " life " of mistake has been write as " liter ", and the probability of " liter " word is 1.73152820935e-05, much smaller than other spellings The probability of correct word is write, therefore can be used to detect word misspelling.
The recognition methods of the word misspelling of above-described embodiment utilizes trained in advance misspelling detection model and root According to probability of occurrence of each word in sentence to be measured on current location and preset threshold value to the misspelling in sentence to be detected Word accidentally is detected, and realizes the accurately and efficiently identification to misspelling.
In one embodiment, the recognition methods of the word misspelling further includes:
Step S40, if the probability of occurrence of the alternative word is less than or equal to the second probability threshold value of setting, described in judgement Alternative word is misspelling.In the present embodiment, second probability threshold value is less than the first probability threshold value, passes through setting second Probability threshold value further screens the misspelling less than first threshold, further avoids judging by accident.Improve the precision of detection.
With reference to figure 6, Fig. 6 shows the flow chart of the recognition methods of the word misspelling of another embodiment;In this reality It applies in example, it is after obtaining the first probability of occurrence and being less than the word of the first probability threshold value, the word is alternatively literary as misspelling Word further detects the word;To the word carry out further detect the step of include mainly:
Step S401:Obtain the alternative word that the first probability of occurrence is less than or equal to the first probability threshold value;
The word that probability of occurrence is less than the first probability threshold value is obtained, using the word as the alternative word of misspelling, to this Word is further detected.
Step S402:If the probability of occurrence of the alternative word be more than setting the second probability threshold value, obtain with it is described standby Close word similar in selection word obscures word subset;It is wherein, described that obscure word subset be to have the alternative word and read with it Word of obscuring that sound is close, familiar in shape or common confusing word is constituted closes.
It is stored hereof for example, the corresponding mode of key-value may be used, using the pronunciation of Chinese character as key, to send out this reading The Chinese character of sound forms one as value, with the identical Chinese character of the corresponding multiple pronunciations of a key and easily obscures word subset;Further, It is also conceivable to polyphone, polyphone is deposited in easily obscuring in word subset for multiple pronunciations simultaneously, for example, " meeting " word, it will be same When consider " hui " and " kuai " two pronunciations, but for " for " word, can only consideration " wei ", ignore two tones and four tones Difference.
Step S403:Using the misspelling detection model detect respectively described in obscure the close word in word subset and exist The second probability of occurrence on the current location;
Have invoked the alternative word obscure word subset after, detected respectively using the misspelling detection model described in Obscure the probability that the close word in word subset occurs on the current location residing for the alternative word, i.e. the second probability of occurrence; The quantity for the close word being detected can be configured as needed, it can be to obscuring all close texts in word subset Word is detected or close word only high to part similarity is detected.
Step S404:If the second appearance that the first probability of occurrence of the alternative word is less than at least one close word is general Rate judges that the alternative word is misspelling;
By the first probability of occurrence of the alternative word, compared respectively with the second probability of occurrence of each close word Compared with if the first probability of occurrence of the alternative word is less than the second probability of occurrence of at least one close word, judgement is described standby Selection word is misspelling;If it is second general to be greater than or equal to any one close word for the first probability of the alternative word Rate then judges that the alternative word spelling is correct.
First probability of occurrence is less than the text of the first probability threshold value by the recognition methods of the word misspelling of above-described embodiment Word easily obscures word subset as the alternative word of misspelling, by calling, general to the appearance of alternative word word close with its Rate compares, and further eliminates misjudged word, improves the accuracy of misspelling identification.
In one embodiment, the recognition methods of the word misspelling further includes:By the misspelling of the identification The position at place is exported;And/or the position where the misspelling of the identification is prompted.
Specifically, after identifying misspelling, the position of misspelling can also be exported, it can also be further Misspelling is labeled in a manner of crossing, highlighting etc., to achieve the purpose that prompt misspelling position, by right The output and/or prompt of misspelling position can improve detection efficiency, convenient to be modified to wrongly written character.
The specific implementation mode of the identifying system of the word misspelling of the present invention is described in detail below in conjunction with the accompanying drawings. With reference to figure 7, Fig. 7 shows the structural schematic diagram of the identifying system of the word misspelling of one embodiment, includes mainly:Word Acquisition module 10, Probability Detection module 20 and error detection module 30;
Text acquisition module 10, the sentence to be detected for obtaining input, and extract the word of the sentence to be detected;
Probability Detection module 20, for utilizing misspelling detection model trained in advance in the sentence to be detected Word is detected, and obtains first probability of occurrence of each word in the sentence to be detected on current location;
Error detection module 30, for judging that the first of each word goes out respectively using preset first probability threshold value Existing probability, the word that the first probability of occurrence is less than to the first probability threshold value are determined as misspelling.
For Probability Detection module 20, can be further used for obtaining the word of the sentence to be detected of input one by one;According to The current location of the word, the information above of the sentence to be detected before obtaining the current location;On described First probability of occurrence of the literary acquisition of information word on the current location.
For error detection module 30, it can be also used for obtaining the first probability of occurrence less than or equal to the first probability threshold value Alternative word;If the probability of occurrence of the alternative word is more than the second probability threshold value of setting, obtain and the alternative word phase Close close word obscures word subset;Using the misspelling detection model detect respectively described in obscure phase in word subset Nearly second probability of occurrence of the word on the current location;If the first probability of occurrence of the alternative word is less than at least one Second probability of occurrence of close word judges that the alternative word is misspelling.
For error detection module 30, if the probability of occurrence that can be also used for the alternative word is less than or equal to setting Second probability threshold value judges that the alternative word is misspelling.
For error detection module 30, can be also used for exporting the position where the misspelling of the identification; And/or the position where the misspelling of the identification is prompted.
In addition, it can include model foundation and training module 40, for the corpus data using natural language and are based on Recognition with Recurrent Neural Network technology establishes the training pattern of misspelling detection;Corpus data in the training pattern is located in advance Reason obtains corpus data packet, is trained to the training pattern using the corpus data packet, obtains the misspelling flase drop Survey model.
For model foundation and training module 40, can be further used for will be in the corpus data in the training pattern Redundant content is deleted, and non-legible data are replaced with letter;To language as unit of word and the letter Sentence in material data is split, and in sentence-initial and ending addition sentence-initial label and sentence closing tag, is generated Corpus data packet.
The recognition methods of above-mentioned word misspelling and system utilize trained in advance misspelling detection model and basis Probability of occurrence and preset threshold value of each word on current location are to the misspelling in sentence to be detected in sentence to be measured Word be detected, realize the accurately and efficiently identification to misspelling.
Specific reality of the application example to the present invention of word misspelling detection is carried out with reference to an application present invention Mode is applied to be illustrated in more detail.
With reference to figure 8, Fig. 8 shows the flow chart of the recognition methods of the word misspelling of an application example, main to wrap Include following steps:
Step a1:The training pattern for establishing misspelling detection pre-processes the corpus data of training pattern;
Step a2:The training pattern is trained;
Step a3:The misspelling detection model for obtaining sentence to be detected and being input to after training;
Step a4:The word of the sentence to be detected of the input is extracted using the misspelling detection model after training;
Step a5:Obtain first probability of occurrence of each word in the sentence to be detected on current location;
Step a6:Judge whether the first probability of occurrence of each word is less than or equal to preset first probability threshold Value;If so, executing step a7;Otherwise, judge that word spelling is correct;
Step a7:Identify that the word is misspelling and exports the position of the misspelling.
With reference to figure 9, Fig. 9 shows the flow chart of the recognition methods of the word misspelling of another application example, mainly Include the following steps:
Step b1:The training pattern for establishing misspelling detection pre-processes the corpus data of training pattern;
Step b2:The training pattern is trained;
Step b3:The misspelling detection model for obtaining sentence to be detected and being input to after training;
Step b4:The word of the sentence to be detected of the input is extracted using the misspelling detection model after training;
Step b5:Obtain first probability of occurrence of each word in the sentence to be detected on current location;
Step b6:Judge whether the first probability of occurrence of each word is less than or equal to preset first probability threshold Value;If so, executing step b7;Otherwise, judge that word spelling is correct.
Step b7:Judge whether the probability of occurrence of the word is more than the second probability threshold value of setting;If so, executing step b8;Otherwise, step b10 is executed;
Step b8:Using the misspelling detection model detect respectively described in obscure close word in word subset in institute State the second probability of occurrence on current location;
Step b9:Judge whether the first probability of occurrence of the word is less than the second of at least one close word and occurs generally Rate;If so, executing step b10;Conversely, judging that word spelling is correct;
Step b10:Identify that the word is misspelling and exports the position of the misspelling.
By the recognition methods of above-mentioned word misspelling, waited for using misspelling detection model trained in advance and basis Probability of occurrence and preset threshold value of each word on current location are to the misspelling in sentence to be detected in survey sentence Word is detected, and realizes the accurately and efficiently identification to misspelling.
In one embodiment, the present invention also provides a kind of computer equipment, which includes memory, processing Device and storage are on a memory and the computer program that can run on a processor, wherein reality when processor executes described program Now such as the recognition methods of any one word misspelling in the various embodiments described above.
The computer equipment, when processor executes program, by realizing such as any one text in the various embodiments described above The recognition methods of word misspelling, to according to probability of occurrence of each word in sentence to be measured on current location and preset Threshold value is detected the word of the misspelling in sentence to be detected, realizes the accurately and efficiently identification to misspelling.
In addition, one of ordinary skill in the art will appreciate that realize above-described embodiment method in all or part of flow, It is that relevant hardware can be instructed to complete by computer program, the program can be stored in a non-volatile calculating In machine read/write memory medium, in the embodiment of the present invention, which can be stored in the storage medium of computer system, and by At least one of computer system processor executes, and includes such as the recognition methods of above-mentioned each word misspelling to realize The flow of embodiment.
In one embodiment, a kind of storage medium is also provided, computer program is stored thereon with, wherein the program quilt The recognition methods such as any one word misspelling in the various embodiments described above is realized when processor executes.Wherein, described Storage medium can be magnetic disc, CD, read-only memory (Read-Only Memory, ROM) or random access memory (Random Access Memory, RAM) etc..
The computer storage media, the computer program of storage include such as above-mentioned each word misspelling by realizing Recognition methods embodiment flow, so as to according to probability of occurrence of each word on current location in sentence to be measured The word of the misspelling in sentence to be detected is detected with preset threshold value, is realized to the accurate, high of misspelling The identification of effect.
Figure 10 is the internal structure schematic diagram of one embodiment Computer equipment.Referring to Fig.1 0, the computer equipment packet Include processor, non-volatile memory medium, built-in storage, display and the network interface connected by system bus.Wherein, should The non-volatile memory medium of computer equipment can storage program area and the computer program for realizing voice communication assembly, the meter Calculation machine program is performed, and processor may make to execute a kind of voice communication method.The processor of the computer equipment is for carrying For calculating and control ability, the operation of entire computer equipment is supported.Computer program can be stored in the built-in storage, the meter When calculation machine program is executed by processor, processor may make to execute the recognition methods of word misspelling.The net of computer equipment Network interface is for carrying out network communication.Display screen is for showing application interface etc., for example, display instant messaging chat interface or text The operation interface etc. that word is corrected.The display screen of computer equipment can be liquid crystal display or electric ink display screen, calculate The input unit of machine equipment can be the touch screen covered on display screen, can also be on computer equipment shell equipment by Key, trace ball or Trackpad can also be external keyboard, Trackpad or mouse etc..Touch layer constitutes touch screen with display screen.
It will be understood by those skilled in the art that structure shown in Figure 10, only with the relevant part of the present invention program The block diagram of structure, does not constitute the restriction for the terminal being applied thereon to the present invention program, and specific terminal may include ratio More or fewer components as shown in the figure either combine certain components or are arranged with different components.
Each technical characteristic of embodiment described above can be combined arbitrarily, to keep description succinct, not to above-mentioned reality It applies all possible combination of each technical characteristic in example to be all described, as long as however, the combination of these technical characteristics is not deposited In contradiction, it is all considered to be the range of this specification record.Several implementations of the invention above described embodiment only expresses Mode, the description thereof is more specific and detailed, but can not therefore be construed as limiting the scope of the patent.It should be understood that It is that for those of ordinary skill in the art, without departing from the inventive concept of the premise, several deformations can also be made And improvement, these are all within the scope of protection of the present invention.Therefore, the protection domain of patent of the present invention should be with appended claims It is accurate.

Claims (10)

1. a kind of recognition methods of word misspelling, which is characterized in that include the following steps:
The sentence to be detected of input is obtained, and extracts the word of the sentence to be detected;
The word in the sentence to be detected is detected using misspelling detection model trained in advance, is waited for described in acquisition Detect first probability of occurrence of each word in sentence on current location;
Judge the first probability of occurrence of each word respectively using preset first probability threshold value, the first probability of occurrence is small It is determined as misspelling in the word of the first probability threshold value.
2. the recognition methods of word misspelling according to claim 1, which is characterized in that further include:
If the probability of occurrence of the alternative word is less than or equal to the second probability threshold value of setting, the alternative word is judged to spell Write error.
3. the recognition methods of word misspelling according to claim 1, which is characterized in that further include:
Obtain the alternative word that the first probability of occurrence is less than or equal to the first probability threshold value;
If the probability of occurrence of the alternative word is more than the second probability threshold value of setting, obtain and phase similar in the alternative word Nearly word obscures word subset;
Using the misspelling detection model detect respectively described in obscure close word in word subset in the current location On the second probability of occurrence;
If the first probability of occurrence of the alternative word is less than the second probability of occurrence of at least one close word, judgement is described standby Selection word is misspelling.
4. the recognition methods of word misspelling according to claim 1, which is characterized in that further include:
Using natural language corpus data and based on Recognition with Recurrent Neural Network technology establish misspelling detection training pattern;
Corpus data in the training pattern is pre-processed to obtain corpus data packet, using the corpus data packet to institute It states training pattern to be trained, obtains the misspelling detection model.
5. the recognition methods of word misspelling according to claim 4, which is characterized in that described in training pattern Corpus data is pre-processed the step of obtaining corpus data packet and includes:
Redundant content in corpus data in the training pattern is deleted, and by non-legible data letter into Row is replaced;
The sentence in corpus data is split as unit of word and the letter, and in sentence-initial and ending addition sentence Son beginning label and sentence closing tag, generate corpus data packet.
6. the recognition methods of word misspelling according to claim 1, which is characterized in that the acquisition is described to be detected Each word in sentence includes in the step of the first probability of occurrence on current location:
The word of the sentence to be detected of input is obtained one by one;
According to the current location of the word, the information above of the sentence to be detected before obtaining the current location;
According to first probability of occurrence of the described acquisition of information above word on the current location.
7. the recognition methods of word misspelling according to claim 1, which is characterized in that further include:
Position where the misspelling of the identification is exported, and to the position where the misspelling of the identification into Row prompt.
8. a kind of identifying system of word misspelling, which is characterized in that including:
Text acquisition module, the sentence to be detected for obtaining input, and extract the word of the sentence to be detected;
Probability Detection module, for using misspelling detection model trained in advance to the word in the sentence to be detected into Row detection obtains first probability of occurrence of each word in the sentence to be detected on current location;
Error detection module, for judging that the first of each word occurs generally respectively using preset first probability threshold value Rate, the word that the first probability of occurrence is less than to the first probability threshold value are determined as misspelling.
9. a kind of computer equipment, including memory, processor and it is stored on the memory and can be in the processor The computer program of upper operation, which is characterized in that the processor realized when executing the computer program as claim 1 to The recognition methods of word misspelling described in 7 any one.
10. a kind of computer storage media, is stored thereon with computer program, which is characterized in that the program is executed by processor The recognition methods of word misspellings of the Shi Shixian as described in claim 1 to 7 any one.
CN201810273968.XA 2018-03-29 2018-03-29 Recognition methods, system, computer equipment and the storage medium of word misspelling Pending CN108563634A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810273968.XA CN108563634A (en) 2018-03-29 2018-03-29 Recognition methods, system, computer equipment and the storage medium of word misspelling

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810273968.XA CN108563634A (en) 2018-03-29 2018-03-29 Recognition methods, system, computer equipment and the storage medium of word misspelling

Publications (1)

Publication Number Publication Date
CN108563634A true CN108563634A (en) 2018-09-21

Family

ID=63533452

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810273968.XA Pending CN108563634A (en) 2018-03-29 2018-03-29 Recognition methods, system, computer equipment and the storage medium of word misspelling

Country Status (1)

Country Link
CN (1) CN108563634A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109408813A (en) * 2018-09-30 2019-03-01 北京金山安全软件有限公司 Text correction method and device
CN109558600A (en) * 2018-11-14 2019-04-02 北京字节跳动网络技术有限公司 Translation processing method and device
CN109766538A (en) * 2018-11-21 2019-05-17 北京捷通华声科技股份有限公司 A kind of text error correction method, device, electronic equipment and storage medium
CN110232191A (en) * 2019-06-17 2019-09-13 无码科技(杭州)有限公司 Autotext error-checking method
CN110866390A (en) * 2019-10-15 2020-03-06 平安科技(深圳)有限公司 Method and device for recognizing Chinese grammar error, computer equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105045778A (en) * 2015-06-24 2015-11-11 江苏科技大学 Chinese homonym error auto-proofreading method
CN105159871A (en) * 2015-08-21 2015-12-16 小米科技有限责任公司 Text information detection method and apparatus
CN106776501A (en) * 2016-12-13 2017-05-31 深圳爱拼信息科技有限公司 A kind of automatic method for correcting of text wrong word and server
CN107122346A (en) * 2016-12-28 2017-09-01 平安科技(深圳)有限公司 The error correction method and device of a kind of read statement
CN107357775A (en) * 2017-06-05 2017-11-17 百度在线网络技术(北京)有限公司 The text error correction method and device of Recognition with Recurrent Neural Network based on artificial intelligence

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105045778A (en) * 2015-06-24 2015-11-11 江苏科技大学 Chinese homonym error auto-proofreading method
CN105159871A (en) * 2015-08-21 2015-12-16 小米科技有限责任公司 Text information detection method and apparatus
CN106776501A (en) * 2016-12-13 2017-05-31 深圳爱拼信息科技有限公司 A kind of automatic method for correcting of text wrong word and server
CN107122346A (en) * 2016-12-28 2017-09-01 平安科技(深圳)有限公司 The error correction method and device of a kind of read statement
CN107357775A (en) * 2017-06-05 2017-11-17 百度在线网络技术(北京)有限公司 The text error correction method and device of Recognition with Recurrent Neural Network based on artificial intelligence

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109408813A (en) * 2018-09-30 2019-03-01 北京金山安全软件有限公司 Text correction method and device
CN109408813B (en) * 2018-09-30 2023-07-21 北京金山安全软件有限公司 Text correction method and device
CN109558600A (en) * 2018-11-14 2019-04-02 北京字节跳动网络技术有限公司 Translation processing method and device
CN109766538A (en) * 2018-11-21 2019-05-17 北京捷通华声科技股份有限公司 A kind of text error correction method, device, electronic equipment and storage medium
CN109766538B (en) * 2018-11-21 2023-12-15 北京捷通华声科技股份有限公司 Text error correction method and device, electronic equipment and storage medium
CN110232191A (en) * 2019-06-17 2019-09-13 无码科技(杭州)有限公司 Autotext error-checking method
CN110866390A (en) * 2019-10-15 2020-03-06 平安科技(深圳)有限公司 Method and device for recognizing Chinese grammar error, computer equipment and storage medium
CN110866390B (en) * 2019-10-15 2022-02-11 平安科技(深圳)有限公司 Method and device for recognizing Chinese grammar error, computer equipment and storage medium

Similar Documents

Publication Publication Date Title
CN110489760B (en) Text automatic correction method and device based on deep neural network
CN108491392A (en) Modification method, system, computer equipment and the storage medium of word misspelling
Stoica et al. Re-tacred: Addressing shortcomings of the tacred dataset
CN108563634A (en) Recognition methods, system, computer equipment and the storage medium of word misspelling
CN108563632A (en) Modification method, system, computer equipment and the storage medium of word misspelling
CN114610515B (en) Multi-feature log anomaly detection method and system based on log full semantics
CN107729309A (en) A kind of method and device of the Chinese semantic analysis based on deep learning
CN107229610A (en) The analysis method and device of a kind of affection data
CN106991085A (en) The abbreviation generation method and device of a kind of entity
CN111310447A (en) Grammar error correction method, grammar error correction device, electronic equipment and storage medium
CN106325596B (en) A kind of written handwriting automatic error correction method and system
CN108519973A (en) Detection method, system, computer equipment and the storage medium of word spelling
CN110888798A (en) Software defect prediction method based on graph convolution neural network
CN111966944A (en) Model construction method for multi-level user comment security audit
CN114970506A (en) Grammar error correction method and system based on multi-granularity grammar error template learning fine tuning
Tashu et al. Deep Learning Architecture for Automatic Essay Scoring
CN115358219A (en) Chinese spelling error correction method integrating unsupervised learning and self-supervised learning
CN114580391A (en) Chinese error detection model training method, device, equipment and storage medium
CN113515588A (en) Form data detection method, computer device and storage medium
CN108664467A (en) Candidate word appraisal procedure, device, computer equipment and storage medium
CN108681534A (en) Candidate word appraisal procedure, device, computer equipment and storage medium
CN108628827A (en) Candidate word appraisal procedure, device, computer equipment and storage medium
CN108595419A (en) Candidate word appraisal procedure, candidate word sort method and device
CN108647202A (en) Candidate word appraisal procedure, device, computer equipment and storage medium
CN108681533A (en) Candidate word appraisal procedure, device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20180921