CN108681535B - Candidate word evaluation method and device, computer equipment and storage medium - Google Patents

Candidate word evaluation method and device, computer equipment and storage medium Download PDF

Info

Publication number
CN108681535B
CN108681535B CN201810321706.6A CN201810321706A CN108681535B CN 108681535 B CN108681535 B CN 108681535B CN 201810321706 A CN201810321706 A CN 201810321706A CN 108681535 B CN108681535 B CN 108681535B
Authority
CN
China
Prior art keywords
word
candidate
candidate word
wrong
evaluation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810321706.6A
Other languages
Chinese (zh)
Other versions
CN108681535A (en
Inventor
李贤�
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Shiyuan Electronics Thecnology Co Ltd
Original Assignee
Guangzhou Shiyuan Electronics Thecnology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Shiyuan Electronics Thecnology Co Ltd filed Critical Guangzhou Shiyuan Electronics Thecnology Co Ltd
Priority to CN201810321706.6A priority Critical patent/CN108681535B/en
Publication of CN108681535A publication Critical patent/CN108681535A/en
Application granted granted Critical
Publication of CN108681535B publication Critical patent/CN108681535B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/232Orthographic correction, e.g. spell checking or vowelisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries

Abstract

The invention relates to a candidate word evaluation method, a candidate word evaluation device, computer equipment and a storage medium, which are applied to the field of data processing. The method comprises the following steps: detecting wrong words, and acquiring a plurality of candidate words corresponding to the wrong words; determining the editing distance between each candidate word and the wrong word; determining the similarity between each candidate word and the wrong word, wherein the similarity is obtained according to the longest public subsequence and/or the longest public substring of each candidate word and the wrong word; acquiring error information of the wrong word relative to each candidate word; and determining the evaluation score corresponding to each candidate word according to the editing distance, the similarity and the error information. The embodiment of the invention solves the problem of low evaluation reliability of the candidate words and is beneficial to improving the reliability of the evaluation result of the candidate words.

Description

Candidate word evaluation method and device, computer equipment and storage medium
Technical Field
The present invention relates to the field of data processing technologies, and in particular, to a candidate word evaluation method, apparatus, computer device, and storage medium.
Background
Currently popular Word processing software, such as Word, WPS, WordPerfect, and the like, are embedded with an english spell check function, which is used for implementing english spell check, and when a misspelled Word is checked, prompt information or a corresponding error correction suggestion is given.
In the process of implementing the invention, the inventor finds that the prior art has the following problems, the prior error correction method mainly uses a dictionary for detection, and evaluates the candidate words with wrong words through the editing distance after finding out misspelling, however, the method is too simple and hard, and the reliability of the evaluation result of the candidate words is not ideal.
Disclosure of Invention
Based on this, it is necessary to provide a candidate word evaluation method, apparatus, computer device and storage medium for solving the problem that the evaluation result of the candidate word in the existing manner is not accurate enough.
The scheme provided by the embodiment of the invention comprises the following steps:
a candidate word evaluation method comprising the steps of: detecting wrong words, and acquiring a plurality of candidate words corresponding to the wrong words; determining the editing distance between each candidate word and the wrong word; determining the similarity between each candidate word and the wrong word, wherein the similarity is obtained according to the longest public subsequence and/or the longest public substring of each candidate word and the wrong word; acquiring error information of the wrong word relative to each candidate word; and determining the evaluation score corresponding to each candidate word according to the editing distance, the similarity and the error information.
A candidate word evaluation apparatus comprising: the candidate word acquisition module is used for detecting wrong words and acquiring a plurality of candidate words corresponding to the wrong words; the distance determining module is used for determining the editing distance between each candidate word and the wrong word; the similarity determining module is used for determining the similarity between each candidate word and the wrong word, and the similarity is obtained according to the longest public subsequence and/or the longest public substring of each candidate word and the wrong word; the error information acquisition module is used for acquiring error information of the wrong words relative to the candidate words; and the eighth evaluation module is used for determining the evaluation score corresponding to each candidate word according to the editing distance, the similarity and the error information.
The candidate word evaluation method and the candidate word evaluation device determine the editing distance between the candidate words and the wrong words, calculate the similarity between the candidate words and the wrong words, and determine the evaluation score corresponding to each candidate word according to the editing distance, the similarity and the error information. The method not only considers the phenomenon problem of word writing, but also takes the information of the context language environment into account, thereby being beneficial to improving the accuracy of the candidate word evaluation result.
A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the candidate word evaluation method when executing the computer program.
According to the computer equipment, the accuracy of the candidate word evaluation result is improved through the computer program running on the processor.
A computer storage medium on which a computer program is stored which, when executed by a processor, implements the above candidate word evaluation method.
The computer storage medium is beneficial to improving the accuracy of the candidate word evaluation result through the stored computer program.
Drawings
FIG. 1 is a diagram of an exemplary implementation of a candidate evaluation method;
FIG. 2 is a schematic flow chart diagram of a candidate word evaluation method according to a first embodiment;
FIG. 3 is a schematic flow chart diagram of a candidate word evaluation method according to a second embodiment;
FIG. 4 is a schematic flow chart diagram of a candidate word evaluation method according to a third embodiment;
FIG. 5 is a schematic flow chart diagram of a candidate word evaluation method according to a fourth embodiment;
FIG. 6 is a schematic flow chart diagram of a candidate word evaluation method according to a fifth embodiment;
FIG. 7 is a schematic flow chart diagram illustrating a candidate word evaluation method according to a sixth embodiment;
FIG. 8 is a schematic flow chart diagram of a candidate word evaluation method according to a seventh embodiment;
FIG. 9 is a schematic flow chart diagram illustrating a candidate word evaluation method according to an eighth embodiment;
FIG. 10 is a schematic flow chart diagram illustrating a candidate word evaluation method according to a ninth embodiment;
FIG. 11 is a schematic flow chart diagram illustrating a candidate word evaluation method according to a tenth embodiment;
fig. 12 is a schematic flow chart of a candidate word evaluation method of the eleventh embodiment;
FIG. 13 is a schematic flow chart diagram illustrating a candidate word evaluation method according to a twelfth embodiment;
fig. 14 is a schematic flow chart of a candidate word evaluation method of the thirteenth embodiment;
FIG. 15 is a schematic structural diagram of a candidate word evaluation apparatus according to a fourteenth embodiment;
fig. 16 is a schematic structural diagram of a candidate word evaluation apparatus of a fifteen embodiment;
FIG. 17 is a schematic structural diagram of a candidate word evaluation apparatus according to a sixteen embodiment;
fig. 18 is a schematic structural view of a candidate word evaluation apparatus of the seventeenth embodiment;
fig. 19 is a schematic configuration diagram of a candidate word evaluation apparatus of an eighteen embodiment;
fig. 20 is a schematic structural diagram of a candidate word evaluation apparatus of a nineteenth embodiment;
fig. 21 is a schematic structural view of a candidate word evaluation apparatus of a twentieth embodiment;
FIG. 22 is a schematic block diagram of a candidate word evaluation apparatus according to a twenty-one embodiment;
FIG. 23 is a schematic structural diagram of a candidate word evaluation apparatus according to a twenty-two embodiment;
FIG. 24 is a schematic structural diagram of a candidate word evaluation apparatus according to a twenty-third embodiment;
FIG. 25 is a schematic structural diagram of a candidate word evaluation apparatus of a twenty-four embodiment;
fig. 26 is a schematic structural view of a candidate word evaluation apparatus of a twenty-five embodiment;
fig. 27 is a schematic structural diagram of a candidate word evaluation apparatus of a twenty-sixth embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.
The candidate word evaluation method provided by the application can be applied to the application environment shown in fig. 1. The internal structure of the computer apparatus may be as shown in fig. 1, and includes a processor, a memory, a display screen, and an input device connected through a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The computer program is executed by a processor to implement a candidate word evaluation method. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on the shell of the computer equipment, an external keyboard, a touch pad or a mouse and the like.
The computer device can be a terminal, including but not limited to various personal computers, laptops, smart phones and smart interactive tablets; when the intelligent interactive tablet is used, the intelligent interactive tablet can detect and recognize the writing operation of a user, can detect the content of the writing operation, and even can automatically correct the wrongly written words.
Those skilled in the art will appreciate that the architecture shown in fig. 1 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
As shown in fig. 2, in a first embodiment, a candidate word evaluation method is provided, which includes the following steps:
and step S11, detecting the wrong word and acquiring a plurality of candidate words corresponding to the wrong word.
The wrong words include misspelled words or miswritten words, which are words not present in the corresponding lexicon.
The wrong words can be text words such as symbols, characters, words, English words and the like. The candidate word may be a word similar to the wrong word determined according to the wrong word, that is, a correct word (error-correcting word) corresponding to the wrong word may be possible. The number of candidate words may be one, two, three, etc., and the number of candidate words is not limited in the embodiment of the present invention. And if the determined candidate word is one, directly taking the candidate word as an error correction word.
The method for detecting the wrong word can be implemented in various ways, such as a dictionary detection way, and the embodiment of the invention does not limit the method for detecting the wrong word.
And step S12, determining the edit distance between each candidate word and the wrong word.
The editing distance between the candidate word and the wrong word is used for measuring the difference degree between the candidate word and the wrong word; it may refer to the difference in the number of letters between the candidate word and the wrong word, the stroke difference, etc. For a character string or an english word, the editing distance refers to the minimum number of editing operations required to change one string into another string, and the allowable editing operations include replacing a character, inserting a character, and deleting a character.
In one embodiment, a two-dimensional array of d [ i, j ] is used to store the value of the edit distance, which represents the minimum steps required to convert the string s [1 … i ] to the string t [1 … j ], and when i is equal to 0, i.e., the string s is empty, j characters are added to the corresponding d [0, j ], so that the string s is converted to the string t; when j is equal to 0, i.e., string t is empty, then the corresponding d [ i,0] is decremented by i characters, so that string s is converted to string t. For example, the edit distance between kitten and sitting is 3, for example, because there are at least the following 3 steps: sitten (k → s), sittin (e → i), sitting (→ g).
For example: and for a wrong word, selecting all known words with edit distances of 1 and 2 in a word bank, and constructing a candidate word set of the wrong word. The determination of the set of candidate words may also be based on other edit distance conditions, depending on the actual situation. By screening the candidate words corresponding to the wrong words, the reliability of the candidate word evaluation result is improved, and meanwhile, the follow-up calculation time is favorably reduced.
And step S13, acquiring error information of the wrong word relative to each candidate word.
And the error information of the error word relative to each candidate word is used for representing the distinguishing information of the error word and each candidate word. Alternatively, the error message may be determined according to the habit of the user when writing words in general, such as the user being particularly prone to miss-writing characters, or particular parts being most prone to errors, etc.
In an embodiment, the error information of the error word relative to each candidate word may refer to information on whether the error word is the same as the first letter of each candidate word, information on whether the number of characters of the error word and the candidate word is the same, information on whether the error word is the same as the component of each candidate word, information on whether the error word contains an illegal symbol, or the like.
And step S14, determining the evaluation score corresponding to each candidate word according to the editing distance and the error information.
The evaluation score obtained in this step can reflect the possibility that each candidate word is the error-correcting word (correct word) corresponding to the error word. The step is equivalent to effectively evaluating the possibility that the candidate word is the error-correcting word corresponding to the error word according to the editing distance information and the error information of the candidate word and the error word.
When a wrong word is detected, the candidate word evaluation method of the embodiment first acquires a plurality of corresponding candidate words, determines the edit distance between each candidate word and the wrong word, and determines the error information of the wrong word relative to each candidate word; determining an evaluation score corresponding to each candidate word according to the editing distance and the error information; the method not only considers the habit characteristics of the user when writing the words, but also considers the editing distance information between the words, thereby being beneficial to improving the reliability of the candidate word evaluation result.
In an embodiment, determining an evaluation score corresponding to each candidate word according to the edit distance and the error information includes: and determining the evaluation score corresponding to each candidate word according to the reciprocal of the editing distance and the error information.
Generally, the larger the edit distance, the larger the degree of difference between the candidate word and the error word, and the smaller the possibility that the candidate word is an error correction word of the error word, and therefore, the reliability of the evaluation score of the candidate word can be ensured by the reciprocal of the edit distance and the error information.
In one embodiment, the error information of the error word relative to each candidate word includes: and whether the wrong word and the candidate word have the same initial letter or not. Correspondingly, the step of determining the evaluation score corresponding to each candidate word according to the editing distance and the error information includes: if the wrong word is the same as the initial of the candidate word, calculating an evaluation score corresponding to the candidate word according to the reciprocal of the editing distance and the first coefficient; and if the wrong word is different from the initial letter of the candidate word, calculating the evaluation score corresponding to the candidate word according to the reciprocal of the editing distance and the second coefficient.
For example, suppose that the error information of the wrong word relative to each candidate word is represented by K, and the edit distance is represented by DeditIndicating that the evaluation score corresponding to the candidate word is scorewordExpressed, the formula for calculating the score of each candidate word may be:
scoreword=K×1/Dedit
and selecting the K value according to whether the initial letters of the candidate word and the wrong word are the same, wherein if the initial letters of the candidate word and the wrong word are the same, the K value is K1, and otherwise, the K values K2, K1 and K2 are preset numerical values. For example: and when the same, K is 1, and when the same is not the same, K is 0.5.
For an english word or a character string, based on the writing habit of the user, the first letter generally does not make a mistake, so based on the above embodiment, the possibility that each candidate word is a wrong-word-corrected word is evaluated, and the evaluation score of the candidate word can be more reasonably determined. So that if the other conditions are equal, if the candidate word has the same initial as the error word, it is more likely to be the error word, otherwise, it is less likely to be the error word. It is understood that the values of K include, but are not limited to, the values of the above examples.
For example: the known situation is that when a user writes English words and sentences on the intelligent interactive flat plate, the game is wrongly written into the game, and the context language environment is I-wave game applets; the method for evaluating the candidate words corresponding to the song comprises the following steps:
1. and detecting the tone in the English dictionary through English dictionary detection, and determining the tone as a wrong word.
2. And acquiring a candidate word of the song, and assuming that the song, the same and the one exist as the candidate word.
3. Respectively determining the editing distances of the game, the same, the one and the same, wherein the editing distances are respectively as follows: 1. 2 and 1.
4. According to the formula scoreword=K×1/DeditCalculating a score for each candidate word:
scoresome=1×1=1;
scoresame=1×1/2=0.5;
scoreone=0.5×1=0.5。
therefore, the candidate word some is more likely to be the correct word (error-correcting word) corresponding to the error word sone than other candidate words.
Therefore, the finally obtained evaluation score corresponding to each candidate word is obtained by comprehensively considering the habit characteristics of the user when writing the word and the size of the editing distance between the words, and compared with the traditional candidate word evaluation method based on the editing distance, the method is beneficial to improving the reliability of the evaluation result of the candidate word and ensuring the accuracy of the error-correcting word.
As shown in fig. 3, in a second embodiment, a candidate word evaluation method includes the following steps:
step S21, detecting the wrong word, and obtaining a plurality of candidate words corresponding to the wrong word.
For the implementation of this step, reference may be made to the corresponding step S11 implemented above, which is not described in detail.
And step S22, determining the similarity between each candidate word and the wrong word, wherein the similarity is obtained according to the longest common subsequence and/or the longest common substring of each candidate word and the wrong word.
The longest common subsequence is a problem that is used to find the longest subsequence of all sequences in a set of sequences, typically two sequences. A sequence is called the longest common subsequence of known sequences if it is a subsequence of two or more known sequences, respectively, and is the longest of all sequences that meet this condition.
The longest common substring is the problem used in a set of sequences (usually two sequences) to find the longest substring of all sequences. A substring, if it is the substring of two or more known number series, and the longest substring among all substrings meeting the condition, is called the longest common substring of the known series.
The target of a subsequence is a sequence, and the subsequences are ordered but not necessarily contiguous. For example: the sequence X ═ < B, C, D, B > is a subsequence of the sequence Y ═ a, B, C, B, D, a, B >, and the corresponding subscript sequence is <2,3,5,7 >. Substrings act like character strings, and are ordered and continuous. For example: the character string a ═ abcd is a substring of the character string c ═ aaabcddd; but the string b acdddd is not a substring of the string c.
The number of the same characters between the candidate words and the wrong words can be reflected to a certain extent through the longest common subsequence and/or the longest common substring, and the similarity between the candidate words and the wrong words is determined to have actual physical significance based on the pair.
And step S23, acquiring error information of the wrong word relative to each candidate word.
As to the implementation of this step, reference may be made to the description of the first embodiment above.
And step S24, determining the evaluation score corresponding to each candidate word according to the similarity and the error information.
The candidate word evaluation score obtained by the step can reflect the possibility that each candidate word is the error-correcting word corresponding to the error word.
When a wrong word is detected, firstly, obtaining a plurality of corresponding candidate words, respectively determining the similarity between each candidate word and the wrong word, wherein the similarity is obtained according to the longest public subsequence and/or the longest public substring of each candidate word and the wrong word, and determining error information of the wrong word relative to each candidate word; determining an evaluation score corresponding to each candidate word according to the similarity and the error information; the habit characteristics of the user when writing the words and the similarity between the words are comprehensively considered, so that the reliability of the candidate word evaluation result is improved.
In an embodiment, the step of determining the similarity between each candidate word and the wrong word in the above implementation may specifically be: and calculating the similarity between each candidate word and the wrong word according to at least one of the longest common subsequence rate and the longest common substring rate of each candidate word and the wrong word.
The longest common subsequence length divided by the average of the total sequence length in the sequence set is the longest common subsequence rate, which can be denoted as Dlsc1. Length of longest common substringDividing by the average of the total length of the sequences in the sequence set to obtain the longest common substring rate which can be recorded as Dlsc2. For example: the specific process of calculating the longest common subsequence rate and the longest common substring rate may be: for the sequence set { dbcabca, cbaba }, the longest common subsequence thereof is baba, the length of the longest common subsequence is 4, and the longest common subsequence rate is 4/((7+5)/2) ═ 0.67; for the sequence set { babca, cbaba }, the longest common substring is bab, and the length of the longest common substring is 3, the longest common substring rate is 3/((5+5)/2) ═ 0.6.
The longest common subsequence rate and the longest common substring rate not only consider the amount of the same characters between the candidate words and the wrong words, but also further consider the proportion of the same characters, which is beneficial to further improving the accuracy of the similarity between the candidate words and the wrong words.
Alternatively, if the longest common subsequence rate is used Dlsc1Representing the longest common substring rate by Dlsc2Expressing, the similarity between each candidate word and the wrong word is expressed by S, and the similarity between each candidate word and the wrong word is calculated in the following mode:
S=w1Dlcs1
S=w2Dlcs2
S=w1Dlcs1+w2Dlcs2
wherein, w1、w2Is a preset weight coefficient, and the sum of the weight coefficient and the preset weight coefficient is 1.
And determining the similarity between the candidate word and the wrong word through the weighted summation of the longest common subsequence rate and the longest common substring rate, so as to obtain more accurate similarity. Typically, the longest common subsequence rate Dlsc1The influence on the similarity of two words is larger, so the longest common subsequence rate Dlsc1The weight coefficient is larger than the longest common substring rate Dlsc2The weight coefficient of (2), for example:
S=0.7Dlcs1+0.3Dlcs2
in another embodiment, the step of determining the similarity between each candidate word and the wrong word may further include: and calculating the similarity of each candidate word and the wrong word according to at least one of the longest common subsequence rate and the longest common substring rate of each candidate word and the wrong word and the editing distance between each candidate word and the wrong word.
Optionally, the manner of calculating the similarity between each candidate word and the wrong word may include:
S=1/Dedit+w1Dlcs1
or, S-1/Dedit+w2Dlcs2
Or, S-1/Dedit+w1Dlcs1+w2Dlcs2
Specific examples thereof include:
S=1/Dedit+0.7Dlcs1
or, S-1/Dedit+0.3Dlcs2
Or, S-1/Dedit+0.7Dlcs1+0.3Dlcs2
Wherein D iseditThe editing distance is the minimum number of editing operations required for converting one string into another string, and the permitted editing operations comprise replacing one character, inserting one character and deleting one character. Generally, the smaller the edit distance, the greater the proximity of the two strings.
Therefore, the method can determine the similarity between two words by combining the editing distance between the words and the number of the same characters, and is favorable for improving the effectiveness of similarity calculation by increasing the editing distance as one evaluation dimension.
It can be understood that the longest common subsequence rate D in the above formulalsc1Longest common substring rate Dlsc2The previous weighting coefficients may also take other values, including but not limited to the above examples.
In one embodiment, the error information of the error word relative to each candidate word includes: and whether the wrong word and the candidate word have the same initial letter or not. Correspondingly, the step of determining the evaluation score corresponding to each candidate word according to the edit distance and the error information includes: if the wrong word is the same as the first letter of the candidate word, calculating an evaluation score corresponding to the candidate word according to the similarity and the first coefficient; and if the wrong word is different from the initial of the candidate word, calculating an evaluation score corresponding to the candidate word according to the similarity and the second coefficient.
Alternatively, assuming that the similarity is S, the error information of the wrong word relative to each candidate word is represented by K, and the formula for calculating the score of each candidate word may be:
scoreword=K×S;
and selecting the K value according to whether the initial letters of the candidate word and the wrong word are the same, wherein if the initial letters of the candidate word and the wrong word are the same, the K value is K1, and otherwise, the K values K2, K1 and K2 are preset numerical values. For example: and when the same is carried out, the value of K is 1, and when the same is not carried out, the value of K is 0.5.
For an english word or a character string, based on the writing habit of the user, the initial generally does not make mistakes, so by the above embodiment, the possibility that each candidate word is an error-correcting word of the error word is evaluated, and the evaluation score of the candidate word can be determined more reasonably. So that if the other conditions are equal, if the candidate word has the same initial as the error word, it is more likely to be the error word, otherwise, it is less likely to be the error word. It is understood that the values of K include, but are not limited to, the values of the above examples.
For example: the known situation is that when a user writes English words and sentences on the intelligent interactive flat plate, the game is wrongly written into the game, and the context language environment is I-wave game applets; the method for evaluating the candidate word corresponding to the tone comprises the following steps:
1. and detecting the tone in the English dictionary through English dictionary detection, and determining the tone as a wrong word.
2. And acquiring words with edit distances of 1 and 2 from the song in the English dictionary, and assuming that the song, the same and the one exist as candidate words.
3. Respectively determining the longest common subsequence rate and the longest common substring rate of the game, the same, the one and the game, and calculating the similarity corresponding to each candidate word as follows:
the longest common subsequence of the candidate word some and the wrong word sone is soe, the length of the longest common subsequence is 3, and the longest common subsequence rate is 3/((4+4)/2) ═ 0.75; the longest common substring is so, the length of the longest common substring is 2, and the longest common substring rate is 2/((4+4)/2) ═ 0.5;
the longest public subsequence of the candidate word sam and the wrong word sone is se, the length of the longest public subsequence is 2, and the longest public subsequence rate is 2/((4+4)/2) ═ 0.5; the longest common substring is s or e, the length of the longest common substring is 1, and the longest common substring rate is 1/((4+4)/2) ═ 0.25;
the longest public subsequence of the candidate word one and the wrong word one is one, the length of the longest public subsequence is 3, and the longest public subsequence rate is 3/((3+4)/2) ═ 0.86; the longest common substring is one, the length of the longest common substring is 3, and the longest common substring rate is 3/((3+4)/2) ═ 0.86;
according to the formula S-1/Dedit+0.7Dlcs1+0.3Dlcs2Calculating the similarity S between each candidate word and the wrong word as follows:
Some:1+0.7×0.75+0.3×0.5=1.675;
Same:1/2+0.7×0.5+0.3×0.25=0.925;
One:1+0.7×0.86+0.3×0.86=1.86。
4. according to the formula scorewordCalculating an evaluation score for each candidate word as K × S:
scoresome=1×1.675=1.675;
scoresame=1×0.925=0.925;
scoreone=0.5×1.86=0.93。
therefore, the candidate word some is more likely to be the correct word (error-correcting word) corresponding to the error word sone than other candidate words.
Based on the candidate word evaluation method of the embodiment, the evaluation score corresponding to each candidate word is in a positive correlation with the similarity of the candidate word to the wrong word, and is influenced by the distinguishing information of the evaluation score and the initial of the wrong word, the finally obtained evaluation score corresponding to each candidate word is obtained by comprehensively considering the habit characteristics of the user when writing the word and the similarity information between the words, and the method is beneficial to improving the reliability of the evaluation result of the candidate words compared with the traditional candidate word evaluation method based on the editing distance.
As shown in fig. 4, in a third embodiment, a candidate word evaluation method is provided, which includes the following steps:
step S31, detecting the wrong word, and obtaining a plurality of candidate words corresponding to the wrong word.
For the implementation of this step, reference may be made to the corresponding step S11 implemented above, which is not described in detail.
And step S32, determining the language environment probability of each candidate word at the wrong word position.
The language environment probability of the candidate word at the wrong word position means that when the candidate word is used for replacing the wrong word, the candidate word has the corresponding context rationality, and the language environment probability corresponding to the candidate word with the higher context rationality is higher.
In an embodiment, the probability of each candidate word at the wrong word position is calculated according to a preset language model, and the log value of the probability is used as the language environment probability of the candidate word.
And step S33, acquiring error information of the wrong word relative to each candidate word.
With regard to the implementation of this step, reference may be made to the description of the first embodiment above.
And step S34, determining the evaluation score corresponding to each candidate word according to the language environment probability and the error information.
The evaluation score obtained in this step can reflect the possibility that each candidate word is the error-correcting word (correct word) corresponding to the error word.
In the candidate word evaluation method of the embodiment, when a wrong word is detected, a plurality of corresponding candidate words are obtained, the probability of each candidate word at the position of the wrong word is respectively determined, and error information of the wrong word relative to each candidate word is determined; determining an evaluation score corresponding to each candidate word according to the language environment probability and the error information; the method not only considers the habitual problem of the user when writing the word, but also takes the information of the context language environment into account, thereby being beneficial to improving the reliability of the candidate word evaluation result.
In one embodiment, the language model that determines the language environment probability of each candidate word at the wrong word position includes, but is not limited to, an N-Gram model, a BilSTM model, or an LSTM model.
Wherein the N-Gram model is a statistical language model for predicting the nth item from the first (N-1) items. At the application level, these items can be phonemes (speech recognition application), characters (input method application), words (segmentation application) or base pairs (genetic information). Idea of N-Gram model: given a string of letters, such as "for ex," what the next most likely letter is. From the training corpus data, N probability distributions can be obtained by a maximum likelihood estimation method: the probability of being a is 0.4, the probability of being b is 0.0001, the probability of being c is …, and of course, the constraint condition is satisfied: the sum of all N probability distributions is 1.
The long and short memory neural network model, commonly called as LSTM model, is a special recurrent neural network; the next likely character to occur can be predicted by the character-level sequence entered.
The two-way long-short term memory network model, commonly referred to as the BilSTM model, is structurally identical to the LSTM model, except that the BilSTM model is linked not only to past states, but also to future states. For example, training a one-way LSTM to predict "fish" (the cyclic concatenation on the time axis remembers the past state values) by entering letters one by one, enters the next letter in the sequence in the feedback path of the BiLSTM, which makes it possible to know what the future information is. This form of training allows the network to fill in gaps between information rather than predict information.
In one embodiment, when the language environment probability is represented by a log value of the probability of each candidate word at the wrong word position, the evaluation score corresponding to each candidate word may be determined according to the inverse of the language environment probability and the error information. Therefore, the evaluation score corresponding to each candidate word is in an inverse relation with the corresponding probability log value, and is also influenced by the distinguishing information of the candidate word and the wrong word. The evaluation scores of the candidate words are obtained by comprehensively considering the habit problems of the user when writing the words and the language environment information, and the reliability of candidate word evaluation is improved compared with the traditional candidate word evaluation method.
In one embodiment, the error information of the error word relative to each candidate word includes: and the wrong word and the candidate word have the same initial letter information. Correspondingly, the step of determining the evaluation score corresponding to each candidate word according to the language environment probability and the error information comprises the following steps: if the wrong word is the same as the initial of the candidate word, calculating an evaluation score corresponding to the candidate word according to the language environment probability and the first coefficient; and if the wrong word and the candidate word have different initial letters, calculating the evaluation score corresponding to the candidate word according to the language environment probability and the second coefficient.
Based on any of the above embodiments, the language environment probability of the candidate word at the wrong word position is used
Figure BDA0001625447080000111
Representing that word represents a candidate word and mx represents a language model, the evaluation score of each candidate word is calculated according to the following formula:
Figure BDA0001625447080000112
if the candidate word and the wrong word initial are the same, the value of K is K1, otherwise, the values of K2, K1 and K2 are all preset numerical values.
For example, assume that the language environment probability determined by the N-Gram language model is
Figure BDA0001625447080000113
The formula for calculating the score of each candidate word may be:
Figure BDA0001625447080000114
the language environment probability determined by the BilSTM model and the LSTM model is assumed to be respectively expressed as
Figure BDA0001625447080000115
Figure BDA0001625447080000116
The formula for calculating the score of each candidate word may be:
Figure BDA0001625447080000121
Figure BDA0001625447080000122
and selecting the K value according to whether the candidate word and the wrong word have the same initial, wherein the K value is 1 when the candidate word and the wrong word are the same, and the K value is 0.5 when the candidate word and the wrong word are not the same. It is understood that the values of K include, but are not limited to, the values of the above examples.
For an english word or a character string, based on the writing habit of the user, the first letter generally does not make a mistake, so that the possibility that each candidate word is a wrong-word-corrected word is evaluated through the above embodiment, and the evaluation score of the candidate word can be more reasonably determined. So that if the candidate word is the same as the first letter of the error word, the candidate word is more likely to be the error word, otherwise, the candidate word is less likely to be the error word.
For example, it is known that when a user writes an english word on the smart interactive tablet, the game is wrongly written as a game, and the contextual language environment is I-wave game applets.
For the N-GRAM model, N in the N-GRAM model takes a value of 3:
1. and detecting the tone in the English dictionary through English dictionary detection, and determining the tone as a wrong word.
2. And acquiring words with edit distances of 1 and 2 from the song in the English dictionary, and assuming that the song, the same, the son and the one exist as candidate words.
3. Respectively replacing the song in the I-love song applets with the song, the same, the son and the one, and calculating the corresponding candidate words according to the N-GRAM model
Figure BDA0001625447080000123
The language environment probability for each candidate word is obtained as follows:
Figure BDA0001625447080000124
Figure BDA0001625447080000125
Figure BDA0001625447080000126
Figure BDA0001625447080000127
wherein, P (I, have, some) ═ c (I, have, some)/c (3-Gram), and c represents the count, i.e. the count (number) of 3-tuples I, have, some in the corpus is divided by the count of all 3-tuples;
where P (have | I) ═ c (have, I)/c (I), i.e., the count of such 2-tuples of have, I in the corpus divided by the count of all I;
where p (I) ═ c (I)/c (1-Gram), i.e. the count of such 1-tuples of I in the corpus divided by the count of all 1-tuples.
4. According to the formula
Figure BDA0001625447080000128
Calculating an evaluation score for each candidate word:
scoresome=-1×1/-5.7=0.175;
scoresame=-1×1/-6=0.167;
scoreson=-1×1/-6=0.167;
scoreone=-0.5×1/-4.8=0.104。
therefore, the candidate word some is more likely to be the correct word (error-correcting word) corresponding to the error word sone than other candidate words.
In another embodiment, step 1 and step 2 are the same for the BilSTM model; step 3 is replaced by: respectively replacing the sound in the I-sound applets with the sound, same, son and one, and calculating the language environment probability corresponding to each candidate word
Figure BDA0001625447080000131
Each candidate word is calculated as follows:
Figure BDA0001625447080000132
Figure BDA0001625447080000133
Figure BDA0001625447080000134
Figure BDA0001625447080000135
step 4 is replaced by:
according to the formula
Figure BDA0001625447080000136
Calculating an evaluation score for each candidate word:
scoresome=-1×1/-4.9=0.204;
scoresame=-1×1/-5.07=0.197;
scoreson=-1×1/-7.6=0.132;
scoreone=-0.5×1/-4.66=0.107。
it can also be known that the candidate word some is more likely to be the correct word (error-correcting word) corresponding to the error word sone than other candidate words.
In another embodiment, step 1 and step 2 above are the same for the LSTM model; step 3 is replaced by: respectively mixing some Some, same, son and one replaces the sound in the I-wave sound applets, calculates the language environment probability corresponding to each candidate word
Figure BDA0001625447080000137
Each candidate word is calculated as follows:
Figure BDA0001625447080000138
Figure BDA0001625447080000139
Figure BDA0001625447080000141
Figure BDA0001625447080000142
step 4 is replaced by:
according to the formula
Figure BDA0001625447080000143
Calculating an evaluation score for each candidate word:
scoresome=-1×1/-5.6=0.179;
scoresame=-1×1/-6.3=0.159;
scoreson=-1×1/-6=0.116;
scoreone=-0.5×1/-5.1=0.098。
it can also be known that the candidate word some is more likely to be the correct word (error-correcting word) corresponding to the error word sone than other candidate words.
Through the candidate word evaluation score calculation under the three language models, the habit problem of a user when writing words is considered, and language environment information is also considered, so that candidate words of wrong words can be determined more effectively, and the error correction accuracy is improved.
As shown in fig. 5, in a fourth embodiment, there is provided a candidate word evaluation method including the steps of:
step S41, detecting the wrong word, and obtaining a plurality of candidate words corresponding to the wrong word.
The specific implementation manner of this step refers to step S11 in the above embodiment, which is not described in detail.
Step S42, replacing the wrong word with each candidate word respectively to obtain a candidate sentence, and determining the evaluation probability of the corresponding candidate word according to the candidate sentence, wherein the evaluation probability is obtained according to the language environment probability of the candidate word in the candidate sentence and the language environment probability of the adjacent word of the candidate word.
And replacing wrong words with the candidate words respectively to obtain corresponding candidate sentences, wherein the candidate sentences can contain a plurality of words, and one of the words is a candidate word. The language environment probability of each word in the corresponding position of the word in the sentence refers to that the word is more reasonable relative to the corresponding context, and the higher the context is, the higher the language environment probability corresponding to the word is.
In one embodiment, the neighboring words of the candidate word may be one or more, and may include both immediate neighboring words of the candidate word and alternate neighboring words of the candidate word. The probability of each position of a candidate word and a word adjacent to the candidate word in the candidate sentence can be calculated according to a preset language model, and the log value of the probability is used as the language environment probability of the corresponding word; and further averaging the language environment probability of the candidate words in the candidate sentences and the language environment probability of the adjacent words of the candidate words to obtain the evaluation probability of the candidate words in the candidate sentences. The language environment probability of the candidate word and the mean value of the language environment probabilities of the adjacent words of the candidate word can be an absolute average value or a weighted average value.
And step S43, acquiring error information of the wrong word relative to each candidate word.
As to the implementation of this step, reference may be made to the description of the first embodiment above.
And step S44, determining the evaluation score corresponding to each candidate word according to the evaluation probability and the error information.
The evaluation score obtained by the step can reflect the possibility that each candidate word is the error-correcting word corresponding to the error word.
In the candidate word evaluation method of the embodiment, when a wrong word is detected, a plurality of corresponding candidate words are first obtained, evaluation probabilities corresponding to the candidate words are respectively determined, and error information of the wrong word relative to the candidate words is determined; determining an evaluation score corresponding to each candidate word according to the evaluation probability and the error information; the method not only considers the habitual problem of the user when writing the word, but also takes the information of the context language environment into account, thereby being beneficial to improving the reliability of the candidate word evaluation result.
In one embodiment, the language model that calculates the summary of candidate words, their neighbors, in their respective locations in the candidate sentence includes, but is not limited to, an N-Gram model, a BilSTM model, or an LSTM model. The explanation of each language model can be referred to the description of the third embodiment.
In one embodiment, the language environment probability of each word is represented by a log of the probability of each word at its position, based on which an evaluation probability of a candidate word is obtained, and further, an evaluation score corresponding to each candidate word may be determined according to an inverse of the evaluation probability of the candidate word and the error information.
For example: if the evaluation probability of the candidate word is used
Figure BDA0001625447080000151
Expressing that the error information of the error word relative to each candidate word is represented by K, and the evaluation score corresponding to the candidate word is represented by scorewordThat means, the evaluation score of each candidate word can be calculated according to the following formula:
Figure BDA0001625447080000152
if the candidate word and the wrong word initial are the same, K takes the value of K1, otherwise, K takes the value of K2; k1 and K2 are preset values.
Therefore, the evaluation score corresponding to each candidate word is in an inverse relation with the corresponding evaluation probability, and is also influenced by the evaluation score and the initial information of the wrong word. The finally obtained evaluation score corresponding to each candidate word is obtained by comprehensively considering the habit problem and the context information when the user writes the word, and compared with the traditional method for evaluating the candidate words by depending on the editing distance, the reliability of the evaluation result of the candidate words is improved.
In one embodiment, the error information of the error word relative to each candidate word includes: and the wrong word and the candidate word have the same initial letter information. Correspondingly, the step of determining the evaluation score corresponding to each candidate word according to the evaluation probability and the error information comprises the following steps: if the wrong word is the same as the initial of the candidate word, calculating an evaluation score corresponding to the candidate word according to the evaluation probability and the first coefficient; and if the wrong word and the candidate word have different initial letters, calculating an evaluation score corresponding to the candidate word according to the evaluation probability and the second coefficient.
Optionally, the evaluation probability determined by the N-Gram language model is assumed to be used
Figure BDA0001625447080000161
The formula for calculating the score of each candidate word may be:
Figure BDA0001625447080000162
and selecting the K value according to whether the candidate word and the wrong word have the same initial, wherein the K value is 1 when the candidate word and the wrong word are the same, and the K value is 0.5 when the candidate word and the wrong word are not the same. It is understood that the values of K include, but are not limited to, the values of the above examples.
For an english word or a character string, based on the writing habit of the user, the first letter generally does not make a mistake, so based on the above embodiment, the possibility that each candidate word is a wrong-word-corrected word is evaluated, and the evaluation score of the candidate word can be more reasonably determined. So that if the candidate word is the same as the first letter of the error word, the candidate word is more likely to be the error word, otherwise, the candidate word is less likely to be the error word.
For example: it is known that when a user writes an english word or sentence on the smart interactive tablet, the some things are that the some things are wrong writing the some things into the song, and the contextual language environment is I-have song applets.
1. And detecting the tone in the English dictionary through English dictionary detection, and determining the tone as a wrong word.
2. And acquiring words with edit distances of 1 and 2 from the song in the English dictionary, and assuming that the song, the same, the son and the one exist as candidate words.
3. Respectively replacing the sone in the I-have sone applets with the some of the others; obtaining:
candidate statement one: i have some soy applets;
candidate statement two: i have same applets;
candidate statement three: i have son applets;
candidate sentence four: i have one applets.
Based on the N-GRAM model, N in the N-GRAM model takes a value of 3, a neighboring word corresponding to a candidate word some in a candidate sentence I is an applet, a neighboring word corresponding to a candidate word same in a candidate sentence II is an applet, a neighboring word corresponding to a candidate word son in a candidate sentence III is an applet, and a neighboring word corresponding to a candidate word one in a candidate sentence IV is an applet; based on the calculation, the evaluation probability corresponding to the candidate words in each candidate sentence is as follows:
Figure BDA0001625447080000163
Figure BDA0001625447080000171
Figure BDA0001625447080000172
Figure BDA0001625447080000173
4. according to the formula
Figure BDA0001625447080000174
Calculating an evaluation score for each candidate word:
scoresome=-1×1/-5.6=0.179;
scoresame=-1×1/-6.35=0.157;
scoreson=-1×1/-7.58=0.132;
scoreone=-0.5×1/-5.26=0.095。
therefore, the candidate word some is more likely to be the correct word (error-correcting word) corresponding to the error word sone than other candidate words.
In another embodiment, for the BilSTM model, the neighbor word corresponding to the candidate word some in the first candidate sentence is I, have and applets, the neighbor word corresponding to the candidate word same in the second candidate sentence is I, have and applets, the neighbor word corresponding to the candidate word son in the third candidate sentence is I, have and applets, and the neighbor word corresponding to the candidate word one in the fourth candidate sentence is I, have and applets; based on the calculation, the evaluation probability corresponding to the candidate words in each candidate sentence is as follows:
Figure BDA0001625447080000175
Figure BDA0001625447080000176
Figure BDA0001625447080000181
Figure BDA0001625447080000182
step 4 is replaced by:
according to the formula
Figure BDA0001625447080000183
Calculating an evaluation score for each candidate word:
scoresome=-1×1/-2.88=0.347;
scoresame=-1×1/-3.62=0.276;
scoreson=-1×1/-4.1=0.244;
scoreone=-0.5×1/-2.62=0.191。
It can also be seen that the candidate word some has a higher probability of being the correct word (error-correcting word) corresponding to the error word sone than other candidate words.
In another embodiment, for the LSTM model, the neighboring words corresponding to the candidate word some in the first candidate sentence are I, have and applets, the neighboring words corresponding to the candidate word same in the second candidate sentence are I, have and applets, the neighboring words corresponding to the candidate word son in the third candidate sentence are I, have and applets, and the neighboring words corresponding to the candidate word one in the fourth candidate sentence are I, have and applets; based on the calculation, the evaluation probability corresponding to the candidate words in each candidate sentence is as follows:
Figure BDA0001625447080000184
Figure BDA0001625447080000185
Figure BDA0001625447080000186
Figure BDA0001625447080000187
step 4 is replaced by:
according to the formula
Figure BDA0001625447080000191
Calculating an evaluation score for each candidate word:
scoresome=-1×1/-3.175=0.315;
scoresame=-1×1/-3.75=0.267;
scoreson=-1×1/-4.6=0.217;
scoreone=-0.5×1/-3.45=0.145。
it can also be seen that the candidate word some has a higher probability of being the correct word (error-correcting word) corresponding to the error word sone than other candidate words.
Through candidate word evaluation under the three language models, the habit problem of a user when writing words is considered, and evaluation information of the language environment is added, so that the reliability of candidate word evaluation is improved.
As shown in fig. 6, in a fifth embodiment, there is provided a candidate word evaluation method including the steps of:
step S51, when the wrong word is detected, a plurality of candidate words corresponding to the wrong word are obtained.
Regarding the implementation manner of this step, refer to step S11 of the first embodiment, which is not described in detail.
And step S52, determining the edit distance between each candidate word and the wrong word.
As to the implementation of this step, reference may be made to the description of the first embodiment.
Step S53, replacing the wrong word with each candidate word respectively to obtain a candidate sentence, and determining the evaluation probability of the corresponding candidate word according to the candidate sentence, wherein the evaluation probability is obtained according to the language environment probability of the candidate word in the candidate sentence and the language environment probability of the adjacent word of the candidate word.
With regard to the implementation of this step, reference may be made to the description of the fourth embodiment.
And step S54, determining the evaluation score of each candidate word according to the editing distance and the evaluation probability.
The evaluation score obtained by this step can represent the possibility that each candidate word is the error-correcting word (correct word) corresponding to the error word.
By the candidate word evaluation method of the embodiment, when a wrong word is detected, a plurality of candidate words corresponding to the wrong word are obtained; according to the editing distance between the candidate word and the wrong word and the context language environment information, the possibility that the candidate word is the wrong word corresponding to the wrong word is comprehensively evaluated, and compared with a traditional candidate word evaluation method only depending on the editing distance, the reliability of candidate word evaluation is improved.
In one embodiment, the probabilities of the candidate words and the adjacent words of the candidate words in the candidate sentences at the positions of the candidate words and the adjacent words of the candidate words are calculated according to a preset language model, and the log values of the probabilities are used as the language environment probabilities of the words; further, the language environment probabilities of the candidate words in the candidate sentences and the language environment probabilities of the adjacent words of the candidate words are averaged to obtain the evaluation probabilities of the candidate words in the candidate sentences.
The language model includes, but is not limited to, an N-Gram model, a BilSTM model, or an LSTM model. For each language model, reference is made to the description of the third embodiment above.
In one embodiment, the language environment probability of each word is represented by a log of the probability of each word at its position, based on which an evaluation probability of a candidate word is obtained, and further, an evaluation score corresponding to each candidate word may be determined according to an inverse of the edit distance and an inverse of the evaluation probability.
For example: the evaluation score for each candidate word may be calculated according to the following formula:
Figure BDA0001625447080000201
wherein D iseditRepresenting the edit distance of the candidate word from the wrong word, word representing the candidate word,
Figure BDA0001625447080000202
represents the evaluation probability of the candidate word, scorewordRepresenting the evaluation score corresponding to the candidate word.
Therefore, the evaluation score corresponding to each candidate word is in an inverse relation with the corresponding evaluation probability, and is also influenced by the evaluation score and the initial information of the wrong word. The finally obtained evaluation score corresponding to each candidate word is obtained by comprehensively considering the phenomenon characteristics and the context information when the word is written, and compared with the traditional method for evaluating the candidate words by depending on the editing distance, the evaluation reliability of the candidate words is improved.
For example: the known situation is that when a user writes English words and sentences on the intelligent interactive flat plate, the game is wrongly written into the game, and the context language environment is I-wave game applets; the method for evaluating the candidate word corresponding to the tone comprises the following steps:
1. and detecting the tone in the English dictionary through English dictionary detection, and determining the tone as a wrong word.
2. And acquiring a candidate word of the song, and assuming that the song, the same, the one and the son exist as the candidate word.
3. Respectively determining the editing distances of the some, same, one, son and one, which are respectively as follows: 1. 2, 1 and 1.
4. Respectively replacing the sone in the I-have sone applets with the some of the others; obtaining:
candidate statement one: i have some soy applets;
candidate statement two: i have same applets;
candidate statement three: i have son applets;
candidate sentence four: i have one applets;
based on the N-GRAM model, N in the N-GRAM model takes a value of 3, a neighboring word corresponding to a candidate word some in a candidate sentence I is an applet, a neighboring word corresponding to a candidate word same in a candidate sentence II is an applet, a neighboring word corresponding to a candidate word son in a candidate sentence III is an applet, and a neighboring word corresponding to a candidate word one in a candidate sentence IV is an applet; based on the calculation, the evaluation probability corresponding to the candidate words in each candidate sentence is as follows:
Figure BDA0001625447080000211
Figure BDA0001625447080000212
Figure BDA0001625447080000213
Figure BDA0001625447080000214
5. according to the formula
Figure BDA0001625447080000215
Calculating an evaluation score for each candidate word:
scoresome=-1×1/-5.6=0.179;
scoresame=-1/2×1/-6.35=0.0787;
scoreson=-1×1/-7.58=0.132;
scoreone=-1×1/-5.26=0.19。
therefore, the candidate word some is more likely to be the correct word (error-correcting word) corresponding to the error word sone than other candidate words.
In another embodiment, for the BilSTM model, the neighbor word corresponding to the candidate word some in the first candidate sentence is I, have and applets, the neighbor word corresponding to the candidate word same in the second candidate sentence is I, have and applets, the neighbor word corresponding to the candidate word son in the third candidate sentence is I, have and applets, and the neighbor word corresponding to the candidate word one in the fourth candidate sentence is I, have and applets; based on the calculation, the evaluation probability corresponding to the candidate words in each candidate sentence is as follows:
Figure BDA0001625447080000221
Figure BDA0001625447080000222
Figure BDA0001625447080000223
Figure BDA0001625447080000224
step 4 is replaced by:
according to the formula
Figure BDA0001625447080000225
Calculating an evaluation score for each candidate word:
scoresome=-1×1/-2.88=0.347;
scoresame=-1/2×1/-3.62=0.138;
scoreson=-1×1/-4.1=0.244;
scoreone=-1×1/-2.62=0.382。
similarly, it is found that the candidate word game has a higher probability of being the correct word (error-correcting word) corresponding to the error word tone than other candidate words.
In another embodiment, for the LSTM model, the neighboring words corresponding to the candidate word some in the first candidate sentence are I, have and applets, the neighboring words corresponding to the candidate word same in the second candidate sentence are I, have and applets, the neighboring words corresponding to the candidate word son in the third candidate sentence are I, have and applets, and the neighboring words corresponding to the candidate word one in the fourth candidate sentence are I, have and applets; based on the calculation, the evaluation probability corresponding to the candidate words in each candidate sentence is as follows:
Figure BDA0001625447080000226
Figure BDA0001625447080000227
Figure BDA0001625447080000231
Figure BDA0001625447080000232
step 4 is replaced by:
according to the formula
Figure BDA0001625447080000233
Calculating an evaluation score for each candidate word:
scoresome=-1×1/-3.175=0.315;
scoresame=-1/2×1/-3.75=0.1335;
scoreson=-1×1/-4.6=0.217;
scoreone=-1×1/-3.45=0.29。
it is also found that the candidate word game has a higher probability of being the correct word (error-correcting word) corresponding to the error word tone than other candidate words.
Therefore, the evaluation scores corresponding to the candidate words finally obtained through the three language models comprehensively consider the context language environment information and the editing distance information, and therefore the reliability of candidate word evaluation is improved.
As shown in fig. 7, in a sixth embodiment, there is provided a candidate word evaluation method including the steps of:
step S61, when the wrong word is detected, a plurality of candidate words corresponding to the wrong word are obtained.
Regarding the implementation manner of this step, refer to step S11 of the first embodiment, which is not described in detail.
And step S62, determining the similarity between each candidate word and the wrong word, wherein the similarity is obtained according to the longest common subsequence and/or the longest common substring of each candidate word and the wrong word.
The implementation of this step is described with reference to the above description of the second embodiment, and is not repeated.
Step S63, replacing the wrong word with each candidate word respectively to obtain a candidate sentence, and determining the evaluation probability of the corresponding candidate word according to the candidate sentence, wherein the evaluation probability is obtained according to the language environment probability of the candidate word in the candidate sentence and the language environment probability of the adjacent word of the candidate word.
The implementation of this step is described with reference to the fourth embodiment, and is not repeated.
And step S64, determining the evaluation score of each candidate word according to the similarity and the evaluation probability.
The evaluation score obtained by this step can represent the possibility that each candidate word is the error-correcting word (correct word) corresponding to the error word.
By the candidate word evaluation method of the embodiment, when a wrong word is detected, a plurality of candidate words corresponding to the wrong word are obtained; according to the similarity between the candidate words and the wrong words and the context language environment information, the possibility that the candidate words are the wrong words corresponding to the wrong words is comprehensively evaluated, and compared with a traditional candidate word evaluation method only depending on the editing distance, the reliability of the candidate word evaluation result is improved.
In one embodiment, the probabilities of the candidate words and the adjacent words of the candidate words in the candidate sentences at the positions of the candidate words and the adjacent words of the candidate words are calculated according to a preset language model, and the log values of the probabilities are used as the language environment probabilities of the words; further, the language environment probabilities of the candidate words in the candidate sentences and the language environment probabilities of the adjacent words of the candidate words are averaged to obtain the evaluation probabilities of the candidate words in the candidate sentences. The language model includes, but is not limited to, an N-Gram model, a BilSTM model, or an LSTM model. The case of each language model can be seen in the above-described embodiments.
In one embodiment, the language environment probability of each word is represented by a log of the probability of each word at its position, based on which an evaluation probability of the candidate word is obtained, and further, an evaluation score corresponding to each candidate word may be determined according to the reciprocal of the evaluation probability and the similarity.
Optionally, the evaluation score of each candidate word is calculated according to the following formula:
Figure BDA0001625447080000241
where word represents a candidate word, scorewordRepresenting evaluation scores corresponding to candidate words
Figure BDA0001625447080000242
Evaluation of representation candidate wordsProbability, mx represents a language model, and S represents the similarity of the candidate word to the wrong word.
Therefore, the evaluation score corresponding to each candidate word is in an inverse relation with the corresponding evaluation probability, and is also influenced by the similarity of the candidate word and the wrong word. The evaluation scores corresponding to the candidate words are obtained after the similarity and the context information are integrated, and compared with a traditional method for evaluating the candidate words by means of editing distance, the evaluation reliability of the candidate words is improved.
For example: the known situation is that when a user writes English words and sentences on the intelligent interactive flat plate, the game is wrongly written into the game, and the context language environment is I-wave game applets; the method for evaluating the candidate word corresponding to the tone comprises the following steps:
1. and detecting the tone in the English dictionary through English dictionary detection, and determining the tone as a wrong word.
2. And acquiring a candidate word of the song, and assuming that the song, the same and the son exist as the candidate word.
3. Respectively determining the longest common subsequence rate and the longest common substring rate of the some, same, son and none, and calculating the similarity corresponding to each candidate word as follows:
the longest common subsequence of the candidate word some and the wrong word sone is soe, the length of the longest common subsequence is 3, and the longest common subsequence ratio is 3/((4+4)/2) ═ 0.75; the longest common substring is so, the length of the longest common substring is 2, and the longest common substring rate is 2/((4+4)/2) ═ 0.5;
the longest public subsequence of the candidate word sam and the wrong word sone is se, the length of the longest public subsequence is 2, and the longest public subsequence rate is 2/((4+4)/2) ═ 0.5; the longest common substring is s or e, the length of the longest common substring is 1, and the longest common substring rate is 1/((4+4)/2) ═ 0.25;
the longest common subsequence of the candidate word son and the wrong word sone is son, the length of the longest common subsequence is 3, and the longest common subsequence rate is 3/((3+4)/2) ═ 0.86; the longest common substring is son, the length of the longest common substring is 3, and the longest common substring rate is 3/((3+4)/2) — (0.86).
According to the formula S-1/Dedit+0.7Dlcs1+0.3Dlcs2Calculating the similarity S between each candidate word and the wrong word as follows:
Some:1+0.7×0.75+0.3×0.5=1.675;
Same:1/2+0.7×0.5+0.3×0.25=0.925;
son:1+0.7×0.86+0.3×0.86=1.86。
4. respectively replacing the sone in the I-have sone applets by the same, same and son; obtaining:
candidate statement one: i have some soy applets;
candidate statement two: i have same applets;
candidate statement three: i have son applets.
Based on the N-GRAM model, N in the N-GRAM model takes a value of 3, a neighboring word corresponding to a candidate word some in the first candidate sentence is an applet, a neighboring word corresponding to a candidate word same in the second candidate sentence is an applet, and a neighboring word corresponding to a candidate word son in the third candidate sentence is an applet; based on the calculation, the evaluation probability corresponding to the candidate words in each candidate sentence is as follows:
Figure BDA0001625447080000251
Figure BDA0001625447080000252
Figure BDA0001625447080000253
wherein, P (I, have, some) ═ c (I, have, some)/c (3-Gram), and c represents the count, i.e. the count (number) of 3-tuples I, have, some in the corpus is divided by the count of all 3-tuples;
wherein, P (have | I) ═ c (have, I)/c (I), i.e. the count of 2-tuple in the corpus, such as have, I, is divided by the count of all I;
where p (I) ═ c (I)/c (1-Gram), i.e. the 1-tuple count in the corpus I divided by the count of all 1-tuples.
5. According to the formula
Figure BDA0001625447080000261
Calculating an evaluation score for each candidate word:
scoresome=-1.675×1/-5.6=0.299;
scoresame=-0.925×1/-6.35=0.146;
scoreson=-1.86×1/-7.58=0.245。
therefore, the candidate word some is more likely to be the correct word (error-correcting word) corresponding to the error word sone than other candidate words.
In another embodiment, for the BilSTM model, the neighbor word corresponding to the candidate word some in the first candidate sentence is I, have and applets, the neighbor word corresponding to the candidate word same in the second candidate sentence is I, have and applets, and the neighbor word corresponding to the candidate word son in the third candidate sentence is I, have and applets; based on the calculation, the evaluation probability corresponding to the candidate words in each candidate sentence is as follows:
Figure BDA0001625447080000262
Figure BDA0001625447080000263
Figure BDA0001625447080000264
step 5 is replaced by:
according to the formula
Figure BDA0001625447080000265
Calculating an evaluation score for each candidate word:
scoresome=-1.675×1/-2.88=0.582;
scoresame=-0.925×1/-3.62=0.256;
scoreson=-1.86×1/-4.1=0.454。
it can also be known that the candidate word some is more likely to be the correct word (error-correcting word) corresponding to the error word sone than other candidate words.
In another embodiment, for the LSTM model, the neighbor word corresponding to the candidate word some in the first candidate sentence is I, have and applets, the neighbor word corresponding to the candidate word same in the second candidate sentence is I, have and applets, and the neighbor word corresponding to the candidate word son in the third candidate sentence is I, have and applets; based on the calculation, the evaluation probability corresponding to the candidate words in each candidate sentence is as follows:
Figure BDA0001625447080000271
Figure BDA0001625447080000272
Figure BDA0001625447080000273
step 5 is replaced by:
according to the formula
Figure BDA0001625447080000274
Calculating an evaluation score for each candidate word:
scoresome=-1.675×1/-3.175=0.528;
scoresame=-0.925×1/-3.75=0.247;
scoreson=-1.86×1/-4.6=0.404。
it can also be known that the candidate word some is more likely to be the correct word (error-correcting word) corresponding to the error word sone than other candidate words.
Therefore, the evaluation score corresponding to each candidate word finally obtained through the three language models comprehensively considers the context language environment information and the similarity information between the words, and therefore the reliability of candidate word evaluation is improved.
As shown in fig. 8, in a seventh embodiment, there is provided a candidate word evaluation method including the steps of:
step S71, when the wrong word is detected, a plurality of candidate words corresponding to the wrong word are obtained.
Regarding the implementation of this step, reference may be made to the description of step S11 of the first embodiment.
And step S72, determining the edit distance between each candidate word and the wrong word.
As to the implementation of this step, reference may be made to the description of the first embodiment.
And step S73, determining the language environment probability of each candidate word at the wrong word position.
With regard to the implementation of this step, reference may be made to the description of the third embodiment.
And step S74, acquiring error information of the wrong word relative to each candidate word.
The specific implementation of this step can be referred to the description of the first embodiment above.
And step S75, determining the evaluation score corresponding to each candidate word according to the editing distance, the language environment probability and the error information.
The evaluation score obtained in this step can reflect the possibility that each candidate word is the error-correcting word (correct word) corresponding to the error word.
When a wrong word is detected, the candidate word evaluation method of the embodiment first acquires a plurality of corresponding candidate words, determines an edit distance between each candidate word and the wrong word, a probability of each candidate word at the position of the wrong word, and error information of the wrong word relative to each candidate word; determining an evaluation score corresponding to each candidate word according to the editing distance, the language environment probability and the error information; not only the editing distance and the habit problem of the user when writing the word are considered, but also the information of the context language environment is taken into account, thereby being beneficial to improving the reliability of the candidate word evaluation result.
In one embodiment, the language model that determines the language environment probability of each candidate word at the wrong word position includes, but is not limited to, an N-Gram model, a BilSTM model, or an LSTM model. The description of each model refers to the description of the related embodiment described above.
In one embodiment, when the language environment probability is represented by a log value of the probability of each candidate word at the wrong word position, the evaluation score corresponding to each candidate word may be determined according to the inverse of the edit distance, the inverse of the language environment probability, and the error information.
Therefore, the evaluation score corresponding to each candidate word is in a reverse relation with the editing distance of the corresponding wrong word and in a reverse relation with the corresponding language environment probability, and is also influenced by the error information of the wrong word. The evaluation scores of the candidate words are obtained by comprehensively considering habit problems, editing distance and language environment information when the user writes the words, and compared with a traditional candidate word evaluation method, the evaluation method is beneficial to improving the evaluation reliability of the candidate words.
In one embodiment, the error information of the error word relative to each candidate word includes: and the wrong word and the candidate word have the same initial letter information. Correspondingly, the step of determining the evaluation score corresponding to each candidate word according to the editing distance, the language environment probability and the error information comprises the following steps: if the wrong word is the same as the initial of the candidate word, calculating an evaluation score corresponding to the candidate word according to the editing distance, the language environment probability and the first coefficient; and if the wrong word is different from the initial letter of the candidate word, calculating the evaluation score corresponding to the candidate word according to the editing distance, the language environment probability and the second coefficient.
Optionally, the evaluation score of each candidate word is calculated according to the following formula:
Figure BDA0001625447080000281
wherein word represents a candidate word, DeditIndicating the edit distance of the candidate word from the wrong word,
Figure BDA0001625447080000282
representing the probability of the language environment of the candidate word, mx representing the language model, scorewordAnd expressing the evaluation score corresponding to the candidate word, wherein K expresses error information of the wrong word relative to each candidate word, if the candidate word and the first letter of the wrong word are the same, the value of K is K1, and otherwise, the values of K2, K1 and K2 are all preset numerical values.
For example, assume that the language environment probability determined by the N-Gram language model is
Figure BDA0001625447080000291
The formula for calculating the score of each candidate word may be:
Figure BDA0001625447080000292
and selecting the K value according to whether the candidate word and the wrong word have the same initial, wherein the K value is 1 when the candidate word and the wrong word are the same, and the K value is 0.5 when the candidate word and the wrong word are not the same. It is understood that the values of K include, but are not limited to, the values of the above examples.
For an english word or a character string, based on the writing habit of the user, the first letter generally does not make a mistake, so based on the above embodiment, the possibility that each candidate word is a wrong-word-corrected word is evaluated, and the evaluation score of the candidate word can be more reasonably determined. So that if the candidate word is the same as the first letter of the error word, the candidate word is more likely to be the error word, otherwise, the candidate word is less likely to be the error word.
For example, it is known that when a user writes an english word on the smart interactive tablet, the game is wrongly written as a game, and the contextual language environment is I-wave game applets.
1. And detecting the tone in the English dictionary through English dictionary detection, and determining the tone as a wrong word.
2. And acquiring words with edit distances of 1 and 2 from the song in the English dictionary, and assuming that the song, the same, the son and the one exist as candidate words.
3. Respectively replacing the song in the I-love song applets with the song, the same, the son and the one, and calculating the corresponding candidate words according to the N-GRAM model
Figure BDA0001625447080000293
And N in the N-GRAM model takes a value of 3, and the language environment probability of each candidate word is obtained as follows:
Figure BDA0001625447080000294
Figure BDA0001625447080000295
Figure BDA0001625447080000296
Figure BDA0001625447080000297
wherein, P (I, have, some) ═ c (I, have, some)/c (3-Gram), and c represents the count, i.e. the count (number) of 3-tuples I, have, some in the corpus is divided by the count of all 3-tuples;
wherein, P (have | I) ═ c (have, I)/c (I), i.e. the count of 2-tuple in the corpus, such as have, I, is divided by the count of all I;
where p (I) ═ c (I)/c (1-Gram), i.e. the 1-tuple count in the corpus I divided by the count of all 1-tuples.
4. According to the formula
Figure BDA0001625447080000301
Calculating an evaluation score for each candidate word:
scoresome=-1×1×1/-5.7=0.175;
scoresame=-1×0.5×1/-6=0.083;
scoreson=-1×1×1/-6=0.167;
scoreone=-0.5×1×1/-4.8=0.104。
therefore, the candidate word some is more likely to be the correct word (error-correcting word) corresponding to the error word sone than other candidate words.
In another embodiment, step 1 and step 2 are the same for the BilSTM model;
step 3 is replaced by: respectively replacing the sound in the I-wave sound applets with the sound, the same, the son and the one, and calculating the language environment probability corresponding to each candidate word as follows:
Figure BDA0001625447080000302
Figure BDA0001625447080000303
Figure BDA0001625447080000304
Figure BDA0001625447080000305
step 4 is replaced by:
according to the formula
Figure BDA0001625447080000306
Calculating an evaluation score for each candidate word:
scoresome=-1×1×1/-4.9=0.204;
scoresame=-1×1/2×1/-5.07=0.098;
scoreson=-1×1×1/-7.6=0.132;
scoreone=-0.5×1×1/-4.66=0.107。
it can also be known that the candidate word some is more likely to be the correct word (error-correcting word) corresponding to the error word sone than other candidate words.
In another embodiment, step 1 and step 2 above are the same for the LSTM model;
step 3 is replaced by: respectively replacing the sound in the I-wave sound applets with the sound, the same, the son and the one, and calculating the language environment probability corresponding to each candidate word as follows:
Figure BDA0001625447080000311
Figure BDA0001625447080000312
Figure BDA0001625447080000313
Figure BDA0001625447080000314
step 4 is replaced by:
according to the formula
Figure BDA0001625447080000315
Calculating an evaluation score for each candidate word:
scoresome=-1×1×1/-5.6=0.179;
scoresame=-1×1/2×1/-6.3=0.079;
scoreson=-1×1×1/-6=0.116;
scoreone=-0.5×1×1/-5.1=0.098。
it can also be known that the candidate word some is more likely to be the correct word (error-correcting word) corresponding to the error word sone than other candidate words.
By the embodiment, the editing distance information and the context language environment are considered, and the habit problem that the initial letter is not easy to make mistakes when the user writes English words is also considered, so that more accurate error-correcting words can be determined.
As shown in fig. 9, in an eighth embodiment, there is provided a candidate word evaluation method including the steps of:
step S81, detecting the wrong word, and obtaining a plurality of candidate words corresponding to the wrong word.
The implementation of this step can refer to the description of the first embodiment, and is not repeated.
And step S82, determining the edit distance between each candidate word and the wrong word.
The implementation of this step can refer to the description of the first embodiment, and is not repeated.
And step S83, determining the similarity between each candidate word and the wrong word, wherein the similarity is obtained according to the longest common subsequence and/or the longest common substring of each candidate word and the wrong word.
For the implementation of this step, reference may be made to the description of the second embodiment, which is not repeated.
And step S84, acquiring error information of the wrong word relative to each candidate word.
The implementation of this step can refer to the description of the first embodiment, and is not repeated.
And step S85, determining the evaluation score corresponding to each candidate word according to the editing distance, the similarity and the error information.
Specifically, the step is equivalent to scoring the candidate words according to the editing distance between the candidate words and the wrong words, the distinguishing information and the proximity degree between the candidate words and the wrong words, and integrating the three aspects to evaluate the possibility that the candidate words are the wrong words corresponding to the wrong words.
In this embodiment, the evaluation score corresponding to each candidate word is determined according to the edit distance, the similarity and the error information of each candidate word and the error word. The problems of editing distance and similarity between the candidate words and wrong words and phenomena of writing habits of users are considered, and the reliability of the candidate word evaluation result can be improved.
In one embodiment, the step of determining an evaluation score corresponding to each candidate word according to the edit distance, the similarity, and the error information includes: if the wrong word is the same as the initial of the candidate word, calculating an evaluation score corresponding to the candidate word according to the editing distance, the similarity and the first coefficient; and if the wrong word is different from the initial letter of the candidate word, calculating the evaluation score corresponding to the candidate word according to the editing distance, the similarity and the second coefficient.
In one embodiment, the step of calculating the evaluation score corresponding to the candidate word according to the edit distance, the similarity and the first coefficient includes: and calculating the evaluation score corresponding to the candidate word according to the reciprocal of the editing distance, the similarity and the first coefficient.
In one embodiment, the step of calculating the evaluation score corresponding to the candidate word according to the edit distance, the similarity and the second coefficient includes: and calculating the evaluation score corresponding to the candidate word according to the reciprocal of the editing distance, the similarity and a second coefficient.
In one embodiment, the evaluation score for each candidate word may be calculated according to the following formula:
scoreword=K×S×1/Dedit
the specific calculation process of the evaluation score is exemplified as follows:
it is known that when a user writes an english word or sentence on the smart interactive tablet, the some things are that the some things are wrong writing the some things into the song, and the contextual language environment is I-have song applets. And (4) detecting the English dictionary, finding that the song does not exist in the English dictionary, and determining the song as a wrong word. Obtaining candidate words of the song: some, same, one.
The edit distances of the song, the same, the one and the wrong song are respectively as follows: 1,2,1.
1. According to the formula S-1/Dedit+0.7Dlcs1+0.3Dlcs2The similarity between each candidate word and the wrong word obtained by calculation is respectively as follows:
Ssome:1+0.7×0.75+0.3×0.5=1.675;
Ssame:1/2+0.7×0.5+0.3×0.25=0.925;
Sone:1+0.7×0.86+0.3×0.86=1.86。
2. according to the formula scoreword=K×S×1/DeditThe evaluation score of each candidate word obtained by calculation is as follows:
scoresome=1×1.675×1/1=1.675;
scoresame=1×0.925×1/2=0.463;
scoreone=0.5×1.86×1/1=0.93。
as can be seen from the above evaluation scores, the evaluation score of the candidate word some is higher than the evaluation scores of same and one; the evaluation result has a certain reference value for determining the error-correcting words.
As shown in fig. 10, in a ninth embodiment, there is provided a candidate word evaluation method including the steps of:
step S91, detecting the wrong word, and obtaining a plurality of candidate words corresponding to the wrong word.
For the implementation of this step, reference may be made to the description of the first embodiment, and details are not repeated.
And step S92, determining the edit distance between each candidate word and the wrong word.
The implementation of this step can refer to the description of the first embodiment, and is not repeated.
Step S93, replacing the wrong word with each candidate word respectively to obtain a candidate sentence, and determining the evaluation probability of the corresponding candidate word according to the candidate sentence, wherein the evaluation probability is obtained according to the language environment probability of the candidate word in the candidate sentence and the language environment probability of the adjacent word of the candidate word.
For the implementation of this step, reference may be made to the description of the fourth embodiment, and details are not repeated.
And step S94, acquiring error information of the wrong word relative to each candidate word.
The implementation of this step can refer to the description of the first embodiment, and is not repeated.
And step S95, determining the evaluation score corresponding to each candidate word according to the editing distance, the evaluation probability and the error information.
According to the method and the device, the evaluation score corresponding to each candidate word is determined according to the editing distance between each candidate word and the wrong word, the evaluation probability and the error information. The method not only considers the editing distance between the candidate words and the wrong words and the phenomenon problem of writing of the user, but also adds the evaluation information of the language model, can improve the reliability of the evaluation result of the candidate words, and is beneficial to improving the efficiency and the accuracy of text editing.
In one embodiment, the language environment probability of each word is represented by a log of the probability of each word at its position, based on which an evaluation probability of a candidate word is obtained, and further, an evaluation score corresponding to each candidate word is determined according to the reciprocal of the edit distance, the reciprocal of the evaluation probability, and the error information.
In one embodiment, the step of determining an evaluation score corresponding to each candidate word according to the edit distance, the evaluation probability, and the error information includes: if the wrong word is the same as the initial of the candidate word, calculating the evaluation score of the candidate word according to the editing distance, the evaluation probability and the first coefficient; and if the wrong word is different from the initial letter of the candidate word, calculating the evaluation score of the candidate word according to the editing distance, the evaluation probability and the second coefficient.
In one embodiment, the evaluation score for each candidate word is calculated according to the following formula:
Figure BDA0001625447080000341
the specific calculation process of the evaluation score is exemplified as follows:
it is known that when a user writes an english word or sentence on the smart interactive tablet, the some things are that the some things are wrong writing the some things into the song, and the contextual language environment is I-have song applets.
For the N-Gram model, N in the N-Gram model takes a value of 3:
1. and detecting the tone in the English dictionary through English dictionary detection, and determining the tone as a wrong word.
2. And acquiring words with edit distances of 1 and 2 from the song in the English dictionary, and assuming that the song, the same, the son and the one exist as candidate words. The editing distances corresponding to the candidate words some, same, son and one are respectively as follows: 1,2,1,1.
3. Respectively replacing the song in the I-love song applets with the song, the same, the son and the one to obtain corresponding evaluation sentences; respectively calculating the evaluation probability corresponding to each candidate word according to the N-GRAM model and the evaluation sentence, wherein the obtained evaluation probability is as follows:
Figure BDA0001625447080000342
Figure BDA0001625447080000343
Figure BDA0001625447080000344
Figure BDA0001625447080000345
4. according to the formula
Figure BDA0001625447080000346
The evaluation score of each candidate word is calculated as follows:
scoresome=-1×1/1×1/(-5.6)=0.179;
scoresame=-1×1/2×1/(-6.35)=0.079;
scoreson=-1×1/1×1/(-7.58)=0.132;
scoreone=-0.5×1/1×1/(-5.26)=0.095。
therefore, the candidate word some is more likely to be the correct word (error-correcting word) corresponding to the error word sone than other candidate words.
In another embodiment, step 1 and step 2 are the same for the BilSTM model; step 3 is replaced by: respectively replacing the sone in the I-have sone applets with the some of the others; respectively calculating the evaluation probability corresponding to each candidate word according to the BilSTM model and the evaluation statement, wherein the obtained evaluation probability is as follows:
Figure BDA0001625447080000351
Figure BDA0001625447080000352
Figure BDA0001625447080000353
Figure BDA0001625447080000354
step 4 is replaced by:
according to the formula
Figure BDA0001625447080000355
And calculating to obtain the evaluation scores of the candidate words as follows:
scoresome=-1×1/1×1/(-2.88)=0.347;
scoresame=-1×1/2×1/(-3.62)=0.138;
scoreson=-1×1/1×1/(-4.1)=0.244;
scoreone=-0.5×1/1×1/(-2.62)=0.191。
it can also be seen that the candidate word some has a higher probability of being the correct word (error-correcting word) corresponding to the error word sone than other candidate words.
In another embodiment, step 1 and step 2 above are the same for the LSTM model; step 3 is replaced by: respectively calculating the evaluation probability corresponding to each candidate word according to the LSTM model and the evaluation statement, wherein the obtained evaluation probability is as follows:
Figure BDA0001625447080000361
Figure BDA0001625447080000362
Figure BDA0001625447080000363
Figure BDA0001625447080000364
and step 4 is replaced by:
according to the formula
Figure BDA0001625447080000365
The evaluation score of each candidate word obtained by calculation is as follows:
scoresome=-1×1/1×1/(-3.175)=0.315;
scoresame=-1×1/2×1(-3.75)=0.133;
scoreson=-1×1/1×1/(-4.6)=0.217;
scoreone=-0.5×1/1×1/(-3.45)=0.145。
it can also be seen that the candidate word some has a higher probability of being the correct word (error-correcting word) corresponding to the error word sone than other candidate words.
Through the candidate word evaluation score calculation under the three language models, the habit problem of a user when writing words is considered, the editing distance and the language environment information are also considered, and therefore the candidate word for evaluating wrong words can be determined more effectively.
As shown in fig. 11, in a tenth embodiment, there is provided a candidate word evaluation method including the steps of:
and S101, detecting wrong words and acquiring a plurality of candidate words corresponding to the wrong words.
For the implementation of this step, reference may be made to the corresponding steps of the first embodiment, which are not described in detail.
And step S102, determining the similarity of each candidate word and the wrong word, wherein the similarity is obtained according to the longest common subsequence and/or the longest common substring of each candidate word and the wrong word.
For the implementation of this step, reference may be made to the corresponding steps of the second embodiment, which are not described in detail.
And step S103, determining the language environment probability of each candidate word at the wrong word position.
For the implementation of this step, reference may be made to the description of the third embodiment, which is not repeated.
And step S104, acquiring error information of the wrong word relative to each candidate word.
The implementation of this step can refer to the description of the first embodiment, and is not repeated.
And step S105, determining the evaluation score corresponding to each candidate word according to the similarity, the language environment probability and the error information.
In this embodiment, the evaluation score corresponding to each candidate word is determined according to the similarity between each candidate word and the wrong word, the language environment probability and the error information. The method not only considers the similarity between the candidate words and the wrong words and the phenomenon problem of writing of the user, but also adds the evaluation information of the language model, can improve the reliability of the evaluation result of the candidate words, and is beneficial to improving the efficiency and the accuracy of text editing.
In one embodiment, when the language environment probability is represented by a log value of the probability of each candidate word at the wrong word position, the evaluation score corresponding to each candidate word may be determined according to the similarity, the inverse of the language environment probability, and the error information.
In one embodiment, the step of determining an evaluation score corresponding to each candidate word according to the similarity, the language environment probability, and the error information includes: if the wrong word is the same as the initial of the candidate word, calculating the evaluation score of the candidate word according to the similarity, the language environment probability and the first coefficient; and if the wrong word is different from the initial of the candidate word, calculating the evaluation score of the candidate word according to the similarity, the language environment probability and the second coefficient.
In one embodiment, the evaluation score for each candidate word is calculated according to the following formula:
Figure BDA0001625447080000371
in one embodiment, the language model is an N-Gram model, a BilSTM model, or an LSTM model.
Optionally, the language environment probability corresponding to the candidate word may be calculated by combining the plurality of language models.
Optionally, determining the evaluation score corresponding to each candidate word through the N-Gram model may be calculated through the following formula:
Figure BDA0001625447080000372
wherein the content of the first and second substances,
Figure BDA0001625447080000381
and calculating the language environment probability corresponding to a certain candidate word through the N-Gram model.
For the N-Gram model, N in the N-Gram model takes a value of 3:
specific examples are as follows: it is known that when a user writes an english word or sentence on the smart interactive tablet, the some things are that the some things are wrong writing the some things into the song, and the contextual language environment is I-have song applets.
1. And through detection of the English dictionary, finding that the song does not exist in the English dictionary, and determining the song as a wrong word.
2. And acquiring words with edit distances of 1 and 2 from the song in the English dictionary, and assuming that the song, the same, the son and the one exist as candidate words. Determining the similarity between each candidate word and the wrong word as follows: ssome=1.675;Ssame=0.925;Sson=1.86;Sone=1.86。
3. Respectively replacing the sound, same, son and one in the I-have sound applets, and calculating the language environment probability corresponding to the candidate words according to the replaced sentences
Figure BDA0001625447080000382
The language environment probability of each candidate word at the position of the song calculated according to the N-Gram model is as follows:
Figure BDA0001625447080000383
Figure BDA0001625447080000384
Figure BDA0001625447080000385
Figure BDA0001625447080000386
wherein, P (I, have, some) ═ c (I, have, some)/c (3-Gram), and c represents the count, i.e. the count (number) of 3-tuples I, have, some in the corpus is divided by the count of all 3-tuples;
wherein, P (have | I) ═ c (have, I)/c (I), i.e. the count of 2-tuple in the corpus, such as have, I, is divided by the count of all I;
where p (I) ═ c (I)/c (1-Gram), i.e. the 1-tuple count in the corpus I divided by the count of all 1-tuples.
4. According to the formula
Figure BDA0001625447080000387
The evaluation score corresponding to each candidate word obtained by calculation is as follows:
scoresome=-1×1.675×1/(-5.7)=0.294;
scoresame=-1×0.925×1/(-6)=0.154;
scoreson=-1×1.86×1/(-8)=0.219;
scoreone=-0.5×1.86×1/(-4.8)=0.182。
therefore, the candidate word some is more likely to be the correct word (error-correcting word) corresponding to the error word sone than other candidate words.
In another embodiment, step 1 and step 2 are the same for the BilSTM model; step 3 is replaced by: respectively replacing the sound in the I-sound applets with the sound, same, son and one, and calculating the language environment probability corresponding to each candidate word
Figure BDA0001625447080000391
Each candidate word is calculated as follows:
Figure BDA0001625447080000392
Figure BDA0001625447080000393
Figure BDA0001625447080000394
Figure BDA0001625447080000395
step 4 is replaced by:
according to the formula
Figure BDA0001625447080000396
The evaluation score corresponding to each candidate word obtained by calculation is as follows:
scoresome=-1×1.675×1/-5.6=0.299;
scoresame=-1×0.925×1/-6.3=0.147;
scoreson=-1×1.86×1/-8.6=0.203;
scoreone=-0.5×1.86×1/-5.1=0.172。
therefore, the candidate word some is more likely to be the correct word (error-correcting word) corresponding to the error word sone than other candidate words.
In another embodiment, step 1 and step 2 above are the same for the LSTM model; step 3 is replaced by: respectively replacing the sound in the I-sound applets with the sound, same, son and one, and calculating the language environment probability corresponding to each candidate word
Figure BDA0001625447080000397
Each candidate word is calculated as follows:
Figure BDA0001625447080000398
Figure BDA0001625447080000399
Figure BDA00016254470800003910
Figure BDA00016254470800003911
step 4 is replaced by:
according to the formula
Figure BDA00016254470800003912
The evaluation score corresponding to each candidate word obtained by calculation is as follows:
scoresome=-1×1.675×1/-5.6=0.299;
scoresame=-1×0.925×1/-6.3=0.147;
scoreson=-1×1.86×1/-8.6=0.203;
scoreone=-0.5×1.86×1/-5.1=0.172。
it can also be known that the candidate word some is more likely to be the correct word (error-correcting word) corresponding to the error word sone than other candidate words.
Through candidate word evaluation under the three language models, the habit problem of a user writing words is considered, and similarity information and evaluation information of language environment are added, so that the reliability of candidate word evaluation is improved.
As shown in fig. 12, in an eleventh embodiment, there is provided a candidate word evaluation method including the steps of:
and step S111, detecting wrong words and acquiring a plurality of candidate words corresponding to the wrong words.
The implementation of this step can refer to the description of the first embodiment, and is not repeated.
And step S112, determining the similarity between each candidate word and the wrong word, wherein the similarity is obtained according to the longest common subsequence and/or the longest common substring of each candidate word and the wrong word.
For the implementation of this step, reference may be made to the description of the second embodiment, which is not repeated.
Step S113, replacing the wrong word with each candidate word respectively to obtain a candidate sentence, and determining the evaluation probability of the corresponding candidate word according to the candidate sentence, wherein the evaluation probability is obtained according to the language environment probability of the candidate word in the candidate sentence and the language environment probability of the adjacent word of the candidate word.
For the implementation of this step, reference may be made to the description of the fourth embodiment, which is not repeated.
Step S114, obtaining error information of the wrong word relative to each candidate word.
The implementation of this step can refer to the description of the first embodiment, and is not repeated.
And step S115, determining the evaluation score corresponding to each candidate word according to the similarity, the evaluation probability and the error information.
In this embodiment, the evaluation score corresponding to each candidate word is determined according to the similarity between each candidate word and the wrong word, the evaluation probability, and the error information. The method not only considers the similarity between the candidate words and the wrong words and the phenomenon problem of writing of the user, but also adds the evaluation information of the language model, and can improve the reliability of the evaluation result of the candidate words.
In one embodiment, the language environment probability of each word is represented by a log of the probability of each word at its position, based on which an evaluation probability of a candidate word is obtained, and further, an evaluation score corresponding to each candidate word may be determined according to the similarity, the inverse of the evaluation probability, and the error information.
In one embodiment, the step of determining an evaluation score corresponding to each candidate word according to the similarity, the evaluation probability, and the error information includes: if the wrong word is the same as the initial of the candidate word, calculating the evaluation score of the candidate word according to the similarity, the evaluation probability and the first coefficient; and if the wrong word is different from the initial of the candidate word, calculating the evaluation score of the candidate word according to the similarity, the evaluation probability and the second coefficient.
In one embodiment, the evaluation score for each candidate word is calculated according to the following formula:
Figure BDA0001625447080000411
the specific calculation process of the evaluation score is exemplified as follows:
it is known that when a user writes english words and sentences on the smart interactive tablet, the phone is wrongly written as phone, and the contextual language environment is I-wave phone applets.
1. And detecting the tone in the English dictionary through English dictionary detection, and determining the tone as a wrong word.
2. And acquiring words with edit distances of 1 and 2 from the song in the English dictionary, and assuming that the song, the same, the son and the one exist as candidate words. Determining the similarity between each candidate word and the wrong word as follows: ssome=1.675;Ssame=0.925;Sson=1.86;Sone=1.86。
3. Respectively replacing the sone in the I-have sone applets with the some of the others; obtaining:
candidate statement one: i have some soy applets;
candidate statement two: i have same applets;
candidate statement three: i have son applets;
candidate sentence four: i have one applets.
Based on the N-GRAM model, N in the N-GRAM model takes a value of 3, a neighboring word corresponding to a candidate word some in a candidate sentence I is an applet, a neighboring word corresponding to a candidate word same in a candidate sentence II is an applet, a neighboring word corresponding to a candidate word son in a candidate sentence III is an applet, and a neighboring word corresponding to a candidate word one in a candidate sentence IV is an applet; and respectively calculating the evaluation probability corresponding to each candidate word according to the N-GRAM model, wherein the obtained evaluation probability is as follows:
Figure BDA0001625447080000412
Figure BDA0001625447080000413
Figure BDA0001625447080000421
Figure BDA0001625447080000422
4. according to the formula
Figure BDA0001625447080000423
The evaluation score of each candidate word obtained by calculation is as follows:
scoresome=-1×1.675×1/(-5.6)=0.299;
scoresame=-1×0.925×1/(-6.35)=0.146;
scoreson=-1×1.86×1/(-7.58)=0.245;
scoreone=-0.5×1.86×1/(-5.26)=0.177。
therefore, the candidate word some is more likely to be the correct word (error-correcting word) corresponding to the error word sone than other candidate words.
In another embodiment, for the BilSTM model, the neighbor word corresponding to the candidate word some in the first candidate sentence is I, have and applets, the neighbor word corresponding to the candidate word same in the second candidate sentence is I, have and applets, the neighbor word corresponding to the candidate word son in the third candidate sentence is I, have and applets, and the neighbor word corresponding to the candidate word one in the fourth candidate sentence is I, have and applets; based on the calculation, the evaluation probability corresponding to the candidate words in each candidate sentence is as follows:
Figure BDA0001625447080000424
Figure BDA0001625447080000425
Figure BDA0001625447080000431
Figure BDA0001625447080000432
and step 4 is replaced by:
according to the formula
Figure BDA0001625447080000433
The evaluation score of each candidate word obtained by calculation is as follows:
scoresome=-1×1.675×1/(-2.88)=0.582;
scoresame=-1×0.925×1/(-3.62)=0.256;
scoreson=-1×1.86×1/(-4.1)=0.454;
scoreone=-0.5×1.86×1/(-2.62)=0.355。
it can also be seen that the candidate word some has a higher probability of being the correct word (error-correcting word) corresponding to the error word sone than other candidate words.
In another embodiment, for the LSTM model, the neighboring words corresponding to the candidate word some in the first candidate sentence are I, have and applets, the neighboring words corresponding to the candidate word same in the second candidate sentence are I, have and applets, the neighboring words corresponding to the candidate word son in the third candidate sentence are I, have and applets, and the neighboring words corresponding to the candidate word one in the fourth candidate sentence are I, have and applets; based on the calculation, the evaluation probability corresponding to the candidate words in each candidate sentence is as follows:
Figure BDA0001625447080000434
Figure BDA0001625447080000435
Figure BDA0001625447080000436
Figure BDA0001625447080000441
step 4 is replaced by:
according to the formula
Figure BDA0001625447080000442
The evaluation score of each candidate word obtained by calculation is as follows:
scoresome=-1×1.675×1/(-3.175)=0.528;
scoresame=-1×0.925×1/(-3.75)=0.247;
scoreson=-1×1.86×1/(-4.6)=0.404;
scoreone=-0.5×1.86×1/(-3.45)=0.27。
it can also be seen that the candidate word some has a higher probability of being the correct word (error-correcting word) corresponding to the error word sone than other candidate words.
Through candidate word evaluation under the three language models, the habit problem of a user writing words is considered, and similarity information and evaluation information of language environment are added, so that the reliability of candidate word evaluation is improved.
As shown in fig. 13, in a twelfth embodiment, there is provided a candidate word evaluation method including the steps of:
step S121, a wrong word is detected, and a plurality of candidate words corresponding to the wrong word are obtained.
The implementation of this step can refer to the description of the first embodiment, and is not repeated.
Step S122, determining the editing distance between each candidate word and the wrong word.
The implementation of this step can refer to the description of the first embodiment, and is not repeated.
And S123, determining the similarity between each candidate word and the wrong word, wherein the similarity is obtained according to the longest common subsequence and/or the longest common substring of each candidate word and the wrong word.
For the implementation of this step, reference may be made to the description of the second embodiment, and details are not repeated.
And step S124, determining the language environment probability of each candidate word at the wrong word position.
For the implementation of this step, reference may be made to the description of the third embodiment, which is not repeated.
Step S125, obtaining error information of the error word with respect to each candidate word.
The implementation of this step can refer to the description of the first embodiment, and is not repeated.
And step S126, determining the evaluation score corresponding to each candidate word according to the editing distance, the similarity, the language environment probability and the error information.
In the embodiment, the evaluation score corresponding to each candidate word is determined according to the editing distance, the similarity, the language environment probability and the error information of each candidate word and the error word. The method and the device have the advantages that the problems of editing distance and similarity between the candidate words and the wrong words and phenomena of writing of users are considered, the evaluation information of the language model is added, reliability of evaluation results of the candidate words can be improved, and further, the method and the device are beneficial to improving efficiency and accuracy of text editing.
In one embodiment, when the language environment probability is represented by a log value of the probability of each candidate word at the wrong word position, the evaluation score corresponding to each candidate word may be determined according to the inverse of the edit distance, the similarity, the inverse of the language environment probability, and the error information.
In one embodiment, the step of determining an evaluation score corresponding to each candidate word according to the edit distance, the similarity, the language environment probability, and the error information includes: if the wrong word is the same as the initial letter of the candidate word, calculating the evaluation score of the candidate word according to the editing distance, the similarity, the language environment probability and the first coefficient; and if the wrong word is different from the initial letter of the candidate word, calculating the evaluation score of the candidate word according to the editing distance, the similarity, the language environment probability and the second coefficient.
In one embodiment, the evaluation score for each candidate word is calculated according to the following formula:
Figure BDA0001625447080000451
the specific calculation process of the evaluation score is exemplified as follows:
it is known that when a user writes an english word or sentence on the smart interactive tablet, the some things are that the some things are wrong writing the some things into the song, and the contextual language environment is I-have song applets.
1. Through detection of an English dictionary, the none exists in the English dictionary, and the none is determined to be a wrong word.
2. And acquiring words with edit distances of 1 and 2 from the song in the English dictionary, and assuming that the song, the same, the son and the one exist as candidate words. Determining the similarity between each candidate word and the wrong word as follows: ssome=1.675;Ssame=0.925;Sson=1.86;Sone1.86. The editing distances corresponding to the candidate words some, same, son and one are respectively as follows: 1,2,1,1.
3. Respectively replacing the song in the I-love song applets with the song, the same, the son and the one, and calculating the corresponding candidate words according to the N-GRAM model
Figure BDA0001625447080000452
And N in the N-GRAM model takes a value of 3, and the language environment probability of each candidate word is obtained as follows:
Figure BDA0001625447080000453
Figure BDA0001625447080000454
Figure BDA0001625447080000455
Figure BDA0001625447080000456
4. according to the formula
Figure BDA0001625447080000461
The evaluation score of each candidate word is calculated as:
scoresome=-1×1.675×1/1×1/(-5.7)=0.294;
scoresame=-1×0.925×1/2×1/(-6)=0.077;
scoreson=-1×1.86×1/1×1/(-8)=0.233;
scoreone=-0.5×1.86×1/1×1/(-4.8)=0.194。
therefore, the candidate word some is more likely to be the correct word (error-correcting word) corresponding to the error word sone than other candidate words.
In another embodiment, step 1 and step 2 are the same for the BilSTM model; step 3 is replaced by: respectively replacing the sound in the I-sound applets with the sound, same, son and one, and calculating the language environment probability corresponding to each candidate word
Figure BDA0001625447080000462
Each candidate word is calculated as follows:
Figure BDA0001625447080000463
Figure BDA0001625447080000464
Figure BDA0001625447080000465
Figure BDA0001625447080000466
step 4 is replaced by:
according to the formula
Figure BDA0001625447080000467
Calculating an evaluation score for each candidate word:
scoresome=-1×1.675×1/1×1/-4.9=0.342;
scoresame=-1×0.925×1/2×1/-5.07=0.182;
scoreson=-1×1.86×1/1×1/-7.6=0.246;
scoreone=-0.5×1.86×1/1×1/-4.66=0.199。
similarly, the candidate word game has a higher probability of being the correct word (error-correcting word) corresponding to the error word tone than other candidate words.
In another embodiment, step 1 and step 2 above are the same for the LSTM model; step 3 is replaced by: respectively replacing the sound in the I-sound applets with the sound, same, son and one, and calculating the language environment probability corresponding to each candidate word
Figure BDA0001625447080000471
Each candidate word is calculated as follows:
Figure BDA0001625447080000472
Figure BDA0001625447080000473
Figure BDA0001625447080000474
Figure BDA0001625447080000475
step 4 is replaced by:
according to the formula
Figure BDA0001625447080000476
Calculating an evaluation score for each candidate word:
scoresome=-1×1.675×1/1×1/-5.6=0.3;
scoresame=-1×0.925×1/2×1/-6.3=0.147;
scoreson=-1×1.86×1/1×1/-6=0.216;
scoreone=-0.5×1.86×1/1×1/-5.1=0.182。
it can also be known that the candidate word some is more likely to be the correct word (error-correcting word) corresponding to the error word sone than other candidate words.
Through candidate word evaluation under the three language models, the habit problem of a user writing words is considered, and evaluation information of editing distance, similarity and language environment is added, so that the reliability of candidate word evaluation is improved.
As shown in fig. 14, in the thirteenth embodiment, there is provided a candidate word evaluation method including the steps of:
step S131, a wrong word is detected, and a plurality of candidate words corresponding to the wrong word are obtained.
The implementation of this step can refer to the description of the first embodiment, and is not repeated.
Step S132, determining the editing distance between each candidate word and the wrong word.
For the implementation of this step, reference may be made to the description of the first embodiment, and details are not repeated.
And step S133, determining the similarity between each candidate word and the wrong word, wherein the similarity is obtained according to the longest common subsequence and/or the longest common substring of each candidate word and the wrong word.
For the implementation of this step, reference may be made to the description of the second embodiment, which is not repeated.
And S134, replacing the wrong words with the candidate words respectively to obtain candidate sentences, and determining the evaluation probability of the corresponding candidate words according to the candidate sentences, wherein the evaluation probability is obtained according to the language environment probability of the candidate words in the candidate sentences and the language environment probability of the adjacent words of the candidate words.
For the implementation of this step, reference may be made to the description of the fourth embodiment, which is not repeated.
Step S135, obtain error information of the wrong word relative to each candidate word.
The implementation of this step can refer to the description of the first embodiment, and is not repeated.
And S136, determining the evaluation score corresponding to each candidate word according to the editing distance, the similarity, the evaluation probability and the error information.
According to the method and the device, the evaluation score corresponding to each candidate word is determined according to the editing distance, the similarity, the evaluation probability and the error information of each candidate word and the error word. The method and the device have the advantages that the problems of editing distance and similarity between the candidate words and the wrong words and phenomena of writing of users are considered, the evaluation information of the language model is added, reliability of evaluation results of the candidate words can be improved, and efficiency and accuracy of text editing can be improved.
In one embodiment, the language environment probability of each word is represented by a log of the probability of each word at its position, based on which an evaluation probability of a candidate word is obtained, and further, an evaluation score corresponding to each candidate word may be determined according to the reciprocal of the edit distance, the similarity, the reciprocal of the evaluation probability, and the error information.
In one embodiment, the step of determining an evaluation score corresponding to each candidate word according to the edit distance, the similarity, the evaluation probability, and the error information includes: if the wrong word is the same as the initial letter of the candidate word, calculating the evaluation score of the candidate word according to the editing distance, the similarity, the evaluation probability and the first coefficient; and if the wrong word is different from the initial letter of the candidate word, calculating the evaluation score of the candidate word according to the editing distance, the similarity, the evaluation probability and the second coefficient.
In one embodiment, the evaluation score for each candidate word is calculated according to the following formula:
Figure BDA0001625447080000481
the specific calculation process of the evaluation score is exemplified as follows:
it is known that when a user writes an english word or sentence on the smart interactive tablet, the some things are that the some things are wrong writing the some things into the song, and the contextual language environment is I-have song applets.
1. And detecting the tone in the English dictionary through English dictionary detection, and determining the tone as a wrong word.
2. And acquiring words with edit distances of 1 and 2 from the song in the English dictionary, and assuming that the song, the same, the son and the one exist as candidate words. Calculating the similarity between each candidate word and the wrong word as follows: ssome=1.675;Ssame=0.925;Sson=1.86;Sone1.86. The editing distances corresponding to the candidate words some, same, son and one are respectively as follows: 1,2,1,1.
3. Respectively replacing the song in the I-love song applets with the song, the same, the son and the one, and respectively calculating the evaluation probability corresponding to each candidate word according to the N-GRAM model, wherein the obtained evaluation probability is as follows:
Figure BDA0001625447080000482
Figure BDA0001625447080000491
Figure BDA0001625447080000492
Figure BDA0001625447080000493
4. according to the formula
Figure BDA0001625447080000494
Calculating to obtain each candidate wordThe evaluation of (a) was divided into:
scoresome=-1×1.675×1/1×1/(-5.6)=0.299;
scoresame=-1×0.925×1/2×1/(-6.35)=0.073;
scoreson=-1×1.86×1/1×1/(-7.58)=0.245;
scoreone=-0.5×1.86×1/1×1/(-5.26)=0.177。
therefore, the candidate word some is more likely to be the correct word (error-correcting word) corresponding to the error word sone than other candidate words.
In another embodiment, for the BilSTM model, the evaluation probability corresponding to the candidate word in each candidate sentence is calculated as follows:
Figure BDA0001625447080000495
Figure BDA0001625447080000496
Figure BDA0001625447080000501
Figure BDA0001625447080000502
step 4 is replaced by:
according to the formula
Figure BDA0001625447080000503
The evaluation score of each candidate word obtained by calculation is as follows:
scoresome=-1×1.675×1/1×1/(-2.88)=0.582;
scoresame=-1×0.925×1/2×1/(-3.62)=0.128;
scoreson=-1×1.86×1/1×1/(-4.1)=0.454;
scoreone=-0.5×1.86×1/1×1/(-2.62)=0.355。
it can also be known that the candidate word some is more likely to be the correct word (error-correcting word) corresponding to the error word sone than other candidate words.
In another embodiment, for the LSTM model, the evaluation probability corresponding to the candidate words in each candidate sentence is calculated as follows:
Figure BDA0001625447080000504
Figure BDA0001625447080000505
Figure BDA0001625447080000506
Figure BDA0001625447080000507
step 4 is replaced by:
according to the formula
Figure BDA0001625447080000511
The evaluation score of each candidate word obtained by calculation is as follows:
scoresome=-1×1.675×1/1×1/(-3.175)=0.528;
scoresame=-1×0.925×1/2×1/(-3.75)=0.124;
scoreson=-1×1.86×1/1×1/(-4.6)=0.404;
scoreone=-0.5×1.86×1/1×1/(-3.45)=0.27。
it can also be seen that the candidate word some has a higher probability of being the correct word (error-correcting word) corresponding to the error word sone than other candidate words.
Through candidate word evaluation under the three language models, the habit problem of a user writing words is considered, and evaluation information of editing distance, similarity and language environment is added, so that the reliability of candidate word evaluation results is improved.
On the basis of obtaining the evaluation score corresponding to the candidate word in any of the above embodiments, in an embodiment, the candidate word evaluation method further includes the steps of: determining error correction words corresponding to the error words from the candidate words according to the evaluation scores, and correcting the error words by using the error correction words; this enables more accurate determination of correction of wrong words.
Optionally, a candidate word with the highest evaluation score is selected from the multiple candidate words of the wrong word, and is used as the error correction word corresponding to the wrong word. Further, the wrong words can be replaced by the error correction words, so that the effect of automatically correcting the wrong words is achieved.
In addition, in an embodiment, on the basis of obtaining the evaluation score corresponding to the candidate word in any of the above embodiments, the candidate word evaluation method further includes the steps of: and sorting the candidate words according to the evaluation scores, and displaying the candidate words according to the sorting, so that the candidate words with higher evaluation scores are displayed more forwards, and a user is better prompted.
In an embodiment, on the basis of any of the above embodiments, the candidate word evaluation method further includes: detecting whether the words to be detected are in a preset word bank or not, and if not, determining that the words to be detected are wrong words. For example: scanning each word, detecting whether each word is in the dictionary, and determining the word is wrong if not in the dictionary.
It can be understood that the preset word stock may be a general english dictionary, a chinese dictionary, or the like, or may be other specific word stocks, and the word stock may be selected according to actual situations.
In an embodiment, optionally, after detecting the wrong word, the candidate word evaluation method further includes a step of determining a candidate word set corresponding to the wrong word. The steps can be as follows: and calculating the editing distance between the wrong word and the known word in the word bank, and selecting the known word with the editing distance within a set range to obtain a plurality of candidate words corresponding to the wrong word. For example, a known word with an edit distance less than 3 is selected as a candidate word, thereby improving the effectiveness of candidate word evaluation.
It should be understood that for the foregoing method embodiments, although the steps in the flowcharts are shown in order indicated by the arrows, the steps are not necessarily performed in order indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a portion of the steps in the flow charts of the method embodiments may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performance of the sub-steps or stages is not necessarily sequential, but may be performed in turn or alternating with other steps or at least a portion of the sub-steps or stages of other steps.
Based on the ideas of the candidate word evaluation methods of the first to thirteenth embodiments, the embodiments of the present application further provide corresponding candidate word evaluation apparatuses.
As shown in fig. 15, in the fourteenth embodiment, the candidate word evaluation means includes:
the candidate word acquisition module 101 is configured to detect a wrong word and acquire a plurality of candidate words corresponding to the wrong word;
a distance determining module 102, configured to determine an editing distance between each candidate word and the wrong word;
an error information obtaining module 103, configured to obtain error information of the wrong word relative to each candidate word;
and the first evaluation module 104 is configured to determine an evaluation score corresponding to each candidate word according to the edit distance and the error information.
When a wrong word is detected, the candidate word evaluation device of the embodiment first acquires a plurality of corresponding candidate words, determines the edit distance between each candidate word and the wrong word, and determines the error information of the wrong word relative to each candidate word; determining an evaluation score corresponding to each candidate word according to the editing distance and the error information; the method not only considers the phenomenon problem of word writing, but also takes the information of the editing distance between words into consideration, thereby improving the reliability of the evaluation result of the candidate words.
In an embodiment, the first evaluation module 104 is configured to determine an evaluation score corresponding to each candidate word according to an inverse of the edit distance and error information.
In an embodiment, the error information of the error word relative to each candidate word includes: the information whether the wrong word and the candidate word have the same initial;
the first evaluation module 104 includes: the first scoring submodule is used for calculating an evaluation score corresponding to the candidate word according to the editing distance and the first coefficient if the wrong word is the same as the first letter of the candidate word; and the second scoring submodule is used for calculating the evaluation score corresponding to the candidate word according to the editing distance and the second coefficient if the wrong word is different from the initial of the candidate word.
As shown in fig. 16, in the fifteenth embodiment, the candidate word evaluation means includes:
a candidate word obtaining module 201, configured to detect a wrong word and obtain multiple candidate words corresponding to the wrong word;
a similarity determining module 202, configured to determine a similarity between each candidate word and a wrong word, where the similarity is obtained according to a longest common subsequence and/or a longest common substring of each candidate word and the wrong word;
an error information obtaining module 203, configured to obtain error information of the wrong word relative to each candidate word;
and the second evaluation module 204 is configured to determine an evaluation score corresponding to each candidate word according to the similarity and the error information.
When a wrong word is detected, the candidate word evaluation device of the embodiment first acquires a plurality of corresponding candidate words, respectively determines the similarity between each candidate word and the wrong word, and determines error information of the wrong word relative to each candidate word; determining an evaluation score corresponding to each candidate word according to the similarity and the error information; the problem of word writing is considered, and the similarity information among the words is also considered, so that the reliability of the candidate word evaluation result can be improved.
In one embodiment, the similarity determination module 202 includes: and the first similarity calculation submodule or the second similarity calculation submodule.
The first similarity calculation operator module is used for calculating the similarity between each candidate word and the wrong word according to at least one of the longest common subsequence rate and the longest common substring rate of each candidate word and the wrong word; or the second similarity calculation operator module is used for calculating the similarity between each candidate word and the wrong word according to at least one of the longest common subsequence rate and the longest common substring rate of each candidate word and the wrong word and the editing distance between each candidate word and the wrong word.
In an embodiment, the second similarity operator module is configured to calculate a similarity between each candidate word and the wrong word according to at least one of a longest common sub-sequence rate and a longest common sub-sequence rate of each candidate word and the wrong word, and an inverse of an edit distance between each candidate word and the wrong word.
In an embodiment, the error information of the error word relative to each candidate word includes: the information whether the wrong word and the candidate word have the same initial; optionally, the second evaluation module 204 includes: the first scoring submodule is used for calculating an evaluation score corresponding to the candidate word according to the similarity and the first coefficient if the wrong word is the same as the first letter of the candidate word; and the second scoring submodule is used for calculating the evaluation score corresponding to the candidate word according to the similarity and the second coefficient if the wrong word is different from the initial of the candidate word.
As shown in fig. 17, in the sixteenth embodiment, a candidate word evaluation means includes:
a candidate word obtaining module 301, configured to detect a wrong word and obtain multiple candidate words corresponding to the wrong word;
a first probability determination module 302, configured to determine a language environment probability of each candidate word at the wrong word position;
an error information obtaining module 303, configured to obtain error information of the wrong word relative to each candidate word;
and a third evaluation module 304, configured to determine an evaluation score corresponding to each candidate word according to the language environment probability and the error information.
When a wrong word is detected, the candidate word evaluation device of the embodiment first acquires a plurality of corresponding candidate words, respectively determines the probability of each candidate word at the position of the wrong word, and determines error information of the wrong word relative to each candidate word; determining an evaluation score corresponding to each candidate word according to the language environment probability and the error information; the habit problem of the user when writing the word is considered, and the information of the context language environment is also considered, so that the reliability of the candidate word evaluation result can be improved.
In an embodiment, the first probability determining module 302 is configured to calculate, according to a preset language model, a probability of each candidate word at the wrong word position, and use a log value of the probability as a language environment probability of the candidate word.
In an embodiment, the third evaluation module 304 is configured to determine an evaluation score corresponding to each candidate word according to the inverse of the language environment probability and error information; wherein the language model includes, but is not limited to, an N-Gram model, a BilSTM model, or an LSTM model.
In one embodiment, the error information of the error word relative to each candidate word includes: the information whether the wrong word and the candidate word have the same initial; correspondingly, the third evaluation module 304 includes: the first scoring submodule is used for calculating the evaluation score of the candidate word according to the reciprocal of the language environment probability and a first coefficient if the wrong word is the same as the initial of the candidate word; and the second scoring submodule is used for calculating the evaluation score of the candidate word according to the inverse of the language environment probability and a second coefficient if the wrong word is different from the initial of the candidate word.
As shown in fig. 18, in the seventeenth embodiment, the candidate word evaluation means includes:
a candidate word obtaining module 401, configured to detect a wrong word, and obtain multiple candidate words corresponding to the wrong word;
a second probability determining module 402, configured to replace the wrong word with each candidate word respectively to obtain a candidate sentence, and determine an evaluation probability of the corresponding candidate word according to the candidate sentence, where the evaluation probability is obtained according to a language environment probability of the candidate word in the candidate sentence and a language environment probability of a word adjacent to the candidate word;
an error information obtaining module 403, configured to obtain error information of the wrong word relative to each candidate word;
and a fourth evaluation module 404, configured to determine an evaluation score corresponding to each candidate word according to the evaluation probability and the error information.
When a wrong word is detected, the candidate word evaluation device of the embodiment first acquires a plurality of corresponding candidate words, determines the evaluation probability corresponding to each candidate word, and determines the error information of the wrong word relative to each candidate word; determining an evaluation score corresponding to each candidate word according to the evaluation probability and the error information; the habitual problem of the user in writing the word is considered, and the information of the context language environment is also taken into account, so that the reliability of the candidate word evaluation result can be improved.
In an embodiment, the second probability determining module 402 is further configured to calculate probabilities of respective positions of a candidate word and neighboring words of the candidate word in the candidate sentence according to a preset language model, and use a log value of the probabilities as a language environment probability of each word; and averaging the language environment probability of the candidate words in the candidate sentences and the language environment probability of the adjacent words of the candidate words to obtain the evaluation probability of the candidate words in the candidate sentences.
In an embodiment, the fourth evaluation module 404 is specifically configured to determine an evaluation score corresponding to each candidate word according to an inverse of the evaluation probability and error information; wherein the language model includes, but is not limited to, an N-Gram model, a BilSTM model, or an LSTM model.
In an embodiment, the error information of the error word relative to each candidate word includes: the information whether the wrong word and the candidate word have the same initial; the fourth evaluation module 404 includes: the first scoring submodule is used for calculating an evaluation score corresponding to the candidate word according to the evaluation probability and the first coefficient if the wrong word is the same as the first letter of the candidate word; and the second scoring submodule is used for calculating the evaluation score corresponding to the candidate word according to the evaluation probability and the second coefficient if the wrong word and the candidate word have different initial letters.
As shown in fig. 19, in the eighteenth embodiment, the candidate word evaluation means includes:
the candidate word obtaining module 501 is configured to detect a wrong word, and obtain multiple candidate words corresponding to the wrong word.
A distance determining module 502, configured to determine an editing distance between each candidate word and the wrong word;
a second probability determining module 503, configured to replace the wrong word with each candidate word, respectively, to obtain a candidate sentence, and determine, according to the candidate sentence, an evaluation probability of a corresponding candidate word, where the evaluation probability is obtained according to a language environment probability of a candidate word in the candidate sentence and a language environment probability of a word adjacent to the candidate word;
and a fifth evaluation module 504, configured to determine an evaluation score of each candidate word according to the edit distance and the evaluation probability.
In an embodiment, the second probability determining module 503 is configured to replace the wrong word with each candidate word respectively to obtain a candidate sentence, calculate probabilities of the candidate word and neighboring words of the candidate word in the candidate sentence at respective positions according to a preset language model, and use a log value of the probabilities as a language environment probability of each word; and averaging the language environment probability of the candidate words in the candidate sentences and the language environment probability of the adjacent words of the candidate words to obtain the evaluation probability of the candidate words in the candidate sentences. Wherein the language model includes, but is not limited to: an N-Gram model, a BilSTM model, or an LSTM model.
Based on the foregoing embodiment, the fifth evaluation module 504 is specifically configured to determine an evaluation score corresponding to each candidate word according to the reciprocal of the edit distance and the reciprocal of the evaluation probability.
As shown in fig. 20, in the nineteenth embodiment, the candidate word evaluation device includes:
the candidate word obtaining module 601 is configured to detect a wrong word and obtain multiple candidate words corresponding to the wrong word.
A similarity determining module 602, configured to determine a similarity between each candidate word and a wrong word, where the similarity is obtained according to a longest common subsequence and/or a longest common substring of each candidate word and the wrong word;
a second probability determining module 603, configured to replace the wrong word with each candidate word, respectively, to obtain a candidate sentence, and determine an evaluation probability of the corresponding candidate word according to the candidate sentence, where the evaluation probability is obtained according to a language environment probability of the candidate word in the candidate sentence and a language environment probability of a word adjacent to the candidate word;
and a sixth evaluation module 604, configured to determine an evaluation score of each candidate word according to the similarity and the evaluation probability.
In an embodiment, the second probability determining module 603 is configured to replace the wrong word with each candidate word, respectively, to obtain a candidate sentence, calculate probabilities of the candidate word and neighboring words of the candidate word in the candidate sentence at respective positions according to a preset language model, and use a log value of the probabilities as a language environment probability of each word; and averaging the language environment probability of the candidate words in the candidate sentences and the language environment probability of the adjacent words of the candidate words to obtain the evaluation probability of the candidate words in the candidate sentences. Wherein the language model includes, but is not limited to: an N-Gram model, a BilSTM model, or an LSTM model.
Based on the above embodiment, the sixth evaluation module 604 is specifically configured to determine an evaluation score corresponding to each candidate word according to the reciprocal of the evaluation probability and the similarity.
As shown in fig. 21, in the twentieth embodiment, the candidate word evaluation means includes:
a candidate word obtaining module 701, configured to obtain multiple candidate words corresponding to a wrong word when the wrong word is detected;
a distance determining module 702, configured to determine an editing distance between each candidate word and the wrong word;
a first probability determining module 703, configured to determine a language environment probability of each candidate word at the wrong word position;
an error information obtaining module 704, configured to obtain error information of the wrong word relative to each candidate word;
and a seventh evaluation module 705, configured to determine an evaluation score corresponding to each candidate word according to the edit distance, the language environment probability, and the error information.
When a wrong word is detected, the candidate word evaluation device of the embodiment first acquires a plurality of corresponding candidate words, determines the edit distance between each candidate word and the wrong word, the probability of each candidate word at the position of the wrong word, and the error information of the wrong word relative to each candidate word; determining an evaluation score corresponding to each candidate word according to the editing distance, the language environment probability and the error information; the editing distance information and the context language environment are considered, and the habit problem of the user when writing the word is also considered, so that the reliability of the candidate word evaluation result can be improved.
In an embodiment, the first probability determining module 703 is configured to calculate, according to a preset language model, a probability of each candidate word at the wrong word position, and use a log value of the probability as a language environment probability of the candidate word.
In an embodiment, the seventh evaluation module 505 is specifically configured to determine an evaluation score corresponding to each candidate word according to an inverse of the editing distance, an inverse of the language environment probability, and error information; the language model includes, but is not limited to, an N-Gram model, a BilSTM model, or an LSTM model.
In one embodiment, the error information of the error word relative to each candidate word includes: the information whether the wrong word and the candidate word have the same initial; correspondingly, the seventh evaluation module 705 comprises: the first scoring submodule is used for calculating an evaluation score corresponding to the candidate word according to the editing distance, the language environment probability and the first coefficient if the wrong word is the same as the first letter of the candidate word; and the second scoring submodule is used for calculating the evaluation score corresponding to the candidate word according to the editing distance, the language environment probability and the second coefficient if the wrong word is different from the initial of the candidate word.
As shown in fig. 22, in the twenty-first embodiment, the candidate word evaluation device includes:
the candidate word obtaining module 801 is configured to detect a wrong word and obtain multiple candidate words corresponding to the wrong word.
A distance determining module 802, configured to determine an editing distance between each candidate word and the wrong word.
And a similarity determining module 803, configured to determine similarity between each candidate word and the wrong word, where the similarity is obtained according to the longest common subsequence and/or the longest common substring of each candidate word and the wrong word.
An error information obtaining module 804, configured to obtain error information of the wrong word relative to each candidate word.
And an eighth evaluation module 805, configured to determine, according to the edit distance, the similarity, and the error information, an evaluation score corresponding to each candidate word.
When a wrong word is detected, the candidate word evaluation device of the embodiment first acquires a plurality of corresponding candidate words, determines the edit distance and similarity between each candidate word and the wrong word, and determines error information of the wrong word relative to each candidate word; determining an evaluation score corresponding to each candidate word according to the editing distance, the similarity and the error information; the editing distance information and the similarity are considered, and the habit problem of the user when writing the word is also considered, so that the reliability of the candidate word evaluation result can be improved.
In one embodiment, the eighth evaluation module 805 comprises: the first scoring submodule is used for calculating an evaluation score corresponding to the candidate word according to the distance, the similarity and the first coefficient if the wrong word is the same as the first letter of the candidate word; and the second scoring submodule is used for calculating the evaluation score corresponding to the candidate word according to the distance, the similarity and the second coefficient if the wrong word is different from the initial of the candidate word.
As shown in fig. 23, in twenty-two embodiments, a candidate word evaluation apparatus is provided, and the candidate word evaluation apparatus of this embodiment includes:
the candidate word obtaining module 901 is configured to detect a wrong word and obtain multiple candidate words corresponding to the wrong word.
A distance determining module 902, configured to determine an editing distance between each candidate word and the wrong word.
A second probability determining module 903, configured to replace the wrong word with each candidate word to obtain a candidate sentence, and determine an evaluation probability of the corresponding candidate word according to the candidate sentence, where the evaluation probability is obtained according to the language environment probability of the candidate word in the candidate sentence and the language environment probability of a word adjacent to the candidate word.
An error information obtaining module 904, configured to obtain error information of the wrong word relative to each candidate word.
And a ninth evaluation module 905, configured to determine, according to the edit distance, the evaluation probability, and the error information, an evaluation score corresponding to each candidate word.
When a wrong word is detected, the candidate word evaluation device of the embodiment first acquires a plurality of corresponding candidate words, determines the edit distance between each candidate word and the wrong word, the evaluation probability of each candidate word at the position of the wrong word, and the error information of the wrong word relative to each candidate word; determining an evaluation score corresponding to each candidate word according to the editing distance, the evaluation probability and the error information; the editing distance information and the context language environment are considered, and the habit problem of the user when writing the word is also considered, so that the reliability of the candidate word evaluation result can be improved.
In an embodiment, the second probability determining module 903 is specifically configured to calculate probabilities of respective positions of a candidate word and a neighboring word of the candidate word in a candidate sentence according to a preset language model, and use a log value of the probabilities as a language environment probability of each word; and averaging the language environment probability of the candidate words in the candidate sentences and the language environment probability of the adjacent words of the candidate words to obtain the evaluation probability of the candidate words in the candidate sentences.
In one embodiment, the ninth evaluation module 905 includes: the first scoring submodule is used for calculating the evaluation score of the candidate word according to the editing distance, the evaluation probability and the first coefficient if the wrong word is the same as the first letter of the candidate word; and the second scoring submodule is used for calculating the evaluation score of the candidate word according to the editing distance, the evaluation probability and the second coefficient if the wrong word is different from the initial of the candidate word.
As shown in fig. 24, in twenty-third embodiment, a candidate word evaluation apparatus is provided, and the candidate word evaluation apparatus of this embodiment includes:
the candidate word obtaining module 1001 is configured to detect a wrong word and obtain multiple candidate words corresponding to the wrong word.
The similarity determining module 1002 is configured to determine a similarity between each candidate word and the wrong word, where the similarity is obtained according to a longest common subsequence and/or a longest common substring of each candidate word and the wrong word.
A first probability determining module 1003, configured to determine a language environment probability of each candidate word at the wrong-word position.
An error information obtaining module 1004, configured to obtain error information of the wrong word relative to each candidate word.
A tenth evaluation module 1005, configured to determine an evaluation score corresponding to each candidate word according to the similarity, the language environment probability, and the error information.
When a wrong word is detected, the candidate word evaluation device of the embodiment first acquires a plurality of corresponding candidate words, and determines similarity between each candidate word and the wrong word, prediction environment probability of each candidate word at the position of the wrong word, and error information of the wrong word relative to each candidate word; determining an evaluation score corresponding to each candidate word according to the similarity, the language environment probability and the error information; the editing distance information and the context language environment are considered, and the habit problem of the user when writing the word is also considered, so that the reliability of the candidate word evaluation result can be improved.
In an embodiment, the first probability determining module 1003 is configured to calculate, according to a preset language model, a probability of each candidate word at the wrong word position, and use a log value of the probability as a language environment probability of the candidate word.
In one embodiment, the tenth evaluation module 1005 includes: the first scoring submodule is used for calculating the evaluation score of the candidate word according to the similarity, the language environment probability and the first coefficient if the wrong word is the same as the first letter of the candidate word; and the second scoring submodule is used for calculating the evaluation score of the candidate word according to the similarity, the language environment probability and the second coefficient if the wrong word is different from the initial of the candidate word.
As shown in fig. 25, in a twenty-fourth embodiment, a candidate word evaluation apparatus is provided, and the candidate word evaluation apparatus of this embodiment includes:
the candidate word obtaining module 1101 is configured to detect a wrong word, and obtain a plurality of candidate words corresponding to the wrong word.
And a similarity determining module 1102, configured to determine similarity between each candidate word and the wrong word, where the similarity is obtained according to the longest common subsequence and/or the longest common substring of each candidate word and the wrong word.
A second probability determining module 1103, configured to replace the wrong word with each candidate word, respectively, to obtain a candidate sentence, and determine an evaluation probability of the corresponding candidate word according to the candidate sentence, where the evaluation probability is obtained according to the language environment probability of the candidate word in the candidate sentence and the language environment probability of a word adjacent to the candidate word.
An error information obtaining module 1104, configured to obtain error information of the wrong word relative to each candidate word.
And an eleventh evaluating module 1105, configured to determine an evaluation score corresponding to each candidate word according to the similarity, the evaluation probability, and the error information.
When a wrong word is detected, the candidate word evaluation device of the embodiment first acquires a plurality of corresponding candidate words, determines similarity between each candidate word and the wrong word and evaluation probability of each candidate word at the position of the wrong word, and determines error information of the wrong word relative to each candidate word; determining an evaluation score corresponding to each candidate word according to the similarity, the evaluation probability and the error information; the similarity and the context language environment are considered, and the habit problem of the user when writing the word is also considered, so that the reliability of the candidate word evaluation result can be improved.
In one embodiment, the second probability determining module 1103 is specifically configured to calculate, according to a preset language model, probabilities of respective positions of a candidate word and neighboring words of the candidate word in a candidate sentence, and use a log value of the probabilities as a language environment probability of each word; and averaging the language environment probability of the candidate words in the candidate sentences and the language environment probability of the adjacent words of the candidate words to obtain the evaluation probability of the candidate words in the candidate sentences.
In one embodiment, the eleventh evaluation module 1105 includes: the first scoring submodule is used for calculating the evaluation score of the candidate word according to the similarity, the evaluation probability and the first coefficient if the wrong word is the same as the first letter of the candidate word; and the second scoring submodule is used for calculating the evaluation score of the candidate word according to the similarity, the evaluation probability and the second coefficient if the wrong word is different from the initial of the candidate word.
As shown in fig. 26, in a twenty-fifth embodiment, a candidate word evaluation device is provided, and the candidate word evaluation device of this embodiment includes:
a candidate word obtaining module 1201, configured to detect a wrong word, and obtain multiple candidate words corresponding to the wrong word.
A distance determining module 1202, configured to determine an editing distance between each candidate word and the wrong word.
A similarity determining module 1203, configured to determine a similarity between each candidate word and the wrong word, where the similarity is obtained according to the longest common subsequence and/or the longest common substring of each candidate word and the wrong word.
A first probability determination module 1204, configured to determine a language environment probability of each candidate word at the wrong-word position.
An error information obtaining module 1205 is configured to obtain error information of the wrong word relative to each candidate word.
And a twelfth evaluation module 1206, configured to determine, according to the editing distance, the similarity, the language environment probability, and the error information, an evaluation score corresponding to each candidate word.
When a wrong word is detected, the candidate word evaluation device of the embodiment first acquires a plurality of corresponding candidate words, determines the edit distance and similarity between each candidate word and the wrong word and the language environment probability of each candidate word at the position of the wrong word, and determines error information of the wrong word relative to each candidate word; determining an evaluation score corresponding to each candidate word according to the editing distance, the similarity, the language environment probability and the error information; the editing distance information, the similarity and the context language environment are considered, and the habit problem of the user when writing the word is also considered, so that the reliability of the candidate word evaluation result can be improved.
In an embodiment, the first probability determining module 1204 is specifically configured to calculate, according to a preset language model, a probability of each candidate word at the wrong word position, and use a log value of the probability as a language environment probability of the candidate word.
In one embodiment, the twelfth evaluation module 1206 comprises: the first scoring submodule is used for calculating the evaluation score of the candidate word according to the editing distance, the similarity, the language environment probability and a first coefficient if the wrong word is the same as the first letter of the candidate word; and the second scoring submodule is used for calculating the evaluation score of the candidate word according to the editing distance, the similarity, the language environment probability and the second coefficient if the wrong word is different from the initial of the candidate word.
As shown in fig. 27, in a twenty-sixth embodiment, a candidate word evaluation device is provided, and the candidate word evaluation device of this embodiment includes:
the candidate word obtaining module 1301 is configured to detect a wrong word, and obtain multiple candidate words corresponding to the wrong word.
A distance determining module 1302, configured to determine an editing distance between each candidate word and the wrong word.
And a similarity determining module 1303, configured to determine similarity between each candidate word and the wrong word, where the similarity is obtained according to the longest common subsequence and/or the longest common substring of each candidate word and the wrong word.
And a second probability determining module 1304, configured to replace the wrong word with each candidate word to obtain a candidate sentence, and determine an evaluation probability of the corresponding candidate word according to the candidate sentence, where the evaluation probability is obtained according to the language environment probability of the candidate word in the candidate sentence and the language environment probability of a word adjacent to the candidate word.
An error information obtaining module 1305, configured to obtain error information of the wrong word relative to each candidate word.
And a thirteenth evaluation module 1306, configured to determine, according to the edit distance, the similarity, the evaluation probability, and the error information, an evaluation score corresponding to each candidate word.
When a wrong word is detected, the candidate word evaluation device of the embodiment first acquires a plurality of corresponding candidate words, determines the edit distance and similarity between each candidate word and the wrong word and the evaluation probability of each candidate word at the position of the wrong word, and determines error information of the wrong word relative to each candidate word; determining an evaluation score corresponding to each candidate word according to the editing distance, the similarity, the evaluation probability and the error information; the editing distance, the similarity and the context language environment are considered, and the habit problem of the user when writing the word is also considered, so that the reliability of the candidate word evaluation result can be improved.
In an embodiment, the second probability determining module 1304 is specifically configured to calculate, according to a preset language model, probabilities of respective positions of a candidate word and a neighboring word of the candidate word in a candidate sentence, and use a log value of the probabilities as a language environment probability of each word; and averaging the language environment probability of the candidate words in the candidate sentences and the language environment probability of the adjacent words of the candidate words to obtain the evaluation probability of the candidate words in the candidate sentences.
In one embodiment, the thirteenth evaluating module 1306 is further configured to determine an evaluation score corresponding to each candidate word according to the inverse of the edit distance, the similarity, the inverse of the evaluation probability, and the error information.
In one embodiment, the thirteenth evaluation module 1306 includes: the first scoring submodule is used for calculating an evaluation score corresponding to the candidate word according to the editing distance, the similarity, the evaluation probability and the first coefficient if the wrong word is the same as the first letter of the candidate word; and the second scoring submodule is used for calculating the evaluation score corresponding to the candidate word according to the editing distance, the similarity, the evaluation probability and the second coefficient if the wrong word is different from the initial of the candidate word.
Based on the candidate word evaluation apparatus in any of the above embodiments, in an embodiment, the candidate word evaluation apparatus further includes: and the candidate word determining module is used for calculating the editing distance between the wrong word and the known word in the preset word bank, selecting the known word with the editing distance within a set range, and obtaining a plurality of candidate words corresponding to the wrong word.
In an embodiment, on the basis of the candidate word evaluation apparatus in any of the above embodiments, the candidate word evaluation apparatus further includes: and the correction module is used for determining error correction words corresponding to the error words from the candidate words according to the evaluation scores and correcting the error words by using the error correction words. Optionally, the wrong word correcting module is configured to determine, from the multiple candidate words, a candidate word with a highest evaluation score as the wrong word corresponding to the wrong word.
In an embodiment, on the basis of the candidate word evaluation apparatus in any of the above embodiments, the candidate word evaluation apparatus further includes: and the sorting module is used for sorting the candidate words according to the evaluation scores and displaying the sorted candidate words.
In an embodiment, a computer device is provided, which includes a memory and a processor, the memory stores a computer program, and the processor implements the steps of the candidate word evaluation method in any one of the above embodiments when executing the computer program.
Compared with the traditional candidate word evaluation mode according to the editing distance, the computer equipment can evaluate the candidate words of wrong words more effectively.
In an embodiment, a computer-readable storage medium is provided, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the candidate word evaluation method of any one of the above embodiments.
Compared with the traditional candidate word evaluation mode according to the editing distance, the computer-readable storage medium can evaluate the candidate words of wrong words more effectively.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).
The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
The terms "comprises" and "comprising," as well as any variations thereof, of the embodiments herein are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or (module) elements is not limited to only those steps or elements but may alternatively include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
Reference herein to "a plurality" means two or more. "and/or" describes the association relationship of the associated objects, meaning that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship.
References to "first \ second" herein are merely to distinguish between similar objects and do not denote a particular ordering with respect to the objects, it being understood that "first \ second" may, where permissible, be interchanged with a particular order or sequence. It should be understood that "first \ second" distinct objects may be interchanged under appropriate circumstances such that the embodiments described herein may be practiced in sequences other than those illustrated or described herein.
The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims (14)

1. A candidate word evaluation method, comprising:
detecting wrong words, and acquiring a plurality of candidate words corresponding to the wrong words;
determining the editing distance between each candidate word and the wrong word;
determining the similarity of each candidate word and the wrong word through the weighted summation of the longest common subsequence rate and the longest common substring rate; or calculating the similarity of each candidate word and the wrong word through the weighted summation of the longest common subsequence rate and the longest common substring rate and the editing distance between each candidate word and the wrong word; wherein, the longest common subsequence rate is the average of the length of the longest common subsequence divided by the total length of the sequences in the sequence set; the longest common substring rate is the average of the length of the longest common substring divided by the total length of the sequences in the sequence set;
acquiring error information of the wrong word relative to each candidate word;
and determining the evaluation score corresponding to each candidate word according to the editing distance, the similarity and the error information.
2. The candidate word evaluation method of claim 1, wherein the step of calculating the similarity between each candidate word and the wrong word through the weighted summation of the longest common subsequence rate and the longest common substring rate and the edit distance between each candidate word and the wrong word comprises:
and calculating the similarity of each candidate word and the wrong word through the weighted summation of the longest common subsequence rate and the longest common substring rate and the reciprocal of the editing distance between each candidate word and the wrong word.
3. The candidate word evaluation method according to any one of claims 1 to 2, wherein the error information of the error word with respect to each candidate word includes: the information whether the wrong word and the candidate word have the same initial;
the step of determining the evaluation score corresponding to each candidate word according to the editing distance, the similarity and the error information comprises the following steps:
if the wrong word is the same as the initial of the candidate word, calculating an evaluation score corresponding to the candidate word according to the editing distance, the similarity and the first coefficient;
and if the wrong word is different from the initial letter of the candidate word, calculating the evaluation score corresponding to the candidate word according to the editing distance, the similarity and the second coefficient.
4. The candidate word evaluation method of claim 3,
the step of calculating the evaluation score corresponding to the candidate word according to the edit distance, the similarity and the first coefficient includes:
calculating an evaluation score corresponding to the candidate word according to the reciprocal of the editing distance, the similarity and a first coefficient;
and/or the presence of a gas in the gas,
the step of calculating the evaluation score corresponding to the candidate word according to the edit distance, the similarity and the second coefficient includes:
and calculating the evaluation score corresponding to the candidate word according to the reciprocal of the editing distance, the similarity and a second coefficient.
5. The candidate word evaluation method of claim 1, further comprising the steps of:
and determining that the word to be detected is not in a preset word bank, and determining that the word to be detected is a wrong word.
6. The candidate word evaluation method of claim 5, further comprising, after detecting a wrong word, the steps of:
and calculating the editing distance between the wrong word and the known word in the word bank, and selecting the known word with the editing distance within a set range to obtain a plurality of candidate words corresponding to the wrong word.
7. The candidate word evaluation method of any one of claims 1, 2, 4, 5, and 6, further comprising the steps of:
determining a wrong word corresponding to the wrong word from the plurality of candidate words according to the evaluation score, and correcting the wrong word by using the wrong word;
and/or the presence of a gas in the gas,
and sorting the candidate words according to the evaluation scores, and displaying the sorted candidate words.
8. The candidate word evaluation method according to claim 3, wherein the evaluation score of each candidate word is calculated according to the following formula:
Figure DEST_PATH_IMAGE001
wherein the content of the first and second substances,
Figure 746670DEST_PATH_IMAGE002
a candidate word is represented that represents a candidate word,
Figure DEST_PATH_IMAGE003
indicating the edit distance of the candidate word from the wrong word,
Figure 585182DEST_PATH_IMAGE004
represents the evaluation score corresponding to the candidate word,Srepresenting the similarity of the candidate word to the wrong word,Kerror information of the error word relative to each candidate word is represented; if the candidate word and the wrong word initial are the same,Kthe value is K1, otherwise,Kthe values K2, K1 and K2 are preset values.
9. A candidate word evaluation apparatus, comprising:
the candidate word acquisition module is used for detecting wrong words and acquiring a plurality of candidate words corresponding to the wrong words;
the distance determining module is used for determining the editing distance between each candidate word and the wrong word;
the similarity determining module is used for determining the similarity between each candidate word and the wrong word through the weighted summation of the longest common subsequence rate and the longest common substring rate; or calculating the similarity of each candidate word and the wrong word through the weighted summation of the longest common subsequence rate and the longest common substring rate and the editing distance between each candidate word and the wrong word; wherein, the longest common subsequence rate is the average of the length of the longest common subsequence divided by the total length of the sequences in the sequence set; the longest common substring rate is the average of the length of the longest common substring divided by the total length of the sequences in the sequence set;
the error information acquisition module is used for acquiring error information of the wrong words relative to the candidate words;
and the eighth evaluation module is used for determining the evaluation score corresponding to each candidate word according to the editing distance, the similarity and the error information.
10. The candidate word evaluation device of claim 9, wherein the similarity determination module is further configured to calculate the similarity between each candidate word and the wrong word by weighted summation of the longest common subsequence rate and the longest common substring rate, and an inverse of an edit distance between each candidate word and the wrong word.
11. The candidate word evaluation apparatus according to any one of claims 9 to 10, wherein the error information of the error word with respect to each candidate word comprises: the information whether the wrong word and the candidate word have the same initial;
the eighth evaluation module comprises:
the first scoring submodule is used for calculating an evaluation score corresponding to the candidate word according to the editing distance, the similarity and the first coefficient if the wrong word is the same as the first letter of the candidate word;
and the second scoring submodule is used for calculating the evaluation score corresponding to the candidate word according to the editing distance, the similarity and the second coefficient if the wrong word is different from the initial of the candidate word.
12. The candidate word evaluation apparatus of claim 9, further comprising:
the candidate word determining module is used for calculating the editing distance between the wrong words and the known words in the preset word bank, selecting the known words with the editing distance within a set range, and obtaining a plurality of candidate words corresponding to the wrong words;
and/or the presence of a gas in the gas,
the error word determining module is used for determining error words corresponding to the error words from the candidate words according to the evaluation scores and correcting the error words by using the error words;
and/or the presence of a gas in the atmosphere,
and the sorting module is used for sorting the candidate words according to the evaluation scores and displaying the sorted candidate words.
13. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the steps of the method of any one of claims 1 to 8 are performed when the program is executed by the processor.
14. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 8.
CN201810321706.6A 2018-04-11 2018-04-11 Candidate word evaluation method and device, computer equipment and storage medium Active CN108681535B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810321706.6A CN108681535B (en) 2018-04-11 2018-04-11 Candidate word evaluation method and device, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810321706.6A CN108681535B (en) 2018-04-11 2018-04-11 Candidate word evaluation method and device, computer equipment and storage medium

Publications (2)

Publication Number Publication Date
CN108681535A CN108681535A (en) 2018-10-19
CN108681535B true CN108681535B (en) 2022-07-08

Family

ID=63800966

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810321706.6A Active CN108681535B (en) 2018-04-11 2018-04-11 Candidate word evaluation method and device, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN108681535B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102156551A (en) * 2011-03-30 2011-08-17 北京搜狗科技发展有限公司 Method and system for correcting error of word input
CN105550173A (en) * 2016-02-06 2016-05-04 北京京东尚科信息技术有限公司 Text correction method and device
CN106847288A (en) * 2017-02-17 2017-06-13 上海创米科技有限公司 The error correction method and device of speech recognition text
CN107315734A (en) * 2017-05-04 2017-11-03 中国科学院信息工程研究所 A kind of method and system for becoming pronouns, general term for nouns, numerals and measure words standardization based on time window and semanteme
CN107408108A (en) * 2015-01-06 2017-11-28 三词有限公司 For the method by candidate word suggestion for the replacement for the input string received at electronic installation

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2529774A (en) * 2013-04-15 2016-03-02 Contextual Systems Pty Ltd Methods and systems for improved document comparison
CN106528532B (en) * 2016-11-07 2019-03-12 上海智臻智能网络科技股份有限公司 Text error correction method, device and terminal
CN106650803B (en) * 2016-12-09 2019-06-18 北京锐安科技有限公司 The method and device of similarity between a kind of calculating character string
CN106844331A (en) * 2016-12-13 2017-06-13 苏州大学 A kind of sentence similarity computational methods and system
CN107273359A (en) * 2017-06-20 2017-10-20 北京四海心通科技有限公司 A kind of text similarity determines method
CN107562824B (en) * 2017-08-21 2020-10-27 昆明理工大学 Text similarity detection method

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102156551A (en) * 2011-03-30 2011-08-17 北京搜狗科技发展有限公司 Method and system for correcting error of word input
CN107408108A (en) * 2015-01-06 2017-11-28 三词有限公司 For the method by candidate word suggestion for the replacement for the input string received at electronic installation
CN105550173A (en) * 2016-02-06 2016-05-04 北京京东尚科信息技术有限公司 Text correction method and device
CN106847288A (en) * 2017-02-17 2017-06-13 上海创米科技有限公司 The error correction method and device of speech recognition text
CN107315734A (en) * 2017-05-04 2017-11-03 中国科学院信息工程研究所 A kind of method and system for becoming pronouns, general term for nouns, numerals and measure words standardization based on time window and semanteme

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
拼写校正技术在信息检索和文本处理领域的应用;张扬;《中国优秀硕士学位论文全文数据库 信息科技辑》;20090415;第I138-1141页 *
自动拼写校对的算法设计和系统实现;郑文曦 等;《科技和产业》;20130228;第145页 *

Also Published As

Publication number Publication date
CN108681535A (en) 2018-10-19

Similar Documents

Publication Publication Date Title
US9535896B2 (en) Systems and methods for language detection
US9424246B2 (en) System and method for inputting text into electronic devices
CN109117480B (en) Word prediction method, word prediction device, computer equipment and storage medium
CN110334179B (en) Question-answer processing method, device, computer equipment and storage medium
CN108628826B (en) Candidate word evaluation method and device, computer equipment and storage medium
CN108694167B (en) Candidate word evaluation method, candidate word ordering method and device
US11409374B2 (en) Method and device for input prediction
CN105302882A (en) Keyword obtaining method and apparatus
CN111950280A (en) Address matching method and device
CN104281275A (en) Method and device for inputting English
CN106570196B (en) Video program searching method and device
CN112182337B (en) Method for identifying similar news from massive short news and related equipment
CN108681533B (en) Candidate word evaluation method and device, computer equipment and storage medium
AU2014409115A1 (en) System and method for language detection
CN108664466B (en) Candidate word evaluation method and device, computer equipment and storage medium
CN108595419B (en) Candidate word evaluation method, candidate word sorting method and device
CN108647202B (en) Candidate word evaluation method and device, computer equipment and storage medium
CN108664467B (en) Candidate word evaluation method and device, computer equipment and storage medium
CN113657098A (en) Text error correction method, device, equipment and storage medium
CN108681535B (en) Candidate word evaluation method and device, computer equipment and storage medium
CN108694166B (en) Candidate word evaluation method and device, computer equipment and storage medium
CN108733646B (en) Candidate word evaluation method and device, computer equipment and storage medium
CN108628827A (en) Candidate word appraisal procedure, device, computer equipment and storage medium
CN108681534A (en) Candidate word appraisal procedure, device, computer equipment and storage medium
JP6303508B2 (en) Document analysis apparatus, document analysis system, document analysis method, and program

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant