CN108694167B - Candidate word evaluation method, candidate word ordering method and device - Google Patents

Candidate word evaluation method, candidate word ordering method and device Download PDF

Info

Publication number
CN108694167B
CN108694167B CN201810321032.XA CN201810321032A CN108694167B CN 108694167 B CN108694167 B CN 108694167B CN 201810321032 A CN201810321032 A CN 201810321032A CN 108694167 B CN108694167 B CN 108694167B
Authority
CN
China
Prior art keywords
word
candidate
evaluation
words
wrong
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810321032.XA
Other languages
Chinese (zh)
Other versions
CN108694167A (en
Inventor
李贤�
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Shiyuan Electronics Thecnology Co Ltd
Original Assignee
Guangzhou Shiyuan Electronics Thecnology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Shiyuan Electronics Thecnology Co Ltd filed Critical Guangzhou Shiyuan Electronics Thecnology Co Ltd
Priority to CN201810321032.XA priority Critical patent/CN108694167B/en
Publication of CN108694167A publication Critical patent/CN108694167A/en
Application granted granted Critical
Publication of CN108694167B publication Critical patent/CN108694167B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/232Orthographic correction, e.g. spell checking or vowelisation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

The invention relates to a candidate word evaluation method, a candidate word sorting method and a candidate word sorting device, which are applied to the field of data processing. The method comprises the following steps: detecting wrong words, and acquiring a plurality of candidate words corresponding to the wrong words; determining the editing distance between each candidate word and the wrong word; respectively replacing the wrong words with the candidate words to obtain candidate sentences, and determining the evaluation probability of the corresponding candidate words according to the candidate sentences, wherein the evaluation probability is obtained according to the language environment probability of the candidate words in the candidate sentences and the language environment probability of the adjacent words of the candidate words; and determining the evaluation score of each candidate word according to the editing distance and the evaluation probability. The embodiment of the invention is beneficial to improving the reliability of candidate word evaluation.

Description

Candidate word evaluation method, candidate word sorting method and device
Technical Field
The present invention relates to the field of data processing technologies, and in particular, to a candidate word evaluation method, a candidate word ranking method, and an apparatus thereof.
Background
Currently popular Word processing software, such as Word, WPS, WordPerfect, and the like, are embedded with an english spell check function, which is used for implementing english spell check, and when a misspelled Word is checked, prompt information is given or a corresponding error correction suggestion is given.
In the process of implementing the invention, the inventor finds that the prior art has the following problems, the prior error correction method mainly uses a dictionary for detection, and evaluates the candidate words with wrong words through the editing distance after finding out misspelling, however, the method is too simple and hard, and the reliability of the evaluation result of the candidate words is not ideal.
Disclosure of Invention
Therefore, it is necessary to provide a candidate word evaluation method, a candidate word ranking method and a candidate word ranking device for solving the problem that the reliability of the evaluation result of the candidate word in the conventional method is not ideal.
The scheme provided by the embodiment of the invention comprises the following steps:
in one aspect, a candidate word evaluation method includes:
detecting wrong words, and acquiring a plurality of candidate words corresponding to the wrong words;
determining the editing distance between each candidate word and the wrong word;
respectively replacing the wrong words with the candidate words to obtain candidate sentences, and determining the evaluation probability of the corresponding candidate words according to the candidate sentences, wherein the evaluation probability is obtained according to the language environment probability of the candidate words in the candidate sentences and the language environment probability of the adjacent words of the candidate words;
and determining the evaluation score of each candidate word according to the editing distance and the evaluation probability.
In one embodiment, the determining the evaluation probability of the corresponding candidate word according to the candidate sentence includes:
calculating the probability of each position of a candidate word and adjacent words of the candidate word in a candidate sentence according to a preset language model, and taking the log value of the probability as the language environment probability of each word;
and averaging the language environment probability of the candidate words in the candidate sentences and the language environment probability of the adjacent words of the candidate words to obtain the evaluation probability of the candidate words in the candidate sentences.
In one embodiment, the determining an evaluation score of each candidate word according to the edit distance and the evaluation probability includes:
determining an evaluation score corresponding to each candidate word according to the reciprocal of the editing distance and the reciprocal of the evaluation probability;
and/or the presence of a gas in the atmosphere,
the language model includes: an N-Gram model, a BilSTM model, or an LSTM model.
In one embodiment, the method further comprises the following steps:
and determining that the word to be detected is not in a preset word bank, and determining that the word to be detected is a wrong word.
In one embodiment, after the wrong word is detected, the method further comprises the following steps:
and calculating the editing distance between the wrong word and the known word in the word bank, and selecting the known word with the editing distance within a set range to obtain a plurality of candidate words corresponding to the wrong word.
In one embodiment, the method further comprises the following steps:
determining a wrong word corresponding to the wrong word from the plurality of candidate words according to the evaluation score, and correcting the wrong word by using the wrong word;
and/or the presence of a gas in the gas,
and sorting the candidate words according to the evaluation scores, and displaying the sorted candidate words.
In one embodiment, the determining, according to the evaluation score, a wrong word corresponding to the wrong word from the multiple candidate words includes:
and determining the candidate word with the highest evaluation score from the plurality of candidate words as the error correction word corresponding to the error word.
In one embodiment, the evaluation score of each candidate word is calculated according to the following formula:
Figure BDA0001625278480000021
wherein D is edit Indicating the edit distance of the candidate word from the wrong word, word indicating the candidate word,
Figure BDA0001625278480000022
represents the evaluation probability of the candidate word, score word Representing the evaluation score corresponding to the candidate word.
Another aspect provides a candidate word evaluation apparatus, including:
the candidate word acquisition module is used for detecting wrong words and acquiring a plurality of candidate words corresponding to the wrong words;
the distance determining module is used for determining the editing distance between each candidate word and the wrong word;
the second probability determining module is used for replacing the wrong words with the candidate words respectively to obtain candidate sentences, and determining the evaluation probability of the corresponding candidate words according to the candidate sentences, wherein the evaluation probability is obtained according to the language environment probability of the candidate words in the candidate sentences and the language environment probability of the adjacent words of the candidate words;
and the fifth evaluation module is used for determining the evaluation score of each candidate word according to the editing distance and the evaluation probability.
In one embodiment, the second probability determining module is configured to replace the wrong word with each candidate word respectively to obtain a candidate sentence, calculate probabilities of the candidate word and neighboring words of the candidate word in the candidate sentence at respective positions according to a preset language model, and use a log value of the probabilities as a language environment probability of each word; and averaging the language environment probability of the candidate words in the candidate sentences and the language environment probability of the adjacent words of the candidate words to obtain the evaluation probability of the candidate words in the candidate sentences.
According to the candidate word evaluation method and device, when a wrong word is detected, a plurality of candidate words corresponding to the wrong word are obtained; determining the editing distance between each candidate word and the wrong word; respectively replacing the wrong words with the candidate words to obtain candidate sentences, determining the evaluation probability of the corresponding candidate words according to the candidate sentences, and determining the evaluation score of each candidate word according to the editing distance and the evaluation probability; the evaluation result of the candidate words obtained in the above way not only considers the editing distance, but also comprehensively considers the context language environment information, and improves the reliability of the evaluation result of the candidate words.
A further aspect provides a computer apparatus comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of any of the methods described above when executing the program.
According to the computer equipment, the accuracy of the candidate word evaluation result can be improved through the computer program running on the processor.
A further aspect provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, carries out the steps of any of the methods described above.
The computer storage medium can improve the accuracy of the candidate word evaluation result through the stored computer program.
Drawings
FIG. 1 is a diagram of an exemplary implementation of a candidate evaluation method;
FIG. 2 is a schematic flow chart diagram of a candidate word evaluation method according to a first embodiment;
FIG. 3 is a schematic flow chart diagram of a candidate word evaluation method according to a second embodiment;
FIG. 4 is a schematic flow chart diagram of a candidate word evaluation method according to a third embodiment;
FIG. 5 is a schematic flow chart diagram of a candidate word evaluation method according to a fourth embodiment;
FIG. 6 is a schematic flow chart diagram of a candidate word evaluation method according to a fifth embodiment;
FIG. 7 is a schematic flow chart diagram illustrating a candidate word evaluation method according to a sixth embodiment;
FIG. 8 is a schematic flow chart diagram of a candidate word evaluation method according to a seventh embodiment;
FIG. 9 is a schematic flow chart diagram illustrating a candidate word evaluation method according to an eighth embodiment;
FIG. 10 is a schematic flow chart diagram illustrating a candidate word evaluation method according to a ninth embodiment;
FIG. 11 is a schematic flow chart diagram illustrating a candidate word evaluation method according to a tenth embodiment;
FIG. 12 is a schematic flow chart diagram of a candidate word evaluation method according to an eleventh embodiment;
FIG. 13 is a schematic flow chart diagram illustrating a candidate word evaluation method according to a twelfth embodiment;
fig. 14 is a schematic flow chart of a candidate word evaluation method of the thirteenth embodiment;
FIG. 15 is a schematic structural diagram of a candidate word evaluation apparatus according to a fourteenth embodiment;
fig. 16 is a schematic structural diagram of a candidate word evaluation apparatus of a fifteen embodiment;
FIG. 17 is a schematic structural diagram of a candidate word evaluation apparatus according to a sixteen embodiment;
fig. 18 is a schematic structural view of a candidate word evaluation apparatus of the seventeenth embodiment;
fig. 19 is a schematic configuration diagram of a candidate word evaluation apparatus of an eighteen embodiment;
fig. 20 is a schematic structural diagram of a candidate word evaluation apparatus of a nineteenth embodiment;
fig. 21 is a schematic configuration diagram of a candidate word evaluation apparatus of a twenty embodiments;
FIG. 22 is a schematic block diagram of a candidate word evaluation apparatus according to a twenty-one embodiment;
FIG. 23 is a schematic structural diagram of a candidate word evaluation apparatus according to a twenty-two embodiment;
FIG. 24 is a schematic structural diagram of a candidate word evaluation apparatus according to a twenty-third embodiment;
FIG. 25 is a schematic structural diagram of a candidate word evaluation apparatus of a twenty-four embodiment;
fig. 26 is a schematic structural view of a candidate word evaluation apparatus of a twenty-five embodiment;
fig. 27 is a schematic configuration diagram of a candidate word evaluation apparatus of a twenty-six embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein may be combined with other embodiments.
The candidate word evaluation method provided by the application can be applied to the application environment shown in fig. 1. The internal structure of the computer apparatus may be as shown in fig. 1, and includes a processor, a memory, a display screen, and an input device connected through a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The computer program is executed by a processor to implement a candidate word evaluation method. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on a shell of the computer equipment, an external keyboard, a touch pad or a mouse and the like.
The computer device can be a terminal, including but not limited to various personal computers, laptops, smart phones and smart interactive tablets; when the intelligent interactive tablet is used, the intelligent interactive tablet can detect and recognize the writing operation of a user, can detect the content of the writing operation, and even can automatically correct the wrongly written words.
It will be appreciated by those skilled in the art that the configuration shown in fig. 1 is a block diagram of only a portion of the configuration associated with the present application, and is not intended to limit the computing device to which the present application may be applied, and that a particular computing device may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
As shown in fig. 2, in a first embodiment, a candidate word evaluation method is provided, which includes the following steps:
step S11, detecting the wrong word, and obtaining a plurality of candidate words corresponding to the wrong word.
The wrong words include misspelled words or miswritten words, which are words not present in the corresponding lexicon.
The wrong words can be text words such as symbols, characters, words, English words and the like. The candidate word may be a word similar to the wrong word determined according to the wrong word, that is, a correct word (error correction word) corresponding to the wrong word may be possible. The number of candidate words may be one, two, three, etc., and the number of candidate words is not limited in the embodiment of the present invention. And if the determined candidate word is one, directly taking the candidate word as an error correction word.
The method for detecting the wrong word can be implemented in various ways, such as a dictionary detection way, and the embodiment of the invention does not limit the method for detecting the wrong word.
And step S12, determining the edit distance between each candidate word and the wrong word.
The editing distance between the candidate word and the wrong word is used for measuring the difference degree between the candidate word and the wrong word; it may refer to the number difference of letters between the candidate word and the wrong word, stroke difference, etc. For a character string or an english word, the editing distance refers to the minimum number of editing operations required to change one string into another string, and the allowable editing operations include replacing a character, inserting a character, and deleting a character.
In one embodiment, a two-dimensional array of d [ i, j ] is used to store values of edit distances, representing the minimum steps required to convert a string s [1 … i ] to a string t [1 … j ], where when i is equal to 0, i.e., string s is empty, then the corresponding d [0, j ] is incremented by j characters, so that string s is converted to string t; when j is equal to 0, i.e., string t is empty, then the corresponding d [ i,0] is decremented by i characters, so that string s is converted into string t. For example, the edit distance between kitten and sitting is 3, for example, because there are at least the following 3 steps: sitten (k → s), sittin (e → i), sitting (→ g).
For example: and for a wrong word, selecting all known words with edit distances of 1 and 2 in the word stock, and constructing a candidate word set of the wrong word. The determination of the set of candidate words may also be based on other edit distance conditions, depending on the actual situation. By screening the candidate words corresponding to the wrong words, the reliability of the candidate word evaluation result is improved, and meanwhile, the subsequent calculation time is favorably reduced.
And step S13, acquiring error information of the wrong word relative to each candidate word.
And error information of the error word relative to each candidate word is used for representing the distinguishing information of the error word and each candidate word. Alternatively, the error message may be determined according to the habit of the user when writing words in general, such as that the user is particularly prone to missing characters, or that a particular part is most prone to errors, etc.
In an embodiment, the error information of the error word relative to each candidate word may refer to information on whether the error word is the same as the first letter of each candidate word, information on whether the number of characters of the error word and the candidate word is the same, information on whether the error word is the same as the component of each candidate word, information on whether the error word contains an illegal symbol, or the like.
And step S14, determining the evaluation score corresponding to each candidate word according to the editing distance and the error information.
The evaluation score obtained in this step can reflect the possibility that each candidate word is the error-correcting word (correct word) corresponding to the error word. This step is equivalent to effectively evaluating the possibility that the candidate word is the error-correcting word corresponding to the error word according to the edit distance information and the error information between the candidate word and the error word.
When a wrong word is detected, the candidate word evaluation method of the embodiment first acquires a plurality of corresponding candidate words, determines the edit distance between each candidate word and the wrong word, and determines the error information of the wrong word relative to each candidate word; determining an evaluation score corresponding to each candidate word according to the editing distance and the error information; the method not only considers the habit characteristics of the user when writing the words, but also considers the editing distance information between the words, thereby being beneficial to improving the reliability of the candidate word evaluation result.
In an embodiment, determining an evaluation score corresponding to each candidate word according to the edit distance and the error information includes: and determining the evaluation score corresponding to each candidate word according to the reciprocal of the editing distance and the error information.
Generally, the larger the edit distance, the larger the degree of difference between the candidate word and the error word, and the smaller the possibility that the candidate word is an error correction word of the error word, and therefore, the reliability of the evaluation score of the candidate word can be ensured by the reciprocal of the edit distance and the error information.
In one embodiment, the error information of the error word relative to each candidate word includes: and whether the wrong word and the candidate word have the same initial letter or not. Correspondingly, the step of determining the evaluation score corresponding to each candidate word according to the editing distance and the error information includes: if the wrong word is the same as the initial of the candidate word, calculating an evaluation score corresponding to the candidate word according to the reciprocal of the editing distance and the first coefficient; and if the wrong word is different from the initial letter of the candidate word, calculating the evaluation score corresponding to the candidate word according to the reciprocal of the editing distance and the second coefficient.
For example, suppose that the error information of the wrong word relative to each candidate word is represented by K, and the edit distance is represented by D edit Indicating that the evaluation score corresponding to the candidate word is score word Expressed, the formula for calculating the score of each candidate word may be:
score word =K×1/D edit
and selecting the K value according to whether the initial letters of the candidate word and the wrong word are the same, wherein if the initial letters of the candidate word and the wrong word are the same, the K value is K1, and otherwise, the K values K2, K1 and K2 are preset numerical values. For example: and when the same, K is 1, and when the same is not the same, K is 0.5.
For an english word or a character string, based on the writing habit of the user, the initial generally does not make mistakes, so based on the above embodiment, the possibility that each candidate word is an error-correcting word of the error word is evaluated, and the evaluation score of the candidate word can be determined more reasonably. So that if the candidate word is the same as the first letter of the error word, the candidate word is more likely to be the error word, otherwise, the candidate word is less likely to be the error word. It is understood that the value of K includes, but is not limited to, the values of the above examples.
For example: the known situation is that when a user writes English words and sentences on the intelligent interactive flat plate, the game is wrongly written into the game, and the context language environment is I-wave game applets; the method for evaluating the candidate words corresponding to the song comprises the following steps:
1. and detecting the tone in the English dictionary through English dictionary detection, and determining the tone as a wrong word.
2. And acquiring a candidate word of the song, and assuming that the song, the same and the one exist as the candidate word.
3. Respectively determining the editing distances of the game, the same, the one and the same, wherein the editing distances are respectively as follows: 1. 2 and 1.
4. According to the formula score word =K×1/D edit Calculating a score for each candidate word:
score some =1×1=1;
score same =1×1/2=0.5;
score one =0.5×1=0.5。
therefore, the candidate word some is more likely to be the correct word (error-correcting word) corresponding to the error word sone than other candidate words.
Therefore, the finally obtained evaluation score corresponding to each candidate word is obtained by comprehensively considering the habit characteristics of the user when writing the word and the editing distance between the words, and compared with the traditional candidate word evaluation method based on the editing distance, the method is favorable for improving the reliability of the evaluation result of the candidate word and ensuring the accuracy of the error-correcting word.
As shown in fig. 3, in a second embodiment, a candidate word evaluation method includes the following steps:
step S21, detecting the wrong word, and obtaining a plurality of candidate words corresponding to the wrong word.
For the implementation of this step, reference may be made to the corresponding step S11 implemented above, which is not described in detail.
And step S22, determining the similarity between each candidate word and the wrong word, wherein the similarity is obtained according to the longest common subsequence and/or the longest common substring of each candidate word and the wrong word.
The longest common subsequence is a problem that is used to find the longest subsequence of all sequences in a set of sequences, typically two sequences. A sequence is called the longest common subsequence of known sequences if it is a subsequence of two or more known sequences, respectively, and is the longest of all sequences that meet the condition.
The longest common substring is the problem used in a set of sequences (usually two sequences) to find the longest substring of all sequences. A substring, if it is the substring of two or more known number series, and the longest substring among all substrings meeting the condition, is called the longest common substring of the known series.
The effect of a subsequence is a sequence, and the subsequences are ordered but not necessarily contiguous. For example: the sequence X ═ < B, C, D, B > is a subsequence of the sequence Y ═ a, B, C, B, D, a, B >, and the corresponding subscript sequence is <2,3,5,7 >. Substrings act like character strings, and are ordered and continuous. For example: the character string a ═ abcd is a substring of the character string c ═ aaabcddd; but the string b acdddd is not a substring of the string c.
The number of the same characters between the candidate words and the wrong words can be reflected to a certain extent through the longest common subsequence and/or the longest common substring, and the similarity between the candidate words and the wrong words is determined to have actual physical significance based on the pair.
And step S23, acquiring error information of the wrong word relative to each candidate word.
As to the implementation of this step, reference may be made to the description of the first embodiment above.
And step S24, determining the corresponding evaluation score of each candidate word according to the similarity and the error information.
The candidate word evaluation score obtained through this step may reflect the possibility that each candidate word is the error correction word corresponding to the error word.
When a wrong word is detected, firstly, obtaining a plurality of corresponding candidate words, respectively determining the similarity between each candidate word and the wrong word, wherein the similarity is obtained according to the longest public subsequence and/or the longest public substring of each candidate word and the wrong word, and determining error information of the wrong word relative to each candidate word; determining an evaluation score corresponding to each candidate word according to the similarity and the error information; the habit characteristics of the user when writing the words and the similarity between the words are comprehensively considered, so that the reliability of the candidate word evaluation result is improved.
In an embodiment, the step of determining the similarity between each candidate word and the wrong word in the above implementation may specifically be: and calculating the similarity between each candidate word and the wrong word according to at least one of the longest common subsequence rate and the longest common substring rate of each candidate word and the wrong word.
The longest common subsequence length divided by the average of the total sequence length in the sequence set is the longest common subsequence rate, which can be denoted as D lsc1 . The average of the length of the longest common substring divided by the total length of the sequences in the sequence set is the longest common substring rate, which can be recorded as D lsc2 . For example: the specific process of calculating the longest common subsequence rate and the longest common substring rate may be: for a sequence set { dbcabca, cbaba }, the longest common subsequence thereof is baba, and the length of the longest common subsequence is 4, the longest common subsequence rate is 4/((7+5)/2) ═ 0.67; for the sequence set { babca, cbaba }, its longest common substring is bab, and longest common substring is babIf the length of the common substring is 3, the longest common substring rate is 3/((5+5)/2) — 0.6.
The longest common subsequence rate and the longest common substring rate not only consider the number of the same characters between the candidate words and the wrong words, but also further consider the proportion of the same characters, which is beneficial to further improving the accuracy of the similarity between the candidate words and the wrong words.
Alternatively, if the longest common subsequence rate is used D lsc1 Representing the longest common substring rate by D lsc2 Representing, representing the similarity between each candidate word and the wrong word by S, and calculating the similarity between each candidate word and the wrong word in the following way:
S=w 1 D lcs1
S=w 2 D lcs2
S=w 1 D lcs1 +w 2 D lcs2
wherein, w 1 、w 2 Is a preset weight coefficient, and the sum of the weight coefficient and the preset weight coefficient is 1.
And determining the similarity between the candidate word and the wrong word through the weighted summation of the longest common subsequence rate and the longest common substring rate, so as to obtain more accurate similarity. Typically, the longest common subsequence rate D lsc1 The influence on the similarity of two words is larger, so the longest common subsequence rate D lsc1 The weight coefficient is larger than the longest common substring rate D lsc2 The weight coefficient of (2), for example:
S=0.7D lcs1 +0.3D lcs2
in another embodiment, the step of determining the similarity between each candidate word and the wrong word may further include: and calculating the similarity of each candidate word and the wrong word according to at least one of the longest common subsequence rate and the longest common substring rate of each candidate word and the wrong word and the editing distance between each candidate word and the wrong word.
Optionally, the manner of calculating the similarity between each candidate word and the wrong word may include:
S=1/D edit +w 1 D lcs1
or, S is 1/D edit +w 2 D lcs2
Or, S-1/D edit +w 1 D lcs1 +w 2 D lcs2
Specific examples thereof include:
S=1/D edit +0.7D lcs1
or, S-1/D edit +0.3D lcs2
Or, S is 1/D edit +0.7D lcs1 +0.3D lcs2
Wherein D is edit The editing distance is the minimum number of editing operations required for converting one string into another string, and the permitted editing operations comprise replacing one character, inserting one character and deleting one character. Generally, the smaller the edit distance, the greater the proximity of the two strings.
Therefore, the method can determine the similarity between two words by combining the editing distance between the words and the number of the same characters, and is favorable for improving the effectiveness of similarity calculation by increasing the editing distance as one evaluation dimension.
It can be understood that the longest common subsequence rate D in the above formula lsc1 Longest common substring rate D lsc2 The previous weighting coefficients may also take other values, including but not limited to the above examples.
In one embodiment, the error information of the error word relative to each candidate word includes: and whether the wrong word and the candidate word have the same initial letter or not. Correspondingly, the step of determining the evaluation score corresponding to each candidate word according to the editing distance and the error information includes: if the wrong word is the same as the initial of the candidate word, calculating an evaluation score corresponding to the candidate word according to the similarity and the first coefficient; and if the wrong word is different from the initial of the candidate word, calculating the evaluation score corresponding to the candidate word according to the similarity and the second coefficient.
Alternatively, assuming that the similarity is S, the error information of the wrong word relative to each candidate word is represented by K, and the formula for calculating the score of each candidate word may be:
score word =K×S;
and selecting the K value according to whether the initial letters of the candidate word and the wrong word are the same, wherein if the initial letters of the candidate word and the wrong word are the same, the K value is K1, and otherwise, the K values K2, K1 and K2 are preset numerical values. For example: and when the same, K is 1, and when the same is not the same, K is 0.5.
For an english word or a character string, based on the writing habit of the user, the first letter generally does not make a mistake, so that the possibility that each candidate word is a wrong-word-corrected word is evaluated through the above embodiment, and the evaluation score of the candidate word can be more reasonably determined. So that if the other conditions are equal, if the candidate word has the same initial as the error word, it is more likely to be the error word, otherwise, it is less likely to be the error word. It is understood that the values of K include, but are not limited to, the values of the above examples.
For example: the known situation is that when a user writes English words and sentences on the intelligent interactive flat plate, the game is wrongly written into the game, and the context language environment is I-wave game applets; the method for evaluating the candidate word corresponding to the tone comprises the following steps:
1. and detecting the tone in the English dictionary through English dictionary detection, and determining the tone as a wrong word.
2. And acquiring words with editing distances 1 and 2 from the song in the English dictionary, and assuming that the song, the same and the one exist as candidate words.
3. Respectively determining the longest common subsequence rate and the longest common substring rate of the game, the same, the one and the game, and calculating the similarity corresponding to each candidate word as follows:
the longest common subsequence of the candidate word some and the wrong word sone is soe, the length of the longest common subsequence is 3, and the longest common subsequence rate is 3/((4+4)/2) ═ 0.75; the longest common substring is so, the length of the longest common substring is 2, and the longest common substring rate is 2/((4+4)/2) ═ 0.5;
the longest public subsequence of the candidate word sam and the wrong word sone is se, the length of the longest public subsequence is 2, and the longest public subsequence rate is 2/((4+4)/2) ═ 0.5; the longest common substring is s or e, the length of the longest common substring is 1, and the longest common substring rate is 1/((4+4)/2) ═ 0.25;
the longest common subsequence of the candidate word one and the wrong word one is one, the length of the longest common subsequence is 3, and the longest common subsequence rate is 3/((3+4)/2) ═ 0.86; the longest common substring is one, the length of the longest common substring is 3, and the longest common substring rate is 3/((3+4)/2) ═ 0.86;
according to the formula S-1/D edit +0.7D lcs1 +0.3D lcs2 Calculating the similarity S between each candidate word and the wrong word as follows:
Some:1+0.7×0.75+0.3×0.5=1.675;
Same:1/2+0.7×0.5+0.3×0.25=0.925;
One:1+0.7×0.86+0.3×0.86=1.86。
4. according to the formula score word Calculating an evaluation score for each candidate word as K × S:
score some =1×1.675=1.675;
score same =1×0.925=0.925;
score one =0.5×1.86=0.93。
therefore, the candidate word some is more likely to be the correct word (error-correcting word) corresponding to the error word sone than other candidate words.
Based on the candidate word evaluation method of the embodiment, the evaluation score corresponding to each candidate word is in a positive correlation with the similarity of the candidate word relative to the wrong word, and is influenced by the distinguishing information of the candidate word and the first letter of the wrong word, and the finally obtained evaluation score corresponding to each candidate word is obtained by comprehensively considering the habit characteristics of the user when writing the word and the similarity information between the word and the word.
As shown in fig. 4, in a third embodiment, a candidate word evaluation method is provided, which includes the following steps:
step S31, detecting the wrong word, and obtaining a plurality of candidate words corresponding to the wrong word.
For the implementation of this step, reference may be made to the corresponding step S11 implemented above, which is not described in detail.
And step S32, determining the language environment probability of each candidate word at the wrong word position.
The language environment probability of the candidate word at the wrong word position means that when the candidate word is used for replacing the wrong word, the candidate word has the corresponding context rationality, and the language environment probability corresponding to the candidate word with the higher context rationality is higher.
In an embodiment, the probability of each candidate word at the wrong word position is calculated according to a preset language model, and the log value of the probability is used as the language environment probability of the candidate word.
And step S33, acquiring error information of the error word relative to each candidate word.
With regard to the implementation of this step, reference may be made to the description of the first embodiment above.
And step S34, determining the evaluation score corresponding to each candidate word according to the language environment probability and the error information.
The evaluation score obtained in this step can reflect the possibility that each candidate word is the error-correcting word (correct word) corresponding to the error word.
In the candidate word evaluation method of the embodiment, when a wrong word is detected, a plurality of corresponding candidate words are obtained, the probability of each candidate word at the position of the wrong word is respectively determined, and error information of the wrong word relative to each candidate word is determined; determining an evaluation score corresponding to each candidate word according to the language environment probability and the error information; the method not only considers the habitual problem of the user when writing the word, but also takes the information of the context language environment into account, thereby being beneficial to improving the reliability of the candidate word evaluation result.
In one embodiment, the language model that determines the language environment probability of each candidate word at the wrong word position includes, but is not limited to, an N-Gram model, a BilSTM model, or an LSTM model.
Wherein the N-Gram model is a statistical language model for predicting the nth item from the first (N-1) items. At the application level, these items can be phonemes (speech recognition application), characters (input method application), words (segmentation application) or base pairs (genetic information). Idea of N-Gram model: given a string of letters, such as "for ex," what the next most likely letter is. From the training corpus data, N probability distributions can be obtained by a maximum likelihood estimation method: the probability of being a is 0.4, the probability of being b is 0.0001, the probability of being c is …, and of course, the constraint condition is satisfied: the sum of all N probability distributions is 1.
The long and short memory neural network model, commonly called as LSTM model, is a special recurrent neural network; the next likely character to occur can be predicted by the character-level sequence entered.
The two-way long-short term memory network model, commonly referred to as the BilSTM model, is structurally identical to the LSTM model, except that the BilSTM model is linked not only to past states, but also to future states. For example, training a one-way LSTM to predict "fish" (the cyclic concatenation on the time axis remembers the past state values) by entering letters one by one, inputs the next letter in the sequence in the feedback path of the BiLSTM, which makes it possible to know what the future information is. This form of training allows the network to fill in gaps between information rather than predict information.
In one embodiment, when the language environment probability is represented by a log value of the probability of each candidate word at the wrong word position, the evaluation score corresponding to each candidate word may be determined according to the inverse of the language environment probability and the error information. Therefore, the evaluation score corresponding to each candidate word is in an inverse relationship with the corresponding probability log value, and is also influenced by the distinguishing information of the candidate word and the wrong word. The evaluation scores of the candidate words are obtained by comprehensively considering the habit problems of the user when writing the words and the language environment information, and the reliability of candidate word evaluation is improved compared with the traditional candidate word evaluation method.
In one embodiment, the error information of the error word relative to each candidate word includes: and the wrong word and the candidate word have the same initial letter information. Correspondingly, the step of determining the evaluation score corresponding to each candidate word according to the language environment probability and the error information comprises the following steps: if the wrong word is the same as the initial of the candidate word, calculating an evaluation score corresponding to the candidate word according to the language environment probability and the first coefficient; and if the wrong word is different from the initial of the candidate word, calculating an evaluation score corresponding to the candidate word according to the language environment probability and the second coefficient.
Based on any of the above embodiments, the language environment probability of the candidate word at the wrong word position is used
Figure BDA0001625278480000121
Representing that word represents a candidate word and mx represents a language model, the evaluation score of each candidate word is calculated according to the following formula:
Figure BDA0001625278480000122
if the candidate word and the wrong word initial are the same, the value of K is K1, otherwise, the value of K is K2, and K1 and K2 are all preset numerical values.
For example, assume that the language environment probability determined by the N-Gram language model is
Figure BDA0001625278480000131
The formula for calculating the score of each candidate word may be:
Figure BDA0001625278480000132
the language environment probability determined by the BilSTM model and the LSTM model is assumed to be respectively expressed as
Figure BDA0001625278480000133
Figure BDA0001625278480000134
The formula for calculating the score of each candidate word may be:
Figure BDA0001625278480000135
Figure BDA0001625278480000136
and selecting the K value according to whether the candidate word and the wrong word have the same initial, wherein the K value is 1 when the candidate word and the wrong word are the same, and the K value is 0.5 when the candidate word and the wrong word are not the same. It is understood that the values of K include, but are not limited to, the values of the above examples.
For an english word or a character string, based on the writing habit of the user, the initial generally does not make mistakes, so by the above embodiment, the possibility that each candidate word is an error-correcting word of the error word is evaluated, and the evaluation score of the candidate word can be determined more reasonably. So that if the candidate word is the same as the first letter of the error word, the candidate word is more likely to be the error word, otherwise, the candidate word is less likely to be the error word.
For example, it is known that when a user writes an english word on the smart interactive tablet, the game is wrongly written as a game, and the contextual language environment is I-wave game applets.
For the N-GRAM model, N in the N-GRAM model takes a value of 3:
1. and detecting the tone in the English dictionary through English dictionary detection, and determining the tone as a wrong word.
2. And acquiring words with edit distances of 1 and 2 from the song in the English dictionary, and assuming that the song, the same, the son and the one exist as candidate words.
3. Respectively replacing the song in the I-love song applets with the song, the same, the son and the one, and calculating the corresponding candidate words according to the N-GRAM model
Figure BDA0001625278480000137
The language environment probability of each candidate word is obtained as follows:
Figure BDA0001625278480000138
Figure BDA0001625278480000139
Figure BDA00016252784800001310
Figure BDA00016252784800001311
wherein, P (I, have, some) ═ c (I, have, some)/c (3-Gram), and c represents the count, i.e. the count (number) of 3-tuples I, have, some in the corpus is divided by the count of all 3-tuples;
where P (have | I) ═ c (have, I)/c (I), i.e., the count of such 2-tuples of have, I in the corpus divided by the count of all I;
where p (I) ═ c (I)/c (1-Gram), i.e. the count of such 1-tuples of I in the corpus divided by the count of all 1-tuples.
4. According to the formula
Figure BDA0001625278480000141
Calculating an evaluation score for each candidate word:
score some =-1×1/-5.7=0.175;
score same =-1×1/-6=0.167;
score son =-1×1/-6=0.167;
score one =-0.5×1/-4.8=0.104。
therefore, the candidate word some is more likely to be the correct word (error-correcting word) corresponding to the error word sone than other candidate words.
In another embodiment, step 1 and step 2 are the same for the BilSTM model; step 3 is replaced by: respectively replacing the phone in the I-phone phones with the phone, same, son and one, and calculating the language environment probability corresponding to each candidate word
Figure BDA0001625278480000142
Each candidate word is calculated as follows:
Figure BDA0001625278480000143
Figure BDA0001625278480000144
Figure BDA0001625278480000145
Figure BDA0001625278480000146
and step 4 is replaced by:
according to the formula
Figure BDA0001625278480000147
Calculating an evaluation score for each candidate word:
score some =-1×1/-4.9=0.204;
score same =-1×1/-5.07=0.197;
score son =-1×1/-7.6=0.132;
score one =-0.5×1/-4.66=0.107。
it can also be known that the candidate word some is more likely to be the correct word (error-correcting word) corresponding to the error word sone than other candidate words.
In another embodiment, step 1 and step 2 above are the same for the LSTM model; step 3 is replaced by: respectively replacing the phone in the I-phone phones with the phone, same, son and one, and calculating the language environment probability corresponding to each candidate word
Figure BDA0001625278480000151
Each candidate word is calculated as follows:
Figure BDA0001625278480000152
Figure BDA0001625278480000153
Figure BDA0001625278480000154
Figure BDA0001625278480000155
and step 4 is replaced by:
according to the formula
Figure BDA0001625278480000156
Calculating an evaluation score for each candidate word:
score some =-1×1/-5.6=0.179;
score same =-1×1/-6.3=0.159;
score son =-1×1/-6=0.116;
score one =-0.5×1/-5.1=0.098。
it can also be known that the candidate word some is more likely to be the correct word (error-correcting word) corresponding to the error word sone than other candidate words.
Through the candidate word evaluation score calculation under the three language models, the habit problem of a user when writing words is considered, and language environment information is also considered, so that candidate words of wrong words can be determined more effectively, and the error correction accuracy is improved.
As shown in fig. 5, in a fourth embodiment, there is provided a candidate word evaluation method including the steps of:
and step S41, detecting the wrong word and acquiring a plurality of candidate words corresponding to the wrong word.
The specific implementation manner of this step refers to step S11 in the above embodiment, which is not described in detail.
Step S42, replacing the wrong word with each candidate word respectively to obtain a candidate sentence, and determining the evaluation probability of the corresponding candidate word according to the candidate sentence, wherein the evaluation probability is obtained according to the language environment probability of the candidate word in the candidate sentence and the language environment probability of the adjacent word of the candidate word.
And replacing wrong words with the candidate words respectively to obtain corresponding candidate sentences, wherein the candidate sentences can contain a plurality of words, and one of the words is a candidate word. The language environment probability of each word in the sentence at the corresponding position of the word means that the word is more reasonable than the corresponding context, and the higher the context is, the higher the language environment probability of the word is.
In one embodiment, the neighboring words of the candidate word may be one or more, and may include both immediate neighboring words of the candidate word and alternate neighboring words of the candidate word. The probability of each position of a candidate word and a word adjacent to the candidate word in the candidate sentence can be calculated according to a preset language model, and the log value of the probability is used as the language environment probability of the corresponding word; and further averaging the language environment probability of the candidate words in the candidate sentences and the language environment probability of the adjacent words of the candidate words to obtain the evaluation probability of the candidate words in the candidate sentences. The language environment probability of the candidate word and the mean value of the language environment probabilities of the adjacent words of the candidate word can be an absolute mean value or a weighted mean value.
And step S43, acquiring error information of the wrong word relative to each candidate word.
As to the implementation of this step, reference may be made to the description of the first embodiment above.
And step S44, determining the evaluation score corresponding to each candidate word according to the evaluation probability and the error information.
The evaluation score obtained by this step may reflect the possibility that each candidate word is the error correction word corresponding to the error word.
In the candidate word evaluation method of the embodiment, when a wrong word is detected, a plurality of corresponding candidate words are first obtained, evaluation probabilities corresponding to the candidate words are respectively determined, and error information of the wrong word relative to the candidate words is determined; determining an evaluation score corresponding to each candidate word according to the evaluation probability and the error information; the method not only considers the habitual problem of the user when writing the word, but also takes the information of the context language environment into consideration, thereby being beneficial to improving the reliability of the candidate word evaluation result.
In one embodiment, the language model that calculates the summary of candidate words, their neighbors, in their respective locations in the candidate sentence includes, but is not limited to, an N-Gram model, a BilSTM model, or an LSTM model. The explanation of each language model can be referred to the description of the third embodiment.
In one embodiment, the language environment probability of each word is represented by a log of the probability of each word at its position, based on which an evaluation probability of a candidate word is obtained, and further, an evaluation score corresponding to each candidate word may be determined according to an inverse of the evaluation probability of the candidate word and the error information.
For example: if the evaluation probability of the candidate word is used
Figure BDA0001625278480000161
Expressing that the error information of the error word relative to each candidate word is represented by K, and the evaluation score corresponding to the candidate word is represented by score word That means, the evaluation score of each candidate word can be calculated according to the following formula:
Figure BDA0001625278480000162
if the candidate word and the wrong word initial are the same, K takes the value of K1, otherwise, K takes the value of K2; k1 and K2 are preset values.
Therefore, the evaluation score corresponding to each candidate word is in an inverse relation with the corresponding evaluation probability, and is also influenced by the evaluation score and the initial information of the wrong word. The finally obtained evaluation score corresponding to each candidate word is obtained by comprehensively considering the habit problem and the context information when the user writes the word, and compared with the traditional method for evaluating the candidate words by depending on the editing distance, the reliability of the evaluation result of the candidate words is improved.
In one embodiment, the error information of the error word relative to each candidate word includes: and the wrong word and the candidate word have the same information of the initial letter. Correspondingly, the step of determining the evaluation score corresponding to each candidate word according to the evaluation probability and the error information comprises the following steps: if the wrong word is the same as the initial of the candidate word, calculating an evaluation score corresponding to the candidate word according to the evaluation probability and the first coefficient; and if the wrong word is different from the initial of the candidate word, calculating an evaluation score corresponding to the candidate word according to the evaluation probability and the second coefficient.
Optionally, the evaluation probability determined by the N-Gram language model is assumed to be used
Figure BDA0001625278480000171
The formula for calculating the score of each candidate word may be:
Figure BDA0001625278480000172
and selecting the K value according to whether the candidate word and the wrong word have the same initial, wherein the K value is 1 when the candidate word and the wrong word are the same, and the K value is 0.5 when the candidate word and the wrong word are not the same. It is understood that the values of K include, but are not limited to, the values of the above examples.
For an english word or a character string, based on the writing habit of the user, the initial generally does not make mistakes, so based on the above embodiment, the possibility that each candidate word is an error-correcting word of the error word is evaluated, and the evaluation score of the candidate word can be determined more reasonably. So that if the candidate word is the same as the first letter of the error word, the candidate word is more likely to be the error word, otherwise, the candidate word is less likely to be the error word.
For example: it is known that when a user writes an english word or sentence on the smart interactive tablet, the some things are that the some things are wrong writing the some things into the song, and the contextual language environment is I-have song applets.
1. And detecting the tone in the English dictionary through English dictionary detection, and determining the tone as a wrong word.
2. And acquiring words with edit distances of 1 and 2 from the song in the English dictionary, and assuming that the song, the same, the son and the one exist as candidate words.
3. Respectively replacing the sone in the I-have sone applets with the some of the others; obtaining:
candidate statement one: i have some soy applets;
candidate statement two: i have same applets;
candidate statement three: i have son applets;
candidate sentence four: i have one applets.
Based on the N-GRAM model, N in the N-GRAM model takes a value of 3, a neighboring word corresponding to a candidate word some in a candidate sentence I is an applet, a neighboring word corresponding to a candidate word sam in a candidate sentence II is an applet, a neighboring word corresponding to a candidate word son in a candidate sentence III is an applet, and a neighboring word corresponding to a candidate word one in a candidate sentence IV is an applet; based on the calculation, the evaluation probability corresponding to the candidate words in each candidate sentence is as follows:
Figure BDA0001625278480000181
Figure BDA0001625278480000182
Figure BDA0001625278480000183
Figure BDA0001625278480000184
4. according to the formula
Figure BDA0001625278480000185
Calculating an evaluation score for each candidate word:
score some =-1×1/-5.6=0.179;
score same =-1×1/-6.35=0.157;
score son =-1×1/-7.58=0.132;
score one =-0.5×1/-5.26=0.095。
therefore, the candidate word some is more likely to be the correct word (error-correcting word) corresponding to the error word sone than other candidate words.
In another embodiment, for the BilSTM model, the neighbor word corresponding to the candidate word some in the first candidate sentence is I, have and applets, the neighbor word corresponding to the candidate word sam in the second candidate sentence is I, have and applets, the neighbor word corresponding to the candidate word son in the third candidate sentence is I, have and applets, and the neighbor word corresponding to the candidate word one in the fourth candidate sentence is I, have and applets; based on the calculation, the evaluation probability corresponding to the candidate words in each candidate sentence is as follows:
Figure BDA0001625278480000186
Figure BDA0001625278480000191
Figure BDA0001625278480000192
Figure BDA0001625278480000193
step 4 is replaced by:
according to the formula
Figure BDA0001625278480000194
Calculating an evaluation score for each candidate word:
score some =-1×1/-2.88=0.347;
score same =-1×1/-3.62=0.276;
score son =-1×1/-4.1=0.244;
score one =-0.5×1/-2.62=0.191。
it can also be seen that the candidate word some has a higher probability of being the correct word (error-correcting word) corresponding to the error word sone than other candidate words.
In another embodiment, for the LSTM model, the neighboring words corresponding to the candidate word some in the first candidate sentence are I, have and applets, the neighboring words corresponding to the candidate word sam in the second candidate sentence are I, have and applets, the neighboring words corresponding to the candidate word son in the third candidate sentence are I, have and applets, and the neighboring words corresponding to the candidate word one in the fourth candidate sentence are I, have and applets; based on the calculation, the evaluation probability corresponding to the candidate words in each candidate sentence is as follows:
Figure BDA0001625278480000195
Figure BDA0001625278480000196
Figure BDA0001625278480000197
Figure BDA0001625278480000201
step 4 is replaced by:
according to the formula
Figure BDA0001625278480000202
Calculating an evaluation score for each candidate word:
score some =-1×1/-3.175=0.315;
score same =-1×1/-3.75=0.267;
score son =-1×1/-4.6=0.217;
score one =-0.5×1/-3.45=0.145。
it can also be seen that the candidate word some has a higher probability of being the correct word (error-correcting word) corresponding to the error word sone than other candidate words.
Through candidate word evaluation under the three language models, the habit problem of a user when writing words is considered, and evaluation information of the language environment is added, so that the reliability of candidate word evaluation is improved.
As shown in fig. 6, in a fifth embodiment, there is provided a candidate word evaluation method including the steps of:
step S51, when the wrong word is detected, a plurality of candidate words corresponding to the wrong word are obtained.
Regarding the implementation manner of this step, refer to step S11 of the first embodiment, which is not described in detail.
And step S52, determining the edit distance between each candidate word and the wrong word.
As to the implementation of this step, reference may be made to the description of the first embodiment.
Step S53, replacing the wrong word with each candidate word respectively to obtain a candidate sentence, and determining the evaluation probability of the corresponding candidate word according to the candidate sentence, wherein the evaluation probability is obtained according to the language environment probability of the candidate word in the candidate sentence and the language environment probability of the adjacent word of the candidate word.
With regard to the implementation of this step, reference may be made to the description of the fourth embodiment.
And step S54, determining the evaluation score of each candidate word according to the editing distance and the evaluation probability.
The evaluation score obtained by this step can represent the possibility that each candidate word is the error-correcting word (correct word) corresponding to the error word.
By the candidate word evaluation method of the embodiment, when a wrong word is detected, a plurality of candidate words corresponding to the wrong word are obtained; according to the editing distance between the candidate word and the wrong word and the context language environment information, the possibility that the candidate word is the wrong word corresponding to the wrong word is comprehensively evaluated, and compared with a traditional candidate word evaluation method only depending on the editing distance, the reliability of candidate word evaluation is improved.
In one embodiment, calculating the probabilities of the candidate words and the adjacent words of the candidate words at the positions of the candidate words in the candidate sentence according to a preset language model, and taking the log values of the probabilities as the language environment probabilities of the words; further, the language environment probabilities of the candidate words in the candidate sentences and the language environment probabilities of the neighboring words of the candidate words are averaged to obtain the evaluation probabilities of the candidate words in the candidate sentences.
The language model includes, but is not limited to, an N-Gram model, a BilSTM model, or an LSTM model. For each language model, see the description of the third embodiment above.
In one embodiment, the language environment probability of each word is represented by a log of the probability of each word at its position, based on which an evaluation probability of a candidate word is obtained, and further, an evaluation score corresponding to each candidate word may be determined according to an inverse of the edit distance and an inverse of the evaluation probability.
For example: the evaluation score of each candidate word may be calculated according to the following formula:
Figure BDA0001625278480000211
wherein D is edit Indicating the edit distance of the candidate word from the wrong word, word indicating the candidate word,
Figure BDA0001625278480000212
representing the evaluation probability of the candidate word, score word Representing the evaluation score corresponding to the candidate word.
Therefore, the evaluation score corresponding to each candidate word is in an inverse relationship with the corresponding evaluation probability, and is also influenced by the evaluation score and the initial information of the wrong word. The finally obtained evaluation score corresponding to each candidate word is obtained by comprehensively considering the phenomenon characteristics and the context information when the word is written, and compared with the traditional method for evaluating the candidate words by depending on the editing distance, the evaluation reliability of the candidate words is improved.
For example: in the known situation, when a user writes English words and sentences on the intelligent interactive panel, the game is wrongly written into the phone, and the context language environment is I-wave phone applets; the method for evaluating the candidate word corresponding to the tone comprises the following steps:
1. and detecting the tone in the English dictionary through English dictionary detection, and determining the tone as a wrong word.
2. And acquiring a candidate word of the song, and assuming that the song, the same, the one and the son exist as the candidate word.
3. Respectively determining the edit distances of the game, the same, the one, the son and the same, wherein the edit distances are respectively as follows: 1. 2, 1 and 1.
4. Respectively replacing the song in the I-have song applets with the song, the same, the son and the one; obtaining:
candidate statement one: i have some soy applets;
candidate statement two: i have same applets;
candidate statement three: i have son applets;
candidate sentence four: i have one applets;
based on the N-GRAM model, N in the N-GRAM model takes a value of 3, a neighboring word corresponding to a candidate word some in a candidate sentence I is an applet, a neighboring word corresponding to a candidate word same in a candidate sentence II is an applet, a neighboring word corresponding to a candidate word son in a candidate sentence III is an applet, and a neighboring word corresponding to a candidate word one in a candidate sentence IV is an applet; based on the calculation, the evaluation probability corresponding to the candidate words in each candidate sentence is as follows:
Figure BDA0001625278480000221
Figure BDA0001625278480000222
Figure BDA0001625278480000223
Figure BDA0001625278480000224
5. according to the formula
Figure BDA0001625278480000225
Calculating an evaluation score for each candidate word:
score some =-1×1/-5.6=0.179;
score same =-1/2×1/-6.35=0.0787;
score son =-1×1/-7.58=0.132;
score one =-1×1/-5.26=0.19。
therefore, the candidate word some is more likely to be the correct word (error-correcting word) corresponding to the error word sone than other candidate words.
In another embodiment, for the BilSTM model, the neighbor word corresponding to the candidate word some in the first candidate sentence is I, have and applets, the neighbor word corresponding to the candidate word sam in the second candidate sentence is I, have and applets, the neighbor word corresponding to the candidate word son in the third candidate sentence is I, have and applets, and the neighbor word corresponding to the candidate word one in the fourth candidate sentence is I, have and applets; based on the calculation, the evaluation probability corresponding to the candidate words in each candidate sentence is as follows:
Figure BDA0001625278480000231
Figure BDA0001625278480000232
Figure BDA0001625278480000233
Figure BDA0001625278480000234
step 4 is replaced by:
according to the formula
Figure BDA0001625278480000235
Calculating an evaluation score for each candidate word:
score some =-1×1/-2.88=0.347;
score same =-1/2×1/-3.62=0.138;
score son =-1×1/-4.1=0.244;
score one =-1×1/-2.62=0.382。
similarly, it is found that the candidate word game has a higher probability of being the correct word (error-correcting word) corresponding to the error word tone than other candidate words.
In another embodiment, for the LSTM model, the neighboring words corresponding to the candidate word some in the first candidate sentence are I, have and applets, the neighboring words corresponding to the candidate word sam in the second candidate sentence are I, have and applets, the neighboring words corresponding to the candidate word son in the third candidate sentence are I, have and applets, and the neighboring words corresponding to the candidate word one in the fourth candidate sentence are I, have and applets; based on the calculation, the evaluation probability corresponding to the candidate words in each candidate sentence is as follows:
Figure BDA0001625278480000236
Figure BDA0001625278480000237
Figure BDA0001625278480000241
Figure BDA0001625278480000242
step 4 is replaced by:
according to the formula
Figure BDA0001625278480000243
Calculating an evaluation score for each candidate word:
score some =-1×1/-3.175=0.315;
score same =-1/2×1/-3.75=0.1335;
score son =-1×1/-4.6=0.217;
score one =-1×1/-3.45=0.29。
it is also found that the candidate word game has a higher probability of being the correct word (error-correcting word) corresponding to the error word tone than other candidate words.
Therefore, the evaluation score corresponding to each candidate word finally obtained through the three language models comprehensively considers the context language environment information and the editing distance information, and therefore the reliability of candidate word evaluation is improved.
As shown in fig. 7, in a sixth embodiment, there is provided a candidate word evaluation method including the steps of:
step S61, when the wrong word is detected, a plurality of candidate words corresponding to the wrong word are obtained.
Regarding the implementation manner of this step, refer to step S11 of the first embodiment, which is not described in detail.
And step S62, determining the similarity between each candidate word and the wrong word, wherein the similarity is obtained according to the longest common subsequence and/or the longest common substring of each candidate word and the wrong word.
For the implementation of this step, refer to the description of the second embodiment, and no further description is given.
Step S63, each candidate word is used for replacing the wrong word to obtain a candidate sentence, and the evaluation probability of the corresponding candidate word is determined according to the candidate sentence, wherein the evaluation probability is obtained according to the language environment probability of the candidate word in the candidate sentence and the language environment probability of the adjacent word of the candidate word.
The implementation of this step is described with reference to the fourth embodiment, and is not repeated.
And step S64, determining the evaluation score of each candidate word according to the similarity and the evaluation probability.
The evaluation score obtained in this step can represent the possibility that each candidate word is the error-correcting word (correct word) corresponding to the error word.
By the candidate word evaluation method of the embodiment, when a wrong word is detected, a plurality of candidate words corresponding to the wrong word are obtained; according to the similarity between the candidate words and the wrong words and the context language environment information, the possibility that the candidate words are the wrong words corresponding to the wrong words is comprehensively evaluated, and compared with a traditional candidate word evaluation method only depending on the editing distance, the reliability of the candidate word evaluation result is improved.
In one embodiment, the probabilities of the candidate words and the adjacent words of the candidate words in the candidate sentences at the positions of the candidate words and the adjacent words of the candidate words are calculated according to a preset language model, and the log values of the probabilities are used as the language environment probabilities of the words; further, the language environment probabilities of the candidate words in the candidate sentences and the language environment probabilities of the adjacent words of the candidate words are averaged to obtain the evaluation probabilities of the candidate words in the candidate sentences. The language model includes, but is not limited to, an N-Gram model, a BilSTM model, or an LSTM model. The case of each language model can be seen in the above-described embodiments.
In one embodiment, the language environment probability of each word is represented by a log of the probability of each word at its position, based on which an evaluation probability of a candidate word is obtained, and further, an evaluation score corresponding to each candidate word may be determined according to the reciprocal of the evaluation probability and the similarity.
Optionally, the evaluation score of each candidate word is calculated according to the following formula:
Figure BDA0001625278480000251
wherein word represents a candidate word, score word Representing evaluation scores corresponding to candidate words
Figure BDA0001625278480000252
Representing the evaluation probability of the candidate word, mx representing the language model, and S representing the similarity of the candidate word and the wrong word.
Therefore, the evaluation score corresponding to each candidate word is in an inverse relation with the corresponding evaluation probability, and is also influenced by the similarity of the candidate word and the wrong word. The evaluation scores corresponding to the candidate words are obtained after the similarity and the context information are integrated, and compared with a traditional method for evaluating the candidate words by means of editing distance, the evaluation reliability of the candidate words is improved.
For example: the known situation is that when a user writes English words and sentences on the intelligent interactive flat plate, the game is wrongly written into the game, and the context language environment is I-wave game applets; the method for evaluating the candidate word corresponding to the tone comprises the following steps:
1. and detecting the tone in the English dictionary through English dictionary detection, and determining the tone as a wrong word.
2. And acquiring a candidate word of the song, and assuming that the song, the same and the son exist as the candidate word.
3. Respectively determining the longest common subsequence ratio and the longest common substring ratio of the some, same, son and one, and calculating the similarity corresponding to each candidate word as follows:
the longest common subsequence of the candidate word some and the wrong word sone is soe, the length of the longest common subsequence is 3, and the longest common subsequence rate is 3/((4+4)/2) ═ 0.75; the longest common substring is so, the length of the longest common substring is 2, and the longest common substring rate is 2/((4+4)/2) ═ 0.5;
the longest public subsequence of the candidate word sam and the wrong word sone is se, the length of the longest public subsequence is 2, and the longest public subsequence rate is 2/((4+4)/2) ═ 0.5; the longest common substring is s or e, the length of the longest common substring is 1, and the longest common substring rate is 1/((4+4)/2) ═ 0.25;
the longest common subsequence of the candidate word son and the wrong word sone is son, the length of the longest common subsequence is 3, and the longest common subsequence rate is 3/((3+4)/2) ═ 0.86; the longest common substring is son, the length of the longest common substring is 3, and the longest common substring rate is 3/((3+4)/2) — (0.86).
According to the formula S-1/D edit +0.7D lcs1 +0.3D lcs2 Calculating the similarity S between each candidate word and the wrong word as follows:
Some:1+0.7×0.75+0.3×0.5=1.675;
Same:1/2+0.7×0.5+0.3×0.25=0.925;
son:1+0.7×0.86+0.3×0.86=1.86。
4. respectively replacing the sone in the I-have sone applets by the same, same and son; obtaining:
candidate statement one: i have some soy applets;
and candidate sentence two: i have same applets;
and (3) candidate sentences: i have son applets.
Based on the N-GRAM model, N in the N-GRAM model takes a value of 3, a neighboring word corresponding to a candidate word some in the first candidate sentence is an applet, a neighboring word corresponding to a candidate word same in the second candidate sentence is an applet, and a neighboring word corresponding to a candidate word son in the third candidate sentence is an applet; based on the calculation, the evaluation probability corresponding to the candidate words in each candidate sentence is as follows:
Figure BDA0001625278480000261
Figure BDA0001625278480000262
Figure BDA0001625278480000263
wherein, P (I, have, some) ═ c (I, have, some)/c (3-Gram), and c represents the count, i.e. the count (number) of 3-tuples I, have, some in the corpus is divided by the count of all 3-tuples;
wherein, P (have | I) ═ c (have, I)/c (I), i.e. the count of 2-tuple in the corpus, such as have, I, is divided by the count of all I;
where p (I) ═ c (I)/c (1-Gram), i.e. the 1-tuple count in the corpus I divided by the count of all 1-tuples.
5. According to the formula
Figure BDA0001625278480000271
Calculating an evaluation score for each candidate word:
score some =-1.675×1/-5.6=0.299;
score same =-0.925×1/-6.35=0.146;
score son =-1.86×1/-7.58=0.245。
therefore, the candidate word some is more likely to be the correct word (error-correcting word) corresponding to the error word sone than other candidate words.
In another embodiment, for the BiLSTM model, the neighboring words corresponding to the candidate word some in the first candidate sentence are I, have and applets, the neighboring words corresponding to the candidate word sam in the second candidate sentence are I, have and applets, and the neighboring words corresponding to the candidate word son in the third candidate sentence are I, have and applets; based on the calculation, the evaluation probability corresponding to the candidate words in each candidate sentence is as follows:
Figure BDA0001625278480000272
Figure BDA0001625278480000273
Figure BDA0001625278480000274
step 5 is replaced by:
according to the formula
Figure BDA0001625278480000275
Calculating an evaluation score for each candidate word:
score some =-1.675×1/-2.88=0.582;
score same =-0.925×1/-3.62=0.256;
score son =-1.86×1/-4.1=0.454。
it can also be known that the candidate word some is more likely to be the correct word (error-correcting word) corresponding to the error word sone than other candidate words.
In another embodiment, for the LSTM model, the neighbor words corresponding to the candidate word some in the first candidate sentence are I, have and applets, the neighbor words corresponding to the candidate word sam in the second candidate sentence are I, have and applets, and the neighbor words corresponding to the candidate word son in the third candidate sentence are I, have and applets; based on the calculation, the evaluation probability corresponding to the candidate words in each candidate sentence is as follows:
Figure BDA0001625278480000281
Figure BDA0001625278480000282
Figure BDA0001625278480000283
step 5 is replaced by:
according to the formula
Figure BDA0001625278480000284
Calculating an evaluation score for each candidate word:
score some =-1.675×1/-3.175=0.528;
score same =-0.925×1/-3.75=0.247;
score son =-1.86×1/-4.6=0.404。
it can also be known that the candidate word some is more likely to be the correct word (error-correcting word) corresponding to the error word sone than other candidate words.
Therefore, the evaluation score corresponding to each candidate word finally obtained through the three language models comprehensively considers the context language environment information and the similarity information between the words, and therefore the reliability of candidate word evaluation is improved.
As shown in fig. 8, in a seventh embodiment, there is provided a candidate word evaluation method including the steps of:
step S71, when the wrong word is detected, a plurality of candidate words corresponding to the wrong word are obtained.
Regarding the implementation of this step, reference may be made to the description of step S11 of the first embodiment.
And step S72, determining the edit distance between each candidate word and the wrong word.
As to the implementation of this step, reference may be made to the description of the first embodiment.
And step S73, determining the language environment probability of each candidate word at the wrong word position.
With regard to the implementation of this step, reference may be made to the description of the third embodiment.
And step S74, acquiring error information of the error word relative to each candidate word.
The specific implementation of this step can be referred to the description of the first embodiment above.
And step S75, determining the evaluation score corresponding to each candidate word according to the editing distance, the language environment probability and the error information.
The evaluation score obtained in this step can reflect the possibility that each candidate word is the error-correcting word (correct word) corresponding to the error word.
When a wrong word is detected, the candidate word evaluation method of the embodiment first acquires a plurality of corresponding candidate words, determines an edit distance between each candidate word and the wrong word, a probability of each candidate word at the position of the wrong word, and error information of the wrong word relative to each candidate word; determining an evaluation score corresponding to each candidate word according to the editing distance, the language environment probability and the error information; not only the editing distance and the habit problem of the user when writing the word are considered, but also the information of the context language environment is taken into account, thereby being beneficial to improving the reliability of the candidate word evaluation result.
In one embodiment, the language model that determines the language environment probability of each candidate word at the wrong word position includes, but is not limited to, an N-Gram model, a BilSTM model, or an LSTM model. Wherein the description of each model refers to the description of the related embodiment described above.
In one embodiment, when the language environment probability is represented by a log value of the probability of each candidate word at the wrong word position, the evaluation score corresponding to each candidate word may be determined according to the inverse of the edit distance, the inverse of the language environment probability, and the error information.
Therefore, the evaluation score corresponding to each candidate word is in a reverse relation with the editing distance of the corresponding wrong word and in a reverse relation with the corresponding language environment probability, and is also influenced by the error information of the wrong word. The evaluation scores of the candidate words are obtained by comprehensively considering habit problems, editing distance and language environment information when the user writes the words, and compared with a traditional candidate word evaluation method, the evaluation method is beneficial to improving the evaluation reliability of the candidate words.
In one embodiment, the error information of the error word relative to each candidate word includes: and the wrong word and the candidate word have the same information of the initial letter. Correspondingly, the step of determining the evaluation score corresponding to each candidate word according to the editing distance, the language environment probability and the error information comprises the following steps: if the wrong word is the same as the initial of the candidate word, calculating an evaluation score corresponding to the candidate word according to the editing distance, the language environment probability and the first coefficient; and if the wrong word is different from the initial of the candidate word, calculating an evaluation score corresponding to the candidate word according to the editing distance, the language environment probability and the second coefficient.
Optionally, the evaluation score of each candidate word is calculated according to the following formula:
Figure BDA0001625278480000291
wherein word represents a candidate word, D edit Indicating the edit distance of the candidate word from the wrong word,
Figure BDA0001625278480000292
representing the linguistic environment probability of the candidate word, mx representing the language model, score word And the evaluation score corresponding to the candidate word is represented, K represents error information of the wrong word relative to each candidate word, if the candidate word and the first letter of the wrong word are the same, the value of K is K1, and otherwise, the values of K2, K1 and K2 are all preset numerical values.
For example, assume that the language environment probability determined by the N-Gram language model is
Figure BDA0001625278480000301
The formula for calculating the score of each candidate word may be:
Figure BDA0001625278480000302
and selecting the K value according to whether the candidate word and the wrong word have the same initial, wherein the K value is 1 when the candidate word and the wrong word are the same, and the K value is 0.5 when the candidate word and the wrong word are not the same. It is understood that the values of K include, but are not limited to, the values of the above examples.
For an english word or a character string, based on the writing habit of the user, the first letter generally does not make a mistake, so based on the above embodiment, the possibility that each candidate word is a wrong-word-corrected word is evaluated, and the evaluation score of the candidate word can be more reasonably determined. So that if the candidate word is the same as the first letter of the error word, the candidate word is more likely to be the error word, otherwise, the candidate word is less likely to be the error word.
For example, it is known that when a user writes an english word on the smart interactive tablet, the game is wrongly written as a game, and the contextual language environment is I-wave game applets.
1. And detecting the tone in the English dictionary through English dictionary detection, and determining the tone as a wrong word.
2. And acquiring words with edit distances of 1 and 2 from the song in the English dictionary, and assuming that the song, the same, the son and the one exist as candidate words.
3. Respectively replacing the song in the I-love song applets with the song, the same, the son and the one, and calculating the corresponding candidate words according to the N-GRAM model
Figure BDA0001625278480000303
And taking the value of N in the N-GRAM model to be 3, and obtaining the language environment probability of each candidate word as follows:
Figure BDA0001625278480000304
Figure BDA0001625278480000305
Figure BDA0001625278480000306
Figure BDA0001625278480000307
wherein, P (I, have, some) ═ c (I, have, some)/c (3-Gram), and c represents the count, i.e. the count (number) of 3-tuples I, have, some in the corpus is divided by the count of all 3-tuples;
wherein, P (have | I) ═ c (have, I)/c (I), i.e. the 2-tuple count of have, I in the corpus is divided by the count of all I;
where p (I) ═ c (I)/c (1-Gram), i.e. the 1-tuple count in the corpus I divided by the count of all 1-tuples.
4. According to the formula
Figure BDA0001625278480000311
Calculating an evaluation score for each candidate word:
score some =-1×1×1/-5.7=0.175;
score same =-1×0.5×1/-6=0.083;
score son =-1×1×1/-6=0.167;
score one =-0.5×1×1/-4.8=0.104。
therefore, the candidate word some is more likely to be the correct word (error-correcting word) corresponding to the error word sone than other candidate words.
In another embodiment, step 1 and step 2 are the same for the BilSTM model;
step 3 is replaced by: respectively replacing the sound in the I-wave sound applets with the sound, the same, the son and the one, and calculating the language environment probability corresponding to each candidate word as follows:
Figure BDA0001625278480000312
Figure BDA0001625278480000313
Figure BDA0001625278480000314
Figure BDA0001625278480000315
step 4 is replaced by:
according to the formula
Figure BDA0001625278480000316
Calculating an evaluation score for each candidate word:
score some =-1×1×1/-4.9=0.204;
score same =-1×1/2×1/-5.07=0.098;
score son =-1×1×1/-7.6=0.132;
score one =-0.5×1×1/-4.66=0.107。
it can also be known that the candidate word some is more likely to be the correct word (error-correcting word) corresponding to the error word sone than other candidate words.
In another embodiment, step 1 and step 2 above are the same for the LSTM model;
step 3 is replaced by: respectively replacing the sound in the I-wave sound applets with the sound, the same, the son and the one, and calculating the language environment probability corresponding to each candidate word as follows:
Figure BDA0001625278480000321
Figure BDA0001625278480000322
Figure BDA0001625278480000323
Figure BDA0001625278480000324
step 4 is replaced by:
according to the formula
Figure BDA0001625278480000325
Calculating an evaluation score for each candidate word:
score some =-1×1×1/-5.6=0.179;
score same =-1×1/2×1/-6.3=0.079;
score son =-1×1×1/-6=0.116;
score one =-0.5×1×1/-5.1=0.098。
it can also be known that the candidate word some is more likely to be the correct word (error-correcting word) corresponding to the error word sone than other candidate words.
By the embodiment, the editing distance information and the context language environment are new, and the habitual problem that the initials are not easy to make mistakes when the user writes English words is also considered, so that more accurate error correction words can be determined.
As shown in fig. 9, in an eighth embodiment, a candidate word evaluation method is provided, which includes the following steps:
step S81, detecting the wrong word, and obtaining a plurality of candidate words corresponding to the wrong word.
The implementation of this step can refer to the description of the first embodiment, and is not repeated.
And step S82, determining the edit distance between each candidate word and the wrong word.
The implementation of this step can refer to the description of the first embodiment, and is not repeated.
And step S83, determining the similarity between each candidate word and the wrong word, wherein the similarity is obtained according to the longest common subsequence and/or the longest common substring of each candidate word and the wrong word.
For the implementation of this step, reference may be made to the description of the second embodiment, which is not repeated.
And step S84, acquiring error information of the wrong word relative to each candidate word.
For the implementation of this step, reference may be made to the description of the first embodiment, and details are not repeated.
And step S85, determining the evaluation score corresponding to each candidate word according to the editing distance, the similarity and the error information.
Specifically, the step is equivalent to scoring the candidate words according to the editing distance between the candidate words and the wrong words, the distinguishing information and the proximity between the candidate words and the wrong words, and integrating the three aspects to evaluate the possibility that the candidate words are the wrong words corresponding to the wrong words.
In this embodiment, the evaluation score corresponding to each candidate word is determined according to the edit distance, the similarity and the error information of each candidate word and the error word. The problems of editing distance and similarity between the candidate words and wrong words and phenomena of writing habits of users are considered, and the reliability of the candidate word evaluation result can be improved.
In one embodiment, the step of determining an evaluation score corresponding to each candidate word according to the edit distance, the similarity, and the error information includes: if the wrong word is the same as the initial of the candidate word, calculating an evaluation score corresponding to the candidate word according to the editing distance, the similarity and the first coefficient; and if the wrong word is different from the initial letter of the candidate word, calculating the evaluation score corresponding to the candidate word according to the editing distance, the similarity and the second coefficient.
In one embodiment, the step of calculating the evaluation score corresponding to the candidate word according to the edit distance, the similarity and the first coefficient includes: and calculating the evaluation score corresponding to the candidate word according to the reciprocal of the editing distance, the similarity and the first coefficient.
In one embodiment, the step of calculating the evaluation score corresponding to the candidate word according to the edit distance, the similarity and the second coefficient includes: and calculating the evaluation score corresponding to the candidate word according to the reciprocal of the editing distance, the similarity and a second coefficient.
In one embodiment, the evaluation score for each candidate word may be calculated according to the following formula:
score word =K×S×1/D edit
the specific calculation process of the evaluation score is exemplified as follows:
it is known that when a user writes an english word or sentence on the smart interactive tablet, the some things are that the some things are wrong writing the some things into the song, and the contextual language environment is I-have song applets. And after detection of the English dictionary, finding that the tone does not exist in the English dictionary, and determining that the tone is a wrong word. Obtaining candidate words of the song: some, same, one.
The edit distances of the song, the same, the one and the wrong song are respectively as follows: 1,2,1.
1. According to the formula S-1/D edit +0.7D lcs1 +0.3D lcs2 Calculating to obtain the similarity between each candidate word and the wrong word as follows:
S some :1+0.7×0.75+0.3×0.5=1.675;
S same :1/2+0.7×0.5+0.3×0.25=0.925;
S one :1+0.7×0.86+0.3×0.86=1.86。
2. according to the formula score word =K×S×1/D edit And calculating to obtain the evaluation scores of the candidate words as follows:
score some =1×1.675×1/1=1.675;
score same =1×0.925×1/2=0.463;
score one =0.5×1.86×1/1=0.93。
as can be seen from the above evaluation scores, the evaluation score of the candidate word some is higher than the evaluation scores of same and one; the evaluation result has a certain reference value for determining the error correction words.
As shown in fig. 10, in a ninth embodiment, there is provided a candidate word evaluation method including the steps of:
step S91, detecting the wrong word, and obtaining a plurality of candidate words corresponding to the wrong word.
The implementation of this step can refer to the description of the first embodiment, and is not repeated.
And step S92, determining the edit distance between each candidate word and the wrong word.
For the implementation of this step, reference may be made to the description of the first embodiment, and details are not repeated.
Step S93, replacing the wrong word with each candidate word respectively to obtain a candidate sentence, and determining the evaluation probability of the corresponding candidate word according to the candidate sentence, wherein the evaluation probability is obtained according to the language environment probability of the candidate word in the candidate sentence and the language environment probability of the adjacent word of the candidate word.
For the implementation of this step, reference may be made to the description of the fourth embodiment, which is not repeated.
And step S94, acquiring error information of the wrong word relative to each candidate word.
For the implementation of this step, reference may be made to the description of the first embodiment, and details are not repeated.
And step S95, determining the evaluation score corresponding to each candidate word according to the editing distance, the evaluation probability and the error information.
According to the method and the device, the evaluation score corresponding to each candidate word is determined according to the editing distance between each candidate word and the wrong word, the evaluation probability and the error information. The method not only considers the editing distance between the candidate words and the wrong words and the phenomenon problem of writing of the user, but also adds the evaluation information of the language model, can improve the reliability of the evaluation result of the candidate words, and is beneficial to improving the efficiency and the accuracy of text editing.
In one embodiment, the language environment probability of each word is represented by a log value of the probability of each word at its position, based on which an evaluation probability of a candidate word is obtained, and further, an evaluation score corresponding to each candidate word is determined according to the reciprocal of the edit distance, the reciprocal of the evaluation probability, and the error information.
In one embodiment, the step of determining an evaluation score corresponding to each candidate word according to the edit distance, the evaluation probability, and the error information includes: if the wrong word is the same as the initial of the candidate word, calculating the evaluation score of the candidate word according to the editing distance, the evaluation probability and the first coefficient; and if the wrong word is different from the initial letter of the candidate word, calculating the evaluation score of the candidate word according to the editing distance, the evaluation probability and the second coefficient.
In one embodiment, the evaluation score for each candidate word is calculated according to the following formula:
Figure BDA0001625278480000351
the specific calculation process of the evaluation score is exemplified as follows:
it is known that when a user writes english words and sentences on the smart interactive tablet, the phone is wrongly written as phone, and the contextual language environment is I-wave phone applets.
For the N-Gram model, N in the N-Gram model takes a value of 3:
1. and detecting the tone in the English dictionary through English dictionary detection, and determining the tone as a wrong word.
2. And acquiring words with edit distances of 1 and 2 from the song in the English dictionary, and assuming that the song, the same, the son and the one exist as candidate words. The editing distances corresponding to the candidate words some, same, son and one are respectively as follows: 1,2,1,1.
3. Respectively replacing the song in the I-love song applets with the song, the same, the son and the one to obtain corresponding evaluation sentences; respectively calculating the evaluation probability corresponding to each candidate word according to the N-GRAM model and the evaluation sentence, wherein the obtained evaluation probability is as follows:
Figure BDA0001625278480000352
Figure BDA0001625278480000353
Figure BDA0001625278480000354
Figure BDA0001625278480000355
4. according to the formula
Figure BDA0001625278480000356
The evaluation score of each candidate word is calculated as follows:
score some =-1×1/1×1/(-5.6)=0.179;
score same =-1×1/2×1/(-6.35)=0.079;
score son =-1×1/1×1/(-7.58)=0.132;
score one =-0.5×1/1×1/(-5.26)=0.095。
therefore, the candidate word some is more likely to be the correct word (error-correcting word) corresponding to the error word sone than other candidate words.
In another embodiment, step 1 and step 2 are the same for the BilSTM model; step 3 is replaced by: respectively replacing the sone in the I-have sone applets with the some of the others; respectively calculating the evaluation probability corresponding to each candidate word according to the BilSTM model and the evaluation statement, wherein the obtained evaluation probability is as follows:
Figure BDA0001625278480000361
Figure BDA0001625278480000362
Figure BDA0001625278480000363
Figure BDA0001625278480000364
step 4 is replaced by:
according to the formula
Figure BDA0001625278480000365
The evaluation score of each candidate word obtained by calculation is as follows:
score some =-1×1/1×1/(-2.88)=0.347;
score same =-1×1/2×1/(-3.62)=0.138;
score s o n =-1×1/1×1/(-4.1)=0.244;
scoreo ne =-0.5×1/1×1/(-2.62)=0.191。
it can also be seen that the candidate word some has a higher probability of being the correct word (error-correcting word) corresponding to the error word sone than other candidate words.
In another embodiment, step 1 and step 2 above are the same for the LSTM model; step 3 is replaced by: respectively calculating the evaluation probability corresponding to each candidate word according to the LSTM model and the evaluation statement, wherein the obtained evaluation probability is as follows:
Figure BDA0001625278480000371
Figure BDA0001625278480000372
Figure BDA0001625278480000373
Figure BDA0001625278480000374
step 4 is replaced by:
according to the formula
Figure BDA0001625278480000375
The evaluation score of each candidate word obtained by calculation is as follows:
score some =-1×1/1×1/(-3.175)=0.315;
scoresame=-1×1/2×1/(-3.75)=0.133;
scoreson=-1×1/1×1/(-4.6)=0.217;
scoreo ne =-0.5×1/1×1/(-3.45)=0.145。
it can also be seen that the candidate word some has a higher probability of being the correct word (error-correcting word) corresponding to the error word sone than other candidate words.
Through the candidate word evaluation score calculation under the three language models, the habit problem of a user when writing words is considered, the editing distance and the language environment information are also considered, and therefore the candidate word for evaluating wrong words can be determined more effectively.
As shown in fig. 11, in a tenth embodiment, there is provided a candidate word evaluation method including the steps of:
and S101, detecting wrong words and acquiring a plurality of candidate words corresponding to the wrong words.
For the implementation of this step, reference may be made to the corresponding steps in the first embodiment, which are not repeated herein.
And S102, determining the similarity between each candidate word and the wrong word, wherein the similarity is obtained according to the longest common subsequence and/or the longest common substring of each candidate word and the wrong word.
For the implementation of this step, reference may be made to the corresponding steps of the second embodiment, which are not described in detail.
And step S103, determining the language environment probability of each candidate word at the wrong word position.
For the implementation of this step, reference may be made to the description of the third embodiment, which is not repeated.
And step S104, acquiring error information of the wrong word relative to each candidate word.
For the implementation of this step, reference may be made to the description of the first embodiment, and details are not repeated.
And S105, determining the corresponding evaluation score of each candidate word according to the similarity, the language environment probability and the error information.
In this embodiment, the evaluation score corresponding to each candidate word is determined according to the similarity between each candidate word and the wrong word, the language environment probability and the error information. The method not only considers the similarity between the candidate words and the wrong words and the phenomenon of writing of the user, but also adds the evaluation information of the language model, can improve the reliability of the evaluation result of the candidate words, and is beneficial to improving the efficiency and the accuracy of text editing.
In one embodiment, when the language environment probability is represented by a log value of the probability of each candidate word at the wrong word position, the evaluation score corresponding to each candidate word may be determined according to the similarity, the inverse of the language environment probability, and the error information.
In one embodiment, the step of determining an evaluation score corresponding to each candidate word according to the similarity, the language environment probability, and the error information includes: if the wrong word is the same as the initial of the candidate word, calculating the evaluation score of the candidate word according to the similarity, the language environment probability and the first coefficient; and if the wrong word is different from the initial of the candidate word, calculating the evaluation score of the candidate word according to the similarity, the language environment probability and the second coefficient.
In one embodiment, the evaluation score for each candidate word is calculated according to the following formula:
Figure BDA0001625278480000381
in one embodiment, the language model is an N-Gram model, a BilSTM model, or an LSTM model.
Optionally, the language environment probability corresponding to the candidate word may be calculated by combining the plurality of language models.
Optionally, determining the evaluation score corresponding to each candidate word through the N-Gram model may be calculated through the following formula:
Figure BDA0001625278480000382
wherein the content of the first and second substances,
Figure BDA0001625278480000391
and calculating the language environment probability corresponding to a certain candidate word through the N-Gram model.
For the N-Gram model, N in the N-Gram model takes a value of 3:
specific examples are as follows: it is known that when a user writes an english word or sentence on the smart interactive tablet, the some things are that the some things are wrong writing the some things into the song, and the contextual language environment is I-have song applets.
1. And through detection of the English dictionary, finding that the song does not exist in the English dictionary, and determining the song as a wrong word.
2. And acquiring words with edit distances of 1 and 2 from the song in the English dictionary, and assuming that the song, the same, the son and the one exist as candidate words. Determining the similarity between each candidate word and the wrong word as follows: s some =1.675;S same =0.925;S son =1.86;S one =1.86。
3. Respectively replacing the sound, same, son and one in the I-have sound applets, and calculating the language environment probability corresponding to the candidate words according to the replaced sentences
Figure BDA0001625278480000392
The language environment probability of each candidate word at the position of the song calculated according to the N-Gram model is as follows:
Figure BDA0001625278480000393
Figure BDA0001625278480000394
Figure BDA0001625278480000395
Figure BDA0001625278480000396
wherein, P (I, have, some) ═ c (I, have, some)/c (3-Gram), and c represents the count, i.e. the count (number) of 3-tuples I, have, some in the corpus is divided by the count of all 3-tuples;
wherein, P (have | I) ═ c (have, I)/c (I), i.e. the count of 2-tuple in the corpus, such as have, I, is divided by the count of all I;
where p (I) ═ c (I)/c (1-Gram), i.e. the 1-tuple count in the corpus I divided by the count of all 1-tuples.
4. According to the formula
Figure BDA0001625278480000397
The evaluation score corresponding to each candidate word obtained by calculation is as follows:
score some =-1×1.675×1/(-5.7)=0.294;
score same =-1×0.925×1/(-6)=0.154;
score son =-1×1.86×1/(-8)=0.219;
score one =-0.5×1.86×1/(-4.8)=0.182。
therefore, the candidate word some is more likely to be the correct word (error-correcting word) corresponding to the error word sone than other candidate words.
In another embodiment, step 1 and step 2 are the same for the BilSTM model; step 3 is replaced by: respectively replacing the sound in the I-sound applets with the sound, same, son and one, and calculating the language environment probability corresponding to each candidate word
Figure BDA0001625278480000401
Each candidate word is calculated as follows:
Figure BDA0001625278480000402
Figure BDA0001625278480000403
Figure BDA0001625278480000404
Figure BDA0001625278480000405
and step 4 is replaced by:
according to the formula
Figure BDA0001625278480000406
The evaluation score corresponding to each candidate word obtained by calculation is as follows:
score some =-1×1.675×1/-5.6=0.299;
score same =-1×0.925×1/-6.3=0.147;
score son =-1×1.86×1/-8.6=0.203;
score one =-0.5×1.86×1/-5.1=0.172。
therefore, the candidate word some is more likely to be the correct word (error-correcting word) corresponding to the error word sone than other candidate words.
In another embodiment, step 1 and step 2 above are the same for the LSTM model; step 3 is replaced by: respectively replacing the phone in the I-phone phones with the phone, same, son and one, and calculating the language environment probability corresponding to each candidate word
Figure BDA0001625278480000407
Each candidate word is calculated as follows:
Figure BDA0001625278480000408
Figure BDA0001625278480000409
Figure BDA00016252784800004010
Figure BDA00016252784800004011
and step 4 is replaced by:
according to the formula
Figure BDA00016252784800004012
Calculating to obtain the evaluation scores corresponding to the candidate words as follows:
score some =-1×1.675×1/-5.6=0.299;
score same =-1×0.925×1/-6.3=0.147;
score son =-1×1.86×1/-8.6=0.203;
score one =-0.5×1.86×1/-5.1=0.172。
similarly, the candidate word game has a higher probability of being the correct word (error-correcting word) corresponding to the error word tone than other candidate words.
Through candidate word evaluation under the three language models, the habit problem of a user writing words is considered, and similarity information and evaluation information of language environment are added, so that the reliability of candidate word evaluation is improved.
As shown in fig. 12, in an eleventh embodiment, there is provided a candidate word evaluation method including the steps of:
and step S111, detecting wrong words and acquiring a plurality of candidate words corresponding to the wrong words.
The implementation of this step can refer to the description of the first embodiment, and is not repeated.
And step S112, determining the similarity between each candidate word and the wrong word, wherein the similarity is obtained according to the longest common subsequence and/or the longest common substring of each candidate word and the wrong word.
For the implementation of this step, reference may be made to the description of the second embodiment, and details are not repeated.
Step S113, replacing the wrong word with each candidate word respectively to obtain a candidate sentence, and determining the evaluation probability of the corresponding candidate word according to the candidate sentence, wherein the evaluation probability is obtained according to the language environment probability of the candidate word in the candidate sentence and the language environment probability of the adjacent word of the candidate word.
For the implementation of this step, reference may be made to the description of the fourth embodiment, and details are not repeated.
Step S114, obtaining error information of the wrong word relative to each candidate word.
For the implementation of this step, reference may be made to the description of the first embodiment, and details are not repeated.
And step S115, determining the evaluation score corresponding to each candidate word according to the similarity, the evaluation probability and the error information.
In this embodiment, the evaluation score corresponding to each candidate word is determined according to the similarity between each candidate word and the wrong word, the evaluation probability, and the error information. The method not only considers the similarity between the candidate words and the wrong words and the phenomenon problem of writing of the user, but also adds the evaluation information of the language model, and can improve the reliability of the evaluation result of the candidate words.
In one embodiment, the language environment probability of each word is represented by a log of the probability of each word at its position, based on which an evaluation probability of a candidate word is obtained, and further, an evaluation score corresponding to each candidate word may be determined according to the similarity, the inverse of the evaluation probability, and the error information.
In one embodiment, the step of determining an evaluation score corresponding to each candidate word according to the similarity, the evaluation probability, and the error information includes: if the wrong word is the same as the initial of the candidate word, calculating the evaluation score of the candidate word according to the similarity, the evaluation probability and the first coefficient; and if the wrong word is different from the initial of the candidate word, calculating the evaluation score of the candidate word according to the similarity, the evaluation probability and the second coefficient.
In one embodiment, the evaluation score for each candidate word is calculated according to the following formula:
Figure BDA0001625278480000421
the specific calculation process of the evaluation score is exemplified as follows:
it is known that when a user writes an english word or sentence on the smart interactive tablet, the some things are that the some things are wrong writing the some things into the song, and the contextual language environment is I-have song applets.
1. Through detection of an English dictionary, the none exists in the English dictionary, and the none is determined to be a wrong word.
2. And acquiring words with edit distances of 1 and 2 from the song in the English dictionary, and assuming that the song, the same, the son and the one exist as candidate words. Determining the similarity between each candidate word and the wrong word as follows: s some =1.675;S same =0.925;S son =1.86;S one =1.86。
3. Respectively replacing the sone in the I-have sone applets with the some of the others; obtaining:
candidate statement one: i have some soy applets;
candidate statement two: i have same applets;
candidate statement three: i have son applets;
candidate sentence four: i have one applets.
Based on the N-GRAM model, N in the N-GRAM model takes a value of 3, a neighboring word corresponding to a candidate word some in a candidate sentence I is an applet, a neighboring word corresponding to a candidate word sam in a candidate sentence II is an applet, a neighboring word corresponding to a candidate word son in a candidate sentence III is an applet, and a neighboring word corresponding to a candidate word one in a candidate sentence IV is an applet; and respectively calculating the evaluation probability corresponding to each candidate word according to the N-GRAM model, wherein the obtained evaluation probability is as follows:
Figure BDA0001625278480000422
Figure BDA0001625278480000423
Figure BDA0001625278480000431
Figure BDA0001625278480000432
4. according to the formula
Figure BDA0001625278480000433
The evaluation score of each candidate word obtained by calculation is as follows:
score some =-1×1.675×1/(-5.6)=0.299;
score same =-1×0.925×1/(-6.35)=0.146;
score son =-1×1.86×1/(-7.58)=0.245;
score one =-0.5×1.86×1/(-5.26)=0.177。
therefore, the candidate word some is more likely to be the correct word (error-correcting word) corresponding to the error word sone than other candidate words.
In another embodiment, for the BilSTM model, the neighbor word corresponding to the candidate word some in the first candidate sentence is I, have and applets, the neighbor word corresponding to the candidate word sam in the second candidate sentence is I, have and applets, the neighbor word corresponding to the candidate word son in the third candidate sentence is I, have and applets, and the neighbor word corresponding to the candidate word one in the fourth candidate sentence is I, have and applets; based on the calculation, the evaluation probability corresponding to the candidate words in each candidate sentence is as follows:
Figure BDA0001625278480000434
Figure BDA0001625278480000435
Figure BDA0001625278480000441
Figure BDA0001625278480000442
step 4 is replaced by:
according to the formula
Figure BDA0001625278480000443
The evaluation score of each candidate word obtained by calculation is as follows:
score some =-1×1.675×1/(-2.88)=0.582;
score same =-1×0.925×1/(-3.62)=0.256;
score son =-1×1.86×1/(-4.1)=0.454;
score one =-0.5×1.86×1/(-2.62)=0.355。
it can also be seen that the candidate word some has a higher probability of being the correct word (error-correcting word) corresponding to the error word sone than other candidate words.
In another embodiment, for the LSTM model, the neighboring words corresponding to the candidate word some in the first candidate sentence are I, have and applets, the neighboring words corresponding to the candidate word sam in the second candidate sentence are I, have and applets, the neighboring words corresponding to the candidate word son in the third candidate sentence are I, have and applets, and the neighboring words corresponding to the candidate word one in the fourth candidate sentence are I, have and applets; based on the calculation, the evaluation probability corresponding to the candidate words in each candidate sentence is as follows:
Figure BDA0001625278480000444
Figure BDA0001625278480000445
Figure BDA0001625278480000446
Figure BDA0001625278480000451
step 4 is replaced by:
according to the formula
Figure BDA0001625278480000452
The evaluation score of each candidate word obtained by calculation is as follows:
score some =-1×1.675×1/(-3.175)=0.528;
score same =-1×0.925×1/(-3.75)=0.247;
score son =-1×1.86×1/(-4.6)=0.404;
score one =-0.5×1.86×1/(-3.45)=0.27。
it can also be seen that the candidate word some has a higher probability of being the correct word (error-correcting word) corresponding to the error word sone than other candidate words.
Through candidate word evaluation under the three language models, the habit problem of a user when writing words is considered, and similarity information and evaluation information of a language environment are added, so that the reliability of candidate word evaluation is improved.
As shown in fig. 13, in a twelfth embodiment, there is provided a candidate word evaluation method including the steps of:
step S121, a wrong word is detected, and a plurality of candidate words corresponding to the wrong word are obtained.
The implementation of this step can refer to the description of the first embodiment, and is not repeated.
Step S122, determining the editing distance between each candidate word and the wrong word.
The implementation of this step can refer to the description of the first embodiment, and is not repeated.
And S123, determining the similarity between each candidate word and the wrong word, wherein the similarity is obtained according to the longest common subsequence and/or the longest common substring of each candidate word and the wrong word.
For the implementation of this step, reference may be made to the description of the second embodiment, and details are not repeated.
And step S124, determining the language environment probability of each candidate word at the wrong word position.
For the implementation of this step, reference may be made to the description of the third embodiment, which is not repeated.
Step S125, error information of the wrong word relative to each candidate word is obtained.
The implementation of this step can refer to the description of the first embodiment, and is not repeated.
And step S126, determining the evaluation score corresponding to each candidate word according to the editing distance, the similarity, the language environment probability and the error information.
In the embodiment, the evaluation score corresponding to each candidate word is determined according to the editing distance, the similarity, the language environment probability and the error information of each candidate word and the error word. The method and the device have the advantages that the problems of editing distance and similarity between the candidate words and the wrong words and phenomena of writing of users are considered, evaluation information of the language model is added, reliability of evaluation results of the candidate words can be improved, and further, the method and the device are beneficial to improving efficiency and accuracy of text editing.
In one embodiment, when the language environment probability is represented by a log value of the probability of each candidate word at the wrong word position, the evaluation score corresponding to each candidate word may be determined according to the inverse of the edit distance, the similarity, the inverse of the language environment probability, and the error information.
In one embodiment, the step of determining an evaluation score corresponding to each candidate word according to the edit distance, the similarity, the language environment probability, and the error information includes: if the wrong word is the same as the initial letter of the candidate word, calculating the evaluation score of the candidate word according to the editing distance, the similarity, the language environment probability and the first coefficient; and if the wrong word is different from the initial letter of the candidate word, calculating the evaluation score of the candidate word according to the editing distance, the similarity, the language environment probability and the second coefficient.
In one embodiment, the evaluation score for each candidate word is calculated according to the following formula:
Figure BDA0001625278480000461
the specific calculation process of the evaluation score is exemplified as follows:
it is known that when a user writes an english word or sentence on the smart interactive tablet, the some things are that the some things are wrong writing the some things into the song, and the contextual language environment is I-have song applets.
1. And detecting the tone in the English dictionary through English dictionary detection, and determining the tone as a wrong word.
2. And acquiring words with edit distances of 1 and 2 from the song in the English dictionary, and assuming that the song, the same, the son and the one exist as candidate words. Determining the similarity between each candidate word and the wrong word as follows: s some =1.675;S same =0.925;S son =1.86;S one 1.86. The editing distances corresponding to the candidate words some, same, son and one are respectively as follows: 1,2,1,1.
3. Respectively replacing the song in the I-love song applets with the song, the same, the son and the one, and calculating the corresponding candidate words according to the N-GRAM model
Figure BDA0001625278480000462
And N in the N-GRAM model takes a value of 3, and the language environment probability of each candidate word is obtained as follows:
Figure BDA0001625278480000463
Figure BDA0001625278480000464
Figure BDA0001625278480000465
Figure BDA0001625278480000466
4. according to the formula
Figure BDA0001625278480000471
The evaluation score of each candidate word is calculated as:
score s o me =-1×1.675×1/1×1/(-5.7)=0.294;
score same =-1×0.925×1/2×1/(-6)=0.077;
score s o n =-1×1.86×1/1×1/(-8)=0.233;
scoreo ne =-0.5×1.86×1/1×1/(-4.8)=0.194。
therefore, the candidate word some is more likely to be the correct word (error-correcting word) corresponding to the error word sone than other candidate words.
In another embodiment, step 1 and step 2 are the same for the BilSTM model; step 3 is replaced by: respectively replacing the sound in the I-sound applets with the sound, same, son and one, and calculating the language environment probability corresponding to each candidate word
Figure BDA0001625278480000472
Each candidate word is calculated as follows:
Figure BDA0001625278480000473
Figure BDA0001625278480000474
Figure BDA0001625278480000475
Figure BDA0001625278480000476
step 4 is replaced by:
according to the formula
Figure BDA0001625278480000477
Calculating an evaluation score for each candidate word:
score some =-1×1.675×1/1×1/-4.9=0.342;
score same =-1×0.925×1/2×1/-5.07=0.182;
score son =-1×1.86×1/1×1/-7.6=0.246;
score one =-0.5×1.86×1/1×1/-4.66=0.199。
it can also be known that the candidate word some is more likely to be the correct word (error-correcting word) corresponding to the error word sone than other candidate words.
In another embodiment, step 1 and step 2 above are the same for the LSTM model; step 3 is replaced by: respectively replacing the sound in the I-sound applets with the sound, same, son and one, and calculating the language environment probability corresponding to each candidate word
Figure BDA0001625278480000481
Each candidate word is calculated as follows:
Figure BDA0001625278480000482
Figure BDA0001625278480000483
Figure BDA0001625278480000484
Figure BDA0001625278480000485
step 4 is replaced by:
according to the formula
Figure BDA0001625278480000486
Calculating an evaluation score for each candidate word:
score some =-1×1.675×1/1×1/-5.6=0.3;
score same =-1×0.925×1/2×1/-6.3=0.147;
score son =-1×1.86×1/1×1/-6=0.216;
score one =-0.5×1.86×1/1×1/-5.1=0.182。
it can also be known that the candidate word some is more likely to be the correct word (error-correcting word) corresponding to the error word sone than other candidate words.
Through candidate word evaluation under the three language models, the habit problem of a user writing words is considered, and evaluation information of editing distance, similarity and language environment is added, so that the reliability of candidate word evaluation is improved.
As shown in fig. 14, in the thirteenth embodiment, there is provided a candidate word evaluation method including the steps of:
step S131, a wrong word is detected, and a plurality of candidate words corresponding to the wrong word are obtained.
The implementation of this step can refer to the description of the first embodiment, and is not repeated.
Step S132, determining the editing distance between each candidate word and the wrong word.
The implementation of this step can refer to the description of the first embodiment, and is not repeated.
And step S133, determining the similarity between each candidate word and the wrong word, wherein the similarity is obtained according to the longest common subsequence and/or the longest common substring of each candidate word and the wrong word.
For the implementation of this step, reference may be made to the description of the second embodiment, and details are not repeated.
And S134, replacing the wrong words with the candidate words respectively to obtain candidate sentences, and determining the evaluation probability of the corresponding candidate words according to the candidate sentences, wherein the evaluation probability is obtained according to the language environment probability of the candidate words in the candidate sentences and the language environment probability of the adjacent words of the candidate words.
For the implementation of this step, reference may be made to the description of the fourth embodiment, which is not repeated.
Step S135, obtain error information of the wrong word relative to each candidate word.
The implementation of this step can refer to the description of the first embodiment, and is not repeated.
And S136, determining the evaluation score corresponding to each candidate word according to the editing distance, the similarity, the evaluation probability and the error information.
According to the method and the device, the evaluation score corresponding to each candidate word is determined according to the editing distance, the similarity, the evaluation probability and the error information of each candidate word and the error word. The method and the device have the advantages that the problems of editing distance and similarity between the candidate words and the wrong words and phenomena of writing of users are considered, evaluation information of the language model is added, reliability of evaluation results of the candidate words can be improved, and efficiency and accuracy of text editing are improved.
In one embodiment, the language environment probability of each word is represented by a log value of the probability of each word at its position, based on which an evaluation probability of a candidate word is obtained, and further, an evaluation score corresponding to each candidate word may be determined according to the reciprocal of the edit distance, the similarity, the reciprocal of the evaluation probability, and error information.
In one embodiment, the step of determining an evaluation score corresponding to each candidate word according to the edit distance, the similarity, the evaluation probability, and the error information includes: if the wrong word is the same as the initial of the candidate word, calculating the evaluation score of the candidate word according to the editing distance, the similarity, the evaluation probability and the first coefficient; and if the wrong word is different from the initial letter of the candidate word, calculating the evaluation score of the candidate word according to the editing distance, the similarity, the evaluation probability and the second coefficient.
In one embodiment, the evaluation score for each candidate word is calculated according to the following formula:
Figure BDA0001625278480000491
the specific calculation process of the evaluation score is exemplified as follows:
it is known that when a user writes an english word or sentence on the smart interactive tablet, the some things are that the some things are wrong writing the some things into the song, and the contextual language environment is I-have song applets.
1. And detecting the tone in the English dictionary through English dictionary detection, and determining the tone as a wrong word.
2. And acquiring words with edit distances of 1 and 2 from the song in the English dictionary, and assuming that the song, the same, the son and the one exist as candidate words. Calculating the similarity between each candidate word and the wrong word as follows: s some =1.675;S same =0.925;S son =1.86;S one 1.86. The editing distances corresponding to the candidate words some, same, son and one are respectively as follows: 1,2,1,1.
3. Respectively replacing the song in the I-love song applets with the song, the same, the son and the one, and respectively calculating the evaluation probability corresponding to each candidate word according to the N-GRAM model, wherein the obtained evaluation probability is as follows:
Figure BDA0001625278480000492
Figure BDA0001625278480000501
Figure BDA0001625278480000502
Figure BDA0001625278480000503
4. according to the formula
Figure BDA0001625278480000504
The evaluation score of each candidate word is calculated as:
score s o me =-1×1.675×1/1×1/(-5.6)=0.299;
scoresame=-1×0.925×1/2×1/(-6.35)=0.073;
scoreso n =-1×1.86×1/1×1/(-7.58)=0.245;
scoreo ne =-0.5×1.86×1/1×1/(-5.26)=0.177。
therefore, the candidate word some is more likely to be the correct word (error-correcting word) corresponding to the error word sone than other candidate words.
In another embodiment, for the BilSTM model, the evaluation probability corresponding to the candidate word in each candidate sentence is calculated as follows:
Figure BDA0001625278480000505
Figure BDA0001625278480000506
Figure BDA0001625278480000511
Figure BDA0001625278480000512
step 4 is replaced by:
according to the formula
Figure BDA0001625278480000513
The evaluation score of each candidate word obtained by calculation is as follows:
score some =-1×1.675×1/1×1/(-2.88)=0.582;
score same =-1×0.925×1/2×1/(-3.62)=0.128;
score son =-1×1.86×1/1×1/(-4.1)=0.454;
score one =-0.5×1.86×1/1×1/(-2.62)=0.355。
it can also be known that the candidate word some is more likely to be the correct word (error-correcting word) corresponding to the error word sone than other candidate words.
In another embodiment, for the LSTM model, the evaluation probability corresponding to the candidate words in each candidate sentence is calculated as follows:
Figure BDA0001625278480000514
Figure BDA0001625278480000515
Figure BDA0001625278480000516
Figure BDA0001625278480000517
step 4 is replaced by:
according to the formula
Figure BDA0001625278480000521
The evaluation score of each candidate word obtained by calculation is as follows:
score some =-1×1.675×1/1×1/(-3.175)=0.528;
score same =-1×0.925×1/2×1/(-3.75)=0.124;
score son =-1×1.86×1/1×1/(-4.6)=0.404;
score one =-0.5×1.86×1/1×1/(-3.45)=0.27。
it can also be seen that the candidate word some has a higher probability of being the correct word (error-correcting word) corresponding to the error word sone than other candidate words.
Through candidate word evaluation under the three language models, the habit problem of a user when writing words is considered, and evaluation information of editing distance, similarity and language environment is added, so that the reliability of a candidate word evaluation result is improved.
On the basis of obtaining the evaluation score corresponding to the candidate word in any of the above embodiments, in an embodiment, the candidate word evaluation method further includes the steps of: determining error correction words corresponding to the error words from the candidate words according to the evaluation scores, and correcting the error words by using the error correction words; this enables more accurate determination of correction of wrong words.
Optionally, a candidate word with the highest evaluation score is selected from the multiple candidate words of the wrong word, and is used as the error correction word corresponding to the wrong word. Furthermore, the error correction words can be adopted to replace the wrong words, so that the effect of automatically correcting the wrong words is achieved.
In addition, in an embodiment, on the basis of obtaining the evaluation score corresponding to the candidate word in any of the above embodiments, the candidate word evaluation method further includes the steps of: and sorting the candidate words according to the evaluation scores, and displaying the candidate words according to the sorting, so that the candidate words with higher evaluation scores are displayed more forwards, and a user is better prompted.
In an embodiment, on the basis of any of the above embodiments, the candidate word evaluation method further includes: detecting whether the words to be detected are in a preset word bank or not, and if not, determining that the words to be detected are wrong words. For example: scanning each word, detecting whether each word is in the dictionary, and determining the word is wrong if not in the dictionary.
It can be understood that the preset word stock may be a general english dictionary, a chinese dictionary, or the like, or may be other specific word stocks, and the word stock may be selected according to actual situations.
In an embodiment, optionally, after detecting the wrong word, the candidate word evaluation method further includes a step of determining a candidate word set corresponding to the wrong word. The steps can be as follows: and calculating the editing distance between the wrong word and the known word in the word bank, and selecting the known word with the editing distance within a set range to obtain a plurality of candidate words corresponding to the wrong word. For example, a known word with an edit distance less than 3 is selected as a candidate word, thereby improving the effectiveness of candidate word evaluation.
It should be understood that for the foregoing method embodiments, although the steps in the flowcharts are shown in order indicated by the arrows, the steps are not necessarily performed in order indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a portion of the steps in the flow charts of the method embodiments may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performance of the sub-steps or stages is not necessarily sequential, but may be performed in turn or alternating with other steps or at least a portion of the sub-steps or stages of other steps.
Based on the ideas of the candidate word evaluation methods of the first to thirteenth embodiments, the embodiments of the present application further provide corresponding candidate word evaluation devices.
As shown in fig. 15, in the fourteenth embodiment, the candidate word evaluation means includes:
the candidate word acquisition module 101 is configured to detect a wrong word and acquire a plurality of candidate words corresponding to the wrong word;
a distance determining module 102, configured to determine an editing distance between each candidate word and the wrong word;
an error information obtaining module 103, configured to obtain error information of the wrong word relative to each candidate word;
and the first evaluation module 104 is configured to determine an evaluation score corresponding to each candidate word according to the edit distance and the error information.
When a wrong word is detected, the candidate word evaluation device in the embodiment first acquires a plurality of corresponding candidate words, determines the edit distance between each candidate word and the wrong word, and determines error information of the wrong word relative to each candidate word; determining an evaluation score corresponding to each candidate word according to the editing distance and the error information; the method not only considers the phenomenon problem of word writing, but also takes the information of the editing distance between words into consideration, thereby improving the reliability of the evaluation result of the candidate words.
In an embodiment, the first evaluation module 104 is configured to determine an evaluation score corresponding to each candidate word according to an inverse of the edit distance and error information.
In an embodiment, the error information of the error word relative to each candidate word includes: the information whether the wrong word and the candidate word have the same initial;
the first evaluation module 104 includes: the first scoring submodule is used for calculating an evaluation score corresponding to the candidate word according to the editing distance and the first coefficient if the wrong word is the same as the first letter of the candidate word; and the second scoring submodule is used for calculating the evaluation score corresponding to the candidate word according to the editing distance and the second coefficient if the wrong word is different from the initial of the candidate word.
As shown in fig. 16, in the fifteenth embodiment, the candidate word evaluation means includes:
a candidate word obtaining module 201, configured to detect a wrong word and obtain multiple candidate words corresponding to the wrong word;
a similarity determining module 202, configured to determine a similarity between each candidate word and a wrong word, where the similarity is obtained according to a longest common subsequence and/or a longest common substring of each candidate word and the wrong word;
an error information obtaining module 203, configured to obtain error information of the wrong word relative to each candidate word;
and the second evaluation module 204 is configured to determine an evaluation score corresponding to each candidate word according to the similarity and the error information.
When a wrong word is detected, the candidate word evaluation device of the embodiment first acquires a plurality of corresponding candidate words, respectively determines the similarity between each candidate word and the wrong word, and determines error information of the wrong word relative to each candidate word; determining an evaluation score corresponding to each candidate word according to the similarity and the error information; the problem of word writing is considered, and the similarity information among the words is also considered, so that the reliability of the candidate word evaluation result can be improved.
In one embodiment, the similarity determination module 202 includes: and the first similarity calculation submodule or the second similarity calculation submodule.
The first similarity operator module is used for calculating the similarity between each candidate word and the wrong word according to at least one of the longest common subsequence rate and the longest common substring rate of each candidate word and the wrong word; or the second similarity calculation operator module is used for calculating the similarity between each candidate word and the wrong word according to at least one of the longest common subsequence rate and the longest common substring rate of each candidate word and the wrong word and the editing distance between each candidate word and the wrong word.
In an embodiment, the second similarity operator module is configured to calculate a similarity between each candidate word and the wrong word according to at least one of a longest common sub-sequence rate and a longest common sub-sequence rate of each candidate word and the wrong word, and an inverse of an edit distance between each candidate word and the wrong word.
In an embodiment, the error information of the error word relative to each candidate word includes: the wrong word and the candidate word have the same information of the initial letter; optionally, the second evaluation module 204 includes: the first scoring submodule is used for calculating an evaluation score corresponding to the candidate word according to the similarity and the first coefficient if the wrong word is the same as the first letter of the candidate word; and the second scoring submodule is used for calculating the evaluation score corresponding to the candidate word according to the similarity and the second coefficient if the wrong word is different from the initial of the candidate word.
As shown in fig. 17, in the sixteenth embodiment, a candidate word evaluation means includes:
a candidate word obtaining module 301, configured to detect a wrong word and obtain multiple candidate words corresponding to the wrong word;
a first probability determining module 302, configured to determine a language environment probability of each candidate word at the wrong word position;
an error information obtaining module 303, configured to obtain error information of the wrong word relative to each candidate word;
and a third evaluation module 304, configured to determine an evaluation score corresponding to each candidate word according to the language environment probability and the error information.
When a wrong word is detected, the candidate word evaluation device of the embodiment first acquires a plurality of corresponding candidate words, respectively determines the probability of each candidate word at the position of the wrong word, and determines error information of the wrong word relative to each candidate word; determining an evaluation score corresponding to each candidate word according to the language environment probability and the error information; the habit problem of the user when writing the word is considered, and the information of the context language environment is also considered, so that the reliability of the candidate word evaluation result can be improved.
In an embodiment, the first probability determining module 302 is configured to calculate, according to a preset language model, a probability of each candidate word at the wrong word position, and use a log value of the probability as a language environment probability of the candidate word.
In an embodiment, the third evaluation module 304 is configured to determine an evaluation score corresponding to each candidate word according to the inverse of the language environment probability and error information; wherein the language model includes, but is not limited to, an N-Gram model, a BilSTM model, or an LSTM model.
In one embodiment, the error information of the error word relative to each candidate word includes: the information whether the wrong word and the candidate word have the same initial; correspondingly, the third evaluation module 304 includes: the first scoring submodule is used for calculating the evaluation score of the candidate word according to the reciprocal of the language environment probability and a first coefficient if the wrong word is the same as the first letter of the candidate word; and the second scoring submodule is used for calculating the evaluation score of the candidate word according to the inverse of the language environment probability and a second coefficient if the wrong word is different from the initial of the candidate word.
As shown in fig. 18, in the seventeenth embodiment, the candidate word evaluation means includes:
a candidate word obtaining module 401, configured to detect a wrong word, and obtain multiple candidate words corresponding to the wrong word;
a second probability determining module 402, configured to replace the wrong word with each candidate word respectively to obtain a candidate sentence, and determine an evaluation probability of the corresponding candidate word according to the candidate sentence, where the evaluation probability is obtained according to a language environment probability of the candidate word in the candidate sentence and a language environment probability of a word adjacent to the candidate word;
an error information obtaining module 403, configured to obtain error information of the wrong word relative to each candidate word;
and a fourth evaluation module 404, configured to determine an evaluation score corresponding to each candidate word according to the evaluation probability and the error information.
When a wrong word is detected, the candidate word evaluation device of the embodiment first acquires a plurality of corresponding candidate words, determines the evaluation probability corresponding to each candidate word, and determines the error information of the wrong word relative to each candidate word; determining an evaluation score corresponding to each candidate word according to the evaluation probability and the error information; the habit problem of the user when writing the word is considered, and the information of the context language environment is also considered, so that the reliability of the candidate word evaluation result can be improved.
In an embodiment, the second probability determining module 402 is further configured to calculate probabilities of respective positions of a candidate word and neighboring words of the candidate word in the candidate sentence according to a preset language model, and use a log value of the probabilities as a language environment probability of each word; and averaging the language environment probability of the candidate words in the candidate sentences and the language environment probability of the adjacent words of the candidate words to obtain the evaluation probability of the candidate words in the candidate sentences.
In an embodiment, the fourth evaluation module 404 is specifically configured to determine an evaluation score corresponding to each candidate word according to an inverse of the evaluation probability and error information; wherein the language model includes, but is not limited to, an N-Gram model, a BilSTM model, or an LSTM model.
In an embodiment, the error information of the error word relative to each candidate word includes: the wrong word and the candidate word have the same information of the initial letter; the fourth evaluation module 404 includes: the first scoring submodule is used for calculating an evaluation score corresponding to the candidate word according to the evaluation probability and the first coefficient if the wrong word is the same as the first letter of the candidate word; and the second scoring submodule is used for calculating an evaluation score corresponding to the candidate word according to the evaluation probability and the second coefficient if the wrong word is different from the initial of the candidate word.
As shown in fig. 19, in the eighteenth embodiment, a candidate word evaluation means includes:
the candidate word obtaining module 501 is configured to detect a wrong word and obtain multiple candidate words corresponding to the wrong word.
A distance determining module 502, configured to determine an editing distance between each candidate word and the wrong word;
a second probability determining module 503, configured to replace the wrong word with each candidate word respectively to obtain a candidate sentence, and determine an evaluation probability of the corresponding candidate word according to the candidate sentence, where the evaluation probability is obtained according to a language environment probability of the candidate word in the candidate sentence and a language environment probability of a word adjacent to the candidate word;
and a fifth evaluation module 504, configured to determine an evaluation score of each candidate word according to the edit distance and the evaluation probability.
In an embodiment, the second probability determining module 503 is configured to replace the wrong word with each candidate word respectively to obtain a candidate sentence, calculate probabilities of the candidate word and neighboring words of the candidate word in the candidate sentence at respective positions according to a preset language model, and use a log value of the probabilities as a language environment probability of each word; and averaging the language environment probability of the candidate words in the candidate sentences and the language environment probability of the adjacent words of the candidate words to obtain the evaluation probability of the candidate words in the candidate sentences. Wherein the language model includes but is not limited to: an N-Gram model, a BilSTM model, or an LSTM model.
Based on the above embodiment, the fifth evaluation module 504 is specifically configured to determine an evaluation score corresponding to each candidate word according to an inverse of the edit distance and an inverse of the evaluation probability.
As shown in fig. 20, in the nineteenth embodiment, the candidate word evaluation device includes:
the candidate word obtaining module 601 is configured to detect a wrong word and obtain multiple candidate words corresponding to the wrong word.
A similarity determining module 602, configured to determine a similarity between each candidate word and a wrong word, where the similarity is obtained according to a longest common subsequence and/or a longest common substring of each candidate word and the wrong word;
a second probability determining module 603, configured to replace the wrong word with each candidate word, respectively, to obtain a candidate sentence, and determine, according to the candidate sentence, an evaluation probability of a corresponding candidate word, where the evaluation probability is obtained according to a language environment probability of a candidate word in the candidate sentence and a language environment probability of a word adjacent to the candidate word;
and a sixth evaluation module 604, configured to determine an evaluation score of each candidate word according to the similarity and the evaluation probability.
In an embodiment, the second probability determining module 603 is configured to replace the wrong word with each candidate word, respectively, to obtain a candidate sentence, calculate probabilities of the candidate word and neighboring words of the candidate word in the candidate sentence at respective positions according to a preset language model, and use a log value of the probabilities as a language environment probability of each word; and averaging the language environment probability of the candidate words in the candidate sentences and the language environment probability of the adjacent words of the candidate words to obtain the evaluation probability of the candidate words in the candidate sentences. Wherein the language model includes, but is not limited to: an N-Gram model, a BilSTM model, or an LSTM model.
Based on the above embodiment, the sixth evaluation module 604 is specifically configured to determine an evaluation score corresponding to each candidate word according to the reciprocal of the evaluation probability and the similarity.
As shown in fig. 21, in the twentieth embodiment, the candidate word evaluation means includes:
a candidate word obtaining module 701, configured to obtain multiple candidate words corresponding to a wrong word when the wrong word is detected;
a distance determining module 702, configured to determine an editing distance between each candidate word and the wrong word;
a first probability determining module 703, configured to determine a language environment probability of each candidate word at the wrong word position;
an error information obtaining module 704, configured to obtain error information of the wrong word relative to each candidate word;
and a seventh evaluation module 705, configured to determine an evaluation score corresponding to each candidate word according to the edit distance, the language environment probability, and the error information.
When a wrong word is detected, the candidate word evaluation device of the embodiment first acquires a plurality of corresponding candidate words, determines the edit distance between each candidate word and the wrong word, the probability of each candidate word at the position of the wrong word, and the error information of the wrong word relative to each candidate word; determining an evaluation score corresponding to each candidate word according to the editing distance, the language environment probability and the error information; the editing distance information and the context language environment are considered, and the habit problem of the user when writing the word is also considered, so that the reliability of the candidate word evaluation result can be improved.
In an embodiment, the first probability determining module 703 is configured to calculate, according to a preset language model, a probability of each candidate word at the wrong word position, and use a log value of the probability as a language environment probability of the candidate word.
In an embodiment, the seventh evaluation module 505 is specifically configured to determine an evaluation score corresponding to each candidate word according to the inverse of the editing distance, the inverse of the language environment probability, and the error information; the language model includes, but is not limited to, an N-Gram model, a BilSTM model, or an LSTM model.
In one embodiment, the error information of the error word relative to each candidate word includes: the information whether the wrong word and the candidate word have the same initial; correspondingly, the seventh evaluation module 705 comprises: the first scoring submodule is used for calculating an evaluation score corresponding to the candidate word according to the editing distance, the language environment probability and the first coefficient if the wrong word is the same as the first letter of the candidate word; and the second scoring submodule is used for calculating the evaluation score corresponding to the candidate word according to the editing distance, the language environment probability and the second coefficient if the wrong word is different from the initial of the candidate word.
As shown in fig. 22, in the twenty-first embodiment, the candidate word evaluation means includes:
the candidate word obtaining module 801 is configured to detect a wrong word and obtain multiple candidate words corresponding to the wrong word.
A distance determining module 802, configured to determine an editing distance between each candidate word and the wrong word.
And a similarity determining module 803, configured to determine similarity between each candidate word and the wrong word, where the similarity is obtained according to the longest common subsequence and/or the longest common substring of each candidate word and the wrong word.
An error information obtaining module 804, configured to obtain error information of the wrong word relative to each candidate word.
And an eighth evaluation module 805, configured to determine, according to the edit distance, the similarity, and the error information, an evaluation score corresponding to each candidate word.
When a wrong word is detected, the candidate word evaluation device of the embodiment first acquires a plurality of corresponding candidate words, determines the edit distance and similarity between each candidate word and the wrong word, and determines error information of the wrong word relative to each candidate word; determining an evaluation score corresponding to each candidate word according to the editing distance, the similarity and the error information; the editing distance information and the similarity are considered, and the habit problem of the user when writing the word is also considered, so that the reliability of the candidate word evaluation result can be improved.
In one embodiment, the eighth evaluation module 805 comprises: the first scoring submodule is used for calculating an evaluation score corresponding to the candidate word according to the distance, the similarity and the first coefficient if the wrong word is the same as the first letter of the candidate word; and the second scoring submodule is used for calculating the evaluation score corresponding to the candidate word according to the distance, the similarity and the second coefficient if the wrong word is different from the initial of the candidate word.
As shown in fig. 23, in twenty-two embodiments, a candidate word evaluation apparatus is provided, and the candidate word evaluation apparatus of this embodiment includes:
the candidate word obtaining module 901 is configured to detect a wrong word and obtain multiple candidate words corresponding to the wrong word.
A distance determining module 902, configured to determine an editing distance between each candidate word and the wrong word.
A second probability determining module 903, configured to replace the wrong word with each candidate word to obtain a candidate sentence, and determine an evaluation probability of the corresponding candidate word according to the candidate sentence, where the evaluation probability is obtained according to the language environment probability of the candidate word in the candidate sentence and the language environment probability of a word adjacent to the candidate word.
An error information obtaining module 904, configured to obtain error information of the wrong word relative to each candidate word.
And a ninth evaluation module 905, configured to determine, according to the edit distance, the evaluation probability, and the error information, an evaluation score corresponding to each candidate word.
When a wrong word is detected, the candidate word evaluation device in the embodiment first acquires a plurality of corresponding candidate words, determines the edit distance between each candidate word and the wrong word, the evaluation probability of each candidate word at the position of the wrong word, and determines error information of the wrong word relative to each candidate word; determining an evaluation score corresponding to each candidate word according to the editing distance, the evaluation probability and the error information; the editing distance information and the context language environment are considered, and the habit problem of the user when writing the word is also considered, so that the reliability of the candidate word evaluation result can be improved.
In an embodiment, the second probability determining module 903 is specifically configured to calculate probabilities of respective positions of a candidate word and a neighboring word of the candidate word in a candidate sentence according to a preset language model, and use a log value of the probabilities as a language environment probability of each word; and averaging the language environment probability of the candidate words in the candidate sentences and the language environment probability of the adjacent words of the candidate words to obtain the evaluation probability of the candidate words in the candidate sentences.
In one embodiment, the ninth evaluation module 905 comprises: the first scoring submodule is used for calculating the evaluation score of the candidate word according to the editing distance, the evaluation probability and a first coefficient if the wrong word is the same as the first letter of the candidate word; and the second scoring submodule is used for calculating the evaluation score of the candidate word according to the editing distance, the evaluation probability and the second coefficient if the wrong word is different from the initial of the candidate word.
As shown in fig. 24, in twenty-third embodiment, a candidate word evaluation apparatus is provided, and the candidate word evaluation apparatus of this embodiment includes:
the candidate word obtaining module 1001 is configured to detect a wrong word and obtain multiple candidate words corresponding to the wrong word.
The similarity determining module 1002 is configured to determine a similarity between each candidate word and the wrong word, where the similarity is obtained according to a longest common subsequence and/or a longest common substring of each candidate word and the wrong word.
A first probability determining module 1003, configured to determine a language environment probability of each candidate word at the wrong-word position.
An error information obtaining module 1004, configured to obtain error information of the error word relative to each candidate word.
A tenth evaluation module 1005, configured to determine an evaluation score corresponding to each candidate word according to the similarity, the language environment probability, and the error information.
When a wrong word is detected, the candidate word evaluation device of the embodiment first acquires a plurality of corresponding candidate words, and determines similarity between each candidate word and the wrong word, prediction environment probability of each candidate word at the position of the wrong word, and error information of the wrong word relative to each candidate word; determining an evaluation score corresponding to each candidate word according to the similarity, the language environment probability and the error information; the editing distance information and the context language environment are considered, and the habit problem of the user when writing the word is also considered, so that the reliability of the candidate word evaluation result can be improved.
In an embodiment, the first probability determining module 1003 is configured to calculate, according to a preset language model, a probability of each candidate word at the wrong word position, and use a log value of the probability as a language environment probability of the candidate word.
In one embodiment, the tenth evaluation module 1005 includes: the first scoring submodule is used for calculating the evaluation score of the candidate word according to the similarity, the language environment probability and the first coefficient if the wrong word is the same as the first letter of the candidate word; and the second scoring submodule is used for calculating the evaluation score of the candidate word according to the similarity, the language environment probability and the second coefficient if the wrong word is different from the initial of the candidate word.
As shown in fig. 25, in a twenty-fourth embodiment, a candidate word evaluation apparatus is provided, and the candidate word evaluation apparatus of this embodiment includes:
the candidate word obtaining module 1101 is configured to detect a wrong word and obtain multiple candidate words corresponding to the wrong word.
A similarity determining module 1102, configured to determine a similarity between each candidate word and the wrong word, where the similarity is obtained according to a longest common subsequence and/or a longest common substring of each candidate word and the wrong word.
The second probability determining module 1103 is configured to replace the wrong word with each candidate word, respectively, to obtain a candidate sentence, and determine, according to the candidate sentence, an evaluation probability of a corresponding candidate word, where the evaluation probability is obtained according to a language environment probability of a candidate word in the candidate sentence and a language environment probability of a word adjacent to the candidate word.
An error information obtaining module 1104, configured to obtain error information of the wrong word relative to each candidate word.
And an eleventh evaluating module 1105, configured to determine an evaluation score corresponding to each candidate word according to the similarity, the evaluation probability, and the error information.
When a wrong word is detected, the candidate word evaluation device of the embodiment first acquires a plurality of corresponding candidate words, determines similarity between each candidate word and the wrong word and evaluation probability of each candidate word at the position of the wrong word, and determines error information of the wrong word relative to each candidate word; determining an evaluation score corresponding to each candidate word according to the similarity, the evaluation probability and the error information; the similarity and the context language environment are considered, and the habit problem of the user when writing the word is also considered, so that the reliability of the candidate word evaluation result can be improved.
In one embodiment, the second probability determining module 1103 is specifically configured to calculate, according to a preset language model, probabilities of respective positions of a candidate word and neighboring words of the candidate word in a candidate sentence, and use a log value of the probabilities as a language environment probability of each word; and averaging the language environment probability of the candidate words in the candidate sentences and the language environment probability of the adjacent words of the candidate words to obtain the evaluation probability of the candidate words in the candidate sentences.
In one embodiment, the eleventh evaluation module 1105 includes: the first scoring submodule is used for calculating the evaluation score of the candidate word according to the similarity, the evaluation probability and the first coefficient if the wrong word is the same as the first letter of the candidate word; and the second scoring submodule is used for calculating the evaluation score of the candidate word according to the similarity, the evaluation probability and the second coefficient if the wrong word is different from the initial of the candidate word.
As shown in fig. 26, in a twenty-fifth embodiment, a candidate word evaluation apparatus is provided, and the candidate word evaluation apparatus of this embodiment includes:
a candidate word obtaining module 1201, configured to detect a wrong word, and obtain multiple candidate words corresponding to the wrong word.
A distance determining module 1202, configured to determine an editing distance between each candidate word and the wrong word.
A similarity determining module 1203, configured to determine a similarity between each candidate word and the wrong word, where the similarity is obtained according to the longest common subsequence and/or the longest common substring of each candidate word and the wrong word.
A first probability determining module 1204, configured to determine a language environment probability of each candidate word at the wrong word position.
An error information obtaining module 1205 is configured to obtain error information of the wrong word relative to each candidate word.
And a twelfth evaluation module 1206, configured to determine an evaluation score corresponding to each candidate word according to the editing distance, the similarity, the language environment probability, and the error information.
When a wrong word is detected, the candidate word evaluation device in the embodiment first acquires a plurality of corresponding candidate words, determines the edit distance and similarity between each candidate word and the wrong word, the language environment probability of each candidate word at the position of the wrong word, and determines error information of the wrong word relative to each candidate word; determining an evaluation score corresponding to each candidate word according to the editing distance, the similarity, the language environment probability and the error information; the editing distance information, the similarity and the context language environment are considered, and the habit problem of the user when writing the word is also considered, so that the reliability of the candidate word evaluation result can be improved.
In an embodiment, the first probability determining module 1204 is specifically configured to calculate, according to a preset language model, a probability of each candidate word at the wrong word position, and use a log value of the probability as a language environment probability of the candidate word.
In one embodiment, the twelfth evaluation module 1206 comprises: the first scoring submodule is used for calculating the evaluation score of the candidate word according to the editing distance, the similarity, the language environment probability and the first coefficient if the wrong word is the same as the initial of the candidate word; and the second scoring submodule is used for calculating the evaluation score of the candidate word according to the editing distance, the similarity, the language environment probability and the second coefficient if the wrong word is different from the initial of the candidate word.
As shown in fig. 27, in a twenty-sixth embodiment, a candidate word evaluation device is provided, and the candidate word evaluation device of this embodiment includes:
the candidate word obtaining module 1301 is configured to detect a wrong word, and obtain multiple candidate words corresponding to the wrong word.
The distance determining module 1302 is configured to determine an editing distance between each candidate word and the wrong word.
And the similarity determining module 1303 is configured to determine the similarity between each candidate word and the wrong word, where the similarity is obtained according to the longest common subsequence and/or the longest common substring of each candidate word and the wrong word.
And a second probability determining module 1304, configured to replace the wrong word with each candidate word, respectively, to obtain a candidate sentence, and determine, according to the candidate sentence, an evaluation probability of a corresponding candidate word, where the evaluation probability is obtained according to the language environment probability of the candidate word in the candidate sentence and the language environment probability of a word adjacent to the candidate word.
An error information obtaining module 1305, configured to obtain error information of the error word relative to each candidate word.
And a thirteenth evaluation module 1306, configured to determine, according to the edit distance, the similarity, the evaluation probability, and the error information, an evaluation score corresponding to each candidate word.
When a wrong word is detected, the candidate word evaluation device of the embodiment first acquires a plurality of corresponding candidate words, determines the edit distance and similarity between each candidate word and the wrong word and the evaluation probability of each candidate word at the position of the wrong word, and determines error information of the wrong word relative to each candidate word; determining an evaluation score corresponding to each candidate word according to the editing distance, the similarity, the evaluation probability and the error information; the editing distance, the similarity and the context language environment are considered, and the habit problem of the user when writing the word is also considered, so that the reliability of the candidate word evaluation result can be improved.
In an embodiment, the second probability determining module 1304 is specifically configured to calculate, according to a preset language model, probabilities of respective positions of a candidate word and a neighboring word of the candidate word in a candidate sentence, and use a log value of the probabilities as a language environment probability of each word; and averaging the language environment probability of the candidate words in the candidate sentences and the language environment probability of the adjacent words of the candidate words to obtain the evaluation probability of the candidate words in the candidate sentences.
In one embodiment, the thirteenth evaluating module 1306 is further configured to determine an evaluation score corresponding to each candidate word according to the inverse of the edit distance, the similarity, the inverse of the evaluation probability, and the error information.
In one embodiment, the thirteenth evaluation module 1306 includes: the first scoring submodule is used for calculating an evaluation score corresponding to the candidate word according to the editing distance, the similarity, the evaluation probability and the first coefficient if the wrong word is the same as the first letter of the candidate word; and the second scoring submodule is used for calculating the evaluation score corresponding to the candidate word according to the editing distance, the similarity, the evaluation probability and the second coefficient if the wrong word is different from the initial of the candidate word.
On the basis of the candidate word evaluation apparatus in any of the above embodiments, in an embodiment, the candidate word evaluation apparatus further includes: and the candidate word determining module is used for calculating the editing distance between the wrong word and the known word in the preset word bank, selecting the known word with the editing distance within a set range, and obtaining a plurality of candidate words corresponding to the wrong word.
In an embodiment, on the basis of the candidate word evaluation apparatus in any of the above embodiments, the candidate word evaluation apparatus further includes: and the correcting module is used for determining error correcting words corresponding to the error words from the candidate words according to the evaluation scores and correcting the error words by using the error correcting words. Optionally, the wrong word correcting module is configured to determine, from the multiple candidate words, a candidate word with a highest evaluation score, and use the candidate word as the wrong word corresponding to the wrong word.
In an embodiment, on the basis of the candidate word evaluation apparatus in any of the above embodiments, the candidate word evaluation apparatus further includes: and the sorting module is used for sorting the candidate words according to the evaluation scores and displaying the sorted candidate words.
In an embodiment, a computer device is provided, which includes a memory and a processor, the memory stores a computer program, and the processor implements the steps of the candidate word evaluation method in any one of the above embodiments when executing the computer program.
Compared with the traditional candidate word evaluation mode according to the editing distance, the computer equipment can evaluate the candidate words of wrong words more effectively.
In an embodiment, a computer-readable storage medium is provided, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the candidate word evaluation method of any one of the above embodiments.
Through the computer-readable storage medium, compared with a traditional candidate word evaluation mode according to the editing distance, the candidate words of wrong words can be evaluated more effectively.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database or other medium used in the embodiments provided herein can include non-volatile and/or volatile memory. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).
The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
The terms "comprises" and "comprising," as well as any variations thereof, of the embodiments herein are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or (module) elements is not limited to only those steps or elements but may alternatively include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
Reference herein to "a plurality" means two or more. "and/or" describes the association relationship of the associated objects, meaning that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship.
References to "first \ second" herein merely distinguish between similar objects and do not denote a particular ordering for the objects, and it is to be understood that "first \ second" may, where permissible, interchange a particular order or sequence. It should be understood that "first \ second" distinct objects may be interchanged under appropriate circumstances such that the embodiments described herein may be practiced in sequences other than those illustrated or described herein.
The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent application shall be subject to the appended claims.

Claims (10)

1. A candidate word evaluation method, comprising:
detecting wrong words, and acquiring a plurality of candidate words corresponding to the wrong words; the wrong words comprise words obtained based on writing operation of a user;
determining the editing distance between each candidate word and the wrong word;
determining the similarity between each candidate word and the wrong word, wherein the similarity is obtained according to the longest common subsequence rate and/or the longest common substring rate of each candidate word and the wrong word; the longest common subsequence rate and/or the longest common substring rate are/is used for representing the number of the same characters between each candidate word and the wrong word and the proportion of the same characters;
respectively replacing the wrong words with the candidate words to obtain candidate sentences, and determining the evaluation probability of the corresponding candidate words according to the candidate sentences, wherein the evaluation probability is obtained according to the language environment probability of the candidate words in the candidate sentences and the language environment probability of the adjacent words of the candidate words;
acquiring error information of the wrong word relative to each candidate word;
determining the evaluation score of each candidate word according to the editing distance, the similarity, the evaluation probability and the error information;
wherein the determining an evaluation probability of a corresponding candidate word from the candidate sentence comprises:
calculating the probability of each position of a candidate word and a word adjacent to the candidate word in the candidate sentence according to a preset language model, and taking the log value of the probability as the language environment probability of each word;
averaging the language environment probability of the candidate words in the candidate sentences and the language environment probability of the adjacent words of the candidate words to obtain the evaluation probability of the candidate words in the candidate sentences;
determining an evaluation score of each candidate word according to the editing distance, the similarity, the evaluation probability and the error information, including:
determining an evaluation score corresponding to each candidate word according to the reciprocal of the editing distance, the similarity, the reciprocal of the evaluation probability and the error information;
the error information includes any one or more of:
the information whether the number of the characters of the wrong word and the candidate word is the same, the information whether the components of the wrong word and the candidate word are the same, and the information whether the wrong word contains illegal symbols.
2. The candidate word evaluation method of claim 1, wherein the language model comprises: an N-Gram model, a BilSTM model, or an LSTM model.
3. The candidate word evaluation method of claim 1, further comprising:
and determining that the word to be detected is not in a preset word bank, and determining that the word to be detected is a wrong word.
4. The candidate word evaluation method of claim 3, further comprising, after detecting a wrong word, the steps of:
and calculating the editing distance between the wrong word and the known word in the word bank, and selecting the known word with the editing distance within a set range to obtain a plurality of candidate words corresponding to the wrong word.
5. The candidate word evaluation method of claim 1, further comprising:
determining a wrong word corresponding to the wrong word from the plurality of candidate words according to the evaluation score, and correcting the wrong word by using the wrong word;
and/or the presence of a gas in the gas,
and sorting the candidate words according to the evaluation scores, and displaying the sorted candidate words.
6. The candidate word evaluation method according to claim 5, wherein the determining a wrong word corresponding to the wrong word from the plurality of candidate words according to the evaluation score includes:
and determining the candidate word with the highest evaluation score from the plurality of candidate words as the error correction word corresponding to the error word.
7. The candidate word evaluation method according to claim 2, wherein the evaluation score of each candidate word is calculated according to the following formula:
Figure FDA0003721458020000021
wherein D is edit Indicating the edit distance of the candidate word from the wrong word, word indicating the candidate word, mx indicating the language model,
Figure FDA0003721458020000022
representing the evaluation probability of the candidate word, score word Representing the evaluation score corresponding to the candidate word, and K representing the error information of the error word relative to each candidate word; if the candidate word and the wrong word initial are the same, the value K is K1, otherwise, the value K2, K1 and K2 are all preset numerical values, and S represents the similarity of the candidate word and the wrong word.
8. A candidate word evaluation apparatus, comprising:
the candidate word acquisition module is used for detecting wrong words and acquiring a plurality of candidate words corresponding to the wrong words; the wrong words comprise words obtained based on writing operation of a user;
the distance determining module is used for determining the editing distance between each candidate word and the wrong word;
the similarity determining module is used for determining the similarity between each candidate word and the wrong word, and the similarity is obtained according to the longest common subsequence rate and/or the longest common substring rate of each candidate word and the wrong word; the longest common subsequence rate and/or the longest common substring rate are/is used for representing the number of the same characters between each candidate word and the wrong word and the proportion of the same characters;
the second probability determination module is used for replacing the wrong words with the candidate words respectively to obtain candidate sentences, and determining the evaluation probability of the corresponding candidate words according to the candidate sentences, wherein the evaluation probability is obtained according to the language environment probability of the candidate words in the candidate sentences and the language environment probability of adjacent words of the candidate words;
the error information acquisition module is used for acquiring error information of the wrong words relative to the candidate words;
the ninth evaluation module is used for determining the evaluation score of each candidate word according to the editing distance, the similarity, the evaluation probability and the error information;
the second probability determining module is specifically configured to calculate probabilities of candidate words and neighboring words of the candidate words at respective positions in the candidate sentences according to a preset language model, and use log values of the probabilities as language environment probabilities of the words; averaging the language environment probability of the candidate words in the candidate sentences and the language environment probability of the adjacent words of the candidate words to obtain the evaluation probability of the candidate words in the candidate sentences;
the ninth evaluation module is specifically configured to determine an evaluation score corresponding to each candidate word according to the reciprocal of the edit distance, the similarity, the reciprocal of the evaluation probability, and the error information;
the error information includes any one or more of:
the information whether the number of the characters of the wrong word and the candidate word is the same, the information whether the components of the wrong word and the candidate word are the same, and the information whether the wrong word contains illegal symbols.
9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the steps of the method of any one of claims 1 to 7 when executing the program.
10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 7.
CN201810321032.XA 2018-04-11 2018-04-11 Candidate word evaluation method, candidate word ordering method and device Active CN108694167B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810321032.XA CN108694167B (en) 2018-04-11 2018-04-11 Candidate word evaluation method, candidate word ordering method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810321032.XA CN108694167B (en) 2018-04-11 2018-04-11 Candidate word evaluation method, candidate word ordering method and device

Publications (2)

Publication Number Publication Date
CN108694167A CN108694167A (en) 2018-10-23
CN108694167B true CN108694167B (en) 2022-09-06

Family

ID=63845594

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810321032.XA Active CN108694167B (en) 2018-04-11 2018-04-11 Candidate word evaluation method, candidate word ordering method and device

Country Status (1)

Country Link
CN (1) CN108694167B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110110303A (en) * 2019-03-28 2019-08-09 苏州八叉树智能科技有限公司 Newsletter archive generation method, device, electronic equipment and computer-readable medium
CN110502754B (en) * 2019-08-26 2021-05-28 腾讯科技(深圳)有限公司 Text processing method and device
CN117034917B (en) * 2023-10-08 2023-12-22 中国医学科学院医学信息研究所 English text word segmentation method, device and computer readable medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106776863A (en) * 2016-11-28 2017-05-31 合网络技术(北京)有限公司 The determination method of the text degree of correlation, the method for pushing and device of Query Result
CN106847288A (en) * 2017-02-17 2017-06-13 上海创米科技有限公司 The error correction method and device of speech recognition text
CN107273359A (en) * 2017-06-20 2017-10-20 北京四海心通科技有限公司 A kind of text similarity determines method
CN107305768A (en) * 2016-04-20 2017-10-31 上海交通大学 Easy wrongly written character calibration method in interactive voice

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130325536A1 (en) * 2012-05-30 2013-12-05 International Business Machines Corporation Measuring short-term cognitive aptitudes of workers for use in recommending specific tasks
CN105279149A (en) * 2015-10-21 2016-01-27 上海应用技术学院 Chinese text automatic correction method
CN106708893B (en) * 2015-11-17 2018-09-28 华为技术有限公司 Search query word error correction method and device
CN105550173A (en) * 2016-02-06 2016-05-04 北京京东尚科信息技术有限公司 Text correction method and device
CN106202153B (en) * 2016-06-21 2019-09-17 广州智索信息科技有限公司 A kind of the spelling error correction method and system of ES search engine

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107305768A (en) * 2016-04-20 2017-10-31 上海交通大学 Easy wrongly written character calibration method in interactive voice
CN106776863A (en) * 2016-11-28 2017-05-31 合网络技术(北京)有限公司 The determination method of the text degree of correlation, the method for pushing and device of Query Result
CN106847288A (en) * 2017-02-17 2017-06-13 上海创米科技有限公司 The error correction method and device of speech recognition text
CN107273359A (en) * 2017-06-20 2017-10-20 北京四海心通科技有限公司 A kind of text similarity determines method

Also Published As

Publication number Publication date
CN108694167A (en) 2018-10-23

Similar Documents

Publication Publication Date Title
US11614862B2 (en) System and method for inputting text into electronic devices
US10402493B2 (en) System and method for inputting text into electronic devices
US9460066B2 (en) Systems and methods for character correction in communication devices
US9659002B2 (en) System and method for inputting text into electronic devices
CN110334179B (en) Question-answer processing method, device, computer equipment and storage medium
US20060206313A1 (en) Dictionary learning method and device using the same, input method and user terminal device using the same
CN108694167B (en) Candidate word evaluation method, candidate word ordering method and device
CN108628826B (en) Candidate word evaluation method and device, computer equipment and storage medium
US20210405765A1 (en) Method and Device for Input Prediction
Habib et al. An exploratory approach to find a novel metric based optimum language model for automatic bangla word prediction
CN108681533B (en) Candidate word evaluation method and device, computer equipment and storage medium
CN106570196B (en) Video program searching method and device
US20100125725A1 (en) Method and system for automatically detecting keyboard layout in order to improve the quality of spelling suggestions and to recognize a keyboard mapping mismatch between a server and a remote user
CN112182337B (en) Method for identifying similar news from massive short news and related equipment
CN108595419B (en) Candidate word evaluation method, candidate word sorting method and device
CN108664466B (en) Candidate word evaluation method and device, computer equipment and storage medium
CN108647202B (en) Candidate word evaluation method and device, computer equipment and storage medium
CN108664467B (en) Candidate word evaluation method and device, computer equipment and storage medium
CN108733646B (en) Candidate word evaluation method and device, computer equipment and storage medium
CN108681535B (en) Candidate word evaluation method and device, computer equipment and storage medium
CN108694166B (en) Candidate word evaluation method and device, computer equipment and storage medium
CN108628827A (en) Candidate word evaluation method and device, computer equipment and storage medium
CN108681534A (en) Candidate word evaluation method and device, computer equipment and storage medium
JP6303508B2 (en) Document analysis apparatus, document analysis system, document analysis method, and program
CN108733645A (en) Candidate word evaluation method and device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant