CN108628826A - Candidate word evaluation method and device, computer equipment and storage medium - Google Patents

Candidate word evaluation method and device, computer equipment and storage medium Download PDF

Info

Publication number
CN108628826A
CN108628826A CN201810320358.0A CN201810320358A CN108628826A CN 108628826 A CN108628826 A CN 108628826A CN 201810320358 A CN201810320358 A CN 201810320358A CN 108628826 A CN108628826 A CN 108628826A
Authority
CN
China
Prior art keywords
word
candidate
candidate word
evaluation
wrong
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201810320358.0A
Other languages
Chinese (zh)
Other versions
CN108628826B (en
Inventor
李贤�
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Shiyuan Electronics Thecnology Co Ltd
Original Assignee
Guangzhou Shiyuan Electronics Thecnology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Shiyuan Electronics Thecnology Co Ltd filed Critical Guangzhou Shiyuan Electronics Thecnology Co Ltd
Priority to CN201810320358.0A priority Critical patent/CN108628826B/en
Publication of CN108628826A publication Critical patent/CN108628826A/en
Application granted granted Critical
Publication of CN108628826B publication Critical patent/CN108628826B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/232Orthographic correction, e.g. spell checking or vowelisation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The invention relates to a candidate word evaluation method, a candidate word evaluation device, computer equipment and a storage medium, which are applied to the field of data processing. The method comprises the following steps: detecting wrong words, and acquiring a plurality of candidate words corresponding to the wrong words; determining the language environment probability of each candidate word at the wrong word position; acquiring error information of the wrong word relative to each candidate word; and determining the evaluation score of each candidate word according to the language environment probability and the error information. The method, the device, the computer equipment or the storage medium of the embodiment of the invention are beneficial to improving the reliability of the candidate word evaluation result.

Description

Candidate word evaluation method and device, computer equipment and storage medium
Technical Field
The present invention relates to the field of data processing technologies, and in particular, to a candidate word evaluation method, apparatus, computer device, and storage medium.
Background
Currently popular Word processing software, such as Word, WPS, WordPerfect, and the like, are embedded with an english spell check function, which is used for implementing english spell check, and when a misspelled Word is checked, prompt information or a corresponding error correction suggestion is given.
In the process of implementing the invention, the inventor finds that the prior art has the following problems, the prior error correction method mainly uses a dictionary for detection, and evaluates the candidate words with wrong words through the editing distance after finding out misspelling, however, the method is too simple and hard, and the reliability of the evaluation result of the candidate words is not ideal.
Disclosure of Invention
Based on this, it is necessary to provide a candidate word evaluation method, apparatus, computer device, and storage medium for solving the problem that the reliability of the evaluation result of the candidate word in the existing manner is not ideal.
The scheme provided by the embodiment of the invention comprises the following steps:
in one aspect, a candidate word evaluation method includes:
detecting wrong words, and acquiring a plurality of candidate words corresponding to the wrong words;
determining the language environment probability of each candidate word at the wrong word position;
acquiring error information of the wrong word relative to each candidate word;
and determining the evaluation score corresponding to each candidate word according to the language environment probability and the error information.
In one embodiment, the determining the language environment probability of each candidate word at the wrong word position includes:
and calculating the probability of each candidate word at the wrong word position according to a preset language model, and taking the log value of the probability as the language environment probability of the candidate word.
In one embodiment, the determining an evaluation score corresponding to each candidate word according to the language environment probability and the error information includes:
determining an evaluation score corresponding to each candidate word according to the reciprocal of the language environment probability and the error information;
and/or the presence of a gas in the gas,
the language model comprises an N-Gram model, a BilSTM model or an LSTM model.
In one embodiment, the error information of the error word relative to each candidate word includes: the information whether the wrong word and the candidate word have the same initial;
determining an evaluation score corresponding to each candidate word according to the language environment probability and the error information, wherein the evaluation score comprises the following steps:
if the wrong word is the same as the initial of the candidate word, calculating an evaluation score corresponding to the candidate word according to the language environment probability and the first coefficient;
and if the wrong word and the candidate word have different initial letters, calculating the evaluation score corresponding to the candidate word according to the language environment probability and the second coefficient.
In one embodiment, the method further comprises the following steps:
detecting whether the words to be detected are in a preset word bank or not, and if not, determining that the words to be detected are wrong words.
In one embodiment, after the wrong word is detected, the method further comprises the following steps:
and calculating the editing distance between the wrong word and the known word in the word bank, and selecting the known word with the editing distance within a set range to obtain a plurality of candidate words corresponding to the wrong word.
In one embodiment, the method further comprises the following steps:
determining error correction words corresponding to the error words from the candidate words according to the evaluation scores, and correcting the error words by using the error correction words;
and/or the presence of a gas in the gas,
and sorting the candidate words according to the evaluation scores, and displaying the sorted candidate words.
In one embodiment, the determining, according to the evaluation score, a wrong word corresponding to the wrong word from the multiple candidate words includes:
and determining the candidate word with the highest evaluation score from the plurality of candidate words as the error correction word corresponding to the error word.
In one embodiment, the evaluation score of each candidate word is calculated according to the following formula:
wherein, word represents a candidate word,score representing the linguistic environment probability of a candidate wordwordRepresenting the evaluation scores corresponding to the candidate words, and K represents error information of the error word relative to each candidate word; and if the candidate word and the wrong word initial are the same, the value K is K1, otherwise, the values K2, K1 and K2 are preset numerical values.
Yet another aspect provides a candidate word evaluation apparatus, including:
the candidate word acquisition module is used for detecting wrong words and acquiring a plurality of candidate words corresponding to the wrong words;
the first probability determination module is used for determining the language environment probability of each candidate word at the wrong word position;
the error information acquisition module is used for acquiring error information of the wrong words relative to the candidate words;
and the third evaluation module is used for determining the evaluation score corresponding to each candidate word according to the language environment probability and the error information.
In one embodiment, the first probability determination module is configured to calculate, according to a preset language model, a probability of each candidate word at the wrong word position, and use a log value of the probability as a language environment probability of the candidate word.
In one embodiment, the error information of the error word relative to each candidate word includes: the information whether the wrong word and the candidate word have the same initial;
the third evaluation module comprises:
the first scoring submodule is used for calculating the evaluation score of the candidate word according to the reciprocal of the language environment probability and a first coefficient if the wrong word is the same as the initial of the candidate word;
and the second scoring submodule is used for calculating the evaluation score of the candidate word according to the inverse of the language environment probability and a second coefficient if the wrong word is different from the initial of the candidate word.
According to the candidate word evaluation method and device, when a wrong word is detected, a plurality of corresponding candidate words are obtained firstly, the editing distance between each candidate word and the wrong word and the probability of each candidate word at the position of the wrong word are determined respectively, and error information of the wrong word relative to each candidate word is determined; determining an evaluation score corresponding to each candidate word according to the language environment probability and the error information; the method not only considers the phenomenon problem of word writing, but also takes the information of the context language environment into account, thereby being beneficial to improving the accuracy of the candidate word evaluation result.
Yet another aspect provides a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the candidate word evaluation method as described above when executing the computer program.
According to the computer equipment, the accuracy of the candidate word evaluation result is improved through the computer program running on the processor.
Yet another aspect provides a computer storage medium having a computer program stored thereon, which when executed by a processor implements the candidate word evaluation method as described above.
The computer storage medium is beneficial to improving the accuracy of the candidate word evaluation result through the stored computer program.
Drawings
FIG. 1 is a diagram of an exemplary implementation of a candidate evaluation method;
FIG. 2 is a schematic flow chart diagram of a candidate word evaluation method according to a first embodiment;
FIG. 3 is a schematic flow chart diagram of a candidate word evaluation method according to a second embodiment;
FIG. 4 is a schematic flow chart diagram of a candidate word evaluation method according to a third embodiment;
FIG. 5 is a schematic flow chart diagram of a candidate word evaluation method according to a fourth embodiment;
FIG. 6 is a schematic flow chart diagram of a candidate word evaluation method according to a fifth embodiment;
FIG. 7 is a schematic flow chart diagram illustrating a candidate word evaluation method according to a sixth embodiment;
FIG. 8 is a schematic flow chart diagram of a candidate word evaluation method according to a seventh embodiment;
FIG. 9 is a schematic flow chart diagram illustrating a candidate word evaluation method according to an eighth embodiment;
FIG. 10 is a schematic flow chart diagram illustrating a candidate word evaluation method according to a ninth embodiment;
FIG. 11 is a schematic flow chart diagram illustrating a candidate word evaluation method according to a tenth embodiment;
fig. 12 is a schematic flow chart of a candidate word evaluation method of the eleventh embodiment;
FIG. 13 is a schematic flow chart diagram illustrating a candidate word evaluation method according to a twelfth embodiment;
fig. 14 is a schematic flow chart of a candidate word evaluation method of the thirteenth embodiment;
FIG. 15 is a schematic structural diagram of a candidate word evaluation apparatus according to a fourteenth embodiment;
fig. 16 is a schematic structural diagram of a candidate word evaluation apparatus of a fifteen embodiment;
FIG. 17 is a schematic structural diagram of a candidate word evaluation apparatus according to a sixteen embodiment;
fig. 18 is a schematic structural view of a candidate word evaluation apparatus of the seventeenth embodiment;
fig. 19 is a schematic configuration diagram of a candidate word evaluation apparatus of an eighteen embodiment;
fig. 20 is a schematic structural diagram of a candidate word evaluation apparatus of a nineteenth embodiment;
fig. 21 is a schematic structural view of a candidate word evaluation apparatus of a twentieth embodiment;
FIG. 22 is a schematic block diagram of a candidate word evaluation apparatus according to a twenty-one embodiment;
FIG. 23 is a schematic structural diagram of a candidate word evaluation apparatus according to a twenty-two embodiment;
FIG. 24 is a schematic structural diagram of a candidate word evaluation apparatus according to a twenty-third embodiment;
FIG. 25 is a schematic structural diagram of a candidate word evaluation apparatus of a twenty-four embodiment;
fig. 26 is a schematic structural view of a candidate word evaluation apparatus of a twenty-five embodiment;
fig. 27 is a schematic structural diagram of a candidate word evaluation apparatus of a twenty-sixth embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.
The candidate word evaluation method provided by the application can be applied to the application environment shown in fig. 1. The internal structure of the computer apparatus may be as shown in fig. 1, and includes a processor, a memory, a display screen, and an input device connected through a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The computer program is executed by a processor to implement a candidate word evaluation method. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on the shell of the computer equipment, an external keyboard, a touch pad or a mouse and the like.
The computer device can be a terminal, including but not limited to various personal computers, laptops, smart phones and smart interactive tablets; when the intelligent interactive tablet is used, the intelligent interactive tablet can detect and recognize the writing operation of a user, can detect the content of the writing operation, and even can automatically correct the wrongly written words.
Those skilled in the art will appreciate that the architecture shown in fig. 1 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
As shown in fig. 2, in a first embodiment, a candidate word evaluation method is provided, which includes the following steps:
step S11, detecting the wrong word, and obtaining a plurality of candidate words corresponding to the wrong word.
The wrong words include misspelled words or miswritten words, which are words not present in the corresponding lexicon.
The wrong words can be text words such as symbols, characters, words, English words and the like. The candidate word may be a word similar to the wrong word determined according to the wrong word, that is, a correct word (error-correcting word) corresponding to the wrong word may be possible. The number of candidate words may be one, two, three, etc., and the number of candidate words is not limited in the embodiment of the present invention. And if the determined candidate word is one, directly taking the candidate word as an error correction word.
The method for detecting the wrong word can be implemented in various ways, such as a dictionary detection way, and the embodiment of the invention does not limit the method for detecting the wrong word.
And step S12, determining the edit distance between each candidate word and the wrong word.
The editing distance between the candidate word and the wrong word is used for measuring the difference degree between the candidate word and the wrong word; it may refer to the number difference of letters between the candidate word and the wrong word, stroke difference, etc. For a character string or an english word, the editing distance refers to the minimum number of editing operations required to change one string into another string, and the allowable editing operations include replacing a character, inserting a character, and deleting a character.
In one embodiment, a two-dimensional array of d [ i, j ] is used to store the value of the edit distance, which represents the minimum steps required to convert the string s [1 … i ] to the string t [1 … j ], and when i is equal to 0, i.e., the string s is empty, j characters are added to the corresponding d [0, j ], so that the string s is converted to the string t; when j is equal to 0, i.e., string t is empty, then the corresponding d [ i,0] is decremented by i characters, so that string s is converted to string t. For example, the edit distance between kitten and sitting is 3, for example, because there are at least the following 3 steps: sitten (k → s), sittin (e → i), sitting (→ g).
For example: and for a wrong word, selecting all known words with edit distances of 1 and 2 in a word bank, and constructing a candidate word set of the wrong word. The determination of the set of candidate words may also be based on other edit distance conditions, depending on the actual situation. By screening the candidate words corresponding to the wrong words, the reliability of the candidate word evaluation result is improved, and meanwhile, the subsequent calculation time is favorably reduced.
And step S13, acquiring error information of the wrong word relative to each candidate word.
And the error information of the error word relative to each candidate word is used for representing the distinguishing information of the error word and each candidate word. Alternatively, the error message may be determined according to the habit of the user when writing words in general, such as the user being particularly prone to miss-writing characters, or particular parts being most prone to errors, etc.
In an embodiment, the error information of the error word relative to each candidate word may refer to information on whether the error word is the same as the first letter of each candidate word, information on whether the number of characters of the error word and the candidate word is the same, information on whether the error word is the same as the component of each candidate word, information on whether the error word contains an illegal symbol, or the like.
And step S14, determining the evaluation score corresponding to each candidate word according to the editing distance and the error information.
The evaluation score obtained in this step can reflect the possibility that each candidate word is the error-correcting word (correct word) corresponding to the error word. The step is equivalent to effectively evaluating the possibility that the candidate word is the error-correcting word corresponding to the error word according to the editing distance information and the error information of the candidate word and the error word.
When a wrong word is detected, the candidate word evaluation method of the embodiment first acquires a plurality of corresponding candidate words, determines the edit distance between each candidate word and the wrong word, and determines the error information of the wrong word relative to each candidate word; determining an evaluation score corresponding to each candidate word according to the editing distance and the error information; the method not only considers the habit characteristics of the user when writing the words, but also considers the editing distance information between the words, thereby being beneficial to improving the reliability of the candidate word evaluation result.
In an embodiment, determining an evaluation score corresponding to each candidate word according to the edit distance and the error information includes: and determining the evaluation score corresponding to each candidate word according to the reciprocal of the editing distance and the error information.
Generally, the larger the edit distance, the larger the degree of difference between the candidate word and the error word, and the smaller the possibility that the candidate word is an error correction word of the error word, and therefore, the reliability of the evaluation score of the candidate word can be ensured by the reciprocal of the edit distance and the error information.
In one embodiment, the error information of the error word relative to each candidate word includes: and whether the wrong word and the candidate word have the same initial letter or not. Correspondingly, the step of determining the evaluation score corresponding to each candidate word according to the editing distance and the error information includes: if the wrong word is the same as the initial of the candidate word, calculating an evaluation score corresponding to the candidate word according to the reciprocal of the editing distance and the first coefficient; and if the wrong word is different from the initial letter of the candidate word, calculating the evaluation score corresponding to the candidate word according to the reciprocal of the editing distance and the second coefficient.
For example, suppose that the error information of the wrong word relative to each candidate word is represented by K, and the edit distance is represented by DeditIndicating that the evaluation score corresponding to the candidate word is scorewordExpressed, the formula for calculating the score of each candidate word may be:
scoreword=K×1/Dedit
and selecting the K value according to whether the initial letters of the candidate word and the wrong word are the same, wherein if the initial letters of the candidate word and the wrong word are the same, the K value is K1, and otherwise, the K values K2, K1 and K2 are preset numerical values. For example: and when the same, K is 1, and when the same is not the same, K is 0.5.
For an english word or a character string, based on the writing habit of the user, the first letter generally does not make a mistake, so based on the above embodiment, the possibility that each candidate word is a wrong-word-corrected word is evaluated, and the evaluation score of the candidate word can be more reasonably determined. So that if the candidate word is the same as the first letter of the error word, the candidate word is more likely to be the error word, otherwise, the candidate word is less likely to be the error word. It is understood that the values of K include, but are not limited to, the values of the above examples.
For example: the known situation is that when a user writes English words and sentences on the intelligent interactive flat plate, the game is wrongly written into the game, and the context language environment is I-wave game applets; the method for evaluating the candidate word corresponding to the tone comprises the following steps:
1. and detecting the tone in the English dictionary through English dictionary detection, and determining the tone as a wrong word.
2. And acquiring a candidate word of the song, and assuming that the song, the same and the one exist as the candidate word.
3. Respectively determining the editing distances of the game, the same, the one and the same, wherein the editing distances are respectively as follows: 1. 2 and 1.
4. According to the formula scoreword=K×1/DeditCalculating a score for each candidate word:
scoresome=1×1=1;
scoresame=1×1/2=0.5;
scoreone=0.5×1=0.5。
therefore, the candidate word some is more likely to be the correct word (error-correcting word) corresponding to the error word sone than other candidate words.
Therefore, the finally obtained evaluation score corresponding to each candidate word is obtained by comprehensively considering the habit characteristics of the user when writing the word and the size of the editing distance between the words, and compared with the traditional candidate word evaluation method based on the editing distance, the method is beneficial to improving the reliability of the evaluation result of the candidate word and ensuring the accuracy of the error-correcting word.
As shown in fig. 3, in a second embodiment, a candidate word evaluation method includes the following steps:
step S21, detecting the wrong word, and obtaining a plurality of candidate words corresponding to the wrong word.
For the implementation of this step, reference may be made to the corresponding step S11 implemented above, which is not described in detail.
And step S22, determining the similarity between each candidate word and the wrong word, wherein the similarity is obtained according to the longest common subsequence and/or the longest common substring of each candidate word and the wrong word.
The longest common subsequence is a problem that is used to find the longest subsequence of all sequences in a set of sequences, typically two sequences. A sequence is called the longest common subsequence of known sequences if it is a subsequence of two or more known sequences, respectively, and is the longest of all sequences that meet this condition.
The longest common substring is the problem used in a set of sequences (usually two sequences) to find the longest substring of all sequences. A substring, if it is the substring of two or more known number series, and the longest substring among all substrings meeting the condition, is called the longest common substring of the known series.
The target of a subsequence is a sequence, and the subsequences are ordered but not necessarily contiguous. For example: the sequence X ═ < B, C, D, B > is a subsequence of the sequence Y ═ a, B, C, B, D, a, B >, and the corresponding subscript sequence is <2,3,5,7 >. Substrings act like character strings, and are ordered and continuous. For example: the character string a ═ abcd is a substring of the character string c ═ aaabcddd; but the string b acdddd is not a substring of the string c.
The number of the same characters between the candidate words and the wrong words can be reflected to a certain extent through the longest common subsequence and/or the longest common substring, and the similarity between the candidate words and the wrong words is determined to have actual physical significance based on the pair.
And step S23, acquiring error information of the wrong word relative to each candidate word.
As to the implementation of this step, reference may be made to the description of the first embodiment above.
And step S24, determining the evaluation score corresponding to each candidate word according to the similarity and the error information.
The candidate word evaluation score obtained by the step can reflect the possibility that each candidate word is the error-correcting word corresponding to the error word.
When a wrong word is detected, firstly, obtaining a plurality of corresponding candidate words, respectively determining the similarity between each candidate word and the wrong word, wherein the similarity is obtained according to the longest public subsequence and/or the longest public substring of each candidate word and the wrong word, and determining error information of the wrong word relative to each candidate word; determining an evaluation score corresponding to each candidate word according to the similarity and the error information; the habit characteristics of the user when writing the words and the similarity between the words are comprehensively considered, so that the reliability of the candidate word evaluation result is improved.
In an embodiment, the step of determining the similarity between each candidate word and the wrong word in the above implementation may specifically be: and calculating the similarity between each candidate word and the wrong word according to at least one of the longest common subsequence rate and the longest common substring rate of each candidate word and the wrong word.
The longest common subsequence length divided by the average of the total sequence length in the sequence set is the longest common subsequence rate, which can be denoted as Dlsc1. The average of the length of the longest common substring divided by the total length of the sequences in the sequence set is the longest common substring rate, which can be recorded as Dlsc2. For example: the specific process of calculating the longest common subsequence rate and the longest common substring rate may be: for a set of sequencesdbcabca, cbaba, the longest common subsequence is baba, the length of the longest common subsequence is 4, and the longest common subsequence rate is 4/((7+5)/2) ═ 0.67; for the sequence set { babca, cbaba }, the longest common substring is bab, and the length of the longest common substring is 3, the longest common substring rate is 3/((5+5)/2) ═ 0.6.
The longest common subsequence rate and the longest common substring rate not only consider the amount of the same characters between the candidate words and the wrong words, but also further consider the proportion of the same characters, which is beneficial to further improving the accuracy of the similarity between the candidate words and the wrong words.
Alternatively, if the longest common subsequence rate is used Dlsc1Representing the longest common substring rate by Dlsc2Representing, representing the similarity between each candidate word and the wrong word by S, and calculating the similarity between each candidate word and the wrong word in the following way:
S=w1Dlcs1
S=w2Dlcs2
S=w1Dlcs1+w2Dlcs2
wherein, w1、w2Is a preset weight coefficient, and the sum of the weight coefficient and the preset weight coefficient is 1.
And determining the similarity between the candidate word and the wrong word through the weighted summation of the longest common subsequence rate and the longest common substring rate, so as to obtain more accurate similarity. Typically, the longest common subsequence rate Dlsc1The influence on the similarity of two words is larger, so the longest common subsequence rate Dlsc1The weight coefficient is larger than the longest common substring rate Dlsc2The weight coefficient of (2), for example:
S=0.7Dlcs1+0.3Dlcs2
in another embodiment, the step of determining the similarity between each candidate word and the wrong word may further include: and calculating the similarity of each candidate word and the wrong word according to at least one of the longest common subsequence rate and the longest common substring rate of each candidate word and the wrong word and the editing distance between each candidate word and the wrong word.
Optionally, the manner of calculating the similarity between each candidate word and the wrong word may include:
S=1/Dedit+w1Dlcs1
or, S-1/Dedit+w2Dlcs2
Or, S-1/Dedit+w1Dlcs1+w2Dlcs2
Specific examples thereof include:
S=1/Dedit+0.7Dlcs1
or, S-1/Dedit+0.3Dlcs2
Or, S-1/Dedit+0.7Dlcs1+0.3Dlcs2
Wherein D iseditThe editing distance is the minimum number of editing operations required for converting one string into another string, and the permitted editing operations comprise replacing one character, inserting one character and deleting one character. Generally, the smaller the edit distance, the greater the proximity of the two strings.
Therefore, the method can determine the similarity between two words by combining the editing distance between the words and the number of the same characters, and is favorable for improving the effectiveness of similarity calculation by increasing the editing distance as one evaluation dimension.
It can be understood that the longest common subsequence rate D in the above formulalsc1Longest common substring rate Dlsc2The previous weighting coefficients may also take other values, including but not limited to the above examples.
In one embodiment, the error information of the error word relative to each candidate word includes: and whether the wrong word and the candidate word have the same initial letter or not. Correspondingly, the step of determining the evaluation score corresponding to each candidate word according to the edit distance and the error information includes: if the wrong word is the same as the first letter of the candidate word, calculating an evaluation score corresponding to the candidate word according to the similarity and the first coefficient; and if the wrong word is different from the initial of the candidate word, calculating the evaluation score corresponding to the candidate word according to the similarity and the second coefficient.
Alternatively, assuming that the similarity is S, the error information of the wrong word relative to each candidate word is represented by K, and the formula for calculating the score of each candidate word may be:
scoreword=K×S;
and selecting the K value according to whether the initial letters of the candidate word and the wrong word are the same, wherein if the initial letters of the candidate word and the wrong word are the same, the K value is K1, and otherwise, the K values K2, K1 and K2 are preset numerical values. For example: and when the same, K is 1, and when the same is not the same, K is 0.5.
For an english word or a character string, based on the writing habit of the user, the first letter generally does not make a mistake, so that the possibility that each candidate word is a wrong-word-corrected word is evaluated through the above embodiment, and the evaluation score of the candidate word can be more reasonably determined. So that if the candidate word is the same as the first letter of the error word, the candidate word is more likely to be the error word, otherwise, the candidate word is less likely to be the error word. It is understood that the values of K include, but are not limited to, the values of the above examples.
For example: the known situation is that when a user writes English words and sentences on the intelligent interactive flat plate, the game is wrongly written into the game, and the context language environment is I-wave game applets; the method for evaluating the candidate word corresponding to the tone comprises the following steps:
1. and detecting the tone in the English dictionary through English dictionary detection, and determining the tone as a wrong word.
2. And acquiring words with edit distances of 1 and 2 from the song in the English dictionary, and assuming that the song, the same and the one exist as candidate words.
3. Respectively determining the longest common subsequence rate and the longest common substring rate of the game, the same, the one and the game, and calculating the similarity corresponding to each candidate word as follows:
the longest common subsequence of the candidate word some and the wrong word sone is soe, the length of the longest common subsequence is 3, and the longest common subsequence rate is 3/((4+4)/2) ═ 0.75; the longest common substring is so, the length of the longest common substring is 2, and the longest common substring rate is 2/((4+4)/2) ═ 0.5;
the longest public subsequence of the candidate word sam and the wrong word sone is se, the length of the longest public subsequence is 2, and the longest public subsequence rate is 2/((4+4)/2) ═ 0.5; the longest common substring is s or e, the length of the longest common substring is 1, and the longest common substring rate is 1/((4+4)/2) ═ 0.25;
the longest public subsequence of the candidate word one and the wrong word one is one, the length of the longest public subsequence is 3, and the longest public subsequence rate is 3/((3+4)/2) ═ 0.86; the longest common substring is one, the length of the longest common substring is 3, and the longest common substring rate is 3/((3+4)/2) ═ 0.86;
according to the formula S-1/Dedit+0.7Dlcs1+0.3Dlcs2Calculating the similarity S between each candidate word and the wrong word as follows:
Some:1+0.7×0.75+0.3×0.5=1.675;
Same:1/2+0.7×0.5+0.3×0.25=0.925;
One:1+0.7×0.86+0.3×0.86=1.86。
4. according to the formula scorewordCalculating an evaluation score for each candidate word as K × S:
scoresome=1×1.675=1.675;
scoresame=1×0.925=0.925;
scoreone=0.5×1.86=0.93。
therefore, the candidate word some is more likely to be the correct word (error-correcting word) corresponding to the error word sone than other candidate words.
Based on the candidate word evaluation method of the embodiment, the evaluation score corresponding to each candidate word is in a positive correlation with the similarity of the candidate word relative to the wrong word, and is influenced by the distinguishing information of the candidate word and the first letter of the wrong word, and the finally obtained evaluation score corresponding to each candidate word is obtained by comprehensively considering the habit characteristics of the user when writing the word and the similarity information between the word and the word.
As shown in fig. 4, in a third embodiment, a candidate word evaluation method is provided, which includes the following steps:
step S31, detecting the wrong word, and obtaining a plurality of candidate words corresponding to the wrong word.
For the implementation of this step, reference may be made to the corresponding step S11 implemented above, which is not described in detail.
And step S32, determining the language environment probability of each candidate word at the wrong word position.
The language environment probability of the candidate word at the wrong word position means that when the candidate word is used for replacing the wrong word, the candidate word has the corresponding context rationality, and the language environment probability corresponding to the candidate word with the higher context rationality is higher.
In an embodiment, the probability of each candidate word at the wrong word position is calculated according to a preset language model, and the log value of the probability is used as the language environment probability of the candidate word.
And step S33, acquiring error information of the wrong word relative to each candidate word.
As to the implementation of this step, reference may be made to the description of the first embodiment above.
And step S34, determining the evaluation score corresponding to each candidate word according to the language environment probability and the error information.
The evaluation score obtained in this step can reflect the possibility that each candidate word is the error-correcting word (correct word) corresponding to the error word.
In the candidate word evaluation method of the embodiment, when a wrong word is detected, a plurality of corresponding candidate words are obtained, the probability of each candidate word at the position of the wrong word is respectively determined, and error information of the wrong word relative to each candidate word is determined; determining an evaluation score corresponding to each candidate word according to the language environment probability and the error information; the method not only considers the habitual problem of the user when writing the word, but also takes the information of the context language environment into account, thereby being beneficial to improving the reliability of the candidate word evaluation result.
In one embodiment, the language model that determines the language environment probability of each candidate word at the wrong word position includes, but is not limited to, an N-Gram model, a BilSTM model, or an LSTM model.
Wherein the N-Gram model is a statistical language model for predicting the nth item from the first (N-1) items. At the application level, these items can be phonemes (speech recognition application), characters (input method application), words (segmentation application) or base pairs (genetic information). Idea of N-Gram model: given a string of letters, such as "for ex," what the next most likely letter is. From the corpus data, N probability distributions can be obtained by a maximum likelihood estimation method: the probability of being a is 0.4, the probability of being b is 0.0001, the probability of being c is …, and of course, the constraint condition is satisfied: the sum of all N probability distributions is 1.
The long and short memory neural network model, commonly called as LSTM model, is a special recurrent neural network; the next likely character to occur can be predicted by the character-level sequence entered.
The two-way long-short term memory network model, commonly referred to as the BilSTM model, is structurally identical to the LSTM model, except that the BilSTM model is linked not only to past states, but also to future states. For example, training a one-way LSTM to predict "fish" (the cyclic concatenation on the time axis remembers the past state values) by entering letters one by one, enters the next letter in the sequence in the feedback path of the BiLSTM, which makes it possible to know what the future information is. This form of training allows the network to fill in gaps between information rather than predict information.
In one embodiment, when the language environment probability is represented by a log value of the probability of each candidate word at the wrong word position, the evaluation score corresponding to each candidate word may be determined according to the inverse of the language environment probability and the error information. Therefore, the evaluation score corresponding to each candidate word is in an inverse relation with the corresponding probability log value, and is also influenced by the distinguishing information of the candidate word and the wrong word. The evaluation scores of the candidate words are obtained by comprehensively considering the habit problems of the user when writing the words and the language environment information, and the reliability of candidate word evaluation is improved compared with the traditional candidate word evaluation method.
In one embodiment, the error information of the error word relative to each candidate word includes: and the wrong word and the candidate word have the same initial letter information. Correspondingly, the step of determining the evaluation score corresponding to each candidate word according to the language environment probability and the error information comprises the following steps: if the wrong word is the same as the initial of the candidate word, calculating an evaluation score corresponding to the candidate word according to the language environment probability and the first coefficient; and if the wrong word and the candidate word have different initial letters, calculating the evaluation score corresponding to the candidate word according to the language environment probability and the second coefficient.
Based on any of the above embodiments, the language environment probability of the candidate word at the wrong word position is usedRepresentation, word represents candidate word, mx representsThe language model calculates the evaluation score of each candidate word according to the following formula:
if the candidate word and the wrong word initial are the same, the value of K is K1, otherwise, the values of K2, K1 and K2 are preset numerical values.
For example, assume that the language environment probability determined by the N-Gram language model isThe formula for calculating the score of each candidate word may be:
the language environment probability determined by the BilSTM model and the LSTM model is assumed to be respectively expressed as The formula for calculating the score of each candidate word may be:
and selecting the K value according to whether the candidate word and the wrong word have the same initial, wherein the K value is 1 when the candidate word and the wrong word are the same, and the K value is 0.5 when the candidate word and the wrong word are not the same. It is understood that the values of K include, but are not limited to, the values of the above examples.
For an english word or a character string, based on the writing habit of the user, the first letter generally does not make a mistake, so that the possibility that each candidate word is a wrong-word-corrected word is evaluated through the above embodiment, and the evaluation score of the candidate word can be more reasonably determined. So that if the candidate word is the same as the first letter of the error word, the candidate word is more likely to be the error word, otherwise, the candidate word is less likely to be the error word.
For example, it is known that when a user writes an english word on the smart interactive tablet, the game is wrongly written as a game, and the contextual language environment is I-wave game applets.
For the N-GRAM model, N in the N-GRAM model takes a value of 3:
1. and detecting the tone in the English dictionary through English dictionary detection, and determining the tone as a wrong word.
2. And acquiring words with edit distances of 1 and 2 from the song in the English dictionary, and assuming that the song, the same, the son and the one exist as candidate words.
3. Respectively replacing the song in the I-love song applets with the song, the same, the son and the one, and calculating the corresponding candidate words according to the N-GRAM modelThe language environment probability for each candidate word is obtained as follows:
wherein, P (I, have, some) ═ c (I, have, some)/c (3-Gram), and c represents the count, i.e. the count (number) of 3-tuples I, have, some in the corpus is divided by the count of all 3-tuples;
where P (have | I) ═ c (have, I)/c (I), i.e., the count of such 2-tuples of have, I in the corpus divided by the count of all I;
where p (I) ═ c (I)/c (1-Gram), i.e. the count of such 1-tuples of I in the corpus divided by the count of all 1-tuples.
4. According to the formulaCalculating an evaluation score for each candidate word:
scoresome=-1×1/-5.7=0.175;
scoresame=-1×1/-6=0.167;
scoreson=-1×1/-6=0.167;
scoreone=-0.5×1/-4.8=0.104。
therefore, the candidate word some is more likely to be the correct word (error-correcting word) corresponding to the error word sone than other candidate words.
In another embodiment, step 1 and step 2 are the same for the BilSTM model; step 3 is replaced by: respectively replacing the sound in the I-sound applets with the sound, same, son and one, and calculating the language environment probability corresponding to each candidate wordEach candidate word is calculated as follows:
step 4 is replaced by:
according to the formulaCalculating an evaluation score for each candidate word:
scoresome=-1×1/-4.9=0.204;
scoresame=-1×1/-5.07=0.197;
scoreson=-1×1/-7.6=0.132;
scoreone=-0.5×1/-4.66=0.107。
it can also be known that the candidate word some is more likely to be the correct word (error-correcting word) corresponding to the error word sone than other candidate words.
In another embodiment, step 1 and step 2 above are the same for the LSTM model; step 3 is replaced by: respectively replacing the sound in the I-sound applets with the sound, same, son and one, and calculating the language environment probability corresponding to each candidate wordEach candidate word is calculated as follows:
step 4 is replaced by:
according to the formulaCalculating an evaluation score for each candidate word:
scoresome=-1×1/-5.6=0.179;
scoresame=-1×1/-6.3=0.159;
scoreson=-1×1/-6=0.116;
scoreone=-0.5×1/-5.1=0.098。
it can also be known that the candidate word some is more likely to be the correct word (error-correcting word) corresponding to the error word sone than other candidate words.
Through the candidate word evaluation score calculation under the three language models, the habit problem of a user when writing words is considered, and language environment information is also considered, so that candidate words of wrong words can be determined more effectively, and the error correction accuracy is improved.
As shown in fig. 5, in a fourth embodiment, there is provided a candidate word evaluation method including the steps of:
step S41, detecting the wrong word, and obtaining a plurality of candidate words corresponding to the wrong word.
The specific implementation manner of this step refers to step S11 in the above embodiment, which is not described in detail.
Step S42, replacing the wrong word with each candidate word respectively to obtain a candidate sentence, and determining the evaluation probability of the corresponding candidate word according to the candidate sentence, wherein the evaluation probability is obtained according to the language environment probability of the candidate word in the candidate sentence and the language environment probability of the adjacent word of the candidate word.
And replacing wrong words with the candidate words respectively to obtain corresponding candidate sentences, wherein the candidate sentences can contain a plurality of words, and one of the words is a candidate word. The language environment probability of each word in the corresponding position of the word in the sentence refers to that the word is more reasonable relative to the corresponding context, and the higher the context is, the higher the language environment probability corresponding to the word is.
In one embodiment, the neighboring words of the candidate word may be one or more, and may include both immediate neighboring words of the candidate word and alternate neighboring words of the candidate word. The probability of each position of a candidate word and a word adjacent to the candidate word in the candidate sentence can be calculated according to a preset language model, and the log value of the probability is used as the language environment probability of the corresponding word; and further averaging the language environment probability of the candidate words in the candidate sentences and the language environment probability of the adjacent words of the candidate words to obtain the evaluation probability of the candidate words in the candidate sentences. The language environment probability of the candidate word and the mean value of the language environment probabilities of the adjacent words of the candidate word can be an absolute mean value or a weighted mean value.
And step S43, acquiring error information of the wrong word relative to each candidate word.
As to the implementation of this step, reference may be made to the description of the first embodiment above.
And step S44, determining the evaluation score corresponding to each candidate word according to the evaluation probability and the error information.
The evaluation score obtained by the step can reflect the possibility that each candidate word is the error-correcting word corresponding to the error word.
In the candidate word evaluation method of the embodiment, when a wrong word is detected, a plurality of corresponding candidate words are first obtained, evaluation probabilities corresponding to the candidate words are respectively determined, and error information of the wrong word relative to the candidate words is determined; determining an evaluation score corresponding to each candidate word according to the evaluation probability and the error information; the method not only considers the habitual problem of the user when writing the word, but also takes the information of the context language environment into account, thereby being beneficial to improving the reliability of the candidate word evaluation result.
In one embodiment, the language model that calculates the summary of candidate words, their neighbors, in their respective locations in the candidate sentence includes, but is not limited to, an N-Gram model, a BilSTM model, or an LSTM model. The explanation of each language model can be referred to the description of the third embodiment.
In one embodiment, the language environment probability of each word is represented by a log of the probability of each word at its position, based on which an evaluation probability of a candidate word is obtained, and further, an evaluation score corresponding to each candidate word may be determined according to an inverse of the evaluation probability of the candidate word and the error information.
For example: if the evaluation probability of the candidate word is usedExpressing that the error information of the error word relative to each candidate word is represented by K, and the evaluation score corresponding to the candidate word is represented by scorewordThat means, the evaluation score of each candidate word can be calculated according to the following formula:
if the candidate word and the wrong word initial are the same, K takes the value of K1, otherwise, K takes the value of K2; k1 and K2 are preset values.
Therefore, the evaluation score corresponding to each candidate word is in an inverse relation with the corresponding evaluation probability, and is also influenced by the evaluation score and the initial information of the wrong word. The finally obtained evaluation score corresponding to each candidate word is obtained by comprehensively considering the habit problem and the context information when the user writes the word, and compared with the traditional method for evaluating the candidate words by depending on the editing distance, the reliability of the evaluation result of the candidate words is improved.
In one embodiment, the error information of the error word relative to each candidate word includes: and the wrong word and the candidate word have the same initial letter information. Correspondingly, the step of determining the evaluation score corresponding to each candidate word according to the evaluation probability and the error information comprises the following steps: if the wrong word is the same as the initial of the candidate word, calculating an evaluation score corresponding to the candidate word according to the evaluation probability and the first coefficient; and if the wrong word and the candidate word have different initial letters, calculating an evaluation score corresponding to the candidate word according to the evaluation probability and the second coefficient.
Optionally, the evaluation probability determined by the N-Gram language model is assumed to be usedThe formula for calculating the score of each candidate word may be:
and selecting the K value according to whether the candidate word and the wrong word have the same initial, wherein the K value is 1 when the candidate word and the wrong word are the same, and the K value is 0.5 when the candidate word and the wrong word are not the same. It is understood that the values of K include, but are not limited to, the values of the above examples.
For an english word or a character string, based on the writing habit of the user, the first letter generally does not make a mistake, so based on the above embodiment, the possibility that each candidate word is a wrong-word-corrected word is evaluated, and the evaluation score of the candidate word can be more reasonably determined. So that if the candidate word is the same as the first letter of the error word, the candidate word is more likely to be the error word, otherwise, the candidate word is less likely to be the error word.
For example: it is known that when a user writes an english word or sentence on the smart interactive tablet, the some things are that the some things are wrong writing the some things into the song, and the contextual language environment is I-have song applets.
1. And detecting the tone in the English dictionary through English dictionary detection, and determining the tone as a wrong word.
2. And acquiring words with edit distances of 1 and 2 from the song in the English dictionary, and assuming that the song, the same, the son and the one exist as candidate words.
3. Respectively replacing the sone in the I-have sone applets with the some of the some; obtaining:
candidate statement one: ihave soy applets;
candidate statement two: i have same applets;
candidate statement three: i have son applets;
candidate sentence four: i have one applets.
Based on the N-GRAM model, N in the N-GRAM model takes a value of 3, a neighboring word corresponding to a candidate word some in a candidate sentence I is an applet, a neighboring word corresponding to a candidate word same in a candidate sentence II is an applet, a neighboring word corresponding to a candidate word son in a candidate sentence III is an applet, and a neighboring word corresponding to a candidate word one in a candidate sentence IV is an applet; based on the calculation, the evaluation probability corresponding to the candidate words in each candidate sentence is as follows:
4. according to the formulaCalculating an evaluation score for each candidate word:
scoresome=-1×1/-5.6=0.179;
scoresame=-1×1/-6.35=0.157;
scoreson=-1×1/-7.58=0.132;
scoreone=-0.5×1/-5.26=0.095。
therefore, the candidate word some is more likely to be the correct word (error-correcting word) corresponding to the error word sone than other candidate words.
In another embodiment, for the BilSTM model, the neighbor word corresponding to the candidate word some in the first candidate sentence is I, have and applets, the neighbor word corresponding to the candidate word sam in the second candidate sentence is I, have and applets, the neighbor word corresponding to the candidate word son in the third candidate sentence is I, have and applets, and the neighbor word corresponding to the candidate word one in the fourth candidate sentence is I, have and applets; based on the calculation, the evaluation probability corresponding to the candidate words in each candidate sentence is as follows:
step 4 is replaced by:
according to the formulaCalculating an evaluation score for each candidate word:
scoresome=-1×1/-2.88=0.347;
scoresame=-1×1/-3.62=0.276;
scoreson=-1×1/-4.1=0.244;
scoreone=-0.5×1/-2.62=0.191。
it can also be seen that the candidate word some has a higher probability of being the correct word (error-correcting word) corresponding to the error word sone than other candidate words.
In another embodiment, for the LSTM model, the neighboring words corresponding to the candidate word some in the first candidate sentence are I, have and applets, the neighboring words corresponding to the candidate word sam in the second candidate sentence are I, have and applets, the neighboring words corresponding to the candidate word son in the third candidate sentence are I, have and applets, and the neighboring words corresponding to the candidate word one in the fourth candidate sentence are I, have and applets; based on the calculation, the evaluation probability corresponding to the candidate words in each candidate sentence is as follows:
step 4 is replaced by:
according to the formulaCalculating an evaluation score for each candidate word:
scoresome=-1×1/-3.175=0.315;
scoresame=-1×1/-3.75=0.267;
scoreson=-1×1/-4.6=0.217;
scoreone=-0.5×1/-3.45=0.145。
it can also be seen that the candidate word some has a higher probability of being the correct word (error-correcting word) corresponding to the error word sone than other candidate words.
Through candidate word evaluation under the three language models, the habit problem of a user when writing words is considered, and evaluation information of the language environment is added, so that the reliability of candidate word evaluation is improved.
As shown in fig. 6, in a fifth embodiment, there is provided a candidate word evaluation method including the steps of:
step S51, when the wrong word is detected, a plurality of candidate words corresponding to the wrong word are obtained.
Regarding the implementation manner of this step, refer to step S11 of the first embodiment, which is not described in detail.
And step S52, determining the edit distance between each candidate word and the wrong word.
As to the implementation of this step, reference may be made to the description of the first embodiment.
Step S53, replacing the wrong word with each candidate word respectively to obtain a candidate sentence, and determining the evaluation probability of the corresponding candidate word according to the candidate sentence, wherein the evaluation probability is obtained according to the language environment probability of the candidate word in the candidate sentence and the language environment probability of the adjacent word of the candidate word.
With regard to the implementation of this step, reference may be made to the description of the fourth embodiment.
And step S54, determining the evaluation score of each candidate word according to the editing distance and the evaluation probability.
The evaluation score obtained by this step can represent the possibility that each candidate word is the error-correcting word (correct word) corresponding to the error word.
By the candidate word evaluation method of the embodiment, when a wrong word is detected, a plurality of candidate words corresponding to the wrong word are obtained; according to the editing distance between the candidate word and the wrong word and the context language environment information, the possibility that the candidate word is the wrong word corresponding to the wrong word is comprehensively evaluated, and compared with a traditional candidate word evaluation method only depending on the editing distance, the reliability of candidate word evaluation is improved.
In one embodiment, the probabilities of the candidate words and the adjacent words of the candidate words in the candidate sentences at the positions of the candidate words and the adjacent words of the candidate words are calculated according to a preset language model, and the log values of the probabilities are used as the language environment probabilities of the words; further, the language environment probabilities of the candidate words in the candidate sentences and the language environment probabilities of the adjacent words of the candidate words are averaged to obtain the evaluation probabilities of the candidate words in the candidate sentences.
The language model includes, but is not limited to, an N-Gram model, a BilSTM model, or an LSTM model. For each language model, reference is made to the description of the third embodiment above.
In one embodiment, the language environment probability of each word is represented by a log of the probability of each word at its position, based on which an evaluation probability of a candidate word is obtained, and further, an evaluation score corresponding to each candidate word may be determined according to an inverse of the edit distance and an inverse of the evaluation probability.
For example: the evaluation score for each candidate word may be calculated according to the following formula:
wherein D iseditIndicating the edit distance of the candidate word from the wrong word, word indicating the candidate word,representing the evaluation probability of the candidate word, scorewordRepresenting the evaluation score corresponding to the candidate word.
Therefore, the evaluation score corresponding to each candidate word is in an inverse relation with the corresponding evaluation probability, and is also influenced by the evaluation score and the initial information of the wrong word. The finally obtained evaluation score corresponding to each candidate word is obtained by comprehensively considering the phenomenon characteristics and the context information when the word is written, and compared with the traditional method for evaluating the candidate words by depending on the editing distance, the evaluation reliability of the candidate words is improved.
For example: the known situation is that when a user writes English words and sentences on the intelligent interactive flat plate, the game is wrongly written into the game, and the context language environment is I-wave game applets; the method for evaluating the candidate word corresponding to the tone comprises the following steps:
1. and detecting the tone in the English dictionary through English dictionary detection, and determining the tone as a wrong word.
2. And acquiring a candidate word of the song, and assuming that the song, the same, the one and the son exist as the candidate word.
3. Respectively determining the editing distances of the some, same, one, son and one, which are respectively as follows: 1. 2, 1 and 1.
4. Respectively replacing the sone in the I-have sone applets with the some of the some; obtaining:
candidate statement one: i have some soy applets;
candidate statement two: i have same applets;
candidate statement three: i have son applets;
candidate sentence four: i have one applets;
based on the N-GRAM model, N in the N-GRAM model takes a value of 3, a neighboring word corresponding to a candidate word some in a candidate sentence I is an applet, a neighboring word corresponding to a candidate word same in a candidate sentence II is an applet, a neighboring word corresponding to a candidate word son in a candidate sentence III is an applet, and a neighboring word corresponding to a candidate word one in a candidate sentence IV is an applet; based on the calculation, the evaluation probability corresponding to the candidate words in each candidate sentence is as follows:
5. according to the formulaCalculating an evaluation score for each candidate word:
scoresome=-1×1/-5.6=0.179;
scoresame=-1/2×1/-6.35=0.0787;
scoreson=-1×1/-7.58=0.132;
scoreone=-1×1/-5.26=0.19。
therefore, the candidate word some is more likely to be the correct word (error-correcting word) corresponding to the error word sone than other candidate words.
In another embodiment, for the BilSTM model, the neighbor word corresponding to the candidate word some in the first candidate sentence is I, have and applets, the neighbor word corresponding to the candidate word sam in the second candidate sentence is I, have and applets, the neighbor word corresponding to the candidate word son in the third candidate sentence is I, have and applets, and the neighbor word corresponding to the candidate word one in the fourth candidate sentence is I, have and applets; based on the calculation, the evaluation probability corresponding to the candidate words in each candidate sentence is as follows:
step 4 is replaced by:
according to the formulaCalculating an evaluation score for each candidate word:
scoresome=-1×1/-2.88=0.347;
scoresame=-1/2×1/-3.62=0.138;
scoreson=-1×1/-4.1=0.244;
scoreone=-1×1/-2.62=0.382。
similarly, it is found that the candidate word game has a higher probability of being the correct word (error-correcting word) corresponding to the error word tone than other candidate words.
In another embodiment, for the LSTM model, the neighboring words corresponding to the candidate word some in the first candidate sentence are I, have and applets, the neighboring words corresponding to the candidate word sam in the second candidate sentence are I, have and applets, the neighboring words corresponding to the candidate word son in the third candidate sentence are I, have and applets, and the neighboring words corresponding to the candidate word one in the fourth candidate sentence are I, have and applets; based on the calculation, the evaluation probability corresponding to the candidate words in each candidate sentence is as follows:
step 4 is replaced by:
according to the formulaCalculating an evaluation score for each candidate word:
scoresome=-1×1/-3.175=0.315;
scoresame=-1/2×1/-3.75=0.1335;
scoreson=-1×1/-4.6=0.217;
scoreone=-1×1/-3.45=0.29。
it is also found that the candidate word game has a higher probability of being the correct word (error-correcting word) corresponding to the error word tone than other candidate words.
Therefore, the evaluation scores corresponding to the candidate words finally obtained through the three language models comprehensively consider the context language environment information and the editing distance information, and therefore the reliability of candidate word evaluation is improved.
As shown in fig. 7, in a sixth embodiment, there is provided a candidate word evaluation method including the steps of:
step S61, when the wrong word is detected, a plurality of candidate words corresponding to the wrong word are obtained.
Regarding the implementation manner of this step, refer to step S11 of the first embodiment, which is not described in detail.
And step S62, determining the similarity between each candidate word and the wrong word, wherein the similarity is obtained according to the longest common subsequence and/or the longest common substring of each candidate word and the wrong word.
The implementation of this step is described with reference to the above description of the second embodiment, and is not repeated.
Step S63, replacing the wrong word with each candidate word respectively to obtain a candidate sentence, and determining the evaluation probability of the corresponding candidate word according to the candidate sentence, wherein the evaluation probability is obtained according to the language environment probability of the candidate word in the candidate sentence and the language environment probability of the adjacent word of the candidate word.
The implementation of this step is described with reference to the fourth embodiment, and is not repeated.
And step S64, determining the evaluation score of each candidate word according to the similarity and the evaluation probability.
The evaluation score obtained by this step can represent the possibility that each candidate word is the error-correcting word (correct word) corresponding to the error word.
By the candidate word evaluation method of the embodiment, when a wrong word is detected, a plurality of candidate words corresponding to the wrong word are obtained; according to the similarity between the candidate words and the wrong words and the context language environment information, the possibility that the candidate words are the wrong words corresponding to the wrong words is comprehensively evaluated, and compared with a traditional candidate word evaluation method only depending on the editing distance, the reliability of the candidate word evaluation result is improved.
In one embodiment, the probabilities of the candidate words and the adjacent words of the candidate words in the candidate sentences at the positions of the candidate words and the adjacent words of the candidate words are calculated according to a preset language model, and the log values of the probabilities are used as the language environment probabilities of the words; further, the language environment probabilities of the candidate words in the candidate sentences and the language environment probabilities of the adjacent words of the candidate words are averaged to obtain the evaluation probabilities of the candidate words in the candidate sentences. The language model includes, but is not limited to, an N-Gram model, a BilSTM model, or an LSTM model. The case of each language model can be seen in the above-described embodiments.
In one embodiment, the language environment probability of each word is represented by a log of the probability of each word at its position, based on which an evaluation probability of the candidate word is obtained, and further, an evaluation score corresponding to each candidate word may be determined according to the reciprocal of the evaluation probability and the similarity.
Optionally, the evaluation score of each candidate word is calculated according to the following formula:
where word represents a candidate word, scorewordRepresenting evaluation scores corresponding to candidate wordsRepresenting the evaluation probability of the candidate word, mx representing the language model, and S representing the similarity of the candidate word and the wrong word.
Therefore, the evaluation score corresponding to each candidate word is in an inverse relation with the corresponding evaluation probability, and is also influenced by the similarity of the candidate word and the wrong word. The evaluation scores corresponding to the candidate words are obtained after the similarity and the context information are integrated, and compared with a traditional method for evaluating the candidate words by means of editing distance, the evaluation reliability of the candidate words is improved.
For example: the known situation is that when a user writes English words and sentences on the intelligent interactive flat plate, the game is wrongly written into the game, and the context language environment is I-wave game applets; the method for evaluating the candidate word corresponding to the tone comprises the following steps:
1. and detecting the tone in the English dictionary through English dictionary detection, and determining the tone as a wrong word.
2. And acquiring a candidate word of the song, and assuming that the song, the same and the son exist as the candidate word.
3. Respectively determining the longest common subsequence rate and the longest common substring rate of the some, same, son and none, and calculating the similarity corresponding to each candidate word as follows:
the longest common subsequence of the candidate word some and the wrong word sone is soe, the length of the longest common subsequence is 3, and the longest common subsequence rate is 3/((4+4)/2) ═ 0.75; the longest common substring is so, the length of the longest common substring is 2, and the longest common substring rate is 2/((4+4)/2) ═ 0.5;
the longest public subsequence of the candidate word sam and the wrong word sone is se, the length of the longest public subsequence is 2, and the longest public subsequence rate is 2/((4+4)/2) ═ 0.5; the longest common substring is s or e, the length of the longest common substring is 1, and the longest common substring rate is 1/((4+4)/2) ═ 0.25;
the longest common subsequence of the candidate word son and the wrong word sone is son, the length of the longest common subsequence is 3, and the longest common subsequence rate is 3/((3+4)/2) ═ 0.86; the longest common substring is son, the length of the longest common substring is 3, and the longest common substring rate is 3/((3+4)/2) — (0.86).
According to the formula S-1/Dedit+0.7Dlcs1+0.3Dlcs2Calculating the similarity S between each candidate word and the wrong word as follows:
Some:1+0.7×0.75+0.3×0.5=1.675;
Same:1/2+0.7×0.5+0.3×0.25=0.925;
son:1+0.7×0.86+0.3×0.86=1.86。
4. respectively replacing the sone in the I-have sone applets by the same, same and son; obtaining:
candidate statement one: i have some soy applets;
candidate statement two: i have same applets;
candidate statement three: i have son applets.
Based on the N-GRAM model, N in the N-GRAM model takes a value of 3, a neighboring word corresponding to a candidate word some in the first candidate sentence is an applet, a neighboring word corresponding to a candidate word same in the second candidate sentence is an applet, and a neighboring word corresponding to a candidate word son in the third candidate sentence is an applet; based on the calculation, the evaluation probability corresponding to the candidate words in each candidate sentence is as follows:
wherein, P (I, have, some) ═ c (I, have, some)/c (3-Gram), and c represents the count, i.e. the count (number) of 3-tuples I, have, some in the corpus is divided by the count of all 3-tuples;
wherein, P (have | I) ═ c (have, I)/c (I), i.e. the count of 2-tuple in the corpus, such as have, I, is divided by the count of all I;
where p (I) ═ c (I)/c (1-Gram), i.e. the 1-tuple count in the corpus I divided by the count of all 1-tuples.
5. According to the formulaCalculating an evaluation score for each candidate word:
scoresome=-1.675×1/-5.6=0.299;
scoresame=-0.925×1/-6.35=0.146;
scoreson=-1.86×1/-7.58=0.245。
therefore, the candidate word some is more likely to be the correct word (error-correcting word) corresponding to the error word sone than other candidate words.
In another embodiment, for the BilSTM model, the neighbor words corresponding to the candidate word some in the first candidate sentence are I, have and applets, the neighbor words corresponding to the candidate word sam in the second candidate sentence are I, have and applets, and the neighbor words corresponding to the candidate word son in the third candidate sentence are I, have and applets; based on the calculation, the evaluation probability corresponding to the candidate words in each candidate sentence is as follows:
step 5 is replaced by:
according to the formulaCalculating an evaluation score for each candidate word:
scoresome=-1.675×1/-2.88=0.582;
scoresame=-0.925×1/-3.62=0.256;
scoreson=-1.86×1/-4.1=0.454。
it can also be known that the candidate word some is more likely to be the correct word (error-correcting word) corresponding to the error word sone than other candidate words.
In another embodiment, for the LSTM model, the neighbor words corresponding to the candidate word some in the first candidate sentence are I, have and applets, the neighbor words corresponding to the candidate word sam in the second candidate sentence are I, have and applets, and the neighbor words corresponding to the candidate word son in the third candidate sentence are I, have and applets; based on the calculation, the evaluation probability corresponding to the candidate words in each candidate sentence is as follows:
step 5 is replaced by:
according to the formulaCalculating an evaluation score for each candidate word:
scoresome=-1.675×1/-3.175=0.528;
scoresame=-0.925×1/-3.75=0.247;
scoreson=-1.86×1/-4.6=0.404。
it can also be known that the candidate word some is more likely to be the correct word (error-correcting word) corresponding to the error word sone than other candidate words.
Therefore, the evaluation scores corresponding to the candidate words finally obtained through the three language models comprehensively consider the context language environment information and the similarity information between the words, and therefore the reliability of candidate word evaluation is improved.
As shown in fig. 8, in a seventh embodiment, there is provided a candidate word evaluation method including the steps of:
step S71, when the wrong word is detected, a plurality of candidate words corresponding to the wrong word are obtained.
Regarding the implementation of this step, reference may be made to the description of step S11 of the first embodiment.
And step S72, determining the edit distance between each candidate word and the wrong word.
As to the implementation of this step, reference may be made to the description of the first embodiment.
And step S73, determining the language environment probability of each candidate word at the wrong word position.
With regard to the implementation of this step, reference may be made to the description of the third embodiment.
And step S74, acquiring error information of the wrong word relative to each candidate word.
The specific implementation of this step can be referred to the description of the first embodiment above.
And step S75, determining the evaluation score corresponding to each candidate word according to the editing distance, the language environment probability and the error information.
The evaluation score obtained in this step can reflect the possibility that each candidate word is the error-correcting word (correct word) corresponding to the error word.
When a wrong word is detected, the candidate word evaluation method of the embodiment first acquires a plurality of corresponding candidate words, determines an edit distance between each candidate word and the wrong word, a probability of each candidate word at the position of the wrong word, and error information of the wrong word relative to each candidate word; determining an evaluation score corresponding to each candidate word according to the editing distance, the language environment probability and the error information; not only the editing distance and the habit problem of the user when writing the word are considered, but also the information of the context language environment is taken into account, thereby being beneficial to improving the reliability of the candidate word evaluation result.
In one embodiment, the language model that determines the language environment probability of each candidate word at the wrong word position includes, but is not limited to, an N-Gram model, a BilSTM model, or an LSTM model. The description of each model refers to the description of the related embodiment described above.
In one embodiment, when the language environment probability is represented by a log value of the probability of each candidate word at the wrong word position, the evaluation score corresponding to each candidate word may be determined according to the inverse of the edit distance, the inverse of the language environment probability, and the error information.
Therefore, the evaluation score corresponding to each candidate word is in a reverse relation with the editing distance of the corresponding wrong word and in a reverse relation with the corresponding language environment probability, and is also influenced by the error information of the wrong word. The evaluation scores of the candidate words are obtained by comprehensively considering habit problems, editing distance and language environment information when the user writes the words, and compared with a traditional candidate word evaluation method, the evaluation method is beneficial to improving the evaluation reliability of the candidate words.
In one embodiment, the error information of the error word relative to each candidate word includes: and the wrong word and the candidate word have the same initial letter information. Correspondingly, the step of determining the evaluation score corresponding to each candidate word according to the editing distance, the language environment probability and the error information comprises the following steps: if the wrong word is the same as the initial of the candidate word, calculating an evaluation score corresponding to the candidate word according to the editing distance, the language environment probability and the first coefficient; and if the wrong word is different from the initial letter of the candidate word, calculating the evaluation score corresponding to the candidate word according to the editing distance, the language environment probability and the second coefficient.
Optionally, the evaluation score of each candidate word is calculated according to the following formula:
wherein word represents a candidate word, DeditIndicating the edit distance of the candidate word from the wrong word,representing the linguistic environment probability of the candidate word, mx representing the language model, scorewordAnd expressing the evaluation score corresponding to the candidate word, wherein K expresses error information of the wrong word relative to each candidate word, if the candidate word and the first letter of the wrong word are the same, the value of K is K1, and otherwise, the values of K2, K1 and K2 are preset numerical values.
For example, assume that the language environment probability determined by the N-Gram language model isThe formula for calculating the score of each candidate word may be:
and selecting the K value according to whether the candidate word and the wrong word have the same initial, wherein the K value is 1 when the candidate word and the wrong word are the same, and the K value is 0.5 when the candidate word and the wrong word are not the same. It is understood that the values of K include, but are not limited to, the values of the above examples.
For an english word or a character string, based on the writing habit of the user, the first letter generally does not make a mistake, so based on the above embodiment, the possibility that each candidate word is a wrong-word-corrected word is evaluated, and the evaluation score of the candidate word can be more reasonably determined. So that if the candidate word is the same as the first letter of the error word, the candidate word is more likely to be the error word, otherwise, the candidate word is less likely to be the error word.
For example, it is known that when a user writes an english word on the smart interactive tablet, the game is wrongly written as a game, and the contextual language environment is I-wave game applets.
1. And detecting the tone in the English dictionary through English dictionary detection, and determining the tone as a wrong word.
2. And acquiring words with edit distances of 1 and 2 from the song in the English dictionary, and assuming that the song, the same, the son and the one exist as candidate words.
3. Respectively replacing the song in the I-love song applets with the song, the same, the son and the one, and calculating the corresponding candidate words according to the N-GRAM modelAnd N in the N-GRAM model takes a value of 3, and the language environment probability of each candidate word is obtained as follows:
wherein, P (I, have, some) ═ c (I, have, some)/c (3-Gram), and c represents the count, i.e. the count (number) of 3-tuples I, have, some in the corpus is divided by the count of all 3-tuples;
wherein, P (have | I) ═ c (have, I)/c (I), i.e. the count of 2-tuple in the corpus, such as have, I, is divided by the count of all I;
where p (I) ═ c (I)/c (1-Gram), i.e. the 1-tuple count in the corpus I divided by the count of all 1-tuples.
4. According to the formulaCalculating an evaluation score for each candidate word:
scoresome=-1×1×1/-5.7=0.175;
scoresame=-1×0.5×1/-6=0.083;
scoreson=-1×1×1/-6=0.167;
scoreone=-0.5×1×1/-4.8=0.104。
therefore, the candidate word some is more likely to be the correct word (error-correcting word) corresponding to the error word sone than other candidate words.
In another embodiment, step 1 and step 2 are the same for the BilSTM model;
step 3 is replaced by: respectively replacing the sound in the I-wave sound applets with the sound, the same, the son and the one, and calculating the language environment probability corresponding to each candidate word as follows:
step 4 is replaced by:
according to the formulaCalculating an evaluation score for each candidate word:
scoresome=-1×1×1/-4.9=0.204;
scoresame=-1×1/2×1/-5.07=0.098;
scoreson=-1×1×1/-7.6=0.132;
scoreone=-0.5×1×1/-4.66=0.107。
it can also be known that the candidate word some is more likely to be the correct word (error-correcting word) corresponding to the error word sone than other candidate words.
In another embodiment, step 1 and step 2 above are the same for the LSTM model;
step 3 is replaced by: respectively replacing the sound in the I-wave sound applets with the sound, the same, the son and the one, and calculating the language environment probability corresponding to each candidate word as follows:
step 4 is replaced by:
according to the formulaCalculating an evaluation score for each candidate word:
scoresome=-1×1×1/-5.6=0.179;
scoresame=-1×1/2×1/-6.3=0.079;
scoreson=-1×1×1/-6=0.116;
scoreone=-0.5×1×1/-5.1=0.098。
it can also be known that the candidate word some is more likely to be the correct word (error-correcting word) corresponding to the error word sone than other candidate words.
By the embodiment, the editing distance information and the context language environment are considered, and the habit problem that the initial letter is not easy to make mistakes when the user writes English words is also considered, so that more accurate error-correcting words can be determined.
As shown in fig. 9, in an eighth embodiment, there is provided a candidate word evaluation method including the steps of:
step S81, detecting the wrong word, and obtaining a plurality of candidate words corresponding to the wrong word.
The implementation of this step can refer to the description of the first embodiment, and is not repeated.
And step S82, determining the edit distance between each candidate word and the wrong word.
The implementation of this step can refer to the description of the first embodiment, and is not repeated.
And step S83, determining the similarity between each candidate word and the wrong word, wherein the similarity is obtained according to the longest common subsequence and/or the longest common substring of each candidate word and the wrong word.
For the implementation of this step, reference may be made to the description of the second embodiment, which is not repeated.
And step S84, acquiring error information of the wrong word relative to each candidate word.
The implementation of this step can refer to the description of the first embodiment, and is not repeated.
And step S85, determining the evaluation score corresponding to each candidate word according to the editing distance, the similarity and the error information.
Specifically, the step is equivalent to scoring the candidate words according to the editing distance between the candidate words and the wrong words, the distinguishing information and the proximity degree between the candidate words and the wrong words, and integrating the three aspects to evaluate the possibility that the candidate words are the wrong words corresponding to the wrong words.
In this embodiment, the evaluation score corresponding to each candidate word is determined according to the edit distance, the similarity and the error information of each candidate word and the error word. The problems of editing distance and similarity between the candidate words and wrong words and phenomena of writing habits of users are considered, and the reliability of the candidate word evaluation result can be improved.
In one embodiment, the step of determining an evaluation score corresponding to each candidate word according to the edit distance, the similarity, and the error information includes: if the wrong word is the same as the initial of the candidate word, calculating an evaluation score corresponding to the candidate word according to the editing distance, the similarity and the first coefficient; and if the wrong word is different from the initial letter of the candidate word, calculating the evaluation score corresponding to the candidate word according to the editing distance, the similarity and the second coefficient.
In one embodiment, the step of calculating the evaluation score corresponding to the candidate word according to the edit distance, the similarity and the first coefficient includes: and calculating the evaluation score corresponding to the candidate word according to the reciprocal of the editing distance, the similarity and the first coefficient.
In one embodiment, the step of calculating the evaluation score corresponding to the candidate word according to the edit distance, the similarity and the second coefficient includes: and calculating the evaluation score corresponding to the candidate word according to the reciprocal of the editing distance, the similarity and a second coefficient.
In one embodiment, the evaluation score for each candidate word may be calculated according to the following formula:
scoreword=K×S×1/Dedit
the specific calculation process of the evaluation score is exemplified as follows:
it is known that when a user writes an english word or sentence on the smart interactive tablet, the some things are that the some things are wrong writing the some things into the song, and the contextual language environment is I-have song applets. And (4) detecting the English dictionary, finding that the song does not exist in the English dictionary, and determining the song as a wrong word. Obtaining candidate words of the song: some, same, one.
The edit distances of the song, the same, the one and the wrong song are respectively as follows: 1,2,1.
1. According to the formula S-1/Dedit+0.7Dlcs1+0.3Dlcs2The similarity between each candidate word and the wrong word obtained by calculation is respectively as follows:
Ssome:1+0.7×0.75+0.3×0.5=1.675;
Ssame:1/2+0.7×0.5+0.3×0.25=0.925;
Sone:1+0.7×0.86+0.3×0.86=1.86。
2. according to the formula scoreword=K×S×1/DeditThe evaluation score of each candidate word obtained by calculation is as follows:
scoresome=1×1.675×1/1=1.675;
scoresame=1×0.925×1/2=0.463;
scoreone=0.5×1.86×1/1=0.93。
as can be seen from the above evaluation scores, the evaluation score of the candidate word some is higher than the evaluation scores of same and one; the evaluation result has a certain reference value for determining the error-correcting words.
As shown in fig. 10, in a ninth embodiment, there is provided a candidate word evaluation method including the steps of:
step S91, detecting the wrong word, and obtaining a plurality of candidate words corresponding to the wrong word.
The implementation of this step can refer to the description of the first embodiment, and is not repeated.
And step S92, determining the edit distance between each candidate word and the wrong word.
The implementation of this step can refer to the description of the first embodiment, and is not repeated.
Step S93, replacing the wrong word with each candidate word respectively to obtain a candidate sentence, and determining the evaluation probability of the corresponding candidate word according to the candidate sentence, wherein the evaluation probability is obtained according to the language environment probability of the candidate word in the candidate sentence and the language environment probability of the adjacent word of the candidate word.
For the implementation of this step, reference may be made to the description of the fourth embodiment, which is not repeated.
And step S94, acquiring error information of the wrong word relative to each candidate word.
The implementation of this step can refer to the description of the first embodiment, and is not repeated.
And step S95, determining the evaluation score corresponding to each candidate word according to the editing distance, the evaluation probability and the error information.
According to the method and the device, the evaluation score corresponding to each candidate word is determined according to the editing distance between each candidate word and the wrong word, the evaluation probability and the error information. The method not only considers the editing distance between the candidate words and the wrong words and the phenomenon problem of writing of the user, but also adds the evaluation information of the language model, can improve the reliability of the evaluation result of the candidate words, and is beneficial to improving the efficiency and the accuracy of text editing.
In one embodiment, the language environment probability of each word is represented by a log of the probability of each word at its position, based on which an evaluation probability of a candidate word is obtained, and further, an evaluation score corresponding to each candidate word is determined according to the reciprocal of the edit distance, the reciprocal of the evaluation probability, and the error information.
In one embodiment, the step of determining an evaluation score corresponding to each candidate word according to the edit distance, the evaluation probability, and the error information includes: if the wrong word is the same as the initial of the candidate word, calculating the evaluation score of the candidate word according to the editing distance, the evaluation probability and the first coefficient; and if the wrong word is different from the initial letter of the candidate word, calculating the evaluation score of the candidate word according to the editing distance, the evaluation probability and the second coefficient.
In one embodiment, the evaluation score for each candidate word is calculated according to the following formula:
the specific calculation process of the evaluation score is exemplified as follows:
it is known that when a user writes an english word or sentence on the smart interactive tablet, the some things are that the some things are wrong writing the some things into the song, and the contextual language environment is I-have song applets.
For the N-Gram model, N in the N-Gram model takes a value of 3:
1. and detecting the tone in the English dictionary through English dictionary detection, and determining the tone as a wrong word.
2. And acquiring words with edit distances of 1 and 2 from the song in the English dictionary, and assuming that the song, the same, the son and the one exist as candidate words. The editing distances corresponding to the candidate words some, same, son and one are respectively as follows: 1,2,1,1.
3. Respectively replacing the song in the I-love song applets with the song, the same, the son and the one to obtain corresponding evaluation sentences; respectively calculating the evaluation probability corresponding to each candidate word according to the N-GRAM model and the evaluation sentence, wherein the obtained evaluation probability is as follows:
4. according to the formulaThe evaluation score of each candidate word is calculated as follows:
scoresome=-1×1/1×1/(-5.6)=0.179;
scoresame=-1×1/2×1/(-6.35)=0.079;
scoreson=-1×1/1×1/(-7.58)=0.132;
scoreone=-0.5×1/1×1/(-5.26)=0.095。
therefore, the candidate word some is more likely to be the correct word (error-correcting word) corresponding to the error word sone than other candidate words.
In another embodiment, step 1 and step 2 are the same for the BilSTM model; step 3 is replaced by: respectively replacing the sone in the I-have sone applets with the some of the some; respectively calculating the evaluation probability corresponding to each candidate word according to the BilSTM model and the evaluation statement, wherein the obtained evaluation probability is as follows:
step 4 is replaced by:
according to the formulaThe evaluation score of each candidate word obtained by calculation is as follows:
scoresome=-1×1/1×1/(-2.88)=0.347;
scoresame=-1×1/2×1/(-3.62)=0.138;
scoreson=-1×1/1×1/(-4.1)=0.244;
scoreone=-0.5×1/1×1/(-2.62)=0.191。
it can also be seen that the candidate word some has a higher probability of being the correct word (error-correcting word) corresponding to the error word sone than other candidate words.
In another embodiment, step 1 and step 2 above are the same for the LSTM model; step 3 is replaced by: respectively calculating the evaluation probability corresponding to each candidate word according to the LSTM model and the evaluation statement, wherein the obtained evaluation probability is as follows:
step 4 is replaced by:
according to the formulaThe evaluation score of each candidate word obtained by calculation is as follows:
scoresome=-1×1/1×1/(-3.175)=0.315;
scoresame=-1×1/2×1/(-3.75)=0.133;
scoreson=-1×1/1×1/(-4.6)=0.217;
scoreone=-0.5×1/1×1/(-3.45)=0.145。
it can also be seen that the candidate word some has a higher probability of being the correct word (error-correcting word) corresponding to the error word sone than other candidate words.
Through the candidate word evaluation score calculation under the three language models, the habit problem of a user when writing words is considered, the editing distance and the language environment information are also considered, and therefore the candidate word for evaluating wrong words can be determined more effectively.
As shown in fig. 11, in a tenth embodiment, there is provided a candidate word evaluation method including the steps of:
and S101, detecting wrong words and acquiring a plurality of candidate words corresponding to the wrong words.
For the implementation of this step, reference may be made to the corresponding steps of the first embodiment, which are not described in detail.
And step S102, determining the similarity of each candidate word and the wrong word, wherein the similarity is obtained according to the longest common subsequence and/or the longest common substring of each candidate word and the wrong word.
For the implementation of this step, reference may be made to the corresponding steps of the second embodiment, which are not described in detail.
And step S103, determining the language environment probability of each candidate word at the wrong word position.
For the implementation of this step, reference may be made to the description of the third embodiment, which is not repeated.
And step S104, acquiring error information of the wrong word relative to each candidate word.
The implementation of this step can refer to the description of the first embodiment, and is not repeated.
And step S105, determining the evaluation score corresponding to each candidate word according to the similarity, the language environment probability and the error information.
In this embodiment, the evaluation score corresponding to each candidate word is determined according to the similarity between each candidate word and the wrong word, the language environment probability and the error information. The method not only considers the similarity between the candidate words and the wrong words and the phenomenon problem of writing of the user, but also adds the evaluation information of the language model, can improve the reliability of the evaluation result of the candidate words, and is beneficial to improving the efficiency and the accuracy of text editing.
In one embodiment, when the language environment probability is represented by a log value of the probability of each candidate word at the wrong word position, the evaluation score corresponding to each candidate word may be determined according to the similarity, the inverse of the language environment probability, and the error information.
In one embodiment, the step of determining an evaluation score corresponding to each candidate word according to the similarity, the language environment probability, and the error information includes: if the wrong word is the same as the initial of the candidate word, calculating the evaluation score of the candidate word according to the similarity, the language environment probability and the first coefficient; and if the wrong word is different from the initial of the candidate word, calculating the evaluation score of the candidate word according to the similarity, the language environment probability and the second coefficient.
In one embodiment, the evaluation score for each candidate word is calculated according to the following formula:
in one embodiment, the language model is an N-Gram model, a BilSTM model, or an LSTM model.
Optionally, the language environment probability corresponding to the candidate word may be calculated by combining the plurality of language models.
Optionally, determining the evaluation score corresponding to each candidate word through the N-Gram model may be calculated through the following formula:
wherein,and calculating the language environment probability corresponding to a certain candidate word through the N-Gram model.
For the N-Gram model, N in the N-Gram model takes a value of 3:
specific examples are as follows: it is known that when a user writes an english word or sentence on the smart interactive tablet, the some things are that the some things are wrong writing the some things into the song, and the contextual language environment is I-have song applets.
1. And through detection of the English dictionary, finding that the song does not exist in the English dictionary, and determining the song as a wrong word.
2. And acquiring words with edit distances of 1 and 2 from the song in the English dictionary, and assuming that the song, the same, the son and the one exist as candidate words. Determining the similarity between each candidate word and the wrong word as follows: ssome=1.675;Ssame=0.925;Sson=1.86;Sone=1.86。
3. Respectively replacing the sound, same, son and one in the I-have sound applets, and calculating the language environment probability corresponding to the candidate words according to the replaced sentences
The language environment probability of each candidate word at the position of the song calculated according to the N-Gram model is as follows:
wherein, P (I, have, some) ═ c (I, have, some)/c (3-Gram), and c represents the count, i.e. the count (number) of 3-tuples I, have, some in the corpus is divided by the count of all 3-tuples;
wherein, P (have | I) ═ c (have, I)/c (I), i.e. the count of 2-tuple in the corpus, such as have, I, is divided by the count of all I;
where p (I) ═ c (I)/c (1-Gram), i.e. the 1-tuple count in the corpus I divided by the count of all 1-tuples.
4. According to the formulaThe evaluation score corresponding to each candidate word obtained by calculation is as follows:
scoresome=-1×1.675×1/(-5.7)=0.294;
scoresame=-1×0.925×1/(-6)=0.154;
scoreson=-1×1.86×1/(-8)=0.219;
scoreone=-0.5×1.86×1/(-4.8)=0.182。
therefore, the candidate word some is more likely to be the correct word (error-correcting word) corresponding to the error word sone than other candidate words.
In another embodiment, step 1 and step 2 are the same for the BilSTM model; step 3 is replaced by: respectively replacing the sound in the I-sound applets with the sound, same, son and one, and calculating the language environment probability corresponding to each candidate wordEach candidate word is calculated as follows:
step 4 is replaced by:
according to the formulaThe evaluation score corresponding to each candidate word obtained by calculation is as follows:
scoresome=-1×1.675×1/-5.6=0.299;
scoresame=-1×0.925×1/-6.3=0.147;
scoreson=-1×1.86×1/-8.6=0.203;
scoreone=-0.5×1.86×1/-5.1=0.172。
therefore, the candidate word some is more likely to be the correct word (error-correcting word) corresponding to the error word sone than other candidate words.
In another embodiment, step 1 and step 2 above are the same for the LSTM model; step 3 is replaced by: respectively replacing the sound in the I-sound applets with the sound, same, son and one, and calculating the language environment probability corresponding to each candidate wordEach candidate word is calculated as follows:
step 4 is replaced by:
according to the formulaThe evaluation score corresponding to each candidate word obtained by calculation is as follows:
scoresome=-1×1.675×1/-5.6=0.299;
scoresame=-1×0.925×1/-6.3=0.147;
scoreson=-1×1.86×1/-8.6=0.203;
scoreone=-0.5×1.86×1/-5.1=0.172。
it can also be known that the candidate word some is more likely to be the correct word (error-correcting word) corresponding to the error word sone than other candidate words.
Through candidate word evaluation under the three language models, the habit problem of a user writing words is considered, and similarity information and evaluation information of language environment are added, so that the reliability of candidate word evaluation is improved.
As shown in fig. 12, in an eleventh embodiment, there is provided a candidate word evaluation method including the steps of:
and step S111, detecting wrong words and acquiring a plurality of candidate words corresponding to the wrong words.
The implementation of this step can refer to the description of the first embodiment, and is not repeated.
And step S112, determining the similarity between each candidate word and the wrong word, wherein the similarity is obtained according to the longest common subsequence and/or the longest common substring of each candidate word and the wrong word.
For the implementation of this step, reference may be made to the description of the second embodiment, which is not repeated.
Step S113, replacing the wrong word with each candidate word respectively to obtain a candidate sentence, and determining the evaluation probability of the corresponding candidate word according to the candidate sentence, wherein the evaluation probability is obtained according to the language environment probability of the candidate word in the candidate sentence and the language environment probability of the adjacent word of the candidate word.
For the implementation of this step, reference may be made to the description of the fourth embodiment, which is not repeated.
Step S114, obtaining error information of the wrong word relative to each candidate word.
The implementation of this step can refer to the description of the first embodiment, and is not repeated.
And step S115, determining the evaluation score corresponding to each candidate word according to the similarity, the evaluation probability and the error information.
In this embodiment, the evaluation score corresponding to each candidate word is determined according to the similarity between each candidate word and the wrong word, the evaluation probability, and the error information. The method not only considers the similarity between the candidate words and the wrong words and the phenomenon problem of writing of the user, but also adds the evaluation information of the language model, and can improve the reliability of the evaluation result of the candidate words.
In one embodiment, the language environment probability of each word is represented by a log of the probability of each word at its position, based on which an evaluation probability of a candidate word is obtained, and further, an evaluation score corresponding to each candidate word may be determined according to the similarity, the inverse of the evaluation probability, and the error information.
In one embodiment, the step of determining an evaluation score corresponding to each candidate word according to the similarity, the evaluation probability, and the error information includes: if the wrong word is the same as the initial of the candidate word, calculating the evaluation score of the candidate word according to the similarity, the evaluation probability and the first coefficient; and if the wrong word is different from the initial of the candidate word, calculating the evaluation score of the candidate word according to the similarity, the evaluation probability and the second coefficient.
In one embodiment, the evaluation score for each candidate word is calculated according to the following formula:
the specific calculation process of the evaluation score is exemplified as follows:
it is known that when a user writes an english word or sentence on the smart interactive tablet, the some things are that the some things are wrong writing the some things into the song, and the contextual language environment is I-have song applets.
1. And detecting the tone in the English dictionary through English dictionary detection, and determining the tone as a wrong word.
2. And acquiring words with edit distances of 1 and 2 from the song in the English dictionary, and assuming that the song, the same, the son and the one exist as candidate words. Determining the similarity between each candidate word and the wrong word as follows: ssome=1.675;Ssame=0.925;Sson=1.86;Sone=1.86。
3. Respectively replacing the sone in the I-have sone applets with the some of the some; obtaining:
candidate statement one: i have some soy applets;
candidate statement two: i have same applets;
candidate statement three: i have son applets;
candidate sentence four: i have one applets.
Based on the N-GRAM model, N in the N-GRAM model takes a value of 3, a neighboring word corresponding to a candidate word some in a candidate sentence I is an applet, a neighboring word corresponding to a candidate word same in a candidate sentence II is an applet, a neighboring word corresponding to a candidate word son in a candidate sentence III is an applet, and a neighboring word corresponding to a candidate word one in a candidate sentence IV is an applet; and respectively calculating the evaluation probability corresponding to each candidate word according to the N-GRAM model, wherein the obtained evaluation probability is as follows:
4. according to the formulaThe evaluation score of each candidate word obtained by calculation is as follows:
scoresome=-1×1.675×1/(-5.6)=0.299;
scoresame=-1×0.925×1/(-6.35)=0.146;
scoreson=-1×1.86×1/(-7.58)=0.245;
scoreone=-0.5×1.86×1/(-5.26)=0.177。
therefore, the candidate word some is more likely to be the correct word (error-correcting word) corresponding to the error word sone than other candidate words.
In another embodiment, for the BilSTM model, the neighbor word corresponding to the candidate word some in the first candidate sentence is I, have and applets, the neighbor word corresponding to the candidate word sam in the second candidate sentence is I, have and applets, the neighbor word corresponding to the candidate word son in the third candidate sentence is I, have and applets, and the neighbor word corresponding to the candidate word one in the fourth candidate sentence is I, have and applets; based on the calculation, the evaluation probability corresponding to the candidate words in each candidate sentence is as follows:
step 4 is replaced by:
according to the formulaThe evaluation score of each candidate word obtained by calculation is as follows:
scoresome=-1×1.675×1/(-2.88)=0.582;
scoresame=-1×0.925×1/(-3.62)=0.256;
scoreson=-1×1.86×1/(-4.1)=0.454;
scoreone=-0.5×1.86×1/(-2.62)=0.355。
it can also be seen that the candidate word some has a higher probability of being the correct word (error-correcting word) corresponding to the error word sone than other candidate words.
In another embodiment, for the LSTM model, the neighboring words corresponding to the candidate word some in the first candidate sentence are I, have and applets, the neighboring words corresponding to the candidate word sam in the second candidate sentence are I, have and applets, the neighboring words corresponding to the candidate word son in the third candidate sentence are I, have and applets, and the neighboring words corresponding to the candidate word one in the fourth candidate sentence are I, have and applets; based on the calculation, the evaluation probability corresponding to the candidate words in each candidate sentence is as follows:
step 4 is replaced by:
according to the formulaThe evaluation score of each candidate word obtained by calculation is as follows:
scoresome=-1×1.675×1/(-3.175)=0.528;
scoresame=-1×0.925×1/(-3.75)=0.247;
scoreson=-1×1.86×1/(-4.6)=0.404;
scoreone=-0.5×1.86×1/(-3.45)=0.27。
it can also be seen that the candidate word some has a higher probability of being the correct word (error-correcting word) corresponding to the error word sone than other candidate words.
Through candidate word evaluation under the three language models, the habit problem of a user writing words is considered, and similarity information and evaluation information of language environment are added, so that the reliability of candidate word evaluation is improved.
As shown in fig. 13, in a twelfth embodiment, there is provided a candidate word evaluation method including the steps of:
step S121, a wrong word is detected, and a plurality of candidate words corresponding to the wrong word are obtained.
The implementation of this step can refer to the description of the first embodiment, and is not repeated.
Step S122, determining the editing distance between each candidate word and the wrong word.
The implementation of this step can refer to the description of the first embodiment, and is not repeated.
And S123, determining the similarity between each candidate word and the wrong word, wherein the similarity is obtained according to the longest common subsequence and/or the longest common substring of each candidate word and the wrong word.
For the implementation of this step, reference may be made to the description of the second embodiment, which is not repeated.
And step S124, determining the language environment probability of each candidate word at the wrong word position.
For the implementation of this step, reference may be made to the description of the third embodiment, which is not repeated.
Step S125, error information of the wrong word relative to each candidate word is obtained.
The implementation of this step can refer to the description of the first embodiment, and is not repeated.
And step S126, determining the evaluation score corresponding to each candidate word according to the editing distance, the similarity, the language environment probability and the error information.
In the embodiment, the evaluation score corresponding to each candidate word is determined according to the editing distance, the similarity, the language environment probability and the error information of each candidate word and the error word. The method and the device have the advantages that the problems of editing distance and similarity between the candidate words and the wrong words and phenomena of writing of users are considered, the evaluation information of the language model is added, reliability of evaluation results of the candidate words can be improved, and further, the method and the device are beneficial to improving efficiency and accuracy of text editing.
In one embodiment, when the language environment probability is represented by a log value of the probability of each candidate word at the wrong word position, the evaluation score corresponding to each candidate word may be determined according to the inverse of the edit distance, the similarity, the inverse of the language environment probability, and the error information.
In one embodiment, the step of determining an evaluation score corresponding to each candidate word according to the edit distance, the similarity, the language environment probability, and the error information includes: if the wrong word is the same as the initial letter of the candidate word, calculating the evaluation score of the candidate word according to the editing distance, the similarity, the language environment probability and the first coefficient; and if the wrong word is different from the initial letter of the candidate word, calculating the evaluation score of the candidate word according to the editing distance, the similarity, the language environment probability and the second coefficient.
In one embodiment, the evaluation score for each candidate word is calculated according to the following formula:
the specific calculation process of the evaluation score is exemplified as follows:
it is known that when a user writes an english word or sentence on the smart interactive tablet, the some things are that the some things are wrong writing the some things into the song, and the contextual language environment is I-have song applets.
1. And detecting the tone in the English dictionary through English dictionary detection, and determining the tone as a wrong word.
2. And acquiring words with edit distances of 1 and 2 from the song in the English dictionary, and assuming that the song, the same, the son and the one exist as candidate words. Determining the similarity between each candidate word and the wrong word as follows: ssome=1.675;Ssame=0.925;Sson=1.86;Sone1.86. The editing distances corresponding to the candidate words some, same, son and one are respectively as follows: 1,2,1,1.
3. Respectively replacing the song in the I-love song applets with the song, the same, the son and the one, and calculating the corresponding candidate words according to the N-GRAM modelObtaining the language environment probability of each candidate word by taking the value of N in the N-GRAM model to be 3The following were used:
4. according to the formulaThe evaluation score of each candidate word is calculated as:
scoresome=-1×1.675×1/1×1/(-5.7)=0.294;
scoresame=-1×0.925×1/2×1/(-6)=0.077;
scoreson=-1×1.86×1/1×1/(-8)=0.233;
scoreone=-0.5×1.86×1/1×1/(-4.8)=0.194。
therefore, the candidate word some is more likely to be the correct word (error-correcting word) corresponding to the error word sone than other candidate words.
In another embodiment, step 1 and step 2 are the same for the BilSTM model; step 3 is replaced by: respectively replacing the sound in the I-sound applets with the sound, same, son and one, and calculating the language environment probability corresponding to each candidate wordEach candidate word is calculated as follows:
step 4 is replaced by:
according to the formulaCalculating an evaluation score for each candidate word:
scoresome=-1×1.675×1/1×1/-4.9=0.342;
scoresame=-1×0.925×1/2×1/-5.07=0.182;
scoreson=-1×1.86×1/1×1/-7.6=0.246;
scoreone=-0.5×1.86×1/1×1/-4.66=0.199。
it can also be known that the candidate word some is more likely to be the correct word (error-correcting word) corresponding to the error word sone than other candidate words.
In another embodiment, step 1 and step 2 above are the same for the LSTM model; step 3 is replaced by: respectively replacing the sound in the I-sound applets with the sound, same, son and one, and calculating the language environment probability corresponding to each candidate wordEach candidate word is calculated as follows:
step 4 is replaced by:
according to the formulaCalculating an evaluation score for each candidate word:
scoresome=-1×1.675×1/1×1/-5.6=0.3;
scoresame=-1×0.925×1/2×1/-6.3=0.147;
scoreson=-1×1.86×1/1×1/-6=0.216;
scoreone=-0.5×1.86×1/1×1/-5.1=0.182。
it can also be known that the candidate word some is more likely to be the correct word (error-correcting word) corresponding to the error word sone than other candidate words.
Through candidate word evaluation under the three language models, the habit problem of a user writing words is considered, and evaluation information of editing distance, similarity and language environment is added, so that the reliability of candidate word evaluation is improved.
As shown in fig. 14, in the thirteenth embodiment, there is provided a candidate word evaluation method including the steps of:
step S131, a wrong word is detected, and a plurality of candidate words corresponding to the wrong word are obtained.
The implementation of this step can refer to the description of the first embodiment, and is not repeated.
Step S132, determining the editing distance between each candidate word and the wrong word.
The implementation of this step can refer to the description of the first embodiment, and is not repeated.
And step S133, determining the similarity between each candidate word and the wrong word, wherein the similarity is obtained according to the longest common subsequence and/or the longest common substring of each candidate word and the wrong word.
For the implementation of this step, reference may be made to the description of the second embodiment, which is not repeated.
And S134, replacing the wrong words with the candidate words respectively to obtain candidate sentences, and determining the evaluation probability of the corresponding candidate words according to the candidate sentences, wherein the evaluation probability is obtained according to the language environment probability of the candidate words in the candidate sentences and the language environment probability of the adjacent words of the candidate words.
For the implementation of this step, reference may be made to the description of the fourth embodiment, which is not repeated.
Step S135, obtain error information of the wrong word relative to each candidate word.
The implementation of this step can refer to the description of the first embodiment, and is not repeated.
And S136, determining the evaluation score corresponding to each candidate word according to the editing distance, the similarity, the evaluation probability and the error information.
According to the method and the device, the evaluation score corresponding to each candidate word is determined according to the editing distance, the similarity, the evaluation probability and the error information of each candidate word and the error word. The method and the device have the advantages that the problems of editing distance and similarity between the candidate words and the wrong words and phenomena of writing of users are considered, the evaluation information of the language model is added, reliability of evaluation results of the candidate words can be improved, and efficiency and accuracy of text editing can be improved.
In one embodiment, the language environment probability of each word is represented by a log of the probability of each word at its position, based on which an evaluation probability of a candidate word is obtained, and further, an evaluation score corresponding to each candidate word may be determined according to the reciprocal of the edit distance, the similarity, the reciprocal of the evaluation probability, and the error information.
In one embodiment, the step of determining an evaluation score corresponding to each candidate word according to the edit distance, the similarity, the evaluation probability, and the error information includes: if the wrong word is the same as the initial letter of the candidate word, calculating the evaluation score of the candidate word according to the editing distance, the similarity, the evaluation probability and the first coefficient; and if the wrong word is different from the initial letter of the candidate word, calculating the evaluation score of the candidate word according to the editing distance, the similarity, the evaluation probability and the second coefficient.
In one embodiment, the evaluation score for each candidate word is calculated according to the following formula:
the specific calculation process of the evaluation score is exemplified as follows:
it is known that when a user writes an english word or sentence on the smart interactive tablet, the some things are that the some things are wrong writing the some things into the song, and the contextual language environment is I-have song applets.
1. And detecting the tone in the English dictionary through English dictionary detection, and determining the tone as a wrong word.
2. ObtainAnd taking words with edit distances of 1 and 2 from the song in the English dictionary, and assuming that the song, the same, the son and the one exist as candidate words. Calculating the similarity between each candidate word and the wrong word as follows: ssome=1.675;Ssame=0.925;Sson=1.86;Sone1.86. The editing distances corresponding to the candidate words some, same, son and one are respectively as follows: 1,2,1,1.
3. Respectively replacing the song in the I-love song applets with the song, the same, the son and the one, and respectively calculating the evaluation probability corresponding to each candidate word according to the N-GRAM model, wherein the obtained evaluation probability is as follows:
4. according to the formulaThe evaluation score of each candidate word is calculated as:
scoresome=-1×1.675×1/1×1/(-5.6)=0.299;
scoresame=-1×0.925×1/2×1/(-6.35)=0.073;
scoreson=-1×1.86×1/1×1/(-7.58)=0.245;
scoreone=-0.5×1.86×1/1×1/(-5.26)=0.177。
therefore, the candidate word some is more likely to be the correct word (error-correcting word) corresponding to the error word sone than other candidate words.
In another embodiment, for the BilSTM model, the evaluation probability corresponding to the candidate word in each candidate sentence is calculated as follows:
step 4 is replaced by:
according to the formulaThe evaluation score of each candidate word obtained by calculation is as follows:
scoresome=-1×1.675×1/1×1/(-2.88)=0.582;
scoresame=-1×0.925×1/2×1/(-3.62)=0.128;
scoreson=-1×1.86×1/1×1/(-4.1)=0.454;
scoreone=-0.5×1.86×1/1×1/(-2.62)=0.355。
it can also be known that the candidate word some is more likely to be the correct word (error-correcting word) corresponding to the error word sone than other candidate words.
In another embodiment, for the LSTM model, the evaluation probability corresponding to the candidate words in each candidate sentence is calculated as follows:
step 4 is replaced by:
according to the formulaThe evaluation score of each candidate word obtained by calculation is as follows:
scoresome=-1×1.675×1/1×1/(-3.175)=0.528;
scoresame=-1×0.925×1/2×1/(-3.75)=0.124;
scoreson=-1×1.86×1/1×1/(-4.6)=0.404;
scoreone=-0.5×1.86×1/1×1/(-3.45)=0.27。
it can also be seen that the candidate word some has a higher probability of being the correct word (error-correcting word) corresponding to the error word sone than other candidate words.
Through candidate word evaluation under the three language models, the habit problem of a user writing words is considered, and evaluation information of editing distance, similarity and language environment is added, so that the reliability of candidate word evaluation results is improved.
On the basis of obtaining the evaluation score corresponding to the candidate word in any of the above embodiments, in an embodiment, the candidate word evaluation method further includes the steps of: determining error correction words corresponding to the error words from the candidate words according to the evaluation scores, and correcting the error words by using the error correction words; this enables more accurate determination of correction of wrong words.
Optionally, a candidate word with the highest evaluation score is selected from the multiple candidate words of the wrong word, and is used as the error correction word corresponding to the wrong word. Further, the wrong words can be replaced by the error correction words, so that the effect of automatically correcting the wrong words is achieved.
In addition, in an embodiment, on the basis of obtaining the evaluation score corresponding to the candidate word in any of the above embodiments, the candidate word evaluation method further includes the steps of: and sorting the candidate words according to the evaluation scores, and displaying the candidate words according to the sorting, so that the candidate words with higher evaluation scores are displayed more forwards, and a user is better prompted.
In an embodiment, on the basis of any of the above embodiments, the candidate word evaluation method further includes: detecting whether the words to be detected are in a preset word bank or not, and if not, determining that the words to be detected are wrong words. For example: scanning each word, detecting whether each word is in the dictionary, and determining the word is wrong if not in the dictionary.
It can be understood that the preset word stock may be a general english dictionary, a chinese dictionary, or the like, or may be other specific word stocks, and the word stock may be selected according to actual situations.
In an embodiment, optionally, after detecting the wrong word, the candidate word evaluation method further includes a step of determining a candidate word set corresponding to the wrong word. The steps can be as follows: and calculating the editing distance between the wrong word and the known word in the word bank, and selecting the known word with the editing distance within a set range to obtain a plurality of candidate words corresponding to the wrong word. For example, a known word with an edit distance less than 3 is selected as a candidate word, thereby improving the effectiveness of candidate word evaluation.
It should be understood that for the foregoing method embodiments, although the steps in the flowcharts are shown in order indicated by the arrows, the steps are not necessarily performed in order indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a portion of the steps in the flow charts of the method embodiments may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performance of the sub-steps or stages is not necessarily sequential, but may be performed in turn or alternating with other steps or at least a portion of the sub-steps or stages of other steps.
Based on the ideas of the candidate word evaluation methods of the first to thirteenth embodiments, the embodiments of the present application further provide corresponding candidate word evaluation devices.
As shown in fig. 15, in the fourteenth embodiment, the candidate word evaluation means includes:
the candidate word acquisition module 101 is configured to detect a wrong word and acquire a plurality of candidate words corresponding to the wrong word;
a distance determining module 102, configured to determine an editing distance between each candidate word and the wrong word;
an error information obtaining module 103, configured to obtain error information of the wrong word relative to each candidate word;
and the first evaluation module 104 is configured to determine an evaluation score corresponding to each candidate word according to the edit distance and the error information.
When a wrong word is detected, the candidate word evaluation device of the embodiment first acquires a plurality of corresponding candidate words, determines the edit distance between each candidate word and the wrong word, and determines the error information of the wrong word relative to each candidate word; determining an evaluation score corresponding to each candidate word according to the editing distance and the error information; the method not only considers the phenomenon problem of word writing, but also takes the information of the editing distance between words into consideration, thereby improving the reliability of the evaluation result of the candidate words.
In an embodiment, the first evaluation module 104 is configured to determine an evaluation score corresponding to each candidate word according to an inverse of the edit distance and error information.
In an embodiment, the error information of the error word relative to each candidate word includes: the information whether the wrong word and the candidate word have the same initial;
the first evaluation module 104 includes: the first scoring submodule is used for calculating an evaluation score corresponding to the candidate word according to the editing distance and the first coefficient if the wrong word is the same as the first letter of the candidate word; and the second scoring submodule is used for calculating the evaluation score corresponding to the candidate word according to the editing distance and the second coefficient if the wrong word is different from the initial of the candidate word.
As shown in fig. 16, in the fifteenth embodiment, the candidate word evaluation means includes:
a candidate word obtaining module 201, configured to detect a wrong word and obtain multiple candidate words corresponding to the wrong word;
a similarity determining module 202, configured to determine a similarity between each candidate word and a wrong word, where the similarity is obtained according to a longest common subsequence and/or a longest common substring of each candidate word and the wrong word;
an error information obtaining module 203, configured to obtain error information of the wrong word relative to each candidate word;
and the second evaluation module 204 is configured to determine an evaluation score corresponding to each candidate word according to the similarity and the error information.
When a wrong word is detected, the candidate word evaluation device of the embodiment first acquires a plurality of corresponding candidate words, respectively determines the similarity between each candidate word and the wrong word, and determines error information of the wrong word relative to each candidate word; determining an evaluation score corresponding to each candidate word according to the similarity and the error information; the problem of word writing is considered, and similarity information among the words is also considered, so that the reliability of the candidate word evaluation result can be improved.
In one embodiment, the similarity determination module 202 includes: and the first similarity calculation submodule or the second similarity calculation submodule.
The first similarity calculation operator module is used for calculating the similarity between each candidate word and the wrong word according to at least one of the longest common subsequence rate and the longest common substring rate of each candidate word and the wrong word; or the second similarity calculation operator module is used for calculating the similarity between each candidate word and the wrong word according to at least one of the longest common subsequence rate and the longest common substring rate of each candidate word and the wrong word and the editing distance between each candidate word and the wrong word.
In an embodiment, the second similarity operator module is configured to calculate a similarity between each candidate word and the wrong word according to at least one of a longest common sub-sequence rate and a longest common sub-sequence rate of each candidate word and the wrong word, and an inverse of an edit distance between each candidate word and the wrong word.
In an embodiment, the error information of the error word relative to each candidate word includes: the information whether the wrong word and the candidate word have the same initial; optionally, the second evaluation module 204 includes: the first scoring submodule is used for calculating an evaluation score corresponding to the candidate word according to the similarity and the first coefficient if the wrong word is the same as the first letter of the candidate word; and the second scoring submodule is used for calculating the evaluation score corresponding to the candidate word according to the similarity and the second coefficient if the wrong word is different from the initial of the candidate word.
As shown in fig. 17, in the sixteenth embodiment, the candidate word evaluation means includes:
a candidate word obtaining module 301, configured to detect a wrong word and obtain multiple candidate words corresponding to the wrong word;
a first probability determination module 302, configured to determine a language environment probability of each candidate word at the wrong word position;
an error information obtaining module 303, configured to obtain error information of the wrong word relative to each candidate word;
and a third evaluation module 304, configured to determine an evaluation score corresponding to each candidate word according to the language environment probability and the error information.
When a wrong word is detected, the candidate word evaluation device of the embodiment first acquires a plurality of corresponding candidate words, respectively determines the probability of each candidate word at the position of the wrong word, and determines error information of the wrong word relative to each candidate word; determining an evaluation score corresponding to each candidate word according to the language environment probability and the error information; the habit problem of the user when writing the word is considered, and the information of the context language environment is also considered, so that the reliability of the candidate word evaluation result can be improved.
In an embodiment, the first probability determining module 302 is configured to calculate, according to a preset language model, a probability of each candidate word at the wrong word position, and use a log value of the probability as a language environment probability of the candidate word.
In an embodiment, the third evaluation module 304 is configured to determine an evaluation score corresponding to each candidate word according to the inverse of the language environment probability and error information; wherein the language model includes, but is not limited to, an N-Gram model, a BilSTM model, or an LSTM model.
In one embodiment, the error information of the error word relative to each candidate word includes: the information whether the wrong word and the candidate word have the same initial; correspondingly, the third evaluation module 304 includes: the first scoring submodule is used for calculating the evaluation score of the candidate word according to the reciprocal of the language environment probability and a first coefficient if the wrong word is the same as the initial of the candidate word; and the second scoring submodule is used for calculating the evaluation score of the candidate word according to the inverse of the language environment probability and a second coefficient if the wrong word is different from the initial of the candidate word.
As shown in fig. 18, in the seventeenth embodiment, the candidate word evaluation means includes:
a candidate word obtaining module 401, configured to detect a wrong word, and obtain multiple candidate words corresponding to the wrong word;
a second probability determining module 402, configured to replace the wrong word with each candidate word respectively to obtain a candidate sentence, and determine an evaluation probability of the corresponding candidate word according to the candidate sentence, where the evaluation probability is obtained according to a language environment probability of the candidate word in the candidate sentence and a language environment probability of a word adjacent to the candidate word;
an error information obtaining module 403, configured to obtain error information of the wrong word relative to each candidate word;
and a fourth evaluation module 404, configured to determine an evaluation score corresponding to each candidate word according to the evaluation probability and the error information.
When a wrong word is detected, the candidate word evaluation device of the embodiment first acquires a plurality of corresponding candidate words, determines the evaluation probability corresponding to each candidate word, and determines the error information of the wrong word relative to each candidate word; determining an evaluation score corresponding to each candidate word according to the evaluation probability and the error information; the habit problem of the user when writing the word is considered, and the information of the context language environment is also considered, so that the reliability of the candidate word evaluation result can be improved.
In an embodiment, the second probability determining module 402 is further configured to calculate probabilities of respective positions of a candidate word and neighboring words of the candidate word in the candidate sentence according to a preset language model, and use a log value of the probabilities as a language environment probability of each word; and averaging the language environment probability of the candidate words in the candidate sentences and the language environment probability of the adjacent words of the candidate words to obtain the evaluation probability of the candidate words in the candidate sentences.
In an embodiment, the fourth evaluation module 404 is specifically configured to determine an evaluation score corresponding to each candidate word according to an inverse of the evaluation probability and error information; wherein the language model includes, but is not limited to, an N-Gram model, a BilSTM model, or an LSTM model.
In an embodiment, the error information of the error word relative to each candidate word includes: the information whether the wrong word and the candidate word have the same initial; the fourth evaluation module 404 includes: the first scoring submodule is used for calculating an evaluation score corresponding to the candidate word according to the evaluation probability and the first coefficient if the wrong word is the same as the first letter of the candidate word; and the second scoring submodule is used for calculating the evaluation score corresponding to the candidate word according to the evaluation probability and the second coefficient if the wrong word and the candidate word have different initial letters.
As shown in fig. 19, in the eighteenth embodiment, the candidate word evaluation means includes:
the candidate word obtaining module 501 is configured to detect a wrong word, and obtain multiple candidate words corresponding to the wrong word.
A distance determining module 502, configured to determine an editing distance between each candidate word and the wrong word;
a second probability determining module 503, configured to replace the wrong word with each candidate word respectively to obtain a candidate sentence, and determine an evaluation probability of the corresponding candidate word according to the candidate sentence, where the evaluation probability is obtained according to a language environment probability of the candidate word in the candidate sentence and a language environment probability of a word adjacent to the candidate word;
and a fifth evaluation module 504, configured to determine an evaluation score of each candidate word according to the edit distance and the evaluation probability.
In an embodiment, the second probability determining module 503 is configured to replace the wrong word with each candidate word respectively to obtain a candidate sentence, calculate probabilities of the candidate word and neighboring words of the candidate word in the candidate sentence at respective positions according to a preset language model, and use a log value of the probabilities as a language environment probability of each word; and averaging the language environment probability of the candidate words in the candidate sentences and the language environment probability of the adjacent words of the candidate words to obtain the evaluation probability of the candidate words in the candidate sentences. Wherein the language model includes, but is not limited to: an N-Gram model, a BilSTM model, or an LSTM model.
Based on the above embodiment, the fifth evaluation module 504 is specifically configured to determine an evaluation score corresponding to each candidate word according to an inverse of the edit distance and an inverse of the evaluation probability.
As shown in fig. 20, in the nineteenth embodiment, the candidate word evaluation device includes:
the candidate word obtaining module 601 is configured to detect a wrong word and obtain multiple candidate words corresponding to the wrong word.
A similarity determining module 602, configured to determine a similarity between each candidate word and a wrong word, where the similarity is obtained according to a longest common subsequence and/or a longest common substring of each candidate word and the wrong word;
a second probability determining module 603, configured to replace the wrong word with each candidate word, respectively, to obtain a candidate sentence, and determine an evaluation probability of the corresponding candidate word according to the candidate sentence, where the evaluation probability is obtained according to a language environment probability of the candidate word in the candidate sentence and a language environment probability of a word adjacent to the candidate word;
and a sixth evaluation module 604, configured to determine an evaluation score of each candidate word according to the similarity and the evaluation probability.
In an embodiment, the second probability determining module 603 is configured to replace the wrong word with each candidate word, respectively, to obtain a candidate sentence, calculate probabilities of the candidate word and neighboring words of the candidate word in the candidate sentence at respective positions according to a preset language model, and use a log value of the probabilities as a language environment probability of each word; and averaging the language environment probability of the candidate words in the candidate sentences and the language environment probability of the adjacent words of the candidate words to obtain the evaluation probability of the candidate words in the candidate sentences. Wherein the language model includes, but is not limited to: an N-Gram model, a BilSTM model, or an LSTM model.
Based on the above embodiment, the sixth evaluation module 604 is specifically configured to determine an evaluation score corresponding to each candidate word according to the reciprocal of the evaluation probability and the similarity.
As shown in fig. 21, in the twentieth embodiment, the candidate word evaluation means includes:
a candidate word obtaining module 701, configured to obtain multiple candidate words corresponding to a wrong word when the wrong word is detected;
a distance determining module 702, configured to determine an editing distance between each candidate word and the wrong word;
a first probability determining module 703, configured to determine a language environment probability of each candidate word at the wrong word position;
an error information obtaining module 704, configured to obtain error information of the wrong word relative to each candidate word;
and a seventh evaluation module 705, configured to determine an evaluation score corresponding to each candidate word according to the edit distance, the language environment probability, and the error information.
When a wrong word is detected, the candidate word evaluation device of the embodiment first acquires a plurality of corresponding candidate words, determines the edit distance between each candidate word and the wrong word, the probability of each candidate word at the position of the wrong word, and the error information of the wrong word relative to each candidate word; determining an evaluation score corresponding to each candidate word according to the editing distance, the language environment probability and the error information; the editing distance information and the context language environment are considered, and the habit problem of the user when writing the word is also considered, so that the reliability of the candidate word evaluation result can be improved.
In an embodiment, the first probability determining module 703 is configured to calculate, according to a preset language model, a probability of each candidate word at the wrong word position, and use a log value of the probability as a language environment probability of the candidate word.
In an embodiment, the seventh evaluation module 505 is specifically configured to determine an evaluation score corresponding to each candidate word according to an inverse of the editing distance, an inverse of the language environment probability, and error information; the language model includes, but is not limited to, an N-Gram model, a BilSTM model, or an LSTM model.
In one embodiment, the error information of the error word relative to each candidate word includes: the information whether the wrong word and the candidate word have the same initial; correspondingly, the seventh evaluation module 705 comprises: the first scoring submodule is used for calculating an evaluation score corresponding to the candidate word according to the editing distance, the language environment probability and the first coefficient if the wrong word is the same as the first letter of the candidate word; and the second scoring submodule is used for calculating the evaluation score corresponding to the candidate word according to the editing distance, the language environment probability and the second coefficient if the wrong word is different from the initial of the candidate word.
As shown in fig. 22, in the twenty-first embodiment, the candidate word evaluation device includes:
the candidate word obtaining module 801 is configured to detect a wrong word and obtain multiple candidate words corresponding to the wrong word.
A distance determining module 802, configured to determine an editing distance between each candidate word and the wrong word.
And a similarity determining module 803, configured to determine similarity between each candidate word and the wrong word, where the similarity is obtained according to the longest common subsequence and/or the longest common substring of each candidate word and the wrong word.
An error information obtaining module 804, configured to obtain error information of the wrong word relative to each candidate word.
And an eighth evaluation module 805, configured to determine, according to the edit distance, the similarity, and the error information, an evaluation score corresponding to each candidate word.
When a wrong word is detected, the candidate word evaluation device of the embodiment first acquires a plurality of corresponding candidate words, determines the edit distance and similarity between each candidate word and the wrong word, and determines error information of the wrong word relative to each candidate word; determining an evaluation score corresponding to each candidate word according to the editing distance, the similarity and the error information; the editing distance information and the similarity are considered, and the habit problem of the user when writing the word is also considered, so that the reliability of the candidate word evaluation result can be improved.
In one embodiment, the eighth evaluation module 805 comprises: the first scoring submodule is used for calculating an evaluation score corresponding to the candidate word according to the distance, the similarity and the first coefficient if the wrong word is the same as the first letter of the candidate word; and the second scoring submodule is used for calculating the evaluation score corresponding to the candidate word according to the distance, the similarity and the second coefficient if the wrong word is different from the initial of the candidate word.
As shown in fig. 23, in twenty-two embodiments, a candidate word evaluation apparatus is provided, and the candidate word evaluation apparatus of this embodiment includes:
the candidate word obtaining module 901 is configured to detect a wrong word and obtain multiple candidate words corresponding to the wrong word.
A distance determining module 902, configured to determine an editing distance between each candidate word and the wrong word.
A second probability determining module 903, configured to replace the wrong word with each candidate word to obtain a candidate sentence, and determine an evaluation probability of the corresponding candidate word according to the candidate sentence, where the evaluation probability is obtained according to the language environment probability of the candidate word in the candidate sentence and the language environment probability of a word adjacent to the candidate word.
An error information obtaining module 904, configured to obtain error information of the wrong word relative to each candidate word.
And a ninth evaluation module 905, configured to determine, according to the edit distance, the evaluation probability, and the error information, an evaluation score corresponding to each candidate word.
When a wrong word is detected, the candidate word evaluation device of the embodiment first acquires a plurality of corresponding candidate words, determines the edit distance between each candidate word and the wrong word, the evaluation probability of each candidate word at the position of the wrong word, and the error information of the wrong word relative to each candidate word; determining an evaluation score corresponding to each candidate word according to the editing distance, the evaluation probability and the error information; the editing distance information and the context language environment are considered, and the habit problem of the user when writing the word is also considered, so that the reliability of the candidate word evaluation result can be improved.
In an embodiment, the second probability determining module 903 is specifically configured to calculate probabilities of respective positions of a candidate word and a neighboring word of the candidate word in a candidate sentence according to a preset language model, and use a log value of the probabilities as a language environment probability of each word; and averaging the language environment probability of the candidate words in the candidate sentences and the language environment probability of the adjacent words of the candidate words to obtain the evaluation probability of the candidate words in the candidate sentences.
In one embodiment, the ninth evaluation module 905 includes: the first scoring submodule is used for calculating the evaluation score of the candidate word according to the editing distance, the evaluation probability and the first coefficient if the wrong word is the same as the first letter of the candidate word; and the second scoring submodule is used for calculating the evaluation score of the candidate word according to the editing distance, the evaluation probability and the second coefficient if the wrong word is different from the initial of the candidate word.
As shown in fig. 24, in twenty-third embodiment, a candidate word evaluation apparatus is provided, and the candidate word evaluation apparatus of this embodiment includes:
the candidate word obtaining module 1001 is configured to detect a wrong word and obtain multiple candidate words corresponding to the wrong word.
The similarity determining module 1002 is configured to determine a similarity between each candidate word and the wrong word, where the similarity is obtained according to a longest common subsequence and/or a longest common substring of each candidate word and the wrong word.
A first probability determining module 1003, configured to determine a language environment probability of each candidate word at the wrong-word position.
An error information obtaining module 1004, configured to obtain error information of the wrong word relative to each candidate word.
A tenth evaluation module 1005, configured to determine an evaluation score corresponding to each candidate word according to the similarity, the language environment probability, and the error information.
When a wrong word is detected, the candidate word evaluation device of the embodiment first acquires a plurality of corresponding candidate words, and determines similarity between each candidate word and the wrong word, prediction environment probability of each candidate word at the position of the wrong word, and error information of the wrong word relative to each candidate word; determining an evaluation score corresponding to each candidate word according to the similarity, the language environment probability and the error information; the editing distance information and the context language environment are considered, and the habit problem of the user when writing the word is also considered, so that the reliability of the candidate word evaluation result can be improved.
In an embodiment, the first probability determining module 1003 is configured to calculate, according to a preset language model, a probability of each candidate word at the wrong word position, and use a log value of the probability as a language environment probability of the candidate word.
In one embodiment, the tenth evaluation module 1005 includes: the first scoring submodule is used for calculating the evaluation score of the candidate word according to the similarity, the language environment probability and the first coefficient if the wrong word is the same as the first letter of the candidate word; and the second scoring submodule is used for calculating the evaluation score of the candidate word according to the similarity, the language environment probability and the second coefficient if the wrong word is different from the initial of the candidate word.
As shown in fig. 25, in a twenty-fourth embodiment, a candidate word evaluation apparatus is provided, and the candidate word evaluation apparatus of this embodiment includes:
the candidate word obtaining module 1101 is configured to detect a wrong word, and obtain a plurality of candidate words corresponding to the wrong word.
And a similarity determining module 1102, configured to determine similarity between each candidate word and the wrong word, where the similarity is obtained according to the longest common subsequence and/or the longest common substring of each candidate word and the wrong word.
A second probability determining module 1103, configured to replace the wrong word with each candidate word, respectively, to obtain a candidate sentence, and determine an evaluation probability of the corresponding candidate word according to the candidate sentence, where the evaluation probability is obtained according to the language environment probability of the candidate word in the candidate sentence and the language environment probability of a word adjacent to the candidate word.
An error information obtaining module 1104, configured to obtain error information of the wrong word relative to each candidate word.
And an eleventh evaluating module 1105, configured to determine an evaluation score corresponding to each candidate word according to the similarity, the evaluation probability, and the error information.
When a wrong word is detected, the candidate word evaluation device of the embodiment first acquires a plurality of corresponding candidate words, determines similarity between each candidate word and the wrong word and evaluation probability of each candidate word at the position of the wrong word, and determines error information of the wrong word relative to each candidate word; determining an evaluation score corresponding to each candidate word according to the similarity, the evaluation probability and the error information; the similarity and the context language environment are considered, and the habit problem of the user when writing the word is also considered, so that the reliability of the candidate word evaluation result can be improved.
In one embodiment, the second probability determining module 1103 is specifically configured to calculate, according to a preset language model, probabilities of respective positions of a candidate word and neighboring words of the candidate word in a candidate sentence, and use a log value of the probabilities as a language environment probability of each word; and averaging the language environment probability of the candidate words in the candidate sentences and the language environment probability of the adjacent words of the candidate words to obtain the evaluation probability of the candidate words in the candidate sentences.
In one embodiment, the eleventh evaluation module 1105 includes: the first scoring submodule is used for calculating the evaluation score of the candidate word according to the similarity, the evaluation probability and the first coefficient if the wrong word is the same as the first letter of the candidate word; and the second scoring submodule is used for calculating the evaluation score of the candidate word according to the similarity, the evaluation probability and the second coefficient if the wrong word is different from the initial of the candidate word.
As shown in fig. 26, in a twenty-fifth embodiment, a candidate word evaluation device is provided, and the candidate word evaluation device of this embodiment includes:
a candidate word obtaining module 1201, configured to detect a wrong word, and obtain multiple candidate words corresponding to the wrong word.
A distance determining module 1202, configured to determine an editing distance between each candidate word and the wrong word.
A similarity determining module 1203, configured to determine a similarity between each candidate word and the wrong word, where the similarity is obtained according to the longest common subsequence and/or the longest common substring of each candidate word and the wrong word.
A first probability determination module 1204, configured to determine a language environment probability of each candidate word at the wrong-word position.
An error information obtaining module 1205 is configured to obtain error information of the wrong word relative to each candidate word.
And a twelfth evaluation module 1206, configured to determine, according to the editing distance, the similarity, the language environment probability, and the error information, an evaluation score corresponding to each candidate word.
When a wrong word is detected, the candidate word evaluation device of the embodiment first acquires a plurality of corresponding candidate words, determines the edit distance and similarity between each candidate word and the wrong word and the language environment probability of each candidate word at the position of the wrong word, and determines error information of the wrong word relative to each candidate word; determining an evaluation score corresponding to each candidate word according to the editing distance, the similarity, the language environment probability and the error information; the editing distance information, the similarity and the context language environment are considered, and the habit problem of the user when writing the word is also considered, so that the reliability of the candidate word evaluation result can be improved.
In an embodiment, the first probability determining module 1204 is specifically configured to calculate, according to a preset language model, a probability of each candidate word at the wrong word position, and use a log value of the probability as a language environment probability of the candidate word.
In one embodiment, the twelfth evaluation module 1206 comprises: the first scoring submodule is used for calculating the evaluation score of the candidate word according to the editing distance, the similarity, the language environment probability and the first coefficient if the wrong word is the same as the initial of the candidate word; and the second scoring submodule is used for calculating the evaluation score of the candidate word according to the editing distance, the similarity, the language environment probability and the second coefficient if the wrong word is different from the initial of the candidate word.
As shown in fig. 27, in a twenty-sixth embodiment, a candidate word evaluation device is provided, and the candidate word evaluation device of this embodiment includes:
the candidate word obtaining module 1301 is configured to detect a wrong word, and obtain multiple candidate words corresponding to the wrong word.
A distance determining module 1302, configured to determine an editing distance between each candidate word and the wrong word.
And a similarity determining module 1303, configured to determine similarity between each candidate word and the wrong word, where the similarity is obtained according to the longest common subsequence and/or the longest common substring of each candidate word and the wrong word.
And a second probability determining module 1304, configured to replace the wrong word with each candidate word to obtain a candidate sentence, and determine an evaluation probability of the corresponding candidate word according to the candidate sentence, where the evaluation probability is obtained according to the language environment probability of the candidate word in the candidate sentence and the language environment probability of a word adjacent to the candidate word.
An error information obtaining module 1305, configured to obtain error information of the wrong word relative to each candidate word.
And a thirteenth evaluation module 1306, configured to determine, according to the edit distance, the similarity, the evaluation probability, and the error information, an evaluation score corresponding to each candidate word.
When a wrong word is detected, the candidate word evaluation device of the embodiment first acquires a plurality of corresponding candidate words, determines the edit distance and similarity between each candidate word and the wrong word and the evaluation probability of each candidate word at the position of the wrong word, and determines error information of the wrong word relative to each candidate word; determining an evaluation score corresponding to each candidate word according to the editing distance, the similarity, the evaluation probability and the error information; the editing distance, the similarity and the context language environment are considered, and the habit problem of the user when writing the word is also considered, so that the reliability of the candidate word evaluation result can be improved.
In an embodiment, the second probability determining module 1304 is specifically configured to calculate, according to a preset language model, probabilities of respective positions of a candidate word and a neighboring word of the candidate word in a candidate sentence, and use a log value of the probabilities as a language environment probability of each word; and averaging the language environment probability of the candidate words in the candidate sentences and the language environment probability of the adjacent words of the candidate words to obtain the evaluation probability of the candidate words in the candidate sentences.
In one embodiment, the thirteenth evaluating module 1306 is further configured to determine an evaluation score corresponding to each candidate word according to the inverse of the edit distance, the similarity, the inverse of the evaluation probability, and the error information.
In one embodiment, the thirteenth evaluation module 1306 includes: the first scoring submodule is used for calculating an evaluation score corresponding to the candidate word according to the editing distance, the similarity, the evaluation probability and the first coefficient if the wrong word is the same as the first letter of the candidate word; and the second scoring submodule is used for calculating the evaluation score corresponding to the candidate word according to the editing distance, the similarity, the evaluation probability and the second coefficient if the wrong word is different from the initial of the candidate word.
On the basis of the candidate word evaluation apparatus in any of the above embodiments, in an embodiment, the candidate word evaluation apparatus further includes: and the candidate word determining module is used for calculating the editing distance between the wrong word and the known word in the preset word bank, selecting the known word with the editing distance within a set range, and obtaining a plurality of candidate words corresponding to the wrong word.
In an embodiment, on the basis of the candidate word evaluation apparatus in any of the above embodiments, the candidate word evaluation apparatus further includes: and the correction module is used for determining error correction words corresponding to the error words from the candidate words according to the evaluation scores and correcting the error words by using the error correction words. Optionally, the wrong word correcting module is configured to determine, from the multiple candidate words, a candidate word with a highest evaluation score as the wrong word corresponding to the wrong word.
In an embodiment, on the basis of the candidate word evaluation apparatus in any of the above embodiments, the candidate word evaluation apparatus further includes: and the sorting module is used for sorting the candidate words according to the evaluation scores and displaying the sorted candidate words.
In an embodiment, a computer device is provided, which includes a memory and a processor, the memory stores a computer program, and the processor implements the steps of the candidate word evaluation method in any one of the above embodiments when executing the computer program.
Compared with the traditional candidate word evaluation mode according to the editing distance, the computer equipment can evaluate the candidate words of wrong words more effectively.
In an embodiment, a computer-readable storage medium is provided, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the candidate word evaluation method of any one of the above embodiments.
Compared with the traditional candidate word evaluation mode according to the editing distance, the computer-readable storage medium can evaluate the candidate words of wrong words more effectively.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).
The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
The terms "comprises" and "comprising," as well as any variations thereof, of the embodiments herein are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or (module) elements is not limited to only those steps or elements but may alternatively include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
Reference herein to "a plurality" means two or more. "and/or" describes the association relationship of the associated objects, meaning that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship.
References to "first \ second" herein are merely to distinguish between similar objects and do not denote a particular ordering with respect to the objects, it being understood that "first \ second" may, where permissible, be interchanged with a particular order or sequence. It should be understood that "first \ second" distinct objects may be interchanged under appropriate circumstances such that the embodiments described herein may be practiced in sequences other than those illustrated or described herein.
The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims (14)

1. A candidate word evaluation method, comprising:
detecting wrong words, and acquiring a plurality of candidate words corresponding to the wrong words;
determining the language environment probability of each candidate word at the wrong word position;
acquiring error information of the wrong word relative to each candidate word;
and determining the evaluation score corresponding to each candidate word according to the language environment probability and the error information.
2. The candidate word evaluation method of claim 1, wherein the determining the language environment probability of each candidate word at the wrong word position comprises:
and calculating the probability of each candidate word at the wrong word position according to a preset language model, and taking the log value of the probability as the language environment probability of the candidate word.
3. The candidate word evaluation method according to claim 2, wherein determining the evaluation score corresponding to each candidate word according to the language environment probability and the error information comprises:
determining an evaluation score corresponding to each candidate word according to the reciprocal of the language environment probability and the error information;
and/or the presence of a gas in the gas,
the language model includes: an N-Gram model, a BilSTM model, or an LSTM model.
4. The candidate word evaluation method according to any one of claims 1 to 3, wherein the error information of the error word with respect to each candidate word includes: the information whether the wrong word and the candidate word have the same initial;
determining an evaluation score corresponding to each candidate word according to the language environment probability and the error information, wherein the evaluation score comprises the following steps:
if the wrong word is the same as the initial of the candidate word, calculating the evaluation score of the candidate word according to the language environment probability and the first coefficient;
and if the wrong word is different from the initial of the candidate word, calculating the evaluation score of the candidate word according to the language environment probability and the second coefficient.
5. The candidate word evaluation method of claim 4, further comprising the steps of:
and determining that the word to be detected is not in a preset word bank, and determining that the word to be detected is a wrong word.
6. The candidate word evaluation method of claim 5, further comprising, after detecting the wrong word:
and calculating the editing distance between the wrong word and the known word in the word bank, and selecting the known word with the editing distance within a set range to obtain a plurality of candidate words corresponding to the wrong word.
7. The candidate word evaluation method according to any one of claims 1, 2,3,5, and 6, further comprising:
determining error correction words corresponding to the error words from the candidate words according to the evaluation scores, and correcting the error words by using the error correction words;
and/or the presence of a gas in the gas,
and sorting the candidate words according to the evaluation scores, and displaying the sorted candidate words.
8. The candidate word evaluation method according to claim 7, wherein the determining a wrong word corresponding to the wrong word from the plurality of candidate words according to the evaluation score includes:
and determining the candidate word with the highest evaluation score from the plurality of candidate words as the error correction word corresponding to the error word.
9. The candidate word evaluation method according to claim 2, wherein the evaluation score of each candidate word is calculated according to the following formula:
wherein word represents a candidate word, mx represents a language model,score representing the linguistic environment probability of a candidate wordwordRepresenting the evaluation scores corresponding to the candidate words, and K represents error information of the error word relative to each candidate word; and if the candidate word and the wrong word initial are the same, the value K is K1, otherwise, the values K2, K1 and K2 are preset numerical values.
10. A candidate word evaluation apparatus, comprising:
the candidate word acquisition module is used for detecting wrong words and acquiring a plurality of candidate words corresponding to the wrong words;
the first probability determination module is used for determining the language environment probability of each candidate word at the wrong word position;
the error information acquisition module is used for acquiring error information of the wrong words relative to the candidate words;
and the third evaluation module is used for determining the evaluation score corresponding to each candidate word according to the language environment probability and the error information.
11. The candidate word evaluation apparatus of claim 10, wherein the first probability determination module is configured to calculate a probability of each candidate word at the wrong word position according to a preset language model, and use a log value of the probability as a language environment probability of the candidate word.
12. The candidate word evaluation apparatus of claim 11, wherein the error information of the error word relative to each candidate word comprises: the information whether the wrong word and the candidate word have the same initial;
the third evaluation module comprises:
the first scoring submodule is used for calculating the evaluation score of the candidate word according to the reciprocal of the language environment probability and a first coefficient if the wrong word is the same as the initial of the candidate word;
and the second scoring submodule is used for calculating the evaluation score of the candidate word according to the inverse of the language environment probability and a second coefficient if the wrong word is different from the initial of the candidate word.
13. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the steps of the method of any one of claims 1 to 9 are implemented when the program is executed by the processor.
14. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 9.
CN201810320358.0A 2018-04-11 2018-04-11 Candidate word evaluation method and device, computer equipment and storage medium Active CN108628826B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810320358.0A CN108628826B (en) 2018-04-11 2018-04-11 Candidate word evaluation method and device, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810320358.0A CN108628826B (en) 2018-04-11 2018-04-11 Candidate word evaluation method and device, computer equipment and storage medium

Publications (2)

Publication Number Publication Date
CN108628826A true CN108628826A (en) 2018-10-09
CN108628826B CN108628826B (en) 2022-09-06

Family

ID=63705092

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810320358.0A Active CN108628826B (en) 2018-04-11 2018-04-11 Candidate word evaluation method and device, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN108628826B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110991173A (en) * 2019-11-29 2020-04-10 支付宝(杭州)信息技术有限公司 Word segmentation method and system
CN112489655A (en) * 2020-11-18 2021-03-12 元梦人文智能国际有限公司 Method, system and storage medium for correcting error of speech recognition text in specific field
CN112651230A (en) * 2019-09-25 2021-04-13 亿度慧达教育科技(北京)有限公司 Fusion language model generation method and device, word error correction method and electronic equipment

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105279149A (en) * 2015-10-21 2016-01-27 上海应用技术学院 Chinese text automatic correction method
CN105550173A (en) * 2016-02-06 2016-05-04 北京京东尚科信息技术有限公司 Text correction method and device
CN106202153A (en) * 2016-06-21 2016-12-07 广州智索信息科技有限公司 The spelling error correction method of a kind of ES search engine and system
WO2017067364A1 (en) * 2015-10-21 2017-04-27 北京瀚思安信科技有限公司 Method and equipment for determining common subsequence of text strings
CN106708893A (en) * 2015-11-17 2017-05-24 华为技术有限公司 Error correction method and device for search query term
CN106847288A (en) * 2017-02-17 2017-06-13 上海创米科技有限公司 The error correction method and device of speech recognition text
CN107180084A (en) * 2017-05-05 2017-09-19 上海木爷机器人技术有限公司 Word library updating method and device
CN107305768A (en) * 2016-04-20 2017-10-31 上海交通大学 Easy wrongly written character calibration method in interactive voice

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105279149A (en) * 2015-10-21 2016-01-27 上海应用技术学院 Chinese text automatic correction method
WO2017067364A1 (en) * 2015-10-21 2017-04-27 北京瀚思安信科技有限公司 Method and equipment for determining common subsequence of text strings
CN106708893A (en) * 2015-11-17 2017-05-24 华为技术有限公司 Error correction method and device for search query term
CN105550173A (en) * 2016-02-06 2016-05-04 北京京东尚科信息技术有限公司 Text correction method and device
CN107305768A (en) * 2016-04-20 2017-10-31 上海交通大学 Easy wrongly written character calibration method in interactive voice
CN106202153A (en) * 2016-06-21 2016-12-07 广州智索信息科技有限公司 The spelling error correction method of a kind of ES search engine and system
CN106847288A (en) * 2017-02-17 2017-06-13 上海创米科技有限公司 The error correction method and device of speech recognition text
CN107180084A (en) * 2017-05-05 2017-09-19 上海木爷机器人技术有限公司 Word library updating method and device

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
张扬: "拼写校正技术在信息检索和文本处理领域的应用", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *
罗刚 张子宪: "《自然语言处理原理与技术实现》", 31 May 2016, 电子工业出版社 *
郑文曦 等: "自动拼写校对的算法设计和系统实现", 《科技和产业》 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112651230A (en) * 2019-09-25 2021-04-13 亿度慧达教育科技(北京)有限公司 Fusion language model generation method and device, word error correction method and electronic equipment
CN110991173A (en) * 2019-11-29 2020-04-10 支付宝(杭州)信息技术有限公司 Word segmentation method and system
CN110991173B (en) * 2019-11-29 2023-09-29 支付宝(杭州)信息技术有限公司 Word segmentation method and system
CN112489655A (en) * 2020-11-18 2021-03-12 元梦人文智能国际有限公司 Method, system and storage medium for correcting error of speech recognition text in specific field
CN112489655B (en) * 2020-11-18 2024-04-19 上海元梦智能科技有限公司 Method, system and storage medium for correcting voice recognition text error in specific field

Also Published As

Publication number Publication date
CN108628826B (en) 2022-09-06

Similar Documents

Publication Publication Date Title
US9535896B2 (en) Systems and methods for language detection
US10699073B2 (en) Systems and methods for language detection
EP2206058A2 (en) Systems and methods for character correction in communication devices
CN108694167B (en) Candidate word evaluation method, candidate word ordering method and device
CN108628826B (en) Candidate word evaluation method and device, computer equipment and storage medium
US20210405765A1 (en) Method and Device for Input Prediction
Reffle et al. Unsupervised profiling of OCRed historical documents
Habib et al. An exploratory approach to find a novel metric based optimum language model for automatic bangla word prediction
CN108681533B (en) Candidate word evaluation method and device, computer equipment and storage medium
CN113657098A (en) Text error correction method, device, equipment and storage medium
JP5097802B2 (en) Japanese automatic recommendation system and method using romaji conversion
CN108595419B (en) Candidate word evaluation method, candidate word sorting method and device
CN108664466B (en) Candidate word evaluation method and device, computer equipment and storage medium
CN112182337B (en) Method for identifying similar news from massive short news and related equipment
CN108647202B (en) Candidate word evaluation method and device, computer equipment and storage medium
CN108664467B (en) Candidate word evaluation method and device, computer equipment and storage medium
CN117422064A (en) Search text error correction method, apparatus, computer device and storage medium
CN108628827A (en) Candidate word evaluation method and device, computer equipment and storage medium
CN108681534A (en) Candidate word evaluation method and device, computer equipment and storage medium
CN108681535B (en) Candidate word evaluation method and device, computer equipment and storage medium
CN108694166B (en) Candidate word evaluation method and device, computer equipment and storage medium
CN108733646B (en) Candidate word evaluation method and device, computer equipment and storage medium
CN108733645A (en) Candidate word evaluation method and device, computer equipment and storage medium
JP6303508B2 (en) Document analysis apparatus, document analysis system, document analysis method, and program
JP5450276B2 (en) Reading estimation device, reading estimation method, and reading estimation program

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant