CN108984515B - Wrongly written character detection method and device, computer readable storage medium and terminal equipment - Google Patents

Wrongly written character detection method and device, computer readable storage medium and terminal equipment Download PDF

Info

Publication number
CN108984515B
CN108984515B CN201810500257.1A CN201810500257A CN108984515B CN 108984515 B CN108984515 B CN 108984515B CN 201810500257 A CN201810500257 A CN 201810500257A CN 108984515 B CN108984515 B CN 108984515B
Authority
CN
China
Prior art keywords
characters
character
detected
probability
threshold value
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810500257.1A
Other languages
Chinese (zh)
Other versions
CN108984515A (en
Inventor
李贤�
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Shiyuan Electronics Thecnology Co Ltd
Original Assignee
Guangzhou Shiyuan Electronics Thecnology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Shiyuan Electronics Thecnology Co Ltd filed Critical Guangzhou Shiyuan Electronics Thecnology Co Ltd
Priority to CN201810500257.1A priority Critical patent/CN108984515B/en
Publication of CN108984515A publication Critical patent/CN108984515A/en
Application granted granted Critical
Publication of CN108984515B publication Critical patent/CN108984515B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/232Orthographic correction, e.g. spell checking or vowelisation

Abstract

The invention discloses a method and a device for detecting wrongly written characters, a computer readable storage medium and terminal equipment, wherein the method comprises the following steps: receiving a sentence to be detected; acquiring the character probability of the characters to be detected in the sentences to be detected and the character probability of K replacement characters of the characters to be detected according to each character in the sentences to be detected; arranging the characters to be detected and each replacement character in a sequence from large to small in character probability to obtain a character sequence; according to the character sequence, obtaining a first threshold value K1, a second threshold value K2 and an arrangement order K of the characters to be detected; and judging whether the characters to be detected are wrongly written characters or not according to the first threshold value K1, the second threshold value K2 and the arrangement order K. By adopting the technical scheme of the invention, the accuracy of wrongly-written character detection can be improved.

Description

Wrongly-written character detection method and device, computer-readable storage medium and terminal equipment
Technical Field
The invention relates to the technical field of text processing, in particular to a method and a device for detecting wrongly written characters, a computer readable storage medium and terminal equipment.
Background
The text is an important carrier for recording information, with the rapid development of information processing technology and internet technology, the traditional text work is gradually replaced by electronic equipment such as computers, the generation mode of the text is correspondingly changed from manual writing into manual editing, and misoperation can occur during actual operation of the manual editing, so that wrongly written characters occur in the manually edited text, the quality of the text is influenced, and therefore the method has very important significance for detecting wrongly written characters.
Most of the wrongly-written characters detection methods provided by the prior art perform certain processing on characters input by a user after receiving the characters input by the user, then calculate the probability of the processed characters and compare the probability with a preset probability threshold, finally determine whether the characters input by the user are wrongly-written characters according to a comparison result, determine that the characters input by the user are wrongly-written characters when the calculated probability of the characters is smaller than the preset probability threshold, and determine that the characters input by the user are correct characters when the calculated probability of the characters is not smaller than the preset probability threshold; therefore, the technical scheme provided by the prior art usually judges the correctness of the characters according to only one probability threshold value after calculating the probability of the characters, the judging mode is simple, and the sentence context where the characters are located is not considered, so that the detection result of wrongly written characters may not be consistent with the actual situation, and the detection accuracy is low.
Disclosure of Invention
The technical problem to be solved in the embodiments of the present invention is to provide a method and an apparatus for detecting a wrongly written word, a computer-readable storage medium, and a terminal device, which can improve the accuracy of detecting a wrongly written word.
In order to solve the above technical problem, an embodiment of the present invention provides a method for detecting a wrongly written word, including:
receiving a sentence to be detected;
acquiring the character probability of the characters to be detected in the sentences to be detected and the character probability of K replacement characters of the characters to be detected according to each character in the sentences to be detected; wherein K is more than or equal to 1;
arranging the characters to be detected and each replaced character according to the sequence of the probability of the characters from large to small to obtain a character sequence;
according to the character sequence, obtaining a first threshold value K1, a second threshold value K2 and an arrangement order K of the characters to be detected; the first threshold value K1 meets a first condition that the character probability of the first K1 characters in the character sequence meets a preset value; the second threshold value K2 meets a second condition that the character probability of the first K2 characters in the character sequence meets the preset condition;
and judging whether the characters to be detected are wrongly written characters or not according to the first threshold value K1, the second threshold value K2 and the arrangement order K.
Compared with the prior art, the embodiment of the invention provides a wrongly-written character detection method, which comprises the steps of obtaining the character probability of characters to be detected and the character probability of replacement characters of the characters to be detected according to each character in sentences to be detected, sequencing the characters to be detected and the replacement characters according to the character probabilities, obtaining a first threshold value and a second threshold value which meet certain conditions and the arrangement sequence of the characters to be detected according to the sequenced character sequence, and judging whether the characters to be detected are wrongly-written characters according to the first threshold value, the second threshold value and the arrangement sequence of the characters to be detected.
Further, the first threshold K1 satisfies a first condition that the probability of the first K1 words in the word sequence meets a preset first condition, and specifically includes:
the first threshold value K1 is satisfied such that the sum of the character probabilities of the first K1-1 characters in the character sequence is less than a preset first probability threshold value, and the sum of the character probabilities of the first K1 characters in the character sequence is greater than the first probability threshold value.
Further, the first probability threshold is 0.95.
Further, the second threshold K2 satisfies a second condition that the probability of the first K2 words in the word sequence meets a preset second condition, and specifically includes:
the second threshold value K2 is satisfied that the sum of the probabilities of the first K2-1 characters in the character sequence is smaller than a preset second probability threshold value, and the sum of the probabilities of the first K2 characters in the character sequence is larger than the second probability threshold value.
Further, the second probability threshold is 0.99.
Further, the determining whether the characters to be detected are wrongly written characters according to the first threshold K1, the second threshold K2 and the arrangement order K specifically includes:
when K is judged to be not less than K1 and not more than K2, obtaining a confusion set of the characters to be detected; wherein the confusion set comprises M confusion words, and M is more than or equal to 1;
and judging whether the words to be detected are wrongly written words or not according to the sentences to be detected and each confusion word.
Further, the method further comprises:
when K is less than K1, judging the characters to be detected as correct characters;
and when K is more than K2, judging the characters to be detected as wrongly written characters.
The embodiment of the invention also provides a device for detecting wrongly written characters, which comprises:
the sentence receiving module is used for receiving the sentences to be detected;
the word probability obtaining module is used for obtaining the word probability of the words to be detected in the sentences to be detected and the word probability of K replaced words of the words to be detected according to each word in the sentences to be detected; wherein K is more than or equal to 1;
the character sequence acquisition module is used for arranging the characters to be detected and each replaced character according to the sequence of the probability of the characters from large to small to obtain a character sequence;
a comparison threshold value obtaining module, configured to obtain a first threshold value K1, a second threshold value K2, and an arrangement order K of the characters to be detected according to the character sequence; the first threshold value K1 meets a first condition that the character probability of the first K1 characters in the character sequence meets a preset condition; the second threshold value K2 meets a second condition that the character probability of the first K2 characters in the character sequence meets the preset condition; and the number of the first and second groups,
and the character judging module is used for judging whether the characters to be detected are wrongly-written characters or not according to the first threshold value K1, the second threshold value K2 and the arrangement order K.
An embodiment of the present invention further provides a computer-readable storage medium, where the computer-readable storage medium includes a stored computer program; wherein the computer program, when running, controls the device on which the computer readable storage medium is located to execute any of the above-mentioned methods for detecting wrongly written words.
The embodiment of the invention also provides terminal equipment, which comprises at least one memory and at least one processor; wherein, the first and the second end of the pipe are connected with each other,
the memory includes at least one stored computer program;
the processor is configured to execute the computer program, and when executing the computer program, the processor implements the method for detecting a wrongly written word according to any one of the above descriptions.
Compared with the prior art, the embodiment of the invention provides a method and a device for detecting wrongly-written characters, a computer-readable storage medium and a terminal device, wherein the method comprises the steps of obtaining the character probability of characters to be detected and the character probability of replacement characters of the characters to be detected according to each character in sentences to be detected, sequencing the characters to be detected and the replacement characters according to the character probabilities, obtaining a first threshold value and a second threshold value which meet certain conditions and the arrangement sequence of the characters to be detected according to the sequenced character sequence, and judging whether the characters to be detected are wrongly-written characters according to the first threshold value, the second threshold value and the arrangement sequence of the characters to be detected.
Drawings
FIG. 1 is a flow chart of a preferred embodiment of a method for detecting wrongly written words provided by the present invention;
FIG. 2 is a flowchart illustrating the step S15 of the method for detecting a wrongly written word according to a preferred embodiment of the present invention;
FIG. 3 is a block diagram of a preferred embodiment of the device for detecting wrongly written words according to the present invention;
fig. 4 is a block diagram of a preferred embodiment of the terminal device provided by the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without any inventive step, are within the scope of the present invention.
The embodiment of the invention provides a wrongly written character detection method.
Referring to fig. 1, it is a flowchart of a preferred embodiment of the method for detecting a wrongly written word according to the present invention, and includes steps S11 to S15:
step S11, receiving a sentence to be detected;
step S12, obtaining the character probability of the characters to be detected in the sentences to be detected and the character probability of K replacement characters of the characters to be detected according to each character in the sentences to be detected; wherein K is more than or equal to 1;
s13, arranging the characters to be detected and each replaced character according to the sequence of character probability from large to small to obtain a character sequence;
step S14, obtaining a first threshold value K1, a second threshold value K2 and the arrangement order K of the characters to be detected according to the character sequence; the first threshold value K1 meets a first condition that the character probability of the first K1 characters in the character sequence meets a preset value; the second threshold value K2 meets a second condition that the character probability of the first K2 characters in the character sequence meets the preset condition;
and S15, judging whether the characters to be detected are wrongly written characters or not according to the first threshold value K1, the second threshold value K2 and the arrangement order K.
In this embodiment, when it is required to determine whether the text to be detected is a wrong type, first, a sentence where the text to be detected is located, that is, the sentence to be detected, is obtained; then, respectively calculating and obtaining the character probability of the characters to be detected and the character probability of K replacement characters of the characters to be detected according to each character in the sentence to be detected; the replacing characters are any characters except the characters to be detected in a preset dictionary set; arranging the characters to be detected and the K replacing characters according to the sequence of the character probability obtained by calculation from large to small so as to obtain a sequenced character sequence; and finally, respectively obtaining a first threshold K1 meeting a preset first condition, a second threshold K2 meeting a preset second condition and the arrangement order K of the characters to be detected in the sorted character sequence according to the sorted character sequence, and judging whether the characters to be detected are wrongly-written characters or not according to the obtained first threshold K1, the obtained second threshold K2 and the arrangement order K of the characters to be detected.
It should be noted that the character probability of the characters to be detected and the character probability of the replaced characters are calculated by using a Bi-directional Long Short Term Memory (Bi-LSTM) model and invoking a softmax algorithm, and the Bi-directional Long Short Term Memory model is a Bi-directional Long Short Term Memory model trained by deep learning frame coding (e.g., tensorflow).
Specifically, the last layer of the Bi-LSTM model is connected with a softmax layer, the sentence to be detected is input into the Bi-LSTM model, the softmax layer calls a softmax algorithm to calculate, and the character probability of the character to be detected can be calculated according to each character in the sentence to be detected; when the character probability of the replacement characters is calculated, the characters to be detected in the sentences to be detected need to be replaced by the replacement characters in sequence, so that corresponding replacement sentences are obtained, then each replacement sentence is input into the trained Bi-LSTM model, and the character probability of each replacement character can be calculated according to each character in each replacement sentence by calling a softmax algorithm through a softmax layer.
According to the wrongly-written character detection method provided by the embodiment of the invention, the character probability of the characters to be detected and the character probability of the replaced characters of the characters to be detected are respectively obtained according to each character in the sentences to be detected, and the first threshold value and the second threshold value which meet certain conditions and the arrangement sequence of the characters to be detected are obtained according to the character sequence sequenced according to the character probabilities, so that whether the characters to be detected are wrongly-written characters is judged according to the first threshold value, the second threshold value and the arrangement sequence of the characters to be detected, the sentence context where the characters are located is combined when the character probability is calculated, and the wrongly-written character detection is carried out by adopting a strategy of character probability sequencing judgment, so that the problems that the judgment mode in the prior art is simple, the sentence context where the characters are located is not considered, the detection accuracy is low are solved, and the accuracy of wrongly-written character detection is improved.
In addition, when the trained Bi-LSTM model is adopted to calculate the character probability of the characters to be detected and the character probability of the replacement characters, calculation is carried out according to the information of each character in the sentence where the characters to be detected and the replacement characters are located, not only can forward information of the characters be utilized, but also backward information of the characters can be utilized, and therefore the calculated character probability is more accurate.
In another preferred embodiment, the first threshold K1 satisfies a first condition that the probability of a first K1 words in the word sequence meets a preset first condition, and specifically includes:
the first threshold value K1 is satisfied such that the sum of the character probabilities of the first K1-1 characters in the character sequence is less than a preset first probability threshold value, and the sum of the character probabilities of the first K1 characters in the character sequence is greater than the first probability threshold value.
The present embodiment is a method for obtaining a first threshold K1 according to a sorted text sequence, and specifically, the sorted text sequence is traversed, and when it is detected that a sum of text probabilities of K1-th text that satisfies first K1-1 text in the text sequence is smaller than a preset first probability threshold and a sum of text probabilities of first K1 text is greater than the first probability threshold, a value of K1 at this time is used as the first threshold K1.
In yet another preferred embodiment, the first threshold of probability is 0.95.
It should be noted that 0.95 in this embodiment is a preferred value of the first probability threshold, that is, when the sum of the probabilities of the top K1-1 characters in the sorted character sequence is less than 0.95 and the sum of the probabilities of the top K1 characters is greater than 0.95, K1 at this time is the first threshold K1 meeting the preset first condition.
In another preferred embodiment, the second threshold K2 satisfies a second condition that the probability of the first K2 words in the word sequence meets a preset second condition, and specifically includes:
the second threshold value K2 is satisfied that the sum of the probabilities of the first K2-1 characters in the character sequence is smaller than a preset second probability threshold value, and the sum of the probabilities of the first K2 characters in the character sequence is larger than the second probability threshold value.
The present embodiment is a method for obtaining a second threshold K2 according to a sorted text sequence, and specifically, the sorted text sequence is traversed, and when it is detected that a sum of text probabilities of K2-th text that satisfies first K2-1 text in the text sequence is smaller than a preset second probability threshold and a sum of text probabilities of first K2 text is greater than the second probability threshold, a value of K2 at this time is used as the second threshold K2.
In yet another preferred embodiment, the second probability threshold is 0.99.
It should be noted that 0.99 in this embodiment is a preferred value of the second probability threshold, that is, when the sum of the probabilities of the top K2-1 characters in the sorted character sequence is less than 0.99 and the sum of the probabilities of the top K2 characters is greater than 0.99, K2 at this time is the second threshold K2 meeting the preset second condition.
Referring to fig. 2, which is a detailed flowchart of a preferred embodiment of step S15 of the method for detecting a wrongly written word provided by the present invention, the determining whether the text to be detected is a wrongly written word according to the first threshold K1, the second threshold K2 and the arrangement order K specifically includes steps S1501 to S1502:
step S1501, when K1 is judged to be not more than K2, obtaining a confusion set of the characters to be detected; the confusion set comprises M confusion words, wherein M is more than or equal to 1;
step S1502, according to the sentence to be detected and each of the confusing characters, determining whether the character to be detected is a wrongly written character.
In this embodiment, when determining whether the characters to be detected are wrongly written characters according to the first threshold K1, the second threshold K2, and the arrangement order K of the characters to be detected, the sizes of the first threshold K1, the second threshold K2, and the arrangement order K of the characters to be detected need to be compared first, and when K is satisfied, K is not less than K1 and is not less than K2, a confusion set of the characters to be detected is obtained, so that whether the characters to be detected are wrongly written characters is further determined according to the sentences to be detected and M confusion words in the confusion set.
It should be noted that the confusing words are words in a preset dictionary set, which are the same as the pinyin of the word to be detected, and the confusing set is a set composed of all confusing words.
Specifically, firstly, calculating and obtaining the character probability of each confusing character according to each character in the sentence to be detected, and arranging the character probabilities of all the confusing characters from large to small, thereby obtaining the arrangement sequence of all the confusing characters according to the character probabilities, and selecting N confusing characters with the maximum character probability as target confusing characters (N can be selected according to actual needs); sequentially replacing the characters to be detected in the sentences to be detected with each target confusing character so as to obtain N corresponding target confusing sentences, and respectively obtaining the character probability of each character in the sentences to be detected and the character probability of each character in the N target confusing sentences so as to obtain the sentence probability of the sentences to be detected and the sentence probability of the N target confusing sentences; and finally, further judging whether the characters to be detected are wrongly-written characters or not according to the sentence probability of the sentences to be detected and the sentence probabilities of the N target confusing sentences.
When the character probability of the confusion characters is calculated, the characters to be detected in the sentences to be detected need to be replaced by the confusion characters in sequence, so that the corresponding confusion sentences are obtained, then each confusion sentence is input into the trained Bi-LSTM model, and the softmax algorithm is called by the softmax layer, so that the character probability of each confusion character can be calculated according to each character in each replacement sentence correspondingly.
When calculating the statement probability of any statement A, firstly inputting the statement A into a trained Bi-LSTM model, calling a softmax algorithm through a softmax layer to calculate and obtain the character probabilities P of B characters in the statement A respectively ai Then according to the preset statement probability calculation formula
Figure BDA0001669363190000081
The calculation is performed so as to obtain the sentence probability of the sentence a.
It should be noted that, taking the logarithm of the character probability of all characters in a sentence and the sentence probability as the sentence, the calculation amount can be reduced, and the truncation caused by too small a number can be avoided.
Taking the sentence "Chinese people" to be detected as an example, wherein the "Chinese characters" are the characters to be detected, and if 7000 characters (including the characters to be detected) are assumed in a preset dictionary set, the character probabilities of the 7000 characters are respectively calculated and obtained, and the 7000 characters are arranged according to the sequence of the character probabilities from large to small, so as to obtain the corresponding character sequence.
Respectively obtaining a first threshold value K1, a second threshold value K2 and an arrangement order K of 'chemical' words according to the word sequence, comparing the sizes of the first threshold value K1, the second threshold value K2 and the arrangement order K of the 'chemical' words, assuming that the first threshold value K1 is 10, the second threshold value K2 is 100, the arrangement order K of the 'chemical' words is 50, and K1 is less than or equal to K2, then obtaining a confusion set of the 'chemical' characters, assuming that 15 characters in the preset dictionary set are the same as the pinyin of the 'chemical' characters, respectively calculating character probabilities of obtaining 15 confusion characters, and arranging the 15 confusing characters according to the sequence of the character probabilities from large to small, selecting the confusing character with the character probability arrangement positioned at the first 5 bits as a target confusing character, and sequentially replacing the 'change' character in the 'Chinese character to be detected' with the 5 target confusing characters so as to obtain the corresponding 5 target confusing sentences.
Respectively calculating and obtaining the character probability of each character in the sentence to be detected, namely the Chinese style of Chinese style of Chinese style of Chinese style of Chinese style of Chinese style of Chinese style of Chinese style of style a1 =P In =p1、P a2 =P Transforming =p2、P a3 =P Human being P3, and P a4 =P People If P4, the sentence probability of the sentence "centralizing people" to be detected is P People of middle-jiao And finally, performing further processing according to the statement probability of the statement to be detected, namely log p1+ log p2+ log p3+ log p4, and the statement probability of 5 target confusing statements.
One method for further judging whether the words to be detected are wrongly written words or not according to the sentence probability of the sentences to be detected and the sentence probabilities of the N target confusing sentences is specifically as follows:
arranging the sentences to be detected and the N target replacement sentences according to the sequence of the probability of the sentences from large to small, thereby obtaining the arrangement sequence of the sentences to be detected and recording the arrangement sequence as N; calculating to obtain a value of N1 according to a formula N1 ═ N +1 × 7%, and when N is less than N1, indicating that the sentence probability of the sentence to be detected is arranged in the first 7% of all the sentence probabilities, so that the character to be detected is judged to be a correct character; when N is larger than or equal to N1, the sentence probability of the sentence to be detected is arranged 93% behind all the sentence probabilities, so that the character to be detected is judged to be a wrongly written character.
In the calculation of N1 according to the formula, if the result of the calculation is a decimal, an integer closest to and greater than the calculation result is taken as the value of N1.
For example, the sentence probabilities of 6 sentences of the sentence to be detected "the Chinese style people" and 5 target confusing sentences are arranged in the descending order, and if the sentence probability corresponding to the sentence to be detected "the Chinese style people" is 7% of the sentence probability of 6 sentences, the Chinese style word is considered as a correct word; and if the sentence probability corresponding to the sentence to be detected, namely the Chinese people, is 93% of the sentence probability of the 6 sentences, the Chinese characters are regarded as wrongly-written characters.
Another method for further judging whether the words to be detected are wrongly written words or not according to the statement probability of the statements to be detected and the statement probabilities of the N target confusing statements is specifically as follows:
arranging the sentences to be detected and the N target replacement sentences according to the sequence of the probability of the sentences from large to small, thereby obtaining the arrangement sequence of the sentences to be detected and recording the arrangement sequence as N; when n is equal to 1, the sentence probability of the sentence to be detected is the maximum in all the sentence probabilities, and therefore the character to be detected is judged to be a correct character; when n ≠ 1, it is stated that the sentence probability of the sentence to be detected is not the maximum among all the sentence probabilities, so that the character to be detected is judged to be a wrongly written character.
For example, the sentence probabilities of 6 sentences including the sentence to be detected "the Chinese style people" and 5 target confusing sentences are arranged in the descending order, and if the sentence probability corresponding to the sentence to be detected "the Chinese style people" is the largest among the sentence probabilities of 6 sentences, the Chinese style word is considered as the correct word; and if the sentence probability corresponding to the sentence to be detected, namely the Chinese metaplasia people, is not the maximum in the sentence probabilities of the 6 sentences, the Chinese metaplasia characters are considered as wrongly-written characters.
The method for detecting wrongly-written characters provided by the embodiment of the invention is characterized in that when the first threshold, the second threshold and the arrangement sequence of the characters to be detected meet certain conditions, a confusion set of the characters to be detected is obtained, a target confusion statement is correspondingly obtained according to the confusion characters in the confusion set of the characters to be detected, whether the characters to be detected are wrongly-written characters is judged according to the statement probability of the sentences to be detected and the statement probability of the target confusion statement, the sentence context where the characters are located is combined when the character probability is calculated, and a strategy for judging the confusion set is adopted to further judge whether the characters to be detected are wrongly-written characters, so that the accuracy of detection of wrongly-written characters is further improved.
In yet another preferred embodiment, the method further comprises:
when K is less than K1, judging the characters to be detected as correct characters;
and when K is more than K2, judging the characters to be detected as wrongly written characters.
In this embodiment, two other situations that whether the characters to be detected are wrongly written characters are determined according to the first threshold K1, the second threshold K2 and the arrangement order K of the characters to be detected are also determined, and similarly, the sizes of the first threshold K1, the second threshold K2 and the arrangement order K of the characters to be detected are first compared, and if K is less than K1, the characters to be detected are determined to be correct characters; if K is more than K2, the character to be detected is judged to be a wrongly written character.
The embodiment of the present invention further provides a device for detecting a wrongly written word, which can implement all the processes of the method for detecting a wrongly written word in any embodiment, and the functions and technical effects of the modules and units in the device are respectively the same as those of the method for detecting a wrongly written word in the embodiment, and are not described herein again.
Referring to fig. 3, a block diagram of a preferred embodiment of the device for detecting a wrongly written word according to the present invention is shown, including:
a sentence receiving module 11, configured to receive a sentence to be detected;
a text probability obtaining module 12, configured to obtain, according to each text in the sentence to be detected, a text probability of the text to be detected in the sentence to be detected and text probabilities of K replacement texts of the text to be detected; wherein K is more than or equal to 1;
the character sequence acquisition module 13 is configured to arrange the characters to be detected and each of the replacement characters in a sequence from a large probability to a small probability to obtain a character sequence;
a comparison threshold obtaining module 14, configured to obtain a first threshold K1, a second threshold K2, and an arrangement order K of the characters to be detected according to the character sequence; the first threshold value K1 meets a first condition that the character probability of the first K1 characters in the character sequence meets a preset value; the second threshold value K2 meets a second condition that the character probability of the first K2 characters in the character sequence meets the preset condition; and the number of the first and second groups,
and the character judging module 15 is configured to judge whether the characters to be detected are wrongly written characters according to the first threshold K1, the second threshold K2 and the arrangement order K.
Preferably, the first threshold K1 satisfies a preset first condition that the probability of a first K1 words in the word sequence meets, specifically including:
the first threshold value K1 is satisfied such that the sum of the character probabilities of the first K1-1 characters in the character sequence is less than a preset first probability threshold value, and the sum of the character probabilities of the first K1 characters in the character sequence is greater than the first probability threshold value.
Preferably, the first probability threshold is 0.95.
Preferably, the second threshold K2 satisfies a second condition that the probability of the first K2 words in the word sequence meets a preset second condition, and specifically includes:
the second threshold value K2 is satisfied that the sum of the probabilities of the first K2-1 characters in the character sequence is smaller than a preset second probability threshold value, and the sum of the probabilities of the first K2 characters in the character sequence is larger than the second probability threshold value.
Preferably, the second probability threshold is 0.99.
Preferably, the text determination module 15 specifically includes:
the confusion set acquisition unit is used for acquiring the confusion set of the characters to be detected when K1 is judged to be less than or equal to K2; wherein the confusion set comprises M confusion words, and M is more than or equal to 1; and the number of the first and second groups,
and the character judging module unit is used for judging whether the characters to be detected are wrongly written characters or not according to the sentences to be detected and each confusing character.
Preferably, the apparatus further comprises:
the first character judgment module is used for judging the characters to be detected to be correct characters when K is less than K1; and the number of the first and second groups,
and the second character judgment module is used for judging that the characters to be detected are wrongly written characters when K is larger than K2.
An embodiment of the present invention further provides a computer-readable storage medium, where the computer-readable storage medium includes a stored computer program; wherein, when running, the computer program controls the device on which the computer-readable storage medium is located to execute the method for detecting wrongly written words according to any of the above embodiments.
The embodiment of the invention also provides the terminal equipment.
Referring to fig. 4, it is a block diagram of a preferred embodiment of the terminal device provided in the present invention, which includes at least one memory 10 and at least one processor 20; wherein the content of the first and second substances,
the memory 10 includes at least one stored computer program;
the processor 20 is configured to execute the computer program, and the processor 20 implements the method for detecting a wrongly written word according to any of the above embodiments when executing the computer program.
It should be noted that fig. 4 only illustrates an example in which one memory and one processor in the terminal device are connected, in some specific embodiments, the terminal device may further include a plurality of memories and/or a plurality of processors, and the specific number and the connection mode of the memories and the processors may be set and adaptively adjusted according to actual needs.
To sum up, the method, the apparatus, the computer-readable storage medium, and the terminal device for detecting a wrongly written word provided in the embodiments of the present invention respectively obtain a text probability of a text to be detected and a text probability of a replacement text of the text to be detected according to each text in a sentence to be detected, and obtain a first threshold and a second threshold that satisfy a certain condition and an arrangement order of the text to be detected according to a text sequence sorted according to the text probabilities, so as to determine whether the text to be detected is a wrongly written word according to the first threshold, the second threshold and the arrangement order of the text to be detected, and obtain a confusion set of the text to be detected when the first threshold, the second threshold and the arrangement order of the text to be detected satisfy a certain condition, determine whether the text to be detected is a wrongly written word according to the text to be detected and the confusion set, not only combine a sentence context where the text is located when calculating the text probabilities, and a strategy combining character probability sequencing judgment and confusion set judgment is adopted to replace a simple strategy of single probability threshold judgment for detecting wrongly-written characters, so that the strategy for judging wrongly-written characters is more flexible and effective, and the accuracy rate of wrongly-written character detection is greatly improved.
The above description is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, several modifications and variations can be made without departing from the technical principle of the present invention, and these modifications and variations should also be regarded as the protection scope of the present invention.

Claims (8)

1. A method for detecting a wrongly written word, comprising:
receiving a sentence to be detected;
acquiring the character probability of the characters to be detected in the sentences to be detected and the character probability of K replacement characters of the characters to be detected according to each character in the sentences to be detected; wherein K is more than or equal to 1;
arranging the characters to be detected and each replaced character according to the sequence of the probability of the characters from large to small to obtain a character sequence;
according to the character sequence, obtaining a first threshold value K1, a second threshold value K2 and an arrangement order K of the characters to be detected; the first threshold value K1 meets a first condition that the character probability of the first K1 characters in the character sequence meets a preset condition; the second threshold value K2 meets a second condition that the character probability of the first K2 characters in the character sequence meets the preset condition;
judging whether the characters to be detected are wrongly written characters or not according to the first threshold value K1, the second threshold value K2 and the arrangement order K;
the character probability of the characters to be detected and the character probability of each replaced character are calculated by adopting a Bi-LSTM model and calling a softmax algorithm;
the first threshold value K1 is satisfied that the sum of the character probabilities of the first K1-1 characters in the character sequence is smaller than a preset first probability threshold value, and the sum of the character probabilities of the first K1 characters in the character sequence is larger than the first probability threshold value;
the second threshold value K2 is satisfied that the sum of the probabilities of the first K2-1 characters in the character sequence is smaller than a preset second probability threshold value, and the sum of the probabilities of the first K2 characters in the character sequence is larger than the second probability threshold value.
2. The method of detecting wrongly written words as recited in claim 1, wherein the first probability threshold is 0.95.
3. The method of erroneous word detection according to claim 1, wherein the second probability threshold is 0.99.
4. The method according to claim 1, wherein the determining whether the text to be detected is a wrongly written text according to the first threshold K1, the second threshold K2 and the arrangement order K specifically comprises:
when the fact that K is larger than or equal to K1 and smaller than or equal to K2 is judged, obtaining a confusion set of the characters to be detected; wherein the confusion set comprises M confusion words, and M is more than or equal to 1;
and judging whether the words to be detected are wrongly written or not according to the sentences to be detected and each confusion word.
5. The method of detecting wrongly written words as recited in claim 1, wherein the method further comprises:
when K is less than K1, judging the characters to be detected as correct characters;
and when K is larger than K2, judging the characters to be detected as wrongly written characters.
6. A wrongly written character detecting apparatus, comprising:
the sentence receiving module is used for receiving the sentences to be detected;
the word probability obtaining module is used for obtaining the word probability of the words to be detected in the sentences to be detected and the word probability of K replaced words of the words to be detected according to each word in the sentences to be detected; wherein K is more than or equal to 1;
the character sequence acquisition module is used for arranging the characters to be detected and each replaced character according to the sequence of the probability of the characters from large to small to obtain a character sequence;
a comparison threshold value obtaining module, configured to obtain a first threshold value K1, a second threshold value K2, and an arrangement order K of the characters to be detected according to the character sequence; the first threshold value K1 meets a first condition that the character probability of the first K1 characters in the character sequence meets a preset value; the second threshold value K2 meets a second condition that the character probability of the first K2 characters in the character sequence meets the preset condition; and the number of the first and second groups,
the character judging module is used for judging whether the characters to be detected are wrongly-written characters or not according to the first threshold value K1, the second threshold value K2 and the arrangement order K;
the character probability of the characters to be detected and the character probability of each replaced character are calculated by adopting a Bi-LSTM model and calling a softmax algorithm;
the first threshold value K1 is satisfied that the sum of the character probabilities of the first K1-1 characters in the character sequence is smaller than a preset first probability threshold value, and the sum of the character probabilities of the first K1 characters in the character sequence is larger than the first probability threshold value;
the second threshold value K2 is satisfied that the sum of the probabilities of the first K2-1 characters in the character sequence is smaller than a preset second probability threshold value, and the sum of the probabilities of the first K2 characters in the character sequence is larger than the second probability threshold value.
7. A computer-readable storage medium, characterized in that the computer-readable storage medium comprises a stored computer program; wherein the computer program, when executed, controls an apparatus in which the computer-readable storage medium is located to perform the method of detecting wrongly-written words as claimed in any one of claims 1 to 5.
8. A terminal device comprising at least one memory and at least one processor; wherein the content of the first and second substances,
the memory includes at least one stored computer program;
the processor is configured to execute the computer program, and when executing the computer program, the processor implements the method of detecting wrongly-written words as claimed in any of claims 1 to 5.
CN201810500257.1A 2018-05-22 2018-05-22 Wrongly written character detection method and device, computer readable storage medium and terminal equipment Active CN108984515B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810500257.1A CN108984515B (en) 2018-05-22 2018-05-22 Wrongly written character detection method and device, computer readable storage medium and terminal equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810500257.1A CN108984515B (en) 2018-05-22 2018-05-22 Wrongly written character detection method and device, computer readable storage medium and terminal equipment

Publications (2)

Publication Number Publication Date
CN108984515A CN108984515A (en) 2018-12-11
CN108984515B true CN108984515B (en) 2022-09-06

Family

ID=64542579

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810500257.1A Active CN108984515B (en) 2018-05-22 2018-05-22 Wrongly written character detection method and device, computer readable storage medium and terminal equipment

Country Status (1)

Country Link
CN (1) CN108984515B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112380839A (en) * 2020-11-13 2021-02-19 沈阳东软智能医疗科技研究院有限公司 Wrongly written character detection method, device and equipment

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH11134335A (en) * 1997-10-30 1999-05-21 Atr Onsei Honyaku Tsushin Kenkyusho:Kk Character error calibrating device
CN105045778A (en) * 2015-06-24 2015-11-11 江苏科技大学 Chinese homonym error auto-proofreading method
CN105159871A (en) * 2015-08-21 2015-12-16 小米科技有限责任公司 Text information detection method and apparatus
CN105512110A (en) * 2015-12-15 2016-04-20 江苏科技大学 Wrong word knowledge base construction method based on fuzzy matching and statistics
CN106484132A (en) * 2015-09-02 2017-03-08 北京搜狗科技发展有限公司 A kind of input error correction method and input subtraction unit
CN106815191A (en) * 2015-11-27 2017-06-09 北京国双科技有限公司 Correct the determination method and apparatus of word

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2388940A (en) * 2002-05-22 2003-11-26 Sharp Kk Method and apparatus for the correction or improvement of word usage
CN104899450A (en) * 2015-06-11 2015-09-09 福州大学 Method for judging tuple linguistic model through comprehensive evaluation of target threat level
US11222253B2 (en) * 2016-11-03 2022-01-11 Salesforce.Com, Inc. Deep neural network model for processing data through multiple linguistic task hierarchies
CN106649269A (en) * 2016-12-16 2017-05-10 广州视源电子科技股份有限公司 Extraction method and device of colloquial sentence
CN106845499A (en) * 2017-01-19 2017-06-13 清华大学 A kind of image object detection method semantic based on natural language
CN107609098B (en) * 2017-09-11 2019-02-01 北京金堤科技有限公司 Searching method and device

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH11134335A (en) * 1997-10-30 1999-05-21 Atr Onsei Honyaku Tsushin Kenkyusho:Kk Character error calibrating device
CN105045778A (en) * 2015-06-24 2015-11-11 江苏科技大学 Chinese homonym error auto-proofreading method
CN105159871A (en) * 2015-08-21 2015-12-16 小米科技有限责任公司 Text information detection method and apparatus
CN106484132A (en) * 2015-09-02 2017-03-08 北京搜狗科技发展有限公司 A kind of input error correction method and input subtraction unit
CN106815191A (en) * 2015-11-27 2017-06-09 北京国双科技有限公司 Correct the determination method and apparatus of word
CN105512110A (en) * 2015-12-15 2016-04-20 江苏科技大学 Wrong word knowledge base construction method based on fuzzy matching and statistics

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
Alibaba at IJCNLP-2017 Task 1:Embedding grammatical features into LSTMs for Chinese grammatical error diagnosis task;Yang Y. 等;《Proceedings of the 8th International Joint Conference on Natural Language Processing 2017》;20171201;41-46 *

Also Published As

Publication number Publication date
CN108984515A (en) 2018-12-11

Similar Documents

Publication Publication Date Title
US6513005B1 (en) Method for correcting error characters in results of speech recognition and speech recognition system using the same
CN111274785B (en) Text error correction method, device, equipment and medium
CN106843737A (en) Text entry method, device and terminal device
CN106325596B (en) A kind of written handwriting automatic error correction method and system
CN108984159B (en) Abbreviative phrase expansion method based on Markov language model
CN108984515B (en) Wrongly written character detection method and device, computer readable storage medium and terminal equipment
CN108694167B (en) Candidate word evaluation method, candidate word ordering method and device
CN108628826B (en) Candidate word evaluation method and device, computer equipment and storage medium
CN108874770B (en) Wrongly written character detection method and device, computer readable storage medium and terminal equipment
KR102339723B1 (en) Method, program, and appratus of decoding based on soft information of a dna storage device
CN110929514B (en) Text collation method, text collation apparatus, computer-readable storage medium, and electronic device
CN109670040B (en) Writing assistance method and device, storage medium and computer equipment
CN112949290A (en) Text error correction method and device and communication equipment
CN108829664B (en) Wrongly written character detection method and device, computer readable storage medium and terminal equipment
CN108829665B (en) Wrongly written character detection method and device, computer readable storage medium and terminal equipment
CN108845984B (en) Wrongly written character detection method and device, computer readable storage medium and terminal equipment
CN110633474B (en) Mathematical formula identification method, device, equipment and readable storage medium
CN108681533B (en) Candidate word evaluation method and device, computer equipment and storage medium
CN108647202B (en) Candidate word evaluation method and device, computer equipment and storage medium
CN108664466B (en) Candidate word evaluation method and device, computer equipment and storage medium
CN108664467B (en) Candidate word evaluation method and device, computer equipment and storage medium
CN108595419B (en) Candidate word evaluation method, candidate word sorting method and device
CN111090341A (en) Input method candidate result display method, related equipment and readable storage medium
CN108694166B (en) Candidate word evaluation method and device, computer equipment and storage medium
CN108681535B (en) Candidate word evaluation method and device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant