CN108984515A - Wrongly written character detection method and device, computer readable storage medium and terminal equipment - Google Patents

Wrongly written character detection method and device, computer readable storage medium and terminal equipment Download PDF

Info

Publication number
CN108984515A
CN108984515A CN201810500257.1A CN201810500257A CN108984515A CN 108984515 A CN108984515 A CN 108984515A CN 201810500257 A CN201810500257 A CN 201810500257A CN 108984515 A CN108984515 A CN 108984515A
Authority
CN
China
Prior art keywords
characters
detected
character
probability
threshold value
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201810500257.1A
Other languages
Chinese (zh)
Other versions
CN108984515B (en
Inventor
李贤�
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Shiyuan Electronics Thecnology Co Ltd
Original Assignee
Guangzhou Shiyuan Electronics Thecnology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Shiyuan Electronics Thecnology Co Ltd filed Critical Guangzhou Shiyuan Electronics Thecnology Co Ltd
Priority to CN201810500257.1A priority Critical patent/CN108984515B/en
Publication of CN108984515A publication Critical patent/CN108984515A/en
Application granted granted Critical
Publication of CN108984515B publication Critical patent/CN108984515B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/232Orthographic correction, e.g. spell checking or vowelisation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)
  • Document Processing Apparatus (AREA)

Abstract

The invention discloses a method and a device for detecting wrongly written characters, a computer readable storage medium and terminal equipment, wherein the method comprises the following steps: receiving a sentence to be detected; acquiring the character probability of the characters to be detected in the sentences to be detected and the character probability of K replacement characters of the characters to be detected according to each character in the sentences to be detected; arranging the characters to be detected and each replaced character according to the sequence of the probability of the characters from large to small to obtain a character sequence; according to the character sequence, obtaining a first threshold value K1, a second threshold value K2 and an arrangement order K of the characters to be detected; and judging whether the characters to be detected are wrongly written characters or not according to the first threshold value K1, the second threshold value K2 and the arrangement order K. By adopting the technical scheme of the invention, the accuracy of wrongly-written character detection can be improved.

Description

Wrongly written character detection method and device, computer readable storage medium and terminal equipment
Technical Field
The invention relates to the technical field of text processing, in particular to a method and a device for detecting wrongly written characters, a computer readable storage medium and terminal equipment.
Background
The text is an important carrier for recording information, with the rapid development of information processing technology and internet technology, the traditional text work is gradually replaced by electronic equipment such as computers, the generation mode of the text is correspondingly changed from manual writing into manual editing, and the manual editing may have misoperation during actual operation, so that wrongly written characters appear in the manually edited text, thereby affecting the text quality, and therefore, the method has very important significance for detecting wrongly written characters.
Most of the wrongly-written characters detection methods provided by the prior art perform certain processing on characters input by a user after receiving the characters input by the user, then calculate the probability of the processed characters and compare the probability with a preset probability threshold, finally determine whether the characters input by the user are wrongly-written characters according to a comparison result, determine that the characters input by the user are wrongly-written characters when the calculated probability of the characters is smaller than the preset probability threshold, and determine that the characters input by the user are correct characters when the calculated probability of the characters is not smaller than the preset probability threshold; therefore, according to the technical scheme provided by the prior art, after the probability of the characters is obtained through calculation, the correctness of the characters is often judged according to only one probability threshold, the judgment mode is simple, and the sentence context where the characters are located is not considered, so that the detection result of wrongly-written characters may not be in accordance with the actual situation, and the detection accuracy is low.
Disclosure of Invention
The technical problem to be solved by the embodiments of the present invention is to provide a method and an apparatus for detecting wrongly written characters, a computer-readable storage medium, and a terminal device, which can improve the accuracy of detecting wrongly written characters.
In order to solve the above technical problem, an embodiment of the present invention provides a method for detecting a wrongly written word, including:
receiving a sentence to be detected;
acquiring the character probability of the characters to be detected in the sentences to be detected and the character probability of K replacement characters of the characters to be detected according to each character in the sentences to be detected; wherein K is more than or equal to 1;
arranging the characters to be detected and each replaced character according to the sequence of the probability of the characters from large to small to obtain a character sequence;
according to the character sequence, obtaining a first threshold value K1, a second threshold value K2 and an arrangement order K of the characters to be detected; the first threshold value K1 meets a first condition that the character probability of the first K1 characters in the character sequence meets a preset value; the second threshold value K2 meets a second condition that the character probability of the first K2 characters in the character sequence meets the preset condition;
and judging whether the characters to be detected are wrongly written characters or not according to the first threshold value K1, the second threshold value K2 and the arrangement order K.
Compared with the prior art, the embodiment of the invention provides a wrongly-written character detection method, which includes the steps of obtaining character probability of characters to be detected and character probability of replacement characters of the characters to be detected according to each character in sentences to be detected, sequencing the characters to be detected and the replacement characters according to the character probabilities, obtaining a first threshold value and a second threshold value which meet certain conditions and a sequence of the characters to be detected according to the sequenced character sequences, and judging whether the characters to be detected are wrongly-written characters according to the first threshold value, the second threshold value and the sequence of the characters to be detected.
Further, the first threshold K1 satisfies a first condition that the probability of the first K1 words in the word sequence meets a preset first condition, and specifically includes:
the first threshold value K1 is satisfied such that the sum of the character probabilities of the first K1-1 characters in the character sequence is less than a preset first probability threshold value, and the sum of the character probabilities of the first K1 characters in the character sequence is greater than the first probability threshold value.
Further, the first probability threshold is 0.95.
Further, the second threshold K2 satisfies a second condition that the probability of the first K2 words in the word sequence meets a preset second condition, and specifically includes:
the second threshold value K2 is satisfied that the sum of the probabilities of the first K2-1 characters in the character sequence is smaller than a preset second probability threshold value, and the sum of the probabilities of the first K2 characters in the character sequence is larger than the second probability threshold value.
Further, the second probability threshold is 0.99.
Further, the determining whether the characters to be detected are wrongly written characters according to the first threshold K1, the second threshold K2 and the arrangement order K specifically includes:
when K is judged to be not less than K1 and not more than K2, obtaining a confusion set of the characters to be detected; wherein the confusion set comprises M confusion words, and M is more than or equal to 1;
and judging whether the words to be detected are wrongly written or not according to the sentences to be detected and each confusion word.
Further, the method further comprises:
when K is less than K1, judging the characters to be detected as correct characters;
and when K is larger than K2, judging the characters to be detected as wrongly written characters.
The embodiment of the invention also provides a device for detecting wrongly written characters, which comprises:
the sentence receiving module is used for receiving the sentences to be detected;
the word probability obtaining module is used for obtaining the word probability of the words to be detected in the sentences to be detected and the word probability of K replaced words of the words to be detected according to each word in the sentences to be detected; wherein K is more than or equal to 1;
the character sequence acquisition module is used for arranging the characters to be detected and each replaced character according to the sequence of the probability of the characters from large to small to obtain a character sequence;
a comparison threshold value obtaining module, configured to obtain a first threshold value K1, a second threshold value K2, and an arrangement order K of the characters to be detected according to the character sequence; the first threshold value K1 meets a first condition that the character probability of the first K1 characters in the character sequence meets a preset value; the second threshold value K2 meets a second condition that the character probability of the first K2 characters in the character sequence meets the preset condition; and the number of the first and second groups,
and the character judging module is used for judging whether the characters to be detected are wrongly-written characters or not according to the first threshold value K1, the second threshold value K2 and the arrangement order K.
An embodiment of the present invention further provides a computer-readable storage medium, where the computer-readable storage medium includes a stored computer program; wherein the computer program, when running, controls the device on which the computer-readable storage medium is located to execute any one of the above-mentioned methods for detecting wrongly written words.
The embodiment of the invention also provides terminal equipment, which comprises at least one memory and at least one processor; wherein,
the memory includes at least one stored computer program;
the processor is configured to execute the computer program, and when executing the computer program, the processor implements the method for detecting a wrongly written word according to any one of the above descriptions.
Compared with the prior art, the embodiment of the invention provides a method and a device for detecting wrongly-written characters, a computer-readable storage medium and a terminal device, wherein the method comprises the steps of obtaining the character probability of characters to be detected and the character probability of replacement characters of the characters to be detected according to each character in sentences to be detected, sequencing the characters to be detected and the replacement characters according to the character probabilities, obtaining a first threshold value and a second threshold value which meet certain conditions and the arrangement sequence of the characters to be detected according to the sequenced character sequence, and judging whether the characters to be detected are wrongly-written characters according to the first threshold value, the second threshold value and the arrangement sequence of the characters to be detected.
Drawings
FIG. 1 is a flow chart of a preferred embodiment of a method for detecting wrongly written words provided by the present invention;
FIG. 2 is a flowchart illustrating the steps S15 of the method for detecting a wrongly written word according to the present invention;
FIG. 3 is a block diagram of a preferred embodiment of the device for detecting wrongly written words according to the present invention;
fig. 4 is a block diagram of a preferred embodiment of the terminal device provided by the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without any inventive step, are within the scope of the present invention.
The embodiment of the invention provides a wrongly written character detection method.
Referring to fig. 1, it is a flowchart of a preferred embodiment of the method for detecting a wrongly written word according to the present invention, and the method includes steps S11 to S15:
step S11, receiving a statement to be detected;
step S12, obtaining the character probability of the characters to be detected in the sentences to be detected and the character probability of K replaced characters of the characters to be detected according to each character in the sentences to be detected; wherein K is more than or equal to 1;
s13, arranging the characters to be detected and each replaced character according to the sequence of character probability from large to small to obtain a character sequence;
s14, obtaining a first threshold value K1, a second threshold value K2 and an arrangement order K of the characters to be detected according to the character sequence; the first threshold value K1 meets a first condition that the character probability of the first K1 characters in the character sequence meets a preset value; the second threshold value K2 meets a second condition that the character probability of the first K2 characters in the character sequence meets the preset condition;
and S15, judging whether the characters to be detected are wrongly written characters or not according to the first threshold value K1, the second threshold value K2 and the arrangement order K.
In this embodiment, when it is required to determine whether the characters to be detected are wrongly-typed characters, a sentence where the characters to be detected are located, that is, the sentence to be detected, is obtained first; then, respectively calculating and obtaining the character probability of the characters to be detected and the character probability of K replaced characters of the characters to be detected according to each character in the sentence to be detected; the replacing characters are any characters except the characters to be detected in a preset dictionary set; arranging the characters to be detected and the K replacing characters according to the sequence of the character probability obtained by calculation from large to small so as to obtain a sequenced character sequence; and finally, respectively obtaining a first threshold K1 meeting a preset first condition, a second threshold K2 meeting a preset second condition and the arrangement order K of the characters to be detected in the sorted character sequence according to the sorted character sequence, and judging whether the characters to be detected are wrongly-written characters or not according to the obtained first threshold K1, the obtained second threshold K2 and the arrangement order K of the characters to be detected.
It should be noted that the character probability of the characters to be detected and the character probability of the replaced characters are calculated by using a Bi-directional Long Short Term Memory (Bi-LSTM) model and invoking a softmax algorithm, and the Bi-directional Long Short Term Memory model is a Bi-directional Long Short Term Memory model trained by deep learning frame coding (e.g., tensorflow).
Specifically, the last layer of the Bi-LSTM model is connected with a softmax layer, the sentence to be detected is input into the Bi-LSTM model, the softmax layer calls a softmax algorithm to calculate, and the character probability of the character to be detected can be calculated according to each character in the sentence to be detected; when the character probability of the replacement characters is calculated, the characters to be detected in the sentences to be detected need to be replaced by the replacement characters in sequence, so that corresponding replacement sentences are obtained, then each replacement sentence is input into the trained Bi-LSTM model, and the character probability of each replacement character can be calculated according to each character in each replacement sentence by calling a softmax algorithm through a softmax layer.
According to the wrongly-written character detection method provided by the embodiment of the invention, the character probability of the characters to be detected and the character probability of the replaced characters of the characters to be detected are respectively obtained according to each character in the sentences to be detected, and the first threshold value and the second threshold value which meet certain conditions and the arrangement sequence of the characters to be detected are obtained according to the character sequence sequenced according to the character probabilities, so that whether the characters to be detected are wrongly-written characters is judged according to the first threshold value, the second threshold value and the arrangement sequence of the characters to be detected, the sentence context where the characters are located is combined when the character probability is calculated, and the wrongly-written character detection is carried out by adopting a strategy of character probability sequencing judgment, so that the problems that the judgment mode in the prior art is simple, the sentence context where the characters are located is not considered, the detection accuracy is low are solved, and the accuracy of wrongly-written.
In addition, when the trained Bi-LSTM model is adopted to calculate the character probability of the characters to be detected and the character probability of the replaced characters, calculation is carried out according to the information of each character in the sentence where the characters to be detected and the replaced characters are located, and not only can forward information of the characters be utilized, but also backward information of the characters can be utilized, so that the character probability obtained through calculation is more accurate.
In another preferred embodiment, the first threshold K1 satisfies a first condition that the probability of a first K1 words in the word sequence meets a preset first condition, and specifically includes:
the first threshold value K1 is satisfied such that the sum of the character probabilities of the first K1-1 characters in the character sequence is less than a preset first probability threshold value, and the sum of the character probabilities of the first K1 characters in the character sequence is greater than the first probability threshold value.
The present embodiment is a method for obtaining a first threshold K1 according to a sorted text sequence, and specifically, the method performs traversal on the sorted text sequence, and when it is detected that a text probability sum of K1-th text in the text sequence that satisfies K1-1 first text is smaller than a preset first probability threshold and a text probability sum of K1 first text is greater than the first probability threshold, takes a value of K1 at this time as the first threshold K1.
In yet another preferred embodiment, the first probability threshold is 0.95.
It should be noted that 0.95 in this embodiment is a preferred value of the first probability threshold, that is, when the sum of the probabilities of the top K1-1 characters in the sorted character sequence is less than 0.95 and the sum of the probabilities of the top K1 characters is greater than 0.95, K1 at this time is the first threshold K1 meeting the preset first condition.
In another preferred embodiment, the second threshold K2 satisfies a second condition that the probability of the first K2 words in the word sequence meets a preset second condition, and specifically includes:
the second threshold value K2 is satisfied that the sum of the probabilities of the first K2-1 characters in the character sequence is smaller than a preset second probability threshold value, and the sum of the probabilities of the first K2 characters in the character sequence is larger than the second probability threshold value.
The present embodiment is a method for obtaining a second threshold K2 according to a sorted text sequence, and specifically, the method performs traversal on the sorted text sequence, and when it is detected that a text probability sum of K2-th text in the text sequence that satisfies K2-1 first text is smaller than a preset second probability threshold and a text probability sum of K2 first text is larger than the second probability threshold, takes a value of K2 at this time as the second threshold K2.
In yet another preferred embodiment, the second probability threshold is 0.99.
It should be noted that 0.99 in this embodiment is a preferred value of the second probability threshold, that is, when the sum of the probabilities of the top K2-1 characters in the sorted character sequence is less than 0.99 and the sum of the probabilities of the top K2 characters is greater than 0.99, K2 at this time is the second threshold K2 meeting the preset second condition.
Referring to fig. 2, which is a detailed flowchart of a preferred embodiment of step S15 of the method for detecting a wrongly written word provided by the present invention, the determining whether the text to be detected is a wrongly written word according to the first threshold K1, the second threshold K2 and the arrangement order K specifically includes steps S1501 to S1502:
step S1501, when K1 is judged to be not more than K2, obtaining a confusion set of the characters to be detected; wherein the confusion set comprises M confusion words, and M is more than or equal to 1;
step S1502, according to the sentence to be detected and each of the confusing characters, determining whether the character to be detected is a wrongly written character.
In this embodiment, when determining whether the characters to be detected are wrongly written characters according to the first threshold K1, the second threshold K2, and the arrangement order K of the characters to be detected, the sizes of the first threshold K1, the second threshold K2, and the arrangement order K of the characters to be detected need to be compared first, and when K is satisfied, K is not less than K1 and is not less than K2, a confusion set of the characters to be detected is obtained, so that whether the characters to be detected are wrongly written characters is further determined according to the sentences to be detected and M confusion words in the confusion set.
It should be noted that the confusing words are words in a preset dictionary set, which are the same as the pinyin of the word to be detected, and the confusing set is a set composed of all confusing words.
Specifically, firstly, calculating and obtaining the character probability of each confusing character according to each character in the sentence to be detected, and arranging the character probabilities of all the confusing characters from large to small, thereby obtaining the arrangement sequence of all the confusing characters according to the character probabilities, and selecting N confusing characters with the maximum character probability as target confusing characters (N can be selected according to actual needs); sequentially replacing the characters to be detected in the sentences to be detected with each target confusing character so as to obtain N corresponding target confusing sentences, and respectively obtaining the character probability of each character in the sentences to be detected and the character probability of each character in the N target confusing sentences so as to obtain the sentence probability of the sentences to be detected and the sentence probability of the N target confusing sentences; and finally, further judging whether the characters to be detected are wrongly written characters or not according to the sentence probability of the sentences to be detected and the sentence probabilities of the N target confusing sentences.
When the character probability of the confusion characters is calculated, the characters to be detected in the sentences to be detected need to be replaced by the confusion characters in sequence, so that the corresponding confusion sentences are obtained, then each confusion sentence is input into the trained Bi-LSTM model, and the character probability of each confusion character can be calculated according to each character in each replacement sentence by calling a softmax algorithm through a softmax layer.
When calculating the sentence probability of any sentence A, firstly, inputting the sentence A into a trained Bi-LSTM model, calling a softmax algorithm through a softmax layer to calculate and obtain the character probabilities P of B characters in the sentence A respectivelyaiThen according to the preset statement probability calculation formulaThe calculation is performed so as to obtain the sentence probability of the sentence a.
It should be noted that, taking the logarithm of the character probability of all characters in a sentence and the sentence probability as the sentence, the calculation amount can be reduced, and the truncation caused by too small a number can be avoided.
Taking the sentence "Chinese style people" to be detected as an example, wherein the "Chinese style" is the characters to be detected, and if 7000 characters (including the characters to be detected) are in total in a preset dictionary set, the character probabilities of the 7000 characters are respectively obtained by calculation, and the 7000 characters are arranged according to the sequence of the character probabilities from large to small, so as to obtain the corresponding character sequence.
Respectively obtaining a first threshold value K1, a second threshold value K2 and an arrangement order K of 'chemical' words according to the word sequence, comparing the sizes of the first threshold value K1, the second threshold value K2 and the arrangement order K of the 'chemical' words, assuming that the first threshold value K1 is 10, the second threshold value K2 is 100, the arrangement order K of the 'chemical' words is 50, and K1 is less than or equal to K2, then obtaining a confusion set of the 'chemical' characters, assuming that 15 characters in the preset dictionary set are the same as the pinyin of the 'chemical' characters, respectively calculating character probabilities of obtaining 15 confusion characters, and arranging the 15 confusing characters according to the sequence of the character probabilities from large to small, selecting the confusing character with the character probability arrangement positioned at the first 5 bits as a target confusing character, and sequentially replacing the 'change' character in the 'Chinese character to be detected' with the 5 target confusing characters so as to obtain the corresponding 5 target confusing sentences.
Respectively calculating and obtaining the character probability of each character in the sentence to be detected, namely the Chinese style of Chinese stylea1=PIn=p1、Pa2=PTransforming=p2、Pa3=PHuman beingP3, and Pa4=PPeopleIf P4, the sentence probability of the sentence "centralizing people" to be detected is PPeople of middle-jiaoAnd finally, further processing is carried out according to the sentence probability of the sentence to be detected, namely log p1+ log p2+ log p3+ log p4, and the sentence probability of 5 target confusing sentences.
One method for further judging whether the words to be detected are wrongly written words or not according to the sentence probability of the sentences to be detected and the sentence probabilities of the N target confusing sentences is specifically as follows:
arranging the sentences to be detected and the N target replacement sentences according to the sequence of the probability of the sentences from large to small, thereby obtaining the arrangement sequence of the sentences to be detected and recording the arrangement sequence as N; calculating to obtain a value of N1 according to a formula N1 ═ N +1 × 7%, and when N is less than N1, indicating that the sentence probability of the sentence to be detected is arranged at the first 7% of all the sentence probabilities, so that the character to be detected is judged to be a correct character; when N is larger than or equal to N1, the sentence probability of the sentence to be detected is arranged 93% behind all the sentence probabilities, so that the character to be detected is judged to be a wrongly written character.
In the calculation of N1 according to the formula, if the result of the calculation is a decimal, an integer closest to and greater than the calculation result is taken as the value of N1.
For example, the sentence probabilities of 6 sentences including the sentence to be detected "the Chinese style people" and the 5 target confusing sentences are arranged in the descending order, and if the sentence probability corresponding to the sentence to be detected "the Chinese style people" is 7% of the sentence probability of the 6 sentences, the Chinese style word is considered as the correct word; and if the sentence probability corresponding to the sentence to be detected, namely the Chinese people, is 93% of the sentence probability of the 6 sentences, the Chinese characters are regarded as wrongly-written characters.
Another method for further judging whether the words to be detected are wrongly written words or not according to the statement probability of the statements to be detected and the statement probabilities of the N target confusing statements is specifically as follows:
arranging the sentences to be detected and the N target replacement sentences according to the sequence of the probability of the sentences from large to small, thereby obtaining the arrangement sequence of the sentences to be detected and recording the arrangement sequence as N; when n is 1, the sentence probability of the sentence to be detected is the maximum in all the sentence probabilities, so that the character to be detected is judged to be a correct character; when n ≠ 1, it is stated that the sentence probability of the sentence to be detected is not the maximum among all the sentence probabilities, so that the character to be detected is judged to be a wrongly written character.
For example, the sentence probabilities of 6 sentences including the sentence to be detected "the Chinese style people" and 5 target confusing sentences are arranged in the descending order, and if the sentence probability corresponding to the sentence to be detected "the Chinese style people" is the largest among the sentence probabilities of 6 sentences, the Chinese style word is considered as the correct word; and if the sentence probability corresponding to the sentence to be detected, namely the Chinese metaplasia people, is not the maximum in the sentence probabilities of the 6 sentences, the Chinese metaplasia characters are considered as wrongly-written characters.
The method for detecting wrongly-written characters provided by the embodiment of the invention is characterized in that when the first threshold, the second threshold and the arrangement sequence of the characters to be detected meet certain conditions, a confusion set of the characters to be detected is obtained, a target confusion statement is correspondingly obtained according to the confusion characters in the confusion set of the characters to be detected, whether the characters to be detected are wrongly-written characters is judged according to the statement probability of the sentences to be detected and the statement probability of the target confusion statement, the sentence context where the characters are located is combined when the character probability is calculated, and a strategy for judging the confusion set is adopted to further judge whether the characters to be detected are wrongly-written characters, so that the accuracy of detection of wrongly-written characters is further improved.
In yet another preferred embodiment, the method further comprises:
when K is less than K1, judging the characters to be detected as correct characters;
and when K is larger than K2, judging the characters to be detected as wrongly written characters.
In this embodiment, two other situations that whether the characters to be detected are wrongly written characters are determined according to the first threshold K1, the second threshold K2 and the arrangement order K of the characters to be detected are also determined, and similarly, the sizes of the first threshold K1, the second threshold K2 and the arrangement order K of the characters to be detected are first compared, and if K is less than K1, the characters to be detected are determined to be correct characters; if K is more than K2, the character to be detected is judged to be a wrongly written character.
The embodiment of the present invention further provides a device for detecting a wrongly written word, which can implement all the processes of the method for detecting a wrongly written word in any one of the above embodiments, and the functions and technical effects of the modules and units in the device are respectively the same as those of the method for detecting a wrongly written word in the above embodiment, and are not described herein again.
Referring to fig. 3, a block diagram of a preferred embodiment of the device for detecting a wrongly written word according to the present invention is shown, including:
a sentence receiving module 11, configured to receive a sentence to be detected;
a text probability obtaining module 12, configured to obtain, according to each text in the sentence to be detected, a text probability of the text to be detected in the sentence to be detected and text probabilities of K replacement texts of the text to be detected; wherein K is more than or equal to 1;
the character sequence acquisition module 13 is configured to arrange the characters to be detected and each of the replacement characters in a sequence from a large probability to a small probability to obtain a character sequence;
a comparison threshold obtaining module 14, configured to obtain a first threshold K1, a second threshold K2, and an arrangement order K of the characters to be detected according to the character sequence; the first threshold value K1 meets a first condition that the character probability of the first K1 characters in the character sequence meets a preset value; the second threshold value K2 meets a second condition that the character probability of the first K2 characters in the character sequence meets the preset condition; and the number of the first and second groups,
and the character judging module 15 is configured to judge whether the characters to be detected are wrongly written characters or not according to the first threshold value K1, the second threshold value K2 and the arrangement order K.
Preferably, the first threshold K1 satisfies a first condition that the probability of the first K1 words in the word sequence meets a preset first condition, and specifically includes:
the first threshold value K1 is satisfied such that the sum of the character probabilities of the first K1-1 characters in the character sequence is less than a preset first probability threshold value, and the sum of the character probabilities of the first K1 characters in the character sequence is greater than the first probability threshold value.
Preferably, the first probability threshold is 0.95.
Preferably, the second threshold K2 satisfies a second condition that the probability of the first K2 words in the word sequence meets a preset second condition, and specifically includes:
the second threshold value K2 is satisfied that the sum of the probabilities of the first K2-1 characters in the character sequence is smaller than a preset second probability threshold value, and the sum of the probabilities of the first K2 characters in the character sequence is larger than the second probability threshold value.
Preferably, the second probability threshold is 0.99.
Preferably, the text determination module 15 specifically includes:
the confusion set acquisition unit is used for acquiring the confusion set of the characters to be detected when K1 is judged to be less than or equal to K2; wherein the confusion set comprises M confusion words, and M is more than or equal to 1; and the number of the first and second groups,
and the character judging module unit is used for judging whether the characters to be detected are wrongly written characters or not according to the sentences to be detected and each confusing character.
Preferably, the apparatus further comprises:
the first character judging module is used for judging the characters to be detected to be correct characters when K is less than K1; and the number of the first and second groups,
and the second character judgment module is used for judging that the characters to be detected are wrongly written characters when K is larger than K2.
An embodiment of the present invention further provides a computer-readable storage medium, where the computer-readable storage medium includes a stored computer program; wherein, when running, the computer program controls the device on which the computer-readable storage medium is located to execute the method for detecting wrongly written words according to any of the above embodiments.
The embodiment of the invention also provides the terminal equipment.
Referring to fig. 4, it is a block diagram of a preferred embodiment of the terminal device provided in the present invention, which includes at least one memory 10 and at least one processor 20; wherein,
the memory 10 includes at least one stored computer program;
the processor 20 is configured to execute the computer program, and the processor 20 implements the method for detecting a wrongly written word according to any of the above embodiments when executing the computer program.
It should be noted that fig. 4 only illustrates an example in which one memory and one processor in the terminal device are connected, in some specific embodiments, the terminal device may further include a plurality of memories and/or a plurality of processors, and the specific number and the connection mode of the memories and the processors may be set and adaptively adjusted according to actual needs.
To sum up, the method, the apparatus, the computer-readable storage medium, and the terminal device for detecting a wrongly written word provided in the embodiments of the present invention respectively obtain a text probability of a text to be detected and a text probability of a replacement text of the text to be detected according to each text in a sentence to be detected, and obtain a first threshold and a second threshold that satisfy a certain condition and an arrangement order of the text to be detected according to a text sequence sorted according to the text probabilities, so as to determine whether the text to be detected is a wrongly written word according to the first threshold, the second threshold and the arrangement order of the text to be detected, and obtain a confusion set of the text to be detected when the first threshold, the second threshold and the arrangement order of the text to be detected satisfy a certain condition, determine whether the text to be detected is a wrongly written word according to the text to be detected and the confusion set, not only combine a sentence context where the text is located when calculating the text, and a strategy combining character probability sequencing judgment and confusion set judgment is adopted to replace a simple strategy of single probability threshold judgment for detecting wrongly-written characters, so that the strategy for judging wrongly-written characters is more flexible and effective, and the accuracy rate of wrongly-written character detection is greatly improved.
The above description is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, several modifications and variations can be made without departing from the technical principle of the present invention, and these modifications and variations should also be regarded as the protection scope of the present invention.

Claims (10)

1. A method for detecting wrongly written characters, comprising:
receiving a sentence to be detected;
acquiring the character probability of the characters to be detected in the sentences to be detected and the character probability of K replacement characters of the characters to be detected according to each character in the sentences to be detected; wherein K is more than or equal to 1;
arranging the characters to be detected and each replaced character according to the sequence of the probability of the characters from large to small to obtain a character sequence;
according to the character sequence, obtaining a first threshold value K1, a second threshold value K2 and an arrangement order K of the characters to be detected; the first threshold value K1 meets a first condition that the character probability of the first K1 characters in the character sequence meets a preset value; the second threshold value K2 meets a second condition that the character probability of the first K2 characters in the character sequence meets the preset condition;
and judging whether the characters to be detected are wrongly written characters or not according to the first threshold value K1, the second threshold value K2 and the arrangement order K.
2. The method of claim 1, wherein the first threshold K1 satisfies a first condition that the probability of the first K1 words in the word sequence meets a preset first condition, and specifically comprises:
the first threshold value K1 is satisfied such that the sum of the character probabilities of the first K1-1 characters in the character sequence is less than a preset first probability threshold value, and the sum of the character probabilities of the first K1 characters in the character sequence is greater than the first probability threshold value.
3. The method of claim 2, wherein the first probability threshold is 0.95.
4. The method according to claim 1, wherein the second threshold K2 satisfies a second condition that the probability of the first K2 words in the word sequence meets a preset second condition, and specifically includes:
the second threshold value K2 is satisfied that the sum of the probabilities of the first K2-1 characters in the character sequence is smaller than a preset second probability threshold value, and the sum of the probabilities of the first K2 characters in the character sequence is larger than the second probability threshold value.
5. The method of claim 4, wherein the second probability threshold is 0.99.
6. The method according to claim 1, wherein the determining whether the text to be detected is a wrongly written text according to the first threshold K1, the second threshold K2 and the arrangement order K specifically comprises:
when K is judged to be not less than K1 and not more than K2, obtaining a confusion set of the characters to be detected; wherein the confusion set comprises M confusion words, and M is more than or equal to 1;
and judging whether the words to be detected are wrongly written or not according to the sentences to be detected and each confusion word.
7. The method of detecting wrongly written words as recited in claim 1, wherein the method further comprises:
when K is less than K1, judging the characters to be detected as correct characters;
and when K is larger than K2, judging the characters to be detected as wrongly written characters.
8. A wrongly written character detecting apparatus, comprising:
the sentence receiving module is used for receiving the sentences to be detected;
the word probability obtaining module is used for obtaining the word probability of the words to be detected in the sentences to be detected and the word probability of K replaced words of the words to be detected according to each word in the sentences to be detected; wherein K is more than or equal to 1;
the character sequence acquisition module is used for arranging the characters to be detected and each replaced character according to the sequence of the probability of the characters from large to small to obtain a character sequence;
a comparison threshold value obtaining module, configured to obtain a first threshold value K1, a second threshold value K2, and an arrangement order K of the characters to be detected according to the character sequence; the first threshold value K1 meets a first condition that the character probability of the first K1 characters in the character sequence meets a preset value; the second threshold value K2 meets a second condition that the character probability of the first K2 characters in the character sequence meets the preset condition; and the number of the first and second groups,
and the character judging module is used for judging whether the characters to be detected are wrongly-written characters or not according to the first threshold value K1, the second threshold value K2 and the arrangement order K.
9. A computer-readable storage medium, characterized in that the computer-readable storage medium comprises a stored computer program; wherein the computer program, when executed, controls an apparatus in which the computer-readable storage medium is located to perform the method of detecting wrongly-written words as claimed in any one of claims 1 to 7.
10. A terminal device comprising at least one memory and at least one processor; wherein,
the memory includes at least one stored computer program;
the processor is configured to execute the computer program, and when executing the computer program, the processor implements the method of detecting wrongly-written words as claimed in any of claims 1 to 7.
CN201810500257.1A 2018-05-22 2018-05-22 Wrongly written character detection method and device, computer readable storage medium and terminal equipment Active CN108984515B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810500257.1A CN108984515B (en) 2018-05-22 2018-05-22 Wrongly written character detection method and device, computer readable storage medium and terminal equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810500257.1A CN108984515B (en) 2018-05-22 2018-05-22 Wrongly written character detection method and device, computer readable storage medium and terminal equipment

Publications (2)

Publication Number Publication Date
CN108984515A true CN108984515A (en) 2018-12-11
CN108984515B CN108984515B (en) 2022-09-06

Family

ID=64542579

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810500257.1A Active CN108984515B (en) 2018-05-22 2018-05-22 Wrongly written character detection method and device, computer readable storage medium and terminal equipment

Country Status (1)

Country Link
CN (1) CN108984515B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112380839A (en) * 2020-11-13 2021-02-19 沈阳东软智能医疗科技研究院有限公司 Wrongly written character detection method, device and equipment

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH11134335A (en) * 1997-10-30 1999-05-21 Atr Onsei Honyaku Tsushin Kenkyusho:Kk Character error calibrating device
CN1460948A (en) * 2002-05-22 2003-12-10 夏普株式会社 Method and device for amending or improving words application
CN104899450A (en) * 2015-06-11 2015-09-09 福州大学 Method for judging tuple linguistic model through comprehensive evaluation of target threat level
CN105045778A (en) * 2015-06-24 2015-11-11 江苏科技大学 Chinese homonym error auto-proofreading method
CN105159871A (en) * 2015-08-21 2015-12-16 小米科技有限责任公司 Text information detection method and apparatus
CN105512110A (en) * 2015-12-15 2016-04-20 江苏科技大学 Wrong word knowledge base construction method based on fuzzy matching and statistics
CN106484132A (en) * 2015-09-02 2017-03-08 北京搜狗科技发展有限公司 A kind of input error correction method and input subtraction unit
CN106649269A (en) * 2016-12-16 2017-05-10 广州视源电子科技股份有限公司 Method and device for extracting spoken sentences
CN106815191A (en) * 2015-11-27 2017-06-09 北京国双科技有限公司 Correct the determination method and apparatus of word
CN106845499A (en) * 2017-01-19 2017-06-13 清华大学 A kind of image object detection method semantic based on natural language
CN107609098A (en) * 2017-09-11 2018-01-19 北京金堤科技有限公司 Searching method and device
US20180121787A1 (en) * 2016-11-03 2018-05-03 Salesforce.Com, Inc. Joint Many-Task Neural Network Model for Multiple Natural Language Processing (NLP) Tasks

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH11134335A (en) * 1997-10-30 1999-05-21 Atr Onsei Honyaku Tsushin Kenkyusho:Kk Character error calibrating device
CN1460948A (en) * 2002-05-22 2003-12-10 夏普株式会社 Method and device for amending or improving words application
CN104899450A (en) * 2015-06-11 2015-09-09 福州大学 Method for judging tuple linguistic model through comprehensive evaluation of target threat level
CN105045778A (en) * 2015-06-24 2015-11-11 江苏科技大学 Chinese homonym error auto-proofreading method
CN105159871A (en) * 2015-08-21 2015-12-16 小米科技有限责任公司 Text information detection method and apparatus
CN106484132A (en) * 2015-09-02 2017-03-08 北京搜狗科技发展有限公司 A kind of input error correction method and input subtraction unit
CN106815191A (en) * 2015-11-27 2017-06-09 北京国双科技有限公司 Correct the determination method and apparatus of word
CN105512110A (en) * 2015-12-15 2016-04-20 江苏科技大学 Wrong word knowledge base construction method based on fuzzy matching and statistics
US20180121787A1 (en) * 2016-11-03 2018-05-03 Salesforce.Com, Inc. Joint Many-Task Neural Network Model for Multiple Natural Language Processing (NLP) Tasks
CN106649269A (en) * 2016-12-16 2017-05-10 广州视源电子科技股份有限公司 Method and device for extracting spoken sentences
CN106845499A (en) * 2017-01-19 2017-06-13 清华大学 A kind of image object detection method semantic based on natural language
CN107609098A (en) * 2017-09-11 2018-01-19 北京金堤科技有限公司 Searching method and device

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
LI GUANPENG 等: "Understanding error propagation in deep learning neural network (DNN) accelerators and applications", 《PROCEEDINGS OF THE INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS》 *
YANG Y. 等: "Alibaba at IJCNLP-2017 Task 1:Embedding grammatical features into LSTMs for Chinese grammatical error diagnosis task", 《PROCEEDINGS OF THE 8TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING 2017》 *
朱丽莉: "中文错别字自动检测初探", 《中国优秀博硕士学位论文全文数据库(硕士)哲学与人文科学辑》 *
钱涛: "中文微博文本规范化方法及关键技术研究", 《中国优秀博硕士学位论文全文数据库(博士)信息科技辑》 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112380839A (en) * 2020-11-13 2021-02-19 沈阳东软智能医疗科技研究院有限公司 Wrongly written character detection method, device and equipment

Also Published As

Publication number Publication date
CN108984515B (en) 2022-09-06

Similar Documents

Publication Publication Date Title
US6513005B1 (en) Method for correcting error characters in results of speech recognition and speech recognition system using the same
CN112016304A (en) Text error correction method and device, electronic equipment and storage medium
CN107679032A (en) Voice changes error correction method and device
CN109766538B (en) Text error correction method and device, electronic equipment and storage medium
CN106843737A (en) Text entry method, device and terminal device
CN111274785A (en) Text error correction method, device, equipment and medium
CN106325596B (en) Automatic handwriting error correction method and system
CN109670040B (en) Writing assistance method and device, storage medium and computer equipment
CN108984159B (en) Abbreviative phrase expansion method based on Markov language model
WO2018030601A1 (en) Typographical error character correction method
CN108874770B (en) Wrongly written character detection method and device, computer readable storage medium and terminal equipment
CN114861635B (en) Chinese spelling error correction method, device, equipment and storage medium
CN108984515B (en) Wrongly written character detection method and device, computer readable storage medium and terminal equipment
CN111090341A (en) Input method candidate result display method, related equipment and readable storage medium
CN108694167B (en) Candidate word evaluation method, candidate word ordering method and device
CN108628826B (en) Candidate word evaluation method and device, computer equipment and storage medium
US7925088B2 (en) System, method and apparatus for automatic segmentation and analysis of ink stream
CN108845984B (en) Wrongly written character detection method and device, computer readable storage medium and terminal equipment
KR102339723B1 (en) Method, program, and appratus of decoding based on soft information of a dna storage device
CN108829664B (en) Wrongly written character detection method and device, computer readable storage medium and terminal equipment
CN108829665B (en) Wrongly written character detection method and device, computer readable storage medium and terminal equipment
CN108681533B (en) Candidate word evaluation method and device, computer equipment and storage medium
CN110929514A (en) Text proofreading method and device, computer readable storage medium and electronic equipment
CN110633474B (en) Mathematical formula identification method, device, equipment and readable storage medium
CN108647202B (en) Candidate word evaluation method and device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant