CN112863516A - Text error correction method and system and electronic equipment - Google Patents

Text error correction method and system and electronic equipment Download PDF

Info

Publication number
CN112863516A
CN112863516A CN202011641951.9A CN202011641951A CN112863516A CN 112863516 A CN112863516 A CN 112863516A CN 202011641951 A CN202011641951 A CN 202011641951A CN 112863516 A CN112863516 A CN 112863516A
Authority
CN
China
Prior art keywords
pinyin
text
user
score
pronunciation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011641951.9A
Other languages
Chinese (zh)
Inventor
简仁贤
佘昌宪
李佳纯
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Emotibot Technologies Ltd
Original Assignee
Emotibot Technologies Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Emotibot Technologies Ltd filed Critical Emotibot Technologies Ltd
Priority to CN202011641951.9A priority Critical patent/CN112863516A/en
Publication of CN112863516A publication Critical patent/CN112863516A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a text error correction method, a system and electronic equipment, wherein the text error correction method comprises the following steps: receiving a text to be corrected and acquiring pinyin of the text to be corrected, and acquiring user vocabularies and pinyin of the user vocabularies from a user lexicon; directly comparing the pinyin of the text to be corrected with the pinyin of the vocabulary of each user, and selecting error-correcting words from a user word bank according to a preset algorithm; and reversely deducing the replacement words in the text to be corrected according to the selected path of the error correction words, replacing the replacement words with the error correction words, and obtaining the text after error correction. The error correction method and the error correction system disclosed by the invention are not limited by the entity word model, and can quickly compare each position of the text and find out the position to be replaced.

Description

Text error correction method and system and electronic equipment
Technical Field
The invention relates to the technical field of intelligent recognition, in particular to a natural language processing technology.
Background
With the popularization of deep learning, great breakthroughs are made in the aspects of computer vision, speech recognition, natural language processing and the like. Taking speech recognition as an example, the accuracy of speech recognition has reached 97% at present. The breakthrough of the technology makes the application field of the voice recognition wider and wider. Compared with other human-computer interaction modes, the voice interaction is more in line with the daily habits of people and is more efficient. It can be expected that the voice recognition technology will be widely applied to various fields such as smart home, industrial production, communication, medical treatment, automatic driving, and the like. In the actual voice interaction process, the voice recognition error rate is high due to the influence of factors such as nonstandard pronunciation of a user, noise and the like. The prior art focuses on improving the accuracy of voice recognition, but lacks an error correction means for a recognition result. Due to the reasons, the popularization of voice interaction products is greatly influenced.
In the prior art, in order to improve the error correction condition of the recognition result, the entity words of the speech recognition text are often found through a pre-trained entity word model, and are compared with the user word bank in the pinyin similarity. As in the prior art, the method generally comprises the following steps: analyzing and preprocessing the text data after voice conversion to obtain a sample data set; training an entity recognition model by using sample data; constructing an entity correction data set; and according to the text data after the voice recognition, predicting, verifying the entity and the like by using the entity recognition model.
The technical scheme of the type has the following defects: before the speech recognition text is corrected, entity word models need to be labeled and trained in advance, and entity words of different types may need to be trained additionally, so that the operation time and accuracy are greatly limited by the entity word models.
Disclosure of Invention
In order to solve the problems in the prior art, the invention provides a text error correction method, a text error correction system and electronic equipment.
In a first aspect, the present invention provides a text error correction method, including the following steps: acquiring a text to be corrected, identifying pinyin of the text to be corrected, and extracting user words and user word pinyin from a user word bank; directly comparing the pinyin of the text to be corrected with the pinyin of the vocabulary of each user, and selecting error-correcting words from a user word bank according to a preset algorithm; and reversely deducing the replacement words in the text to be corrected according to the selected path of the error correction words, replacing the replacement words with the error correction words, and obtaining the text after error correction.
With reference to the embodiment of the first aspect, in a possible implementation manner, the text to be corrected is a speech recognition text.
With reference to the embodiment of the first aspect, in a possible implementation manner, the preset algorithm includes a pronunciation score algorithm, and the pronunciation score algorithm includes: calculating pronunciation similarity scores of any two pinyins between the text pinyin to be corrected and the pinyin of the vocabulary of each user, accumulating the pronunciation similarity scores of the text pinyin to be corrected and the pinyin of the vocabulary of the user by using a Longest Common Subsequence (LCS) algorithm to obtain an accumulated pronunciation score of each vocabulary of the user, and recording a scoring path of the pronunciation scores; and dividing the accumulated pronunciation score of the user vocabulary by the pinyin number of the user vocabulary to obtain the pronunciation score of the user vocabulary, and selecting the user vocabulary with the highest pronunciation score as an error correction word.
With reference to the embodiment of the first aspect, in a possible implementation manner, the preset algorithm includes a pinyin score algorithm and a pronunciation score algorithm, and the pinyin score algorithm includes: calculating LCS of the text pinyin to be corrected and the pinyin of each user vocabulary to obtain the accumulated pinyin score of each user vocabulary, dividing the accumulated pinyin score of the user vocabulary by the number of the pinyins of the user vocabulary to obtain the pinyin score of the user vocabulary, and screening the user vocabulary with the pinyin score arranged at the front K position as a candidate word, wherein K is less than or equal to the total number of the user vocabulary. The pronunciation algorithm comprises: calculating pronunciation similarity scores of any two pinyins between the text pinyin to be corrected and the pinyin of each candidate word, accumulating the pronunciation similarity scores of the text pinyin to be corrected and the pinyin of the candidate word by using a Longest Common Subsequence (LCS) algorithm to obtain an accumulated pronunciation score of each candidate word, and recording a scoring path of the pronunciation scores; and dividing the accumulated pronunciation score of the candidate word by the number of the pinyin of the candidate word to obtain the pronunciation score of the candidate word, and selecting the candidate word with the highest pronunciation score as the error correction word.
With reference to the embodiment of the first aspect, in a possible implementation manner, if there are multiple candidate words with the same pronunciation score, picking the error-correcting word according to the following sequence: the candidate word with the highest pinyin score is an error correction word; the candidate word with the same number of words as the text to be corrected and the maximum number of words is the error correction word; the candidate word with the length closest to the word number of the text to be corrected is the error correction word; and randomly selecting the candidate words as error correction words.
With reference to the first aspect, in a possible implementation manner, the pronunciation similarity score of the pinyin is directly obtained by computing a pronunciation similarity algorithm, where the pronunciation similarity algorithm includes a dimim algorithm.
With reference to the first aspect, in a possible implementation manner, the pronunciation score of the error-correcting word is compared with a preset threshold, and if: if the pronunciation score of the error correction word is lower than a preset threshold value, the step of replacing the replacement word with the error correction word is not executed any more, and the text to be corrected is directly output; and (4) replacing the replacement words with the error-correcting words and outputting the error-corrected text when the pronunciation score of the error-correcting words is higher than or equal to a preset threshold value.
With reference to the embodiment of the first aspect, in a possible implementation manner, a value range of the preset threshold is 0.5 to 1.
In a second aspect, the present invention provides a text correction system, comprising the following modules: the voice recognition module is used for receiving the voice input of the user from the user and acquiring a voice recognition text and pinyin of the text to be corrected; the user word bank storage module is used for storing and extracting user words and user word pinyin, calculating the number of the user word pinyin and calculating the number of the user words; the pronunciation score calculation module is used for calculating the pronunciation score of the pinyin of the candidate word and the accumulated pronunciation score scoring path; and the text replacement module is used for text replacement.
With reference to the second aspect, in a possible implementation manner, the apparatus further includes a pinyin-score calculating module, configured to calculate a pinyin score for each user vocabulary.
In a third aspect, the present invention provides an electronic device comprising a memory and a processor, the memory and the processor being connected; the memory is used for storing programs; the processor calls a program stored in the memory to perform the method of the first aspect embodiment and/or any possible implementation manner of the first aspect embodiment.
The beneficial effects of the invention include: the method is not limited by the entity word model, and the entity word model does not need to be labeled and trained in advance, so that the workload is reduced, and the working efficiency and the accuracy are improved; each position of the text can be quickly compared, and the position to be replaced and the most matched error-correcting word can be found out at the same time, so that the operation time is shortened and the accuracy is improved; the invention not only considers the similarity of the pinyin literal, but also considers the similarity of pronunciation, further improving the accuracy.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this application, illustrate exemplary embodiments of the invention and, together with the description, serve to explain the invention and are not intended to limit the invention in a non-limiting sense. In the drawings:
FIG. 1 is a flow chart of a text error correction method according to the present invention;
FIG. 2 is a score chart of the cumulative pronunciation scores of "license plate number is Ku ABC 966" and "Shanghai ADD 966";
FIG. 3 is a graph of the cumulative pronunciation score scoring paths for "license plate number is Ku ABC 966" and "Shanghai ADD 966";
FIG. 4 is a graph of cumulative pronunciation score for "call to home hospital" and "micro-nail hospital";
FIG. 5 is a graph of cumulative pronunciation score scores score paths for "call to home hospital" and "micro-nail hospital";
FIG. 6 is a diagram of a text correction system according to the present invention.
Detailed Description
In order to make the technical problems, technical solutions and technical effects to be solved by the present invention clearer, the technical solutions of the embodiments of the present invention will be described in detail below with reference to the accompanying drawings. It is to be understood that the described embodiments are merely exemplary of a portion of the invention, and not all. All other embodiments obtained by a person skilled in the art based on the embodiments of the present invention without any inventive step are also within the scope of the present invention.
The embodiment of the invention overcomes the defect that in the prior art, the entity word model needs to be labeled and trained in advance before the error correction is carried out on the voice recognition text, and different types of entity words possibly need to be trained additionally, so that the operation time and the accuracy are greatly limited by the entity word model.
The text error correction method provided by the invention comprises the following steps: acquiring a text to be corrected, identifying pinyin of the text to be corrected, and extracting user words and user word pinyin from a user word bank; directly comparing the pinyin of the text to be corrected with the pinyin of the vocabulary of each user, and selecting error-correcting words from a user word bank according to a preset algorithm; and reversely deducing the replacement words in the text to be corrected according to the selected path of the error correction words, replacing the replacement words with the error correction words, and obtaining the text after error correction.
Further, the text to be corrected is a speech recognition text
Further, the preset algorithm may adopt any one of the following two schemes:
first, the preset algorithm includes a pronunciation score algorithm, which includes: calculating pronunciation similarity scores of any two pinyins between the text pinyin to be corrected and the pinyin of the vocabulary of each user, accumulating the pronunciation similarity scores of the text pinyin to be corrected and the pinyin of the vocabulary of the user by using a Longest Common Subsequence (LCS) algorithm to obtain an accumulated pronunciation score of each vocabulary of the user, and recording a scoring path of the pronunciation scores; and dividing the accumulated pronunciation score of the user vocabulary by the pinyin number of the user vocabulary to obtain the pronunciation score of the user vocabulary, and selecting the user vocabulary with the highest pronunciation score as an error correction word.
Secondly, in order to further improve the error correction efficiency, the preset algorithm comprises a pinyin score algorithm and a pronunciation score algorithm, and the pinyin score algorithm comprises: calculating LCS of the text pinyin to be corrected and the pinyin of each user vocabulary to obtain the accumulated pinyin score of each user vocabulary, dividing the accumulated pinyin score of the user vocabulary by the number of the pinyins of the user vocabulary to obtain the pinyin score of the user vocabulary, and screening the user vocabulary with the pinyin score arranged at the front K position as a candidate word, wherein K is less than or equal to the total number of the user vocabulary. The pronunciation score algorithm includes: calculating pronunciation similarity scores of any two pinyins between the text pinyin to be corrected and the pinyin of each candidate word, accumulating the pronunciation similarity scores of the text pinyin to be corrected and the pinyin of the candidate word by using a Longest Common Subsequence (LCS) algorithm to obtain an accumulated pronunciation score of each candidate word, and recording a scoring path of the pronunciation scores; and dividing the accumulated pronunciation score of the candidate word by the number of the pinyin of the candidate word to obtain the pronunciation score of the candidate word, and selecting the candidate word with the highest pronunciation score as the error correction word.
Further, if a plurality of candidate words with the same pronunciation score exist, picking the error-correcting words according to the following sequence: the candidate word with the highest pinyin score is an error correction word; the candidate word with the same number of words as the text to be corrected and the maximum number of words is the error correction word; the candidate word with the length closest to the word number of the text to be corrected is the error correction word; and randomly selecting the candidate words as error correction words.
Furthermore, the pronunciation similarity score of the pinyin is directly obtained by the calculation of a pronunciation similarity algorithm such as a DIMSIM algorithm.
Further, in order to further improve the error correction precision, a threshold is preset, the pronunciation score of the error correction word is compared with the preset threshold, and if: if the pronunciation score of the error correction word is lower than a preset threshold value, the step of replacing the replacement word with the error correction word is not executed any more, and the text to be corrected is directly output; and (4) replacing the replacement words with the error-correcting words and outputting the error-corrected text when the pronunciation score of the error-correcting words is higher than or equal to a preset threshold value.
Further, according to the requirements of different error correction precisions, the value range of the preset threshold is 0.5-1.
In addition, the invention also provides a text error correction system, which comprises the following modules: the voice recognition module is used for receiving the voice input of the user from the user and acquiring a voice recognition text and pinyin of the text to be corrected; the user word bank storage module is used for storing and extracting user words and user word pinyin, calculating the number of the user word pinyin and calculating the number of the user words; the pronunciation score calculation module is used for calculating the pronunciation score of the pinyin of the candidate word and the accumulated pronunciation score scoring path; and the text replacement module is used for text replacement.
Further, the system also comprises a pinyin score calculation module which is used for calculating the pinyin score of each user vocabulary.
Furthermore, the invention provides an electronic device comprising a memory and a processor, the memory and the processor being connected; the memory is used for storing programs; the processor calls a program stored in the memory to perform the method of the first aspect embodiment and/or any possible implementation manner of the first aspect embodiment.
The technical solution of the present application is further illustrated by the following examples.
Example one
Step 1:
acquiring a text to be corrected in a voice recognition mode: "license plate number is cool ABC 966", and the corresponding pinyin is: [ 'che', 'pai', 'halo', 'ma', 'shi', 'ku', 'ei', 'bi', 'xi', 'jiu', 'liu', 'liu' ];
extracting user vocabularies in a user lexicon: gan A7C976, Shanghai ADD966, Wan DPC966 and Zhe EY7966 … … are respectively expressed by corresponding pinyin: [ 'gan', 'ei', 'qi', 'xi', 'jiu', 'qi', 'liu', 'hu', 'ei', 'di', 'jiu', 'liu', 'wan', 'di', 'pi', 'xi', 'jiu', 'liu', 'zhe', 'yi', 'wai', 'qi', 'jiu', 'liu', 'liu' … …
Step 2:
calculating LCS of the text pinyin to be corrected and the user vocabulary pinyin and dividing the LCS by the number of the user vocabulary pinyin to obtain the pinyin score of each user vocabulary, wherein the specific calculation mode is as follows:
LCS (license plate number cool ABC966 pinyin, gan A7C976 pinyin)/7 ═ LCS ([ ' che ', ' pai ', ' hao ', ' ma ', ' shi ', ' ku ', ' ei ', ' bi ', ' xi ', ' jiu ', ' liu ', ' gan ', ' ei ', ' jiu ', ' qi ', ' liu ', ' li ', ' 7) ═ 0.57 ═
LCS (license plate number cool ABC966 pinyin, hushi 966 pinyin)/7 ═ LCS ([ ' che ', ' pai ', ' hao ', ' ma ', ' shi ', ' ku ', ' ei ', ' bi ', ' xi ', ' jiu ', ' liu ', ' hu ', ' ei ', ' di ', ' jiu ', ' liu ', ' liu ')/7 ═ 0.57 { (che ', ' pai ', ' ha ', ' ma ', ' shi ', ' ku ', ' ei ', ' di ', ' jiu ', ' liu ', ' liu ' ]) } 7 { (0.57) }
LCS (license plate number cool ABC966 pinyin, wan DPC966 pinyin))/7 ═ LCS ([ 'che', 'pai', 'halo', 'ma', 'shi', 'ku', 'ei', 'bi', 'xi', 'jiu', 'liu', 'liu' ], [ 'wan', 'di', 'pi', 'xi', 'jiu', 'liu', 'liu' ])/7 ═ 0.57
LCS (license plate number cool ABC966 pinyin, zhe EY7966 pinyin)/7 ═ LCS ([ 'che', 'pai', 'ha', 'ma', 'shi', 'ku', 'ei', 'bi', 'xi', 'jiu', 'liu', 'liu' ], [ 'zhe', 'yi', 'wai', 'qi', 'jiu', 'liu', 'liu' ])/7 ═ 0.43.43
……
When K is set to be 3, screening out the user vocabulary with the Pinyin score at the first 3 positions as candidate words: gan A7C976, Hu ADD966 and Wan DPC 966.
And step 3:
calculating pronunciation similarity scores of any two voices between the text to be corrected and each candidate word by using a pronunciation similarity calculation method such as a DIMSIM algorithm, wherein the pronunciation similarity of "ku" and "hu (hu)" is 0.583, the pronunciation similarity of "ku" and "gan (gan)" is 0.013, the pronunciation similarity of 9(jiu) and 9(jiu) is 1 … …, accumulating the pronunciation similarity scores of the pinyin of the text to be corrected and the pinyin of the candidate word by using a longest common subsequence LCS algorithm to obtain a cumulative pronunciation score of each candidate word, and recording a score path of the pronunciation score, wherein the cumulative pronunciation score of the candidate word is divided by the number of the pinyin of the candidate word to obtain the pronunciation score of the candidate word, and the specific calculation method is as follows:
LCS (license plate number is Cool ABC966 Pinyin, JiangxA 7C976 Pinyin | DIMSIM pronunciation similarity calculation method)/7 ═ 0.59
LCS (license plate number is Ku ABC966 Pinyin, Hu ADD966 Pinyin | DIMSIM pronunciation similarity calculation method)/7 ═ 0.69
LCS (license plate number is Cool ABC966 spelling, Wan DPC966 spelling | DIMSIM pronunciation similarity calculation method)/7 ═ 0.67
Fig. 2 shows a cumulative pronunciation score map of "shanghai ADD 966", and fig. 3 shows a cumulative pronunciation score path map thereof.
And 4, step 4:
the "hunman ADD 966" with the highest pronunciation score is selected as the error-correcting word, and the text replacement word is pushed back to the "cool ABC 966" according to the accumulated score path corresponding to the accumulated pronunciation score as shown in fig. 3.
Step 5-1:
when the preset threshold is set to be 0.8, if the pronunciation score of the error correcting word 'Shanda ADD 966' is 0.69 and is smaller than the preset threshold, the step of replacing the replacement word with the error correcting word is not executed, and the text to be corrected is directly output, namely the text after error correction is still 'the license plate number is ABC cool 966'.
Step 5-2:
when the preset threshold value is set to be 0.6, the pronunciation score of the error correcting word Shanghai ADD966 is 0.69 and is larger than the preset threshold value, the error correcting word Shanghai ADD966 is used for replacing the text replacement word Ku ABC966, namely the text to be corrected with the license plate number Ku ABC966 is modified into the text with the license plate number Shanghai ADD 966.
Example two
Step 1:
acquiring a text to be corrected in a voice recognition mode: "how good to call to the tailstock hospital", the corresponding pinyin is: [ 'jiao', 'che', 'dao', 'wei', 'jia', 'yi', 'yuan', 'hao', 'ma' ];
extracting user vocabularies in a user lexicon: the corresponding pinyin of the user vocabulary of 'micro-armor hospital', 'rock hospital', 'Minghao International Hotel' … … are respectively: [ 'wei', 'jia', 'yi', 'yuan' ], [ 'shi', 'yan', 'yi', 'yuan' ], [ 'ming', 'hao', 'guo', 'ji', 'jiu', 'dian' ] … …
Step 2:
calculating LCS of the text pinyin to be corrected and the user vocabulary pinyin and dividing the LCS by the number of the user vocabulary pinyin to obtain the pinyin score of each user vocabulary, wherein the specific calculation mode is as follows:
LCS (hai car to tail hospital good pinyin, mini-nail hospital pinyin)/4 ═ LCS ([ ' jiao ', ' che ', ' dao ', ' wei ', ' jia ', ' yi ', ' yuan ', ' hao ', ' ma ', ' wei ', ' jia ', ' yi ', ' yuan ', ' 4 ═ 4/4 ═ 1
LCS (good car to tail hospital pinyin, stone hospital pinyin)/4 ═ LCS ([ ' jiao ', ' che ', ' dao ', ' wei ', ' jia ', ' yi ', ' yuan ', ' hao ', ' ma ', ' wei ', ' jia ', ' yi ', ' yuan ', ' 4) ═ 2/4 ═ 0.5
LCS (hai car to tail hospital hanyu pinyin, minghao international hotel pinyin)/6 ═ LCS ([ ' jiao ', ' che ', ' dao ', ' wei ', ' jia ', ' yi ', ' yuan ', ' hao ', ' ma ', ' wei ', ' jia ', ' yi ', ' yuan ', ' 6 ═ 1/6 ═ 0.17 ═ 0.26
……
When K is set to be 2, screening out the user vocabulary with the Pinyin score at the top 2 as a candidate word: "nail hospital" and "rock hospital".
And step 3:
calculating pronunciation similarity scores of any two voices between a text to be corrected and each candidate word by adopting a pronunciation similarity calculation method such as a DIMSIM algorithm, wherein for example, the pronunciation similarity of 'tail (wei)' and 'micro (wei)' is 1, accumulating the pronunciation similarity scores of the text pinyin to be corrected and the candidate word pinyin by using a Longest Common Subsequence (LCS) algorithm to obtain an accumulated pronunciation score of each candidate word, recording a scoring path of the pronunciation score, and dividing the accumulated pronunciation score of the candidate word by the number of the candidate word pinyin to obtain the pronunciation score of the candidate word, wherein the specific calculation method is as follows:
LCS (Pinyin is good after calling the car to the home hospital, Pinyin | DIMSIM pronunciation similarity calculation method in micro-nail hospital)/4 ═ 1
LCS (Pinyin is good after calling the car to the home hospital, Pinyin | DIMSIM pronunciation similarity calculation method in rock hospital)/4-0.52
Fig. 4 shows a graph of the cumulative pronunciation score of the "nail hospital", and fig. 5 shows a graph of the cumulative pronunciation score path thereof.
And 4, step 4:
the "micro-nail hospital" with the highest pronunciation score is selected as the error correction word, and the text replacement word is pushed back to the "tail hospital" according to the accumulated score path corresponding to the highest accumulated pronunciation score as shown in fig. 5.
And 5:
when the threshold value is set to be 0.8, if the pronunciation score of the error correction word 'micro-nail hospital' is 1 and is larger than the threshold value, the error correction word 'micro-nail hospital' is used for replacing the text to replace the word 'tailstock hospital', namely the text to be corrected 'how the car is called to the tailstock hospital' is changed into 'how the car is called to the micro-nail hospital'.
The above is only a preferred embodiment of the present invention, and the protection scope of the present invention is not limited to the examples, and all technical solutions belonging to the idea of the present invention belong to the protection scope of the present invention. It should be noted that modifications and variations can be made by persons skilled in the art without departing from the principles of the invention and should be considered as within the scope of the invention.

Claims (10)

1. A text error correction method, comprising the steps of:
acquiring a text to be corrected, identifying pinyin of the text to be corrected, and extracting user words and user word pinyin from a user word bank;
comparing the pinyin of the text to be corrected with the pinyin of the vocabulary of each user, and selecting error-correcting words from a user word bank according to a preset algorithm;
and reversely deducing the replacement words in the text to be corrected according to the selected path of the error correction words, replacing the replacement words with the error correction words, and obtaining the text after error correction.
2. The text correction method of claim 1 wherein the preset algorithm comprises a pronunciation score algorithm comprising:
calculating pronunciation similarity scores of any two pinyins between the text pinyin to be corrected and the pinyin of the vocabulary of each user, accumulating the pronunciation similarity scores of the text pinyin to be corrected and the pinyin of the vocabulary of the user by using a Longest Common Subsequence (LCS) algorithm to obtain an accumulated pronunciation score of each vocabulary of the user, and recording a scoring path of the pronunciation scores;
dividing the accumulated pronunciation score of the user vocabulary by the pinyin number of the user vocabulary to obtain the pronunciation score of the user vocabulary, and selecting the user vocabulary with the highest pronunciation score as an error correction word.
3. The text correction method of claim 1, wherein the preset algorithm comprises a pinyin score algorithm and a pronunciation score algorithm, the pinyin score algorithm comprising:
calculating LCS of the text pinyin to be corrected and the pinyin of each user vocabulary to obtain the accumulated pinyin score of each user vocabulary, dividing the accumulated pinyin score of the user vocabulary by the number of the pinyins of the user vocabulary to obtain the pinyin score of the user vocabulary, and screening the user vocabulary with the pinyin score arranged at the front K position as a candidate word, wherein K is less than or equal to the total number of the user vocabulary.
The pronunciation algorithm comprises:
calculating pronunciation similarity scores of any two pinyins between the text pinyin to be corrected and the pinyin of each candidate word, accumulating the pronunciation similarity scores of the text pinyin to be corrected and the pinyin of the candidate word by using a Longest Common Subsequence (LCS) algorithm to obtain an accumulated pronunciation score of each candidate word, and recording a scoring path of the pronunciation scores;
and dividing the accumulated pronunciation score of the candidate word by the number of the pinyin of the candidate word to obtain the pronunciation score of the candidate word, and selecting the candidate word with the highest pronunciation score as the error correction word.
4. The text error correction method according to claim 3, wherein if there are a plurality of candidate words having the same pronunciation score, the error correction words are selected in the following order:
the candidate word with the highest pinyin score is an error correction word;
the candidate word with the same number of words as the text to be corrected and the maximum number of words is the error correction word;
the candidate word with the length closest to the word number of the text to be corrected is the error correction word;
and randomly selecting the candidate words as error correction words.
5. The text correction method of claim 2 or 3, wherein the pronunciation similarity score of the pinyin is directly calculated by a pronunciation similarity algorithm, the pronunciation similarity algorithm comprising a DIMSIM algorithm.
6. The text error correction method according to claim 2 or 3, wherein the pronunciation score of the error correction word is compared with a preset threshold, if:
if the pronunciation score of the error correction word is lower than a preset threshold value, the step of replacing the replacement word with the error correction word is not executed any more, and the text to be corrected is directly output;
and (4) replacing the replacement words with the error-correcting words and outputting the error-corrected text when the pronunciation score of the error-correcting words is higher than or equal to a preset threshold value.
7. The text error correction method according to claim 6, wherein the preset threshold value ranges from 0.5 to 1.
8. A text correction system, comprising the following modules:
the voice recognition module is used for receiving the voice input of the user from the user and acquiring a voice recognition text and pinyin of the text to be corrected;
the user word bank storage module is used for storing and extracting user words and user word pinyin, calculating the number of the user word pinyin and calculating the number of the user words;
the pronunciation score calculation module is used for calculating the pronunciation score of the pinyin of the candidate word and the accumulated pronunciation score scoring path;
and the text replacement module is used for text replacement.
9. The text correction system of claim 8, further comprising a pinyin-score computation module for computing a pinyin score for each user vocabulary.
10. An electronic device comprising a memory and a processor, the memory and the processor being connected;
the memory is used for storing programs;
the processor calls a program stored in the memory to perform the method of any of claims 1-4.
CN202011641951.9A 2020-12-31 2020-12-31 Text error correction method and system and electronic equipment Pending CN112863516A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011641951.9A CN112863516A (en) 2020-12-31 2020-12-31 Text error correction method and system and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011641951.9A CN112863516A (en) 2020-12-31 2020-12-31 Text error correction method and system and electronic equipment

Publications (1)

Publication Number Publication Date
CN112863516A true CN112863516A (en) 2021-05-28

Family

ID=76000837

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011641951.9A Pending CN112863516A (en) 2020-12-31 2020-12-31 Text error correction method and system and electronic equipment

Country Status (1)

Country Link
CN (1) CN112863516A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113571051A (en) * 2021-06-11 2021-10-29 天津大学 Voice recognition system and method for lip voice activity detection and result error correction
CN113779972A (en) * 2021-09-10 2021-12-10 平安科技(深圳)有限公司 Speech recognition error correction method, system, device and storage medium
CN115221866A (en) * 2022-06-23 2022-10-21 平安科技(深圳)有限公司 Method and system for correcting spelling of entity word

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104750672A (en) * 2013-12-27 2015-07-01 重庆新媒农信科技有限公司 Chinese word error correction method used in search and device thereof
CN106847288A (en) * 2017-02-17 2017-06-13 上海创米科技有限公司 The error correction method and device of speech recognition text
CN107273359A (en) * 2017-06-20 2017-10-20 北京四海心通科技有限公司 A kind of text similarity determines method
CN107301865A (en) * 2017-06-22 2017-10-27 海信集团有限公司 A kind of method and apparatus for being used in phonetic entry determine interaction text
CN107305768A (en) * 2016-04-20 2017-10-31 上海交通大学 Easy wrongly written character calibration method in interactive voice
US20180349380A1 (en) * 2015-09-22 2018-12-06 Nuance Communications, Inc. Systems and methods for point-of-interest recognition
CN109920431A (en) * 2019-03-05 2019-06-21 百度在线网络技术(北京)有限公司 Method and apparatus for output information
CN110032722A (en) * 2018-01-12 2019-07-19 北京京东尚科信息技术有限公司 Text error correction method and device
KR20200030354A (en) * 2018-09-12 2020-03-20 주식회사 한글과컴퓨터 Voice recognition processing device for performing a correction process of the voice recognition result based on the user-defined words and operating method thereof
CN111444705A (en) * 2020-03-10 2020-07-24 中国平安人寿保险股份有限公司 Error correction method, device, equipment and readable storage medium
CN111613214A (en) * 2020-05-21 2020-09-01 重庆农村商业银行股份有限公司 Language model error correction method for improving voice recognition capability

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104750672A (en) * 2013-12-27 2015-07-01 重庆新媒农信科技有限公司 Chinese word error correction method used in search and device thereof
US20180349380A1 (en) * 2015-09-22 2018-12-06 Nuance Communications, Inc. Systems and methods for point-of-interest recognition
CN107305768A (en) * 2016-04-20 2017-10-31 上海交通大学 Easy wrongly written character calibration method in interactive voice
CN106847288A (en) * 2017-02-17 2017-06-13 上海创米科技有限公司 The error correction method and device of speech recognition text
CN107273359A (en) * 2017-06-20 2017-10-20 北京四海心通科技有限公司 A kind of text similarity determines method
CN107301865A (en) * 2017-06-22 2017-10-27 海信集团有限公司 A kind of method and apparatus for being used in phonetic entry determine interaction text
CN110032722A (en) * 2018-01-12 2019-07-19 北京京东尚科信息技术有限公司 Text error correction method and device
KR20200030354A (en) * 2018-09-12 2020-03-20 주식회사 한글과컴퓨터 Voice recognition processing device for performing a correction process of the voice recognition result based on the user-defined words and operating method thereof
CN109920431A (en) * 2019-03-05 2019-06-21 百度在线网络技术(北京)有限公司 Method and apparatus for output information
CN111444705A (en) * 2020-03-10 2020-07-24 中国平安人寿保险股份有限公司 Error correction method, device, equipment and readable storage medium
CN111613214A (en) * 2020-05-21 2020-09-01 重庆农村商业银行股份有限公司 Language model error correction method for improving voice recognition capability

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
段建勇;关晓龙;: "基于统计和特征相结合的查询纠错方法研究", 现代图书情报技术, no. 02 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113571051A (en) * 2021-06-11 2021-10-29 天津大学 Voice recognition system and method for lip voice activity detection and result error correction
CN113779972A (en) * 2021-09-10 2021-12-10 平安科技(深圳)有限公司 Speech recognition error correction method, system, device and storage medium
CN113779972B (en) * 2021-09-10 2023-09-15 平安科技(深圳)有限公司 Speech recognition error correction method, system, device and storage medium
CN115221866A (en) * 2022-06-23 2022-10-21 平安科技(深圳)有限公司 Method and system for correcting spelling of entity word
CN115221866B (en) * 2022-06-23 2023-07-18 平安科技(深圳)有限公司 Entity word spelling error correction method and system

Similar Documents

Publication Publication Date Title
CN112863516A (en) Text error correction method and system and electronic equipment
CN110096570B (en) Intention identification method and device applied to intelligent customer service robot
US10332507B2 (en) Method and device for waking up via speech based on artificial intelligence
JP3581401B2 (en) Voice recognition method
US8050909B2 (en) Apparatus and method for post-processing dialogue error in speech dialogue system using multilevel verification
US9201862B2 (en) Method for symbolic correction in human-machine interfaces
US9564127B2 (en) Speech recognition method and system based on user personalized information
US8346553B2 (en) Speech recognition system and method for speech recognition
CN109522397B (en) Information processing method and device
CN112287680B (en) Entity extraction method, device and equipment of inquiry information and storage medium
CN110717021B (en) Input text acquisition and related device in artificial intelligence interview
CN110134777B (en) Question duplication eliminating method and device, electronic equipment and computer readable storage medium
CN112231451B (en) Reference word recovery method and device, conversation robot and storage medium
CN111223476B (en) Method and device for extracting voice feature vector, computer equipment and storage medium
CN111613214A (en) Language model error correction method for improving voice recognition capability
CN111274785A (en) Text error correction method, device, equipment and medium
Jyothi et al. Transcribing continuous speech using mismatched crowdsourcing.
CN112101032A (en) Named entity identification and error correction method based on self-distillation
CN111883137A (en) Text processing method and device based on voice recognition
CN110674276A (en) Robot self-learning method, robot terminal, device and readable storage medium
CN110880317A (en) Intelligent punctuation method and device in voice recognition system
US20210158804A1 (en) System and method to improve performance of a speech recognition system by measuring amount of confusion between words
CN110929514B (en) Text collation method, text collation apparatus, computer-readable storage medium, and electronic device
CN110413750B (en) Method and device for recalling standard questions according to user questions
CN112151021A (en) Language model training method, speech recognition device and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination