CN113012705A - Error correction method and device for voice text - Google Patents
Error correction method and device for voice text Download PDFInfo
- Publication number
- CN113012705A CN113012705A CN202110206015.3A CN202110206015A CN113012705A CN 113012705 A CN113012705 A CN 113012705A CN 202110206015 A CN202110206015 A CN 202110206015A CN 113012705 A CN113012705 A CN 113012705A
- Authority
- CN
- China
- Prior art keywords
- word
- text
- correcting
- voice text
- error
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 31
- 238000012937 correction Methods 0.000 title claims abstract description 20
- 230000002159 abnormal effect Effects 0.000 claims abstract description 60
- 238000000605 extraction Methods 0.000 claims description 3
- 238000012545 processing Methods 0.000 claims description 3
- 238000012795 verification Methods 0.000 claims description 3
- 230000005856 abnormality Effects 0.000 claims description 2
- 238000005406 washing Methods 0.000 description 8
- 238000010586 diagram Methods 0.000 description 7
- 238000010025 steaming Methods 0.000 description 7
- 230000006870 function Effects 0.000 description 6
- 241001672694 Citrus reticulata Species 0.000 description 5
- 238000001514 detection method Methods 0.000 description 5
- 230000008569 process Effects 0.000 description 5
- 238000012549 training Methods 0.000 description 5
- 238000005516 engineering process Methods 0.000 description 4
- 238000013461 design Methods 0.000 description 3
- 238000012986 modification Methods 0.000 description 3
- 230000004048 modification Effects 0.000 description 3
- 238000004458 analytical method Methods 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 2
- 239000012634 fragment Substances 0.000 description 2
- 230000004075 alteration Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000003203 everyday effect Effects 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 230000007704 transition Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/003—Changing voice quality, e.g. pitch or formants
- G10L21/007—Changing voice quality, e.g. pitch or formants characterised by the process used
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/253—Grammatical analysis; Style critique
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/26—Speech to text systems
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L2015/088—Word spotting
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
- G10L2015/223—Execution procedure of a spoken command
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Human Computer Interaction (AREA)
- Multimedia (AREA)
- Acoustics & Sound (AREA)
- Artificial Intelligence (AREA)
- General Health & Medical Sciences (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Quality & Reliability (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Signal Processing (AREA)
- Machine Translation (AREA)
Abstract
The embodiment of the application provides a method and a device for correcting a voice text, wherein the method comprises the following steps: and judging whether an abnormal unit exists in the voice text extracted from the voice data by using a word error detector, if so, selecting a candidate error-correcting word with an editing distance smaller than an editing distance threshold value from the abnormal unit from an error-correcting word reference library, and replacing the abnormal unit with the candidate error-correcting word. And if the abnormal unit does not exist, determining that the voice text is correct. The voice text error correction method and the error correction device provided by the application are based on the created word error detector, the situation that a voice recognition product cannot recognize the voice of a user due to personal voice pronunciation habits of the user can be avoided, and the user experience is improved.
Description
Technical Field
The present application relates to the field of speech recognition technologies, and in particular, to a method and an apparatus for correcting a speech text.
Background
With the continuous development of voice recognition technology and smart home, the voice recognition technology is widely applied. The user can control the air conditioner, the washing machine and other equipment through voice. The implementation process of the voice function can be summarized as follows: the voice recognition module converts audio input by a user into a text, then the semantic analysis module performs intention classification and content understanding on the text, and finally the text is converted into a machine code which can be executed by corresponding equipment hardware, so that the aim of controlling the equipment is fulfilled.
During the implementation of the voice function, the voice data input by the user may be a non-standard pronunciation, which may cause a text recognition error, and eventually may result in a situation where the device cannot be controlled by the voice. For example, some users may have incorrect pronunciation of the flat warped tongue, and some users may have a strong nasal sound due to entering an abnormal pronunciation.
In response to the above, current speech recognition engines, while optimized in dialects and similar pronunciations, rely heavily on observations of user data operations and user complaints for error correction. Traditional speech recognition technology still can not avoid, and what user's individual pronunciation custom led to, speech recognition product can't discern the condition of user's pronunciation, causes the user to use to experience relatively poor.
Disclosure of Invention
In order to solve the problems that a traditional fault positioning method is time-consuming and labor-consuming and low in efficiency of positioning fault root causes, the application provides a fault positioning method and a fault positioning device.
In a first aspect, an embodiment of the present application provides a method for correcting a speech text, where the method includes:
extracting a voice text from voice data input by a user, detecting whether an abnormal unit exists in the voice text by using a word error detector, selecting a candidate error-correcting word with an editing distance smaller than an editing distance threshold value from an error-correcting word reference library when the abnormal unit exists in the voice text, and replacing the abnormal unit with the candidate error-correcting word, wherein the word error detector is established based on an N-Gram algorithm;
and when the abnormal unit does not exist in the voice text, determining that the voice text is correct. In a second aspect, an embodiment of the present application provides a speech text error correction apparatus, including:
a speech text extraction unit for performing: extracting a voice text from voice data input by a user;
an abnormality unit determination unit configured to execute: detecting whether an abnormal unit exists in the voice text by utilizing a word error detector, wherein the word error detector is established based on an N-Gram algorithm;
the candidate error-correcting word selecting unit is used for executing the following steps: when an abnormal unit exists in the voice text, selecting a candidate error-correcting word with an editing distance smaller than an editing distance threshold value from an error-correcting word reference library;
a replacement unit to perform: and replacing the abnormal unit with the candidate error-correcting word.
The technical scheme provided by the application comprises the following beneficial effects: and judging whether an abnormal unit exists in the voice text extracted from the voice data by using a word error detector, if so, selecting a candidate error-correcting word with an editing distance smaller than an editing distance threshold value from the abnormal unit from an error-correcting word reference library, and replacing the abnormal unit with the candidate error-correcting word. And if the abnormal unit does not exist, determining that the voice text is correct. The voice text error correction method and the error correction device provided by the application are based on the created word error detector, the situation that a voice recognition product cannot recognize the voice of a user due to personal voice pronunciation habits of the user can be avoided, and the user experience is improved.
Drawings
In order to more clearly explain the technical solution of the present application, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious to those skilled in the art that other drawings can be obtained according to the drawings without any creative effort.
Fig. 1 is a schematic flowchart illustrating a method for correcting a speech text according to an embodiment of the present application;
FIG. 2 is a schematic diagram illustrating abnormal cell detection provided by an embodiment of the present application;
FIG. 3 is a schematic diagram illustrating an edit distance comparison provided by an embodiment of the present application;
FIG. 4 is a diagram illustrating a syntax tree provided by an embodiment of the present application;
FIG. 5 is a diagram illustrating another syntax tree provided by an embodiment of the present application;
fig. 6 shows a schematic diagram of an apparatus for correcting a phonetic text according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present invention will be described clearly and completely with reference to the accompanying drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Reference throughout this specification to "embodiments," "some embodiments," "one embodiment," or "an embodiment," or the like, means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment. Thus, appearances of the phrases "in various embodiments," "in some embodiments," "in at least one other embodiment," or "in an embodiment" or the like throughout this specification are not necessarily all referring to the same embodiment. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. Thus, the particular features, structures, or characteristics shown or described in connection with one embodiment may be combined, in whole or in part, with the features, structures, or characteristics of one or more other embodiments, without limitation. Such modifications and variations are intended to be included within the scope of the present application.
China is a multi-phonetic-culture country, although the country vigorously promotes the common mandarin, because of wide territory, numerous nations and regional dialects in various regions, a large number of non-standard mandarin users appear. Although non-standard mandarin chinese is less problematic in everyday communication, in the field of speech recognition, speech recognition errors due to the nonstandard pronunciation of mandarin chinese, or situations where the user's speech cannot be recognized, occur. For example, some users have problems of inaccurate tone, flat and warped tongue part, heavy nasal sound and the like due to dialect influence, which can cause speech recognition error and even fail to recognize speech.
Illustratively, the wrong text is a homophone situation with or without tonal difference, for example, the user says "infrared intrusion detector disarming", since "disarming" and "vehicle room" are homophones with tonal difference, the intelligent device actually recognizes the text as "infrared intrusion detector vehicle room", causing recognition of wrong voice, and related operations cannot be performed.
Although the current speech recognition engine has optimization measures in dialect and similar pronunciation, the above problems cannot be effectively avoided, and error correction for speech recognition depends on observation of user data operation and user complaint, so that the user experience is poor. Therefore, a method for effectively correcting the text recognized by the wrong speech caused by the similar pronunciation is needed.
In order to solve the problems, the application provides an error correction method for a voice text, which is based on an N-Gram algorithm and an edit distance algorithm and replaces an abnormal unit with a candidate error correction word, so that error correction of the voice text is realized, and the use experience of a user is improved.
Fig. 1 is a flow chart of a method for correcting a speech text, the method comprising the following steps:
step S101 extracts a speech text from speech data input by a user.
Step S102, detecting whether the extracted voice text has abnormal units by using a word error detector, namely whether words which do not accord with the language law exist. The word error detector of the present application is created based on the N-Gram algorithm.
The N-Gram algorithm is an algorithm based on a unified language model, and the basic idea is to select the content in a voice text according to a sliding window frame with the byte size of N and intercept a byte fragment sequence with the length of N. The language model in N-Gram is based on the assumption of a Markov chain that the Nth word occurs with only the first N-1 words being relevant and not with any other words. The probability of occurrence of the truncated byte fragment sequence is the product of the probability of occurrence of each word. The probability of each word occurrence can be directly obtained from the training corpus.
Specifically, assuming a sequence (sentence) composed of m words, a ternary Tri-Gram algorithm can be selected to calculate the probability p (ω) of the sequence1,ω2,ω3,...,ωm) Wherein, ω is1,ω2,ω3,...,ωmRepresenting m words in the sequence, respectively. According to the chaining rule, the following formula can be obtained:
p(ω1,ω2,ω3,...,ωm)=p(ω1)*p(ω2|ω1)*p(ω3|ω1,ω2)*...*p(ωm|ω1,...,ωm-1)
wherein, p (ω)1) Represents omega1Probability of occurrence, p (ω)2|ω1) Represents omega1In the presence of condition omega2The probability of occurrence. Analogize to p (omega)m|ω1,...,ωm-1) Which represents ωmUnder the condition that the first m-1 words appear, ωmThe probability of occurrence.
With the assumption of a Markov chain, the current word is only related to a few previous limited words, resulting in the following formula:
p(ω1,ω2,ω3,...,ωm)=p(ωi|ωi-n+1,...,ωi-1)
the embodiment of the application utilizes a ternary Tri-Gram model, namely n is 3. Then, in a given training corpus, converting the probability of the whole sentence into the product of all conditional probabilities by using Bayesian theorem to obtain the following formula:
the conditional probability values can be counted by training corpora, and the probability formula of each word is calculated as follows:
the conversion to the probability formula associated with the statistical result is:
wherein, count (ω)i-2,ωi-1,ωi) Indicates the number of times of simultaneous occurrence of three words, count (ω)i-2,ωi-1) The number of times the first two words representing the ith word occur simultaneously is the number of times all words occur. Obtained finallyAnd (3) the probability of the ith character under the condition that the first two characters of the ith character exist simultaneously, namely the probability of the ith character appearing simultaneously is shown, and whether the situation of the ith character appearing simultaneously accords with the language rule or not is judged according to the obtained probability.
The word error detector obtained according to the algorithm slides on the voice text, selects N words in the voice text, and judges whether the conditional probability that the N words in the voice text appear simultaneously is larger than or equal to an empirical probability threshold (the probability that the N words appear in a large amount of linguistic data). And if the conditional probability of the simultaneous occurrence of the N words is greater than or equal to the empirical probability threshold, determining that the framed N words are not abnormal units. And if the conditional probability of the simultaneous occurrence of the N words is less than the empirical probability threshold, determining the framed N words as abnormal units.
Illustratively, as shown in the abnormal unit detection diagram of fig. 2, the voice text recognized from the voice data input by the user is "three-tub washing machine set to the plum detective mode". A word error detector created based on the above algorithm is used to slide over the phonetic text. The embodiment also adopts the Tri-Gram model, so that the word error detector selects the section of the voice text to comprise three words. And the word error detector frames words from left to right in sequence. For example, when the box is selected to "three-barrel wash", the probability of three words appearing simultaneously is calculated. And finally, if the operation probability of the simultaneous occurrence of the three-cylinder washing is higher than the empirical probability threshold, judging that the three-cylinder washing conforms to the language rule, and enabling the three-cylinder washing to normally pass.
The word error detector then slides to the right, and when sliding to the position of "set to lie", calculates that the segment occurrence probability is less than the empirical probability threshold, then it can be judged that "set to lie" does not meet the language probability, does not pass normally, and records the segment. And continuing to slide to the right, and using the same judgment rule, obtaining that the plum detective is a local range in which an error can occur, namely an abnormal unit.
Step S103, if an abnormal unit exists in the voice text, selecting a candidate error-correcting word with an editing distance smaller than an editing distance threshold value from the abnormal unit from the error-correcting word reference library, and replacing the abnormal unit with the candidate error-correcting word.
The edit distance is a quantitative measure of the difference between two character strings, and specifically, the number of operations required to delete, add, and replace any one character is equal to two character strings. The embodiment of the application mainly aims at correcting the voice text of the Chinese Mandarin. In consideration of character recognition errors caused by factors such as inaccurate pronunciation of a user, specifically, inaccurate pronunciation of a flat and warped tongue, heavier nasal sound and the like, an editing distance calculation mode based on pinyin is used.
In some embodiments, the edit distance calculation process is: the pinyin syllable of the Chinese character consists of three elements, namely initial consonant, final sound and tone, the initial consonant and the final sound are regarded as independent English character strings, and the editing distance of the initial consonant and the final sound is obtained by using the editing distance mode of the character strings. For example, according to the actual language rule, the easy confusion edit distance of the consonants Z and Zh is set to 1. The vowels eng and er are not easy to be confused, the edit distance between the vowels eng and er is set to be 2, and the edit distance is set to be 1 for different tones. Therefore, as long as the pinyin syllables are randomly changed in the three dimensions, the edit distances can be summed according to the three dimensions to obtain the edit distance of the whole pinyin syllable.
In addition, the embodiment of the application also groups the consonants and the vowels with the easily confused pronunciations to obtain the easily confused consonant pairs and vowel pairs with higher probability in life. For example, some users are out of pronunciation for l and n, out of pronunciation for z and zh, out of pronunciation for in and ing, etc., due to dialect differences. The edit distance between these easier-to-confuse initial or final pairs is set to be small, and may be set to 0.5.
When two character pinyin syllables are compared, the influence of the variation of three dimensions of initial consonant, final and tone on the pinyin similarity is different, and when two or more of the three dimensions are simultaneously varied, the pinyin similarity difference is increased. Therefore, when the total edit distance is calculated, a positive penalty mechanism can be added, and according to actual experience, the tone difference has small influence in the two pinyin similarity comparison. When there is a difference in pitch, a small positive penalty value is preset. If the initial consonant or the final sound changes, a large positive punishment value is preset. Especially, when there is a difference between the initial consonant and the final consonant, a large positive penalty value needs to be added. By setting a positive punishment value, a more appropriate editing distance threshold value can be selected, and then a more appropriate candidate error correcting word is selected.
For example, the length of the abnormal unit can be judged for the first time, and the error-correcting words with the same length are selected from the error-correcting word reference library. And then, comparing the error-correcting words screened from the error-correcting word reference library with the abnormal units respectively, calculating an editing distance, and determining the error-correcting words as candidate error-correcting words when the calculated editing distance is smaller than an editing distance threshold value. As shown in the schematic diagram of editing distance comparison in fig. 3, when the "ion steaming" is performed in a loop, each word in the "ion steaming" is compared with each word in the abnormal unit "plum detection" to calculate the editing distance.
If the initial consonant and the final consonant of "Li" and "Li" are the same, and the tone is different, the edit distance is 1. "child" and "child" are identical, the edit distance is 0. The detection and steaming have the same initial consonant, different vowels and same tone. And because the vowels "en" and "eng" are easier-to-confuse pronunciations, the vowel pair which is easy to confuse is set, and the edit distance of "detecting" and "steaming" is 0.5. The initials of "sound" and "hot" are the same, the finals are different, the tones are the same, and the edit distance is 1. In the embodiment of the application, only when the pinyin editing distance of each word corresponding to the error-correcting word and the abnormal unit is smaller than the editing distance threshold, the error-correcting word can be determined as the candidate error-correcting word of the abnormal unit.
And after determining the corresponding candidate error-correcting word of the abnormal unit from the error-correcting word reference library, replacing the abnormal unit in the voice text with the candidate error-correcting word. Illustratively, the abnormal unit "plum detective" in the original voice text "three-drum washing machine set to plum detective mode" is replaced by the candidate error-correcting word "ion steaming", and the corrected voice text "three-drum washing machine set to ion steaming mode" is obtained.
And step S104, if the abnormal unit does not exist in the voice text, determining that the voice text is correct.
In some embodiments, the error correction process may have error correction, and may also have multiple candidate error correction words obtained from the reference library of error correction words. In order to solve the above problem, in the embodiments of the present application, the corrected voice text may be checked by using a probabilistic context-free grammar.
The specific steps of utilizing the probability context-free grammar to verify the corrected voice text are as follows:
and training a statistical syntax analysis model of the probabilistic context-free grammar, namely a syntax tree model, by utilizing the home field linguistic data and the self-defined vocabulary labels and syntax rules. And inputting the corrected voice text into the grammar tree model, and if the corrected voice text can generate a complete grammar tree according to the grammar tree model, determining that the corrected voice text is correct. And if the corrected voice text cannot generate the finished grammar tree according to the grammar tree model, determining that the corrected voice text is incorrect.
For example, the corrected phonetic text "three-tub washer set to ion steaming mode" may generate a complete grammar tree as shown in fig. 4, which may be determined to be correct. On the other hand, if the speech text before error correction "the three-tub washing machine is set to the plum detection mode", the generated syntax tree is an incomplete syntax tree as shown in fig. 5, and thus the speech text may be incorrect.
In some embodiments, if a plurality of candidate error-correcting words are selected from the reference library of error-correcting words, a plurality of corrected phonetic texts are generated respectively according to the plurality of candidate error-correcting words. And if the plurality of corrected voice texts can generate a complete grammar tree according to the grammar tree model of the training number, calculating the probabilities of the grammar trees corresponding to the plurality of voice texts, and finally determining the voice text corresponding to the grammar tree with the highest probability as the final corrected voice text. In addition, if the grammar trees corresponding to a plurality of voice texts have equal probability and highest probability, the voice texts corresponding to the grammar trees with few layers and simple structures are selected as the final corrected voice texts.
An embodiment of the present application provides an apparatus for correcting a speech text, which is used to execute the embodiment corresponding to fig. 1, and as shown in fig. 6, the apparatus for correcting a speech text includes:
a speech text extraction unit 201 for performing: extracting a voice text from voice data input by a user;
an abnormal unit judgment unit 202 for performing: judging whether an abnormal unit exists in the voice text by utilizing a word error detector, wherein the word error detector is established based on an N-Gram algorithm;
a candidate error correction word selecting unit 203 for performing: when an abnormal unit exists in the voice text, selecting a candidate error-correcting word with an editing distance smaller than an editing distance threshold value from an error-correcting word reference library;
a replacement unit 204 for performing: and replacing the abnormal unit with the candidate error-correcting word.
In some embodiments, the word error detector may select N words of the speech text, and the abnormal unit determining unit is specifically configured to perform:
sliding the word error detector on the voice text, and determining that the N words in the word error detector are not abnormal units when the conditional probability that the N words in the word error detector occur simultaneously is greater than or equal to an empirical probability threshold;
and when the conditional probability of the simultaneous occurrence of the N words in the word error detector is smaller than the empirical probability threshold, determining the N words in the word error detector as abnormal units.
In some embodiments, the apparatus for correcting phonetic text of the present application, further comprises,
a verification unit 205 for performing: and after replacing the abnormal unit with the candidate error-correcting word, checking the corrected voice text by using a probability context-free grammar.
In some embodiments, the verification unit is specifically configured to perform: executing grammar tree generation processing on the corrected voice text according to the trained grammar tree, and determining that the corrected voice text is correct when the corrected voice text can generate a complete grammar tree according to the trained grammar tree model;
and when the corrected voice text can not generate a finished grammar tree according to the trained grammar tree model, determining that the corrected voice text is incorrect.
What has been described above includes examples of implementations of the invention. It is, of course, not possible to describe every conceivable combination of components or methodologies for purposes of describing the claimed subject matter, but it is to be appreciated that many further combinations and permutations of the subject innovation are possible. Accordingly, the claimed subject matter is intended to embrace all such alterations, modifications and variations that fall within the spirit and scope of the appended claims. Moreover, the foregoing description of illustrated implementations of the present application, including what is described in the "abstract," is not intended to be exhaustive or to limit the disclosed implementations to the precise forms disclosed. While specific implementations and examples are described herein for illustrative purposes, various modifications are possible which are considered within the scope of such implementations and examples, as those skilled in the relevant art will recognize.
In particular and in regard to the various functions performed by the above described components, devices, circuits, systems and the like, the terms used to describe such components are intended to correspond, unless otherwise indicated, to any component which performs the specified function of the described component (e.g., that is functionally equivalent), even though not structurally equivalent to the disclosed structure which performs the function in the herein illustrated exemplary aspects of the claimed subject matter. In this regard, it will also be recognized that the innovation includes a system as well as a computer-readable storage medium having computer-executable instructions for performing the acts and/or events of the various methods of the claimed subject matter.
The above-described systems/circuits/modules have been described with respect to interaction between several components/blocks. It can be appreciated that such systems/circuits and components/blocks can include those components or the referenced stator components, some of the specified components or sub-components, and/or additional components, and in various permutations and combinations of the above. Sub-components can also be implemented as components communicatively coupled to other components rather than included within parent components (hierarchical). Additionally, it should be noted that one or more components may be combined into a single component providing aggregate functionality or divided into several separate sub-components, and any one or more middle layers (e.g., a management layer) may be provided to communicatively couple to such sub-components in order to provide comprehensive functionality. Any components described herein may also interact with one or more other components not specifically described herein but known to those of skill in the art.
Notwithstanding that the numerical ranges and parameters setting forth the broad scope of the invention are approximations, the numerical values set forth in the specific examples are reported as precisely as possible. Any numerical value, however, inherently contains certain errors necessarily resulting from the standard deviation found in their respective testing measurements. Moreover, all ranges disclosed herein are to be understood to encompass any and all subranges subsumed therein. For example, a range of "less than or equal to 11" can include any and all subranges between (and including) the minimum value of zero and the maximum value of 11, i.e., any and all subranges have a minimum value equal to or greater than zero and a maximum value of equal to or less than 11 (e.g., 1 to 5). In some cases, the values as described for the parameters can have negative values.
In addition, while a particular feature of the subject innovation may have been disclosed with respect to only one of several implementations, such feature may be combined with one or more other features of the other implementations as may be desired and advantageous for any given or particular application. Furthermore, to the extent that the terms "includes," "including," "has," "incorporates," variants thereof, and other similar words are used in either the detailed description or the claims, these terms are intended to be inclusive in a manner similar to the term "comprising" as an open transition word without precluding any additional or other elements.
Reference throughout this specification to "one implementation" or "an implementation" means that a particular feature, structure, or characteristic described in connection with the implementation is included in at least one implementation. Thus, the appearances of the phrases "in one implementation" or "in an implementation" in various places throughout this specification are not necessarily all referring to the same implementation. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more implementations.
Furthermore, reference throughout this specification to "an item" or "a file" means that a particular structure, feature, or object described in connection with the implementation is not necessarily the same object. Further, "file" or "item" can refer to objects in various formats.
The terms "node," "component," "module," "system," and the like as used herein are generally intended to refer to a computer-related entity, either hardware (e.g., circuitry), a combination of hardware and software, or an entity associated with an operating machine having one or more specific functionalities. For example, a component may be, but is not limited to being, a process running on a processor (e.g., a digital signal processor), a processor, an object, an executable, a thread of execution, a program, and/or a computer. By way of illustration, both an application running on a controller and the controller can be a component. One or more components can reside within a process and/or thread of execution and a component can be localized on one computer and/or distributed between two or more computers. Although individual components are depicted in various implementations, it is to be appreciated that the components can be represented using one or more common components. Further, the design of each implementation can include different component placements, component selections, etc. to achieve optimal performance. Furthermore, "means" can take the form of specially designed hardware; generalized hardware specialized by the execution of software thereon (which enables the hardware to perform specific functions); software stored on a computer readable medium; or a combination thereof.
Moreover, the word "exemplary" or "exemplary" is used herein to mean "serving as an example, instance, or illustration". Any aspect or design described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other aspects or designs. Rather, use of the word "exemplary" or "exemplary" is intended to present concepts in a concrete fashion. As used herein, the term "or" is intended to mean including "or" rather than exclusive "or". That is, unless specified otherwise, or clear from context, "X employs A or B" is intended to mean that it naturally includes either of the substitutions. That is, if X employs A; x is B; or X employs both A and B, then "X employs A or B" is satisfied under any of the above examples. In addition, the articles "a" and "an" as used in this application and the appended claims should generally be construed to mean "one or more" unless specified otherwise or clear from context to be directed to a singular form.
Claims (10)
1. A method for correcting a speech text, comprising:
extracting a voice text from voice data input by a user, detecting whether an abnormal unit exists in the voice text by using a word error detector, selecting a candidate error-correcting word with an editing distance smaller than an editing distance threshold value from an error-correcting word reference library when the abnormal unit exists in the voice text, and replacing the abnormal unit with the candidate error-correcting word, wherein the word error detector is established based on an N-Gram algorithm;
and when the abnormal unit does not exist in the voice text, determining that the voice text is correct.
2. The method for correcting errors in an audio text according to claim 1, wherein the word error detector is capable of selecting N words in the audio text, and the step of determining whether an abnormal unit exists in the audio text by using the word error detector comprises:
sliding the word error detector on the voice text, and determining that the N words in the word error detector are not abnormal units when the conditional probability that the N words in the word error detector occur simultaneously is greater than or equal to an empirical probability threshold;
and when the conditional probability of the simultaneous occurrence of the N words in the word error detector is smaller than the empirical probability threshold, determining the N words in the word error detector as abnormal units.
3. The method for correcting a speech text according to claim 1, wherein an edit distance between each word in the candidate error correction words and each corresponding word in the abnormal unit is smaller than the edit distance threshold.
4. The method of claim 3, wherein the phonetic text is Chinese, and the step of calculating the edit distance comprises:
and respectively comparing three dimensions of initial consonants, vowels and tones in the pinyin syllables of the two characters with the editing distance to be calculated, calculating the editing distances of the three dimensions, and summing the editing distances of the three dimensions to obtain the editing distance of the two characters with the editing distance to be calculated.
5. The method for correcting a speech text according to claim 1, wherein after replacing the abnormal unit with the candidate error-correcting word, the method further comprises:
and checking the corrected voice text by utilizing a probability context-free grammar.
6. The method for correcting the voice text according to claim 5, wherein the step of checking the corrected voice text by using the probabilistic context-free grammar comprises the following steps:
executing grammar tree generation processing on the corrected voice text according to the trained grammar tree model, and determining that the corrected voice text is correct when the corrected voice text can generate a complete grammar tree according to the trained grammar tree model;
and when the corrected voice text can not generate a complete grammar tree according to the trained grammar tree model, determining that the corrected voice text is incorrect.
7. The method according to claim 6, wherein a plurality of candidate error-correcting words selected from the error-correcting word reference library and having an editing distance with respect to the abnormal unit smaller than an editing distance threshold are obtained, and the abnormal unit is replaced with the plurality of candidate error-correcting words, so as to obtain a plurality of corrected voice texts;
and when the plurality of corrected voice texts can generate a complete grammar tree according to the trained grammar tree model, calculating the probabilities of the plurality of generated grammar trees, and determining the voice text corresponding to the grammar tree with the highest probability as the voice text after final error correction.
8. An apparatus for correcting a speech text, comprising:
a speech text extraction unit for performing: extracting a voice text from voice data input by a user;
an abnormality unit determination unit configured to execute: detecting whether an abnormal unit exists in the voice text by utilizing a word error detector, wherein the word error detector is established based on an N-Gram algorithm;
the candidate error-correcting word selecting unit is used for executing the following steps: when an abnormal unit exists in the voice text, selecting a candidate error-correcting word with an editing distance smaller than an editing distance threshold value from an error-correcting word reference library;
a replacement unit to perform: and replacing the abnormal unit with the candidate error-correcting word.
9. The apparatus according to claim 8, wherein the word error detector is capable of selecting N words in the phonetic text, and the abnormal unit determining unit is specifically configured to perform:
sliding the word error detector on the voice text, and determining that the N words in the word error detector are not abnormal units when the conditional probability that the N words in the word error detector occur simultaneously is greater than or equal to an empirical probability threshold;
and when the conditional probability of the simultaneous occurrence of the N words in the word error detector is smaller than the empirical probability threshold, determining the N words in the word error detector as abnormal units.
10. The apparatus for correcting an error of a phonetic text according to claim 8, wherein the verification unit is specifically configured to perform:
executing grammar tree generation processing on the corrected voice text according to the trained grammar tree, and determining that the corrected voice text is correct when the corrected voice text can generate a complete grammar tree according to the trained grammar tree model; and when the corrected voice text can not generate a finished grammar tree according to the trained grammar tree model, determining that the corrected voice text is incorrect.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110206015.3A CN113012705B (en) | 2021-02-24 | 2021-02-24 | Error correction method and device for voice text |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110206015.3A CN113012705B (en) | 2021-02-24 | 2021-02-24 | Error correction method and device for voice text |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113012705A true CN113012705A (en) | 2021-06-22 |
CN113012705B CN113012705B (en) | 2022-12-09 |
Family
ID=76385595
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110206015.3A Active CN113012705B (en) | 2021-02-24 | 2021-02-24 | Error correction method and device for voice text |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113012705B (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113516966A (en) * | 2021-06-24 | 2021-10-19 | 肇庆小鹏新能源投资有限公司 | Voice recognition defect detection method and device |
KR102517661B1 (en) * | 2022-07-15 | 2023-04-04 | 주식회사 액션파워 | Method for identify a word corresponding to a target word in text information |
CN116052657A (en) * | 2022-08-01 | 2023-05-02 | 荣耀终端有限公司 | Character error correction method and device for voice recognition |
WO2023205132A1 (en) * | 2022-04-21 | 2023-10-26 | Google Llc | Machine learning based context aware correction for user input recognition |
Citations (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2014077865A (en) * | 2012-10-10 | 2014-05-01 | Nippon Hoso Kyokai <Nhk> | Speech recognition device, error correction model learning method and program |
CN104252484A (en) * | 2013-06-28 | 2014-12-31 | 重庆新媒农信科技有限公司 | Pinyin error correction method and system |
CN104485106A (en) * | 2014-12-08 | 2015-04-01 | 畅捷通信息技术股份有限公司 | Voice recognition method, voice recognition system and voice recognition equipment |
CN105869642A (en) * | 2016-03-25 | 2016-08-17 | 海信集团有限公司 | Voice text error correction method and device |
CN106528532A (en) * | 2016-11-07 | 2017-03-22 | 上海智臻智能网络科技股份有限公司 | Text error correction method and device and terminal |
WO2017084506A1 (en) * | 2015-11-17 | 2017-05-26 | 华为技术有限公司 | Method and device for correcting search query term |
CN107633250A (en) * | 2017-09-11 | 2018-01-26 | 畅捷通信息技术股份有限公司 | A kind of Text region error correction method, error correction system and computer installation |
CN108304385A (en) * | 2018-02-09 | 2018-07-20 | 叶伟 | A kind of speech recognition text error correction method and device |
CN109065054A (en) * | 2018-08-31 | 2018-12-21 | 出门问问信息科技有限公司 | Speech recognition error correction method, device, electronic equipment and readable storage medium storing program for executing |
CN110008471A (en) * | 2019-03-26 | 2019-07-12 | 北京博瑞彤芸文化传播股份有限公司 | A kind of intelligent semantic matching process based on phonetic conversion |
CN110442870A (en) * | 2019-08-02 | 2019-11-12 | 深圳市珍爱捷云信息技术有限公司 | Text error correction method, device, computer equipment and storage medium |
CN110765763A (en) * | 2019-09-24 | 2020-02-07 | 金蝶软件(中国)有限公司 | Error correction method and device for speech recognition text, computer equipment and storage medium |
CN111062376A (en) * | 2019-12-18 | 2020-04-24 | 厦门商集网络科技有限责任公司 | Text recognition method based on optical character recognition and error correction tight coupling processing |
CN111274785A (en) * | 2020-01-21 | 2020-06-12 | 北京字节跳动网络技术有限公司 | Text error correction method, device, equipment and medium |
CN111312209A (en) * | 2020-02-21 | 2020-06-19 | 北京声智科技有限公司 | Text-to-speech conversion processing method and device and electronic equipment |
CN111814455A (en) * | 2020-06-29 | 2020-10-23 | 平安国际智慧城市科技股份有限公司 | Search term error correction pair construction method, terminal and storage medium |
CN112183073A (en) * | 2020-11-27 | 2021-01-05 | 北京擎盾信息科技有限公司 | Text error correction and completion method suitable for legal hot-line speech recognition |
-
2021
- 2021-02-24 CN CN202110206015.3A patent/CN113012705B/en active Active
Patent Citations (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2014077865A (en) * | 2012-10-10 | 2014-05-01 | Nippon Hoso Kyokai <Nhk> | Speech recognition device, error correction model learning method and program |
CN104252484A (en) * | 2013-06-28 | 2014-12-31 | 重庆新媒农信科技有限公司 | Pinyin error correction method and system |
CN104485106A (en) * | 2014-12-08 | 2015-04-01 | 畅捷通信息技术股份有限公司 | Voice recognition method, voice recognition system and voice recognition equipment |
WO2017084506A1 (en) * | 2015-11-17 | 2017-05-26 | 华为技术有限公司 | Method and device for correcting search query term |
CN105869642A (en) * | 2016-03-25 | 2016-08-17 | 海信集团有限公司 | Voice text error correction method and device |
CN106528532A (en) * | 2016-11-07 | 2017-03-22 | 上海智臻智能网络科技股份有限公司 | Text error correction method and device and terminal |
CN107633250A (en) * | 2017-09-11 | 2018-01-26 | 畅捷通信息技术股份有限公司 | A kind of Text region error correction method, error correction system and computer installation |
CN108304385A (en) * | 2018-02-09 | 2018-07-20 | 叶伟 | A kind of speech recognition text error correction method and device |
CN109065054A (en) * | 2018-08-31 | 2018-12-21 | 出门问问信息科技有限公司 | Speech recognition error correction method, device, electronic equipment and readable storage medium storing program for executing |
CN110008471A (en) * | 2019-03-26 | 2019-07-12 | 北京博瑞彤芸文化传播股份有限公司 | A kind of intelligent semantic matching process based on phonetic conversion |
CN110442870A (en) * | 2019-08-02 | 2019-11-12 | 深圳市珍爱捷云信息技术有限公司 | Text error correction method, device, computer equipment and storage medium |
CN110765763A (en) * | 2019-09-24 | 2020-02-07 | 金蝶软件(中国)有限公司 | Error correction method and device for speech recognition text, computer equipment and storage medium |
CN111062376A (en) * | 2019-12-18 | 2020-04-24 | 厦门商集网络科技有限责任公司 | Text recognition method based on optical character recognition and error correction tight coupling processing |
CN111274785A (en) * | 2020-01-21 | 2020-06-12 | 北京字节跳动网络技术有限公司 | Text error correction method, device, equipment and medium |
CN111312209A (en) * | 2020-02-21 | 2020-06-19 | 北京声智科技有限公司 | Text-to-speech conversion processing method and device and electronic equipment |
CN111814455A (en) * | 2020-06-29 | 2020-10-23 | 平安国际智慧城市科技股份有限公司 | Search term error correction pair construction method, terminal and storage medium |
CN112183073A (en) * | 2020-11-27 | 2021-01-05 | 北京擎盾信息科技有限公司 | Text error correction and completion method suitable for legal hot-line speech recognition |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113516966A (en) * | 2021-06-24 | 2021-10-19 | 肇庆小鹏新能源投资有限公司 | Voice recognition defect detection method and device |
WO2023205132A1 (en) * | 2022-04-21 | 2023-10-26 | Google Llc | Machine learning based context aware correction for user input recognition |
KR102517661B1 (en) * | 2022-07-15 | 2023-04-04 | 주식회사 액션파워 | Method for identify a word corresponding to a target word in text information |
CN116052657A (en) * | 2022-08-01 | 2023-05-02 | 荣耀终端有限公司 | Character error correction method and device for voice recognition |
CN116052657B (en) * | 2022-08-01 | 2023-10-20 | 荣耀终端有限公司 | Character error correction method and device for voice recognition |
Also Published As
Publication number | Publication date |
---|---|
CN113012705B (en) | 2022-12-09 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN113012705B (en) | Error correction method and device for voice text | |
CN112149406B (en) | Chinese text error correction method and system | |
CN111369996B (en) | Speech recognition text error correction method in specific field | |
US8185376B2 (en) | Identifying language origin of words | |
Klejch et al. | Sequence-to-sequence models for punctuated transcription combining lexical and acoustic features | |
KR101590724B1 (en) | Method for modifying error of speech recognition and apparatus for performing the method | |
US20040039570A1 (en) | Method and system for multilingual voice recognition | |
CN112199945A (en) | Text error correction method and device | |
US8738378B2 (en) | Speech recognizer, speech recognition method, and speech recognition program | |
CN112489626B (en) | Information identification method, device and storage medium | |
US11417322B2 (en) | Transliteration for speech recognition training and scoring | |
Lee et al. | Corrective and reinforcement learning for speaker-independent continuous speech recognition | |
US20150179169A1 (en) | Speech Recognition By Post Processing Using Phonetic and Semantic Information | |
CN112489655B (en) | Method, system and storage medium for correcting voice recognition text error in specific field | |
CN115965009A (en) | Training and text error correction method and device for text error correction model | |
CN114817465A (en) | Entity error correction method and intelligent device for multi-language semantic understanding | |
EP1887562B1 (en) | Speech recognition by statistical language model using square-root smoothing | |
CN115457938A (en) | Method, device, storage medium and electronic device for identifying awakening words | |
Marin et al. | Using syntactic and confusion network structure for out-of-vocabulary word detection | |
KR102204395B1 (en) | Method and system for automatic word spacing of voice recognition using named entity recognition | |
Arslan et al. | Detecting and correcting automatic speech recognition errors with a new model | |
CN110929514A (en) | Text proofreading method and device, computer readable storage medium and electronic equipment | |
KR20240137615A (en) | Detecting incorrect suggestions for user-provided content | |
EP3877973A1 (en) | Transliteration for speech recognition training and scoring | |
CN113536776B (en) | Method for generating confusion statement, terminal device and computer readable storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |