CN114530145A

CN114530145A - Speech recognition result error correction method and device, and computer readable storage medium

Info

Publication number: CN114530145A
Application number: CN202011322395.9A
Authority: CN
Inventors: 胡洪涛; 徐景成; 胡珉; 朱耀磷; 李想; 彭成高; 黄毅; 李赫男
Original assignee: China Mobile Communications Group Co Ltd; China Mobile Internet Co Ltd
Current assignee: China Mobile Communications Group Co Ltd; China Mobile Internet Co Ltd
Priority date: 2020-11-23
Filing date: 2020-11-23
Publication date: 2022-05-24
Anticipated expiration: 2040-11-23
Also published as: CN114530145B

Abstract

The embodiment of the application discloses a method and a device for correcting the error of a voice recognition result and a computer readable storage medium, wherein the method comprises the following steps: converting a text of a voice recognition result corresponding to the voice output by the user into a corresponding first pinyin sequence; determining the similarity between the plurality of second pinyin sequences and the first pinyin sequence according to the editing distance between the first pinyin sequence and the plurality of second pinyin sequences in the preset corpus and the confusion probability of the user for confusing each first syllable in the first pinyin sequence into a second syllable corresponding to the first syllable in the second pinyin sequence; and correcting the voice recognition result of the user based on the second pinyin sequence with the highest similarity. The scheme of the embodiment of the application can improve the error correction capability of voice recognition.

Description

Speech recognition result error correction method and device, and computer readable storage medium

Technical Field

The present application relates to the field of natural language processing, and in particular, to a method and an apparatus for correcting a speech recognition result, and a computer-readable storage medium.

Background

With the development of communication technology, the 5G era of high speed and low delay comes, the social demand for various skills in the field of artificial intelligence is higher and higher, and the application of voice recognition is wider and wider, including robots, intelligent sound boxes, voice assistants and the like. Therefore, the accuracy of speech recognition also becomes an important factor affecting an intelligent system. The accuracy of the speech recognition result not only affects the naturalness and fluency of the conversation, but also plays an important role in the accuracy of downstream natural language processing work.

Because the speech recognition system generally adopts the universal language materials to recognize in the speech recognition link, some common expression modes and proper nouns in a specific scene are difficult to recognize. Therefore, many systems combine known information to correct the speech recognition before natural language processing, so as to improve the effect of natural language processing. At present, a common speech recognition error correction mode mainly includes that an error correction word bank is established manually and a common speech recognition error dictionary is maintained. When the words in the word bank are met, the input is modified into the correct modification result of the assumed correct mapping in the error correction word bank according to the rules set by the user.

However, in the existing speech recognition error correction schemes, the differences between different users or user groups are ignored. The recognition object of voice recognition has certain pronunciation habits for each person due to factors such as regions, educational backgrounds, physiological characteristics and the like, for example, some people cannot distinguish 'l' and 'n', some people cannot distinguish flat curled tongues, some people cannot distinguish front and back nasal sounds, some people become retromorphic, some people are not clearly distinguished 'r' and 'n', and the like. These personalized pronunciation habits are also important information in the error correction process. However, the existing schemes do not distinguish the differences, so that the speech recognition error correction capability is low.

Disclosure of Invention

An embodiment of the present application provides a method and an apparatus for correcting a speech recognition result, and a computer-readable storage medium, so as to solve the problem of low error correction capability of the existing speech recognition result.

In order to solve the above technical problem, the present specification is implemented as follows:

in a first aspect, a method for correcting errors of speech recognition results is provided, including: converting a text of a voice recognition result corresponding to the voice output by the user into a corresponding first pinyin sequence; determining the similarity between the plurality of second pinyin sequences and the first pinyin sequence according to the editing distance between the first pinyin sequence and the plurality of second pinyin sequences in the preset corpus and the confusion probability of the user for confusing each first syllable in the first pinyin sequence into a second syllable corresponding to the first syllable in the second pinyin sequence; and correcting the voice recognition result of the user based on the second pinyin sequence with the highest similarity.

Optionally, determining similarity between the plurality of second pinyin sequences and the first pinyin sequence respectively according to the edit distance between the first pinyin sequence and the plurality of second pinyin sequences in the predetermined corpus and the confusion probability of the user confusing each first syllable in the first pinyin sequence into a second syllable corresponding to the first syllable in the second pinyin sequence, including: respectively calculating the editing distance between the first pinyin sequence and the plurality of second pinyin sequences, wherein the editing distance is the number of operation steps for editing the first pinyin sequence into each corresponding second pinyin sequence; determining the similarity between the second pinyin sequence and the first pinyin sequence according to the difference obtained by subtracting the confusion probability of each first syllable in the first pinyin sequence with respect to the second syllable in the second pinyin sequence from the editing distance corresponding to each second pinyin sequence; and determining the second pinyin sequence with the highest similarity according to the second pinyin sequence corresponding to the minimum difference.

Optionally, the method further includes: recording the corresponding editing relation between each first syllable in the first pinyin sequence edited in the operation step and a second syllable corresponding to the first syllable in the plurality of second pinyin sequences; and when the highest similarity in the plurality of second pinyin sequences is larger than a first preset threshold value, acquiring the frequency of confusing each first syllable in the first pinyin sequence into a second syllable corresponding to the first syllable in the second pinyin sequence with the highest similarity according to the corresponding editing relation so as to determine the confusing probability of each first syllable in the first pinyin sequence.

Optionally, determining the confusion probability of each first syllable in the first pinyin sequence includes: initializing a confusion matrix for the user, wherein the confusion matrix is a k +1 dimensional square matrix formed by a predetermined number k of pinyin syllables and 1 empty syllables respectively in rows and columns, and element values m in the confusion matrix_ijRepresenting a confusion probability for identifying an ith syllable as a jth syllable in the confusion matrix for determining a highest similarity among the plurality of second pinyin sequences; determining the current confusion probability of each first syllable in the first pinyin sequence according to the frequency of each first syllable in the corresponding first pinyin sequence being confused into a second syllable corresponding to the first syllable in a second pinyin sequence with the highest similarity and the recording times of the corresponding editing relation by the user; and iteratively updating the confusion probability of the corresponding element value in the confusion matrix of the user by using the current confusion probability of each first syllable in the first pinyin sequence.

Optionally, the method further includes: judging whether the confusion probability of the confusion matrix of the user is converged after each iteration update; in the case that the confusion probabilities converge, combining the confusion matrix of the user with the confusion matrices of other users having converging confusion probabilities into a common confusion matrix.

Optionally, merging the confusion matrix of the user and the confusion matrix of other users with the converged confusion probability into a common confusion matrix, including: calculating the similarity between the confusion matrix of the user and the confusion matrices of the other users; and when the similarity is greater than a second preset threshold value, carrying out weighted average calculation on each element included by the confusion matrix of the user and the confusion matrices of the other users to obtain the common confusion matrix.

Optionally, the confusion probability for the user to confuse each first syllable in the first pinyin sequence into a second syllable corresponding to the first syllable in the second pinyin sequence includes at least one of the following: a probability that a non-null syllable in the first pinyin sequence is confused as a null syllable in the second pinyin sequence; a probability that a null syllable in the first pinyin sequence is confused with a non-null syllable in the second pinyin sequence; a probability that a non-null syllable in the first pinyin sequence is confused with a non-null syllable in the second pinyin sequence.

Optionally, the error correction of the voice recognition result of the user based on the second pinyin sequence with the highest similarity includes: and when the highest similarity in the plurality of second pinyin sequences is larger than a first preset threshold value, determining the second pinyin sequence with the highest similarity as a correct pinyin sequence corresponding to the output voice of the user.

In a second aspect, there is provided a speech recognition result error correction apparatus comprising a processor and a processor electrically connected to the memory, the memory storing a computer program executable by the processor, the computer program, when executed by the processor, implementing the steps of the method according to the first aspect.

In a third aspect, a computer-readable storage medium is provided, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to the first aspect.

In the embodiment of the application, the text of the voice recognition result corresponding to the voice output by the user is converted into the corresponding first pinyin sequence, the similarity between the plurality of second pinyin sequences and the first pinyin sequence is respectively determined according to the editing distance between the first pinyin sequence and the plurality of second pinyin sequences in the preset corpus and the confusion probability of the user for confusing each first syllable in the first pinyin sequence into the second syllable corresponding to the first syllable in the second pinyin sequence, and the voice recognition result of the user is corrected based on the second pinyin sequence with the highest similarity. The personalized confusion matrix is designed for different users to distinguish pronunciation habits of different users, so that a large number of errors for correction caused by the assimilation of mainstream pronunciation habits of a few groups with characteristics of pronunciation are avoided.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:

fig. 1 is a flowchart illustrating a speech recognition result error correction method according to an embodiment of the present application.

Fig. 2 is a schematic flow chart of the pinyin sequence similarity determination procedure in the embodiment of the present application.

Fig. 3 is a schematic diagram illustrating an overall example of a speech recognition result error correction method according to an embodiment of the present application.

Fig. 4 is a schematic structural diagram of a speech recognition result error correction apparatus according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application. The reference numbers in the present application are only used for distinguishing the steps in the scheme and are not used for limiting the execution sequence of the steps, and the specific execution sequence is described in the specification.

In order to solve the problems in the prior art, an embodiment of the present application provides a method for correcting a speech recognition result, as shown in fig. 1, where fig. 1 is a schematic flow diagram of the method for correcting the speech recognition result according to the embodiment of the present application. The method comprises the following steps:

step 102, converting a text of a voice recognition result corresponding to the voice output by the user into a corresponding first pinyin sequence.

The user output speech recognition takes user speech as a research object, and converts the input of a user into corresponding text through speech signal processing and mode recognition. The speech recognition mainly comprises the main processes of preprocessing, feature extraction, an acoustic model, a dictionary, a language model, decoding and the like, wherein the speech recognition is based on the steps of extracting various effective features of speech to be recognized, forming a speech mode to be recognized, comparing the speech mode with a sample mode stored in a memory of a computer, and recognizing through a mode classification method. The embodiment of the present application can adopt the existing traditional voice recognition technology, and the description is not repeated here.

And after the recognition text corresponding to the user voice is obtained, converting the recognition text into a corresponding pinyin sequence based on the dictionary model. The text of the voice recognition is replaced by the pinyin sequence, and in order to correct errors caused by the fact that the universal language model cannot recognize the personalized expression in the voice recognition, the accuracy of the part needs to be reduced, and the recall rate needs to be improved. The effects of pinyin and pitch are eliminated (if the amount of data is large enough, the syllables with pitch can be used as basic units when calculating the degree of confusion).

And 104, respectively determining the similarity between the plurality of second pinyin sequences and the first pinyin sequence according to the editing distance between the first pinyin sequence and the plurality of second pinyin sequences in the predetermined corpus and the confusion probability of the user for confusing each first syllable in the first pinyin sequence into a second syllable corresponding to the first syllable in the second pinyin sequence.

The predetermined corpus may include a general corpus and/or a scenario-specific proprietary corpus, and for recognition of expressions of a user in a specific scenario, a small scenario-related proprietary corpus may be used for each scenario in a scenario-based dialog system for more efficiently and correctly recognizing some common expressions and proper nouns in the specific scenario.

In order to calculate the similarity between each pinyin sequence in the corpus and the pinyin sequence corresponding to the user input voice, the absolute difference between the two sequences needs to be calculated by solving the editing distance between the two sequences.

The edit distance between the first pinyin sequence and each second pinyin sequence is the number of operation steps for editing the first pinyin sequence with possibly identified incorrect syllables into the corresponding second pinyin sequence.

For example, editing a sequence a comprising 14 syllables and a sequence B comprising 16 syllables through an edit distance algorithm, namely, adding a syllable "h" after the 2 nd syllable "o" of the sequence a, then adding "en", modifying the 10 th syllable "l" into "n", modifying the 13 th syllable "l" into "n" for 4 steps, namely, calculating the edit distance from the sequence a to the sequence B to be 4.

The confusion probability represents the probability between two syllables that may be misread by the user. If the probability that the syllable "h" spoken by a user is ignored by the speech recognition system by 40% due to the pronunciation habit of the user and the probability that the syllable "n" spoken by the user is recognized as "l" by the speech recognition system is as high as 95%, the similarity is influenced by the misrecognition caused by failure to embody the personalized accent.

In order to correct the misrecognition caused by the misreading of the accent, the method introduces the concept of confusion probability, namely the confusion probability of a syllable in a pinyin sequence corresponding to the output voice is confused into a syllable corresponding to the syllable in the pinyin sequence corresponding to the correctly expressed voice by a user.

The confusion probability of the user confusing each first syllable in the first pinyin sequence into a second syllable corresponding to the first syllable in the second pinyin sequence comprises at least one of the following: a probability that a non-null syllable in the first pinyin sequence is confused as a null syllable in the second pinyin sequence; a probability that a null syllable in the first pinyin sequence is confused with a non-null syllable in the second pinyin sequence; a probability that a non-null syllable in the first pinyin sequence is confused with a non-null syllable in the second pinyin sequence.

Each first syllable corresponds to a probability of being confused into a different second syllable, but the probability of corresponding to being confused into a different second syllable is different depending on the pronunciation habits of the user for the first syllable.

Optionally, in step 104, according to the edit distances between the first pinyin sequence and the multiple second pinyin sequences in the predetermined corpus and the confusion probability that the user confuses each first syllable in the first pinyin sequence into a second syllable corresponding to the first syllable in the second pinyin sequence, the similarity between the multiple second pinyin sequences and the first pinyin sequence is respectively determined, as shown in fig. 2, the following steps are included, and fig. 2 is a flowchart of the step of determining the similarity between the pinyin sequences in the embodiment of the present application.

Step 202, respectively calculating the editing distance between the first pinyin sequence and the plurality of second pinyin sequences, wherein the editing distance is the number of operation steps for editing the first pinyin sequence into each corresponding second pinyin sequence;

step 204, determining the similarity between the second pinyin sequence and the first pinyin sequence according to the difference obtained by subtracting the confusion probability of each first syllable in the first pinyin sequence with respect to the second syllable in the second pinyin sequence from the editing distance corresponding to each second pinyin sequence;

and step 206, determining the second pinyin sequence with the highest similarity according to the second pinyin sequence corresponding to the minimum difference.

In the embodiment of the present application, in order to obtain the similarity between two pinyin sequences, the edit distance between the two pinyin sequences and the corresponding confusion probability between two syllables that are misrecognized need to be determined.

As described above, the edit distance is the number of operation steps for the first pinyin sequence to be edited into each corresponding second pinyin sequence. For the determination of the confusion probability, a confusion matrix M is first initialized for the user in syllable units_ij. Here, the syllable may be a whole syllable of the corresponding initial, final included in the spelling table, or a partial syllable determined for the corresponding personalized pronunciation of the user. For a pinyin table with k syllables, the confusion matrix can be set as a k +1 dimensional square matrix with k syllables and 1 empty syllables as rows and columns, respectively, i.e. the abscissa and ordinate of the confusion matrix are all syllables and 1 empty syllable, wherein the element m_ijIs the probability of confusion for the ith through jth syllable in the confusion matrix. Initially, the confusion matrix M is_ijThe confusion probability corresponding to the diagonal element of (1) is set to 1, and the confusion probabilities corresponding to the remaining elements are set to 0. The diagonal element indicates that the corresponding syllable i is the same as the syllable j, and the confusion probability is 1, namely that the syllable is not identified by misreading in the speech recognition. The syllable i and syllable j of other non-diagonal elements are represented as different syllables, and the confusion probability that syllable i is misread as syllable j is set to 0 initially, i.e. the probability that two syllables are identified incorrectly is 100%. It should be noted that the probability of recognition and the probability of confusion are different concepts, and the two are inversely related. In addition, the mapping relation between the empty syllables and all non-empty syllables is also defined, namely the element m_pΦRepresenting the probability, m, of a non-null syllable p being misrecognized as a null syllable Φ due to a speech recognition error_ΦpIndicating that due to a speech recognition error, the null syllable Φ is recognized as a non-null syllable p, and the recognition is increased by one syllable p. The confusion probability corresponding to each element in the initially set confusion matrix is continuously updated iteratively along with the subsequent process of correcting the recognition result of the speech output by the user, and then the updated confusion probability is obtainedAnd then, the updated confusion probability is used for correcting the error of the recognition result until the value of the confusion probability is not changed any more, and the following steps are developed for detailed description.

After the initialized confusion matrix is obtained, it can be understood that in step 204, similarity calculation is performed on the first pinyin sequence corresponding to the user output voice and each second pinyin sequence in the corpus by using the edit distance and the initially set confusion probability. And obtaining a difference value between the corresponding editing distance and the confusion probability, wherein the difference value is about small, and the difference value represents that the two sequences are about close and about similar. The second pinyin sequence having the highest similarity to the first pinyin sequence may be determined based on the smallest difference of the calculated differences.

And 106, correcting the voice recognition result of the user based on the second pinyin sequence with the highest similarity.

Optionally, in step 106, performing error correction on the speech recognition result of the user based on the second pinyin sequence with the highest similarity, including: and when the highest similarity in the plurality of second pinyin sequences is larger than a first preset threshold value, determining the second pinyin sequence with the highest similarity as a correct pinyin sequence corresponding to the output voice of the user.

That is, if the highest similarity between the second pinyin sequence and the first pinyin sequence exceeds a predetermined threshold, the second pinyin sequence is considered to be used for recognizing and correcting the voice output by the user, that is, the second pinyin sequence is considered to be a sequence which modifies the first pinyin sequence into a correct pinyin sequence corresponding to the voice output by the user. If the similarity does not exceed the preset threshold, the similarity is considered to be too low, and the error correction operation cannot be executed.

In an embodiment of the present application, the method for correcting the speech recognition result may further include: recording the corresponding editing relation between each first syllable in the first pinyin sequence edited in the operation step and a second syllable corresponding to the first syllable in the plurality of second pinyin sequences; and when the highest similarity in the plurality of second pinyin sequences is larger than a first preset threshold value, acquiring the frequency of confusing each first syllable in the first pinyin sequence into a second syllable corresponding to the first syllable in the second pinyin sequence with the highest similarity according to the corresponding editing relation so as to determine the confusing probability of each first syllable in the first pinyin sequence.

As described above, the first pinyin sequence is edited into the syllables corresponding to the second pinyin sequences, and corresponding editing operation steps (adding, deleting or modifying one syllable) need to be performed, so that when editing in each operation step is completed, the corresponding editing relationship between the first syllable and the second syllable related to the current operation step can be recorded, for example, when the current operation step is to modify "l" of the first syllable into "n" of the second syllable, the corresponding editing relationship is recorded. Thereby reflecting the corresponding relation of the direct elements of the two sequences.

The purpose of recording the corresponding editing relationship in the sequence operation step is to obtain the corresponding relationship between the syllables in the two sequences with the least change steps, that is, the highest similarity, and further to obtain and update the corresponding confusion frequency between the syllables through the corresponding editing relationship when the highest similarity is greater than a first predetermined threshold. For example, when the first pinyin sequence corresponding to the voice output by the user is edited into the second pinyin sequence with the highest similarity, the number of times that the syllable "l" is modified into the syllable "n" is the frequency of the syllable being confused according to the editing distance and the confusion probability. And acquiring the confusion frequency corresponding to each first syllable when the highest similarity in the second pinyin sequence is greater than a first preset threshold value and error correction is performed, wherein the confusion frequency can be used for calculating the confusion probability among the syllables, and further updating the confusion matrix.

For example, if the number of times that the syllable "l" in the first pinyin sequence of the plurality of pieces of speech output by the user is modified to the syllable "n" is recorded is 100 times, for example, but the frequency of performing the error correction process according to the second pinyin sequence having the highest similarity is 80 times, the confusion probability may be (100-20)/100-20%.

As described above, when the highest similarity is not greater than the predetermined threshold, the similarity is considered to be too low, and no error correction is performed. Therefore, it is not always the case that syllable confusion is determined at this time. That is, when the highest similarity is not greater than the predetermined threshold, although the corresponding edit relationship of the first syllable and the second syllable in the two pinyin sequences is recorded, since the error correction processing is not performed, the frequency with which the corresponding first syllable is error-corrected is not acquired.

Next, the update of the confusion probability in the confusion matrix will be explained.

Due to the confusion probability M corresponding to the elements in the confusion matrix M_ijThe probability of the ith syllable being confused as the jth syllable is shown, wherein the editing distance and the confusion probability are defined by a modified text editing distance function D (i, j), and D (i, j) represents the recognition error degree of the process that the pinyin sequence A with the length i is recognized as the pinyin sequence B with the length j.

And, the dynamic programming formula is set as follows:

if i is 0 and j is 0, D (i, j) is 0

if i is equal to or greater than 1 and j is equal to or greater than 1, D (i, j) ═ min { D (i-1, j) + (1-m)_iΦ)，D(i,j-1)+(1-m_Φj)，D(i-1,j-1)+(1-m_ij)}

Here, m in the confusion matrix_iΦRepresenting the probability that the ith syllable is identified as a null syllable, which is not zero, so that the distance D (i, j) between two series of strings is increased by (1-m) when an operation of adding one syllable is performed_iΦ). Similarly, when a delete operation is performed for a syllable, D (i, j) is increased by (1-m)_Φj). When performing an operation to modify one syllable to another, e.g. changing the ith Pinyin to the jth Pinyin in the Pinyin list, D (i, j) is increased by (1-m)_ij). In particular, m is the initial state when the syllable confusion matrix is not updated_ij0(i not equal to j) or 1(i equal to j).

As described above, the recording during the loop process minimizes the change of D (i, j), i.e. corresponds to the syllable corresponding edit relationship in the operation step where the similarity between the two sequences is the highest, and the corresponding edit relationship can be recorded in the operation sequence list.

The specific process of the algorithm is represented by the following codes:

the confusion probability is between [0 and 1], and by adding marks corresponding to different operations, the syllable corresponding editing relation of the pinyin sequence can be recorded in the operation sequence list.

The confusion frequency corresponding to each syllable can be arranged in a variation matrix corresponding to the size of the confusion matrix, and the confusion frequency of the syllable is counted by each element value in the variation matrix instead of the probability. When the highest similarity is larger than a preset threshold value, so that the corresponding second pinyin sequence can execute error correction action on the first pinyin sequence, the syllable corresponding editing relation of the two sequences is obtained through the records in the operation sequence list, and then the change matrix is updated. For example, in the process of editing the first pinyin sequence "n in h ao a" into the second pinyin sequence "n in h ao a" corresponding to the highest similarity, the operation recorded in the operation sequence table is to modify the syllable "in" into the syllable "i" and delete the syllable "a", which means that the user output speech is recognized incorrectly, so that "i" is recognized as "in" and a null syllable is recognized as "a". Then, m can be adjusted at this time_s(i)s(in)And m_Φs(a)Respectively, plus 1, wherein s (x) represents the sequence number of x in the pinyin table, and an updated change matrix can be obtained at this time.

After the confusion frequency corresponding to the syllable is obtained, the confusion probability can be updated. After the change matrix is updated for a predetermined number of times, the posterior probability in the confusion matrix can be updated by using the frequency iteration recorded in the change matrix. When the editing distance between the user input and the corpus is calculated, the posterior probability of confusion among different syllables is calculated and the confusion matrix is updated, so that the confusion matrix is applied and is continuously updated in the error correction processing process, and the effect of correcting the voice recognition result is achieved.

After a predetermined number of error correction processes, the behavior of the user will exhibit a certain regularity and will not appear random any more. At this time, the confusion matrix is updated by using the confusion frequency in the change matrix, so that the confusion matrix can reflect the pronunciation mode of the user more along with the use of the user.

In one embodiment, further comprising: judging whether the confusion probability of the confusion matrix of the user is converged after each iteration update; in case the confusion probability converges, combining the confusion matrix of the user with the confusion matrices of other users having converging confusion probabilities into a common confusion matrix.

Combining the confusion matrix of the user with the confusion matrices of other users having converged confusion probabilities into a common confusion matrix, comprising: calculating the similarity between the confusion matrix of the user and the confusion matrices of the other users; and when the similarity is greater than a second preset threshold value, carrying out weighted average calculation on each element included by the confusion matrix of the user and the confusion matrices of the other users to obtain the common confusion matrix.

And after the update times of the confusion matrix reach a certain value so that the confusion probability convergence does not change any more or changes a little, marking the confusion matrix as a combinable confusion matrix. The similarity between combinable confusion matrices is calculated, where the similarity is calculated using accumulating the absolute value of the difference between two matrix elements. When the similarity of n (n >1) confusion matrices is found to exceed a threshold, the n confusion matrices are weighted and averaged for each element, wherein the weights corresponding to different users may be different, for example, the number of times a certain user takes part in the speech recognition result error correction is large, the number of conversations is large, and the confusion probability corresponding to the corresponding confusion matrix can more accurately reflect the confusion condition of the corresponding syllable. Thus, it is possible to set a weight having a larger weight, whereas a weight having a smaller weight is set. The obtained new confusion matrix is shared by the original users of the n confusion matrices, and the combined confusion matrix and the user group become a larger user group. Therefore, the communication carries out the merging action on the personalized confusion matrix of the users regularly, the maintenance cost can be reduced, and the common information of the users in the same user group can be mined as the reference of data analysis.

The method comprises the steps of converting a text of a voice recognition result corresponding to user output voice into a corresponding first pinyin sequence, determining the similarity between the second pinyin sequences and the first pinyin sequence according to the edit distance between the first pinyin sequence and a plurality of second pinyin sequences in a preset corpus and the confusion probability of each first syllable in the first pinyin sequence into a second syllable corresponding to the first syllable in the second pinyin sequence, and correcting the voice recognition result of the user based on the second pinyin sequence with the highest similarity. The personalized confusion matrix is designed for different users to distinguish pronunciation habits of different users, so that a large number of errors for correction caused by the assimilation of mainstream pronunciation habits of a few groups with characteristics of pronunciation are avoided.

In addition, empty syllables are introduced when confusion probability is set, confusion matrixes from each syllable to the empty syllables and between the empty syllables and each syllable are recorded, so that the syllables which are deleted and added frequently by a voice recognition system to a specific user can be reflected to a certain extent, and then the recall rate is improved.

Referring to fig. 3, fig. 3 is a schematic diagram illustrating an overall example of a speech recognition result error correction method according to an embodiment of the present application.

As shown in fig. 3, it is mainly divided into two parts: pinyin retrieval and matrix updating.

The pinyin retrieval comprises the following steps:

step 302, for each user access, aiming at the voice recognition result of the voice output by the user, converting the corresponding text into pinyin to obtain a corresponding pinyin sequence;

step 304: calculating a plurality of editing distances by combining the pinyin sequences and a plurality of pinyin sequences in a general or specific scene corpus and the corresponding confusion probability among syllables, wherein the confusion probability in the confusion matrix corresponding to the initialized confusion matrix is used at first, and is updated according to the modification of the confusion matrix in the step 312;

step 306: searching the minimum value of the editing distance, judging whether the minimum value is larger than a preset threshold value, and if so, obtaining a most similar sequence;

the matrix updating comprises the following steps:

step 308: calculating the syllable confusion frequency obtained from the edit relation of the edit distance record according to the step 306, and updating the confusion frequency value in the change matrix corresponding to the confusion matrix;

step 310: judging that the update of the confusion matrix change matrix reaches a specified condition;

step 312: modifying and updating the confusion matrix by using the confusion frequency in the change matrix to obtain an updated confusion matrix, and calculating the editing distance corresponding to the step 304 when the user accesses the confusion matrix next time;

step 314: judging that the confusion matrix reaches a merging condition;

step 316: and merging the confusion matrixes corresponding to the plurality of users meeting the merging condition into a new confusion matrix.

Optionally, an embodiment of the present application further provides a speech recognition result error correction device, and fig. 4 is a schematic structural diagram of the speech recognition result error correction device according to the embodiment of the present application. As shown in fig. 4, the speech recognition result error correction apparatus of this embodiment includes a memory 2200 and a processor 2400 electrically connected to the memory 2200, where the memory 2200 stores a computer program that can be executed by the processor 2400, and the computer program, when executed by the processor, implements each process of any one of the above speech recognition result error correction method embodiments, and can achieve the same technical effect, and is not described herein again to avoid repetition.

The embodiment of the present application further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the computer program implements each process of any one of the above embodiments of the speech recognition result error correction method, and can achieve the same technical effect, and in order to avoid repetition, the details are not repeated here. The computer-readable storage medium may be a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present application or portions thereof that contribute to the prior art may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal (which may be a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present application.

While the present embodiments have been described with reference to the accompanying drawings, it is to be understood that the invention is not limited to the precise embodiments described above, which are meant to be illustrative and not restrictive, and that various changes may be made therein by those skilled in the art without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A method for correcting errors in speech recognition results, comprising:

converting a text of a voice recognition result corresponding to the voice output by the user into a corresponding first pinyin sequence;

determining the similarity between the plurality of second pinyin sequences and the first pinyin sequence according to the editing distance between the first pinyin sequence and the plurality of second pinyin sequences in the preset corpus and the confusion probability of the user for confusing each first syllable in the first pinyin sequence into a second syllable corresponding to the first syllable in the second pinyin sequence;

and correcting the voice recognition result of the user based on the second pinyin sequence with the highest similarity.

2. The method of claim 1, wherein determining the similarity of the plurality of second pinyin sequences to the first pinyin sequence based on edit distances between the first pinyin sequence and the plurality of second pinyin sequences in a predetermined corpus and confusion probabilities of the user confusing respective first syllables in the first pinyin sequence into second syllables in the second pinyin sequence that correspond to the first syllables comprises:

respectively calculating the editing distance between the first pinyin sequence and the plurality of second pinyin sequences, wherein the editing distance is the number of operation steps for editing the first pinyin sequence into each corresponding second pinyin sequence;

determining the similarity between the second pinyin sequence and the first pinyin sequence according to the difference obtained by subtracting the confusion probability of each first syllable in the first pinyin sequence with respect to the second syllable in the second pinyin sequence from the editing distance corresponding to each second pinyin sequence;

and determining the second pinyin sequence with the highest similarity according to the second pinyin sequence corresponding to the minimum difference.

3. The method of claim 2, further comprising:

recording the corresponding editing relation between each first syllable in the first pinyin sequence edited in the operation step and a second syllable corresponding to the first syllable in the plurality of second pinyin sequences;

and when the highest similarity in the plurality of second pinyin sequences is larger than a first preset threshold value, acquiring the frequency of confusing each first syllable in the first pinyin sequence into a second syllable corresponding to the first syllable in the second pinyin sequence with the highest similarity according to the corresponding editing relation so as to determine the confusing probability of each first syllable in the first pinyin sequence.

4. The method of claim 3, wherein determining a probability of confusion for each first syllable in the first pinyin sequence includes:

initializing a confusion matrix for the user, wherein the confusion matrix is a k +1 dimensional square matrix formed by a predetermined number k of pinyin syllables and 1 empty syllables respectively in rows and columns, and element values m in the confusion matrix_ijRepresenting a confusion probability for identifying an ith syllable as a jth syllable in the confusion matrix for determining a highest similarity among the plurality of second pinyin sequences;

determining the current confusion probability of each first syllable in the first pinyin sequence according to the frequency of each first syllable in the corresponding first pinyin sequence being confused into a second syllable corresponding to the first syllable in a second pinyin sequence with the highest similarity and the recording times of the corresponding editing relation by the user;

and iteratively updating the confusion probability of the corresponding element value in the confusion matrix of the user by using the current confusion probability of each first syllable in the first pinyin sequence.

5. The method of claim 4, further comprising:

judging whether the confusion probability of the confusion matrix of the user is converged after each iteration update;

in case the confusion probability converges, combining the confusion matrix of the user with the confusion matrices of other users having converging confusion probabilities into a common confusion matrix.

6. The method of claim 5, wherein merging the confusion matrix for the user with the confusion matrices for other users having converged confusion probabilities into a common confusion matrix comprises:

calculating the similarity between the confusion matrix of the user and the confusion matrices of the other users;

and when the similarity is greater than a second preset threshold value, performing weighted average calculation on each element included in the confusion matrix of the user and the confusion matrices of the other users to obtain the common confusion matrix.

7. The method of claim 1, wherein the confusion probability for the user to confuse each first syllable in the first pinyin sequence to a second syllable in the second pinyin sequence that corresponds to the first syllable comprises at least one of:

a probability that a non-null syllable in the first pinyin sequence is confused as a null syllable in the second pinyin sequence;

a probability that a null syllable in the first pinyin sequence is confused with a non-null syllable in the second pinyin sequence;

a probability that a non-null syllable in the first pinyin sequence is confused with a non-null syllable in the second pinyin sequence.

8. The method of claim 1 or 3, wherein correcting the voice recognition result of the user based on the second pinyin sequence having the highest similarity comprises:

and when the highest similarity in the plurality of second pinyin sequences is larger than a first preset threshold value, determining the second pinyin sequence with the highest similarity as a correct pinyin sequence corresponding to the output voice of the user.

9. An apparatus for correcting a speech recognition result, comprising: a memory and a processor electrically connected to the memory, the memory storing a computer program executable on the processor, the computer program, when executed by the processor, implementing the steps of the method according to any one of claims 1 to 8.

10. A computer-readable storage medium, characterized in that a computer program is stored on the computer-readable storage medium, which computer program, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 8.