CN115455946A

CN115455946A - Voice recognition error correction method and device, electronic equipment and storage medium

Info

Publication number: CN115455946A
Application number: CN202211080639.6A
Authority: CN
Inventors: 张文辉; 万根顺; 高建清; 潘嘉; 刘聪; 胡国平
Original assignee: iFlytek Co Ltd
Current assignee: iFlytek Co Ltd
Priority date: 2022-09-05
Filing date: 2022-09-05
Publication date: 2022-12-09

Abstract

The invention provides a voice recognition error correction method, a voice recognition error correction device, electronic equipment and a storage medium, wherein the method comprises the following steps: determining a recognition text of the voice data to be corrected; determining acoustic features corresponding to the characters in the recognition text based on the alignment positions of the characters in the recognition text in the voice data; and correcting the recognition text based on the acoustic features corresponding to the characters in the recognition text and the semantic features of the characters in the recognition text. The speech recognition error correction method, the speech recognition error correction device, the electronic equipment and the storage medium provided by the invention not only use the semantic features of each character in the recognized text, but also use the acoustic features corresponding to each character, compared with the related technology which only considers the semantic features, the method can capture the acoustic and semantic features of each character, and fully utilize a plurality of features to enhance the representation capability of the recognized text to be corrected, thereby improving the accuracy of error positioning and error correction.

Description

Voice recognition error correction method and device, electronic equipment and storage medium

Technical Field

The present invention relates to the field of artificial intelligence technologies, and in particular, to a method and an apparatus for speech recognition error correction, an electronic device, and a storage medium.

Background

The accuracy of speech recognition is very important for products and scenes based on speech, such as a speech input method, conference content transcription, speech emotion recognition and translation system, and the like, and some recognition errors inevitably exist in the current speech recognition system. Therefore, a perfect error correction system is of great significance for the whole application based on the voice scene.

However, most of the existing speech recognition error correction methods use texts for modeling, and texts with errors are used as training data, so that available information is relatively limited, which makes accurate error determination and error correction difficult, and the error correction effect is poor.

Disclosure of Invention

The invention provides a voice recognition error correction method, a voice recognition error correction device, electronic equipment and a storage medium, which are used for solving the defects that accurate error judgment and error correction are difficult to perform and the error correction effect is poor due to the fact that text modeling is selected in the prior art.

The invention provides a voice recognition error correction method, which comprises the following steps:

determining a recognition text of the voice data to be corrected;

determining acoustic features corresponding to the characters in the recognition text based on the alignment positions of the characters in the recognition text in the voice data;

and correcting the recognition text based on the acoustic features corresponding to the characters in the recognition text and the semantic features of the characters in the recognition text.

According to the speech recognition error correction method provided by the invention, the error correction is performed on the recognition text based on the acoustic features corresponding to the characters in the recognition text and the semantic features of the characters in the recognition text, and the method comprises the following steps:

determining the position characteristics of each character in the recognition text based on the alignment position of each character in the recognition text in the voice data;

and correcting the recognition text based on the acoustic features, the position features and the semantic features corresponding to the characters in the recognition text.

According to the speech recognition error correction method provided by the invention, the error correction is performed on the recognition text based on the acoustic features, the position features and the semantic features corresponding to the characters in the recognition text, and the method comprises the following steps:

adding the position features of the characters in the identification text and the semantic features to obtain the position semantic features of the characters in the identification text;

splicing the position semantic features of the characters in the identification text with the acoustic features to obtain the splicing features of the characters in the identification text;

and correcting the recognition text based on the splicing characteristics of the characters in the recognition text.

According to the voice recognition error correction method provided by the invention, the determining of the recognition text of the voice data to be corrected comprises the following steps:

determining an initial recognition text of voice data, and displaying the initial recognition text;

aligning the initial recognition text with the candidate recognition text corresponding to the voice data, determining the aligned initial recognition text as the recognition text of the voice data to be corrected, and displaying the recognition text;

the correcting the recognized text based on the acoustic features corresponding to the characters in the recognized text and the semantic features of the characters in the recognized text comprises:

responding to the selection operation of the user on the characters in the recognition text, and determining characters to be corrected from the recognition text;

and correcting the character to be corrected based on the acoustic characteristic corresponding to the character to be corrected and the semantic characteristic of the character to be corrected.

The speech recognition error correction method provided by the invention further comprises the following steps:

and under the condition that the character to be corrected is a special symbol without semantics, correcting the character to be corrected based on the alignment position of the character to be corrected in the candidate recognition text.

According to the speech recognition error correction method provided by the present invention, the determining the acoustic characteristics corresponding to each character in the recognition text based on the alignment position of each character in the recognition text in the speech data includes:

extracting acoustic features of the voice data to obtain acoustic features of each voice frame of the voice data;

aligning the recognition text with the predicted text of each voice frame of the voice data, and determining the alignment position of each character in the recognition text in the voice data;

and selecting the acoustic features of the characters in the recognition text at the alignment positions in the voice data from the acoustic features of the voice frames of the voice data as the acoustic features corresponding to the characters in the recognition text.

According to the speech recognition error correction method provided by the invention, the correction of the recognized text based on the acoustic features corresponding to the characters in the recognized text and the semantic features of the characters in the recognized text comprises the following steps:

based on a voice recognition error correction model, applying acoustic features corresponding to all characters in the recognition text and semantic features of all characters in the recognition text to correct the recognition text;

the speech recognition error correction model is trained on sample speech data, standard recognition texts of the sample speech data and candidate sample recognition texts.

According to the speech recognition error correction method provided by the invention, the sample speech data is obtained by performing speech synthesis on the standard recognition text, and the candidate sample recognition text is obtained by performing speech recognition on the sample speech data.

According to the speech recognition error correction method provided by the invention, the candidate sample recognition text is obtained by adding disturbance to the standard recognition text, wherein the disturbance comprises at least one of character replacement, insertion or deletion;

wherein the replacement character is determined based on a similar pronunciation of each character in the standard recognition text.

According to the speech recognition error correction method provided by the invention, the speech recognition error correction model is obtained by training based on the following steps:

pre-training the initial model based on context information of the sample text to obtain a pre-training model;

and training the pre-training model based on sample voice data, the standard recognition text of the sample voice data and the candidate sample recognition text to obtain the voice recognition error correction model.

The invention also provides a voice recognition error correction device, comprising:

an identification text determination unit for determining an identification text of the voice data to be corrected;

an acoustic feature determining unit, configured to determine, based on an alignment position of each character in the recognition text in the speech data, an acoustic feature corresponding to each character in the recognition text;

and the error correction unit is used for correcting the error of the recognition text based on the acoustic features corresponding to the characters in the recognition text and the semantic features of the characters in the recognition text.

The invention also provides an electronic device, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor executes the program to realize the voice recognition error correction method.

The present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a speech recognition error correction method as described in any of the above.

The invention also provides a computer program product comprising a computer program which, when executed by a processor, implements a speech recognition error correction method as described in any one of the above.

According to the voice recognition error correction method, the voice recognition error correction device, the electronic equipment and the storage medium, when the recognized text is corrected, not only the semantic features of the characters in the recognized text but also the acoustic features corresponding to the characters are used, compared with the related technology that only the semantic features are considered, the acoustic and semantic features of the characters can be captured, and the representation capability of the recognized text to be corrected is enhanced by fully utilizing various features, so that the accuracy of error positioning and error correction is improved.

Drawings

In order to more clearly illustrate the technical solutions of the present invention or the prior art, the drawings needed for the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.

FIG. 1 is a schematic flow chart of a speech recognition error correction method provided by the present invention;

FIG. 2 is a schematic flow chart of step 130 of the speech recognition error correction method provided by the present invention;

FIG. 3 is a second schematic flow chart of the speech recognition error correction method according to the present invention;

FIG. 4 is a schematic flow chart illustrating step 120 of the speech recognition error correction method provided by the present invention;

FIG. 5 is a schematic flow chart of a speech recognition model training method provided by the present invention;

FIG. 6 is a schematic diagram of a structure of a speech recognition error correction model provided by the present invention;

FIG. 7 is a schematic structural diagram of a speech recognition error correction apparatus provided by the present invention;

fig. 8 is a schematic structural diagram of an electronic device provided in the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without inventive step based on the embodiments of the present invention, are within the scope of protection of the present invention.

Automatic Speech Recognition (ASR), abbreviated as Speech Recognition, refers to a technique in which a computer processor receives a Speech signal, calculates the Speech signal, and converts the Speech signal into text information conforming to human understanding. The technology is widely applied to mobile phone voice assistants, input method software, vehicle navigation, various artificial intelligent wearing devices and the like, and has important application value. Natural language processing, which is a branch of artificial intelligence technology, processes and interprets data in text form using techniques such as deep learning and machine learning, and generally has directions such as natural language understanding and natural language generation.

The accuracy of speech recognition is very important for speech-based products and scenarios, such as speech input methods, conference content transcription, speech emotion recognition and translation systems, and the like. However, all current speech recognition systems inevitably have some recognition errors, and if a certain key word in a sentence is recognized incorrectly, the whole semantic has errors, so that further more serious errors occur in downstream tasks such as translation, emotion analysis and other applications. Therefore, a perfect error correction system is of great significance for the whole application based on the voice scene.

Based on this, the embodiment of the present invention provides a speech recognition error correction method, which corrects a text transcribed by a speech recognition system, so as to improve the accuracy of a speech recognition result, so that an application product based on speech recognition can have better robustness.

The application scenes of the voice recognition error correction method provided by the embodiment of the invention include but are not limited to a mobile phone voice assistant, input method software, vehicle navigation, a translator and the like. It should be noted that the method is applicable to other languages such as japanese, korean, english, latin, etc. besides chinese.

Fig. 1 is a schematic flow chart of a speech recognition error correction method provided by the present invention, where an execution subject of each step of the method may be a speech recognition error correction apparatus, the apparatus may be implemented by software and/or hardware, and the apparatus may be integrated in an electronic device, and the electronic device includes, but is not limited to, a mobile phone, a computer, an intelligent speech interaction device, an intelligent appliance, a vehicle-mounted terminal, and the like. As shown in fig. 1, the speech recognition error correction method may include the steps of:

step 110, the recognition text of the speech data to be corrected is determined.

Specifically, the text to be corrected is the recognized text of the voice data to be corrected, that is, the text to be corrected is obtained by performing voice recognition on the voice data. For example, a piece of speech data labeled "science news," may output recognition text including: "Korea flood fly", "Korea training fly" or "Kodak communication fly", etc.

The voice acquisition device can be arranged on the intelligent equipment or in the surrounding area of the intelligent equipment, and voice data can be acquired through the voice acquisition device. Wherein the voice collecting device may comprise a microphone. After the voice data is acquired by the microphone array, the voice data may be amplified and denoised, which is not specifically limited in the embodiment of the present invention.

The recognized text of the speech data may be obtained offline and/or online. For example, a voice recognition device may be previously installed on the smart device, and the voice recognition device outputs a recognition text of voice data, thereby implementing offline voice recognition. Wherein the speech recognition means may comprise a speech recognition model.

For another example, the intelligent device may establish a network connection with the server, the intelligent device may send the voice data to the server, the voice recognition device in the server outputs a recognition text of the voice data, and the server may send the recognition text to the intelligent device, thereby implementing online voice recognition. The server may comprise a cloud server.

It should be noted that the recognition text of the speech data to be corrected may be a recognition text directly obtained by performing speech recognition on the speech data, or may also be obtained by further processing the recognition text directly obtained by speech recognition, for example, an alignment text obtained by aligning candidate recognition texts corresponding to the speech recognition, which is not specifically limited in this embodiment of the present invention.

And step 120, determining acoustic features corresponding to the characters in the recognized text based on the alignment positions of the characters in the recognized text in the voice data.

In particular, prior art speech recognition error correction methods use text only, typically based on semantics between the recognized text in which errors exist and standard text of speech data, to correct errors. However, the ASR model based on Encoder-Decoder often has errors in which the output is too different acoustically from the speech data, for example, a piece of speech data labeled "functional department" may be misidentified as "functional part". For such errors, if there is no acoustic information, the error correction model cannot solve the problem of erroneous ASR recognition results even if it can output a new sentence with reasonable semantics.

In view of the poor error correction effect, the method provided by the embodiment of the invention aligns each character in the recognized text with the voice data to obtain the alignment position of each character in the recognized text in the voice data, so as to determine the acoustic feature corresponding to each character in the recognized text. The acoustic features corresponding to the characters in the recognized text thus obtained can represent the acoustic features of the speech frame at the aligned position in the speech data.

The identification of the aligned position of each character in the text in the speech data may be implemented by an audio text alignment algorithm, such as (CTC) alignment, (Hidden Markov Model, HMM) alignment, and the like.

And step 130, correcting the recognized text based on the acoustic features corresponding to the characters in the recognized text and the semantic features of the characters in the recognized text.

Specifically, semantic features of each character may also be represented by word vectors or word embedding (embedding), which is a quantized representation of character information in the form of vectors. Optionally, the word expression vector generation network generates a word expression vector for identifying each character in the text, that is, semantic features of each character.

The acoustic features corresponding to the characters in the recognized text and the semantic features of the characters in the recognized text can be input into a pre-trained speech recognition error correction model, the trained speech recognition error correction model performs feature fusion on the acoustic features and the semantic features, and error correction is performed based on a fusion result, so that an error-corrected text is output. The acoustic features corresponding to the characters and the semantic features of the characters may also be applied to correct errors, and the corrected text may be determined by combining the error correction result obtained based on the acoustic features and the error correction result obtained based on the semantic features.

In addition, the error correction of the recognized text may be performed on each character in the recognized text, or may be performed on one or more characters in the recognized text, for example, a user may select one character in the recognized text, and then the character is corrected according to the selection of the user, and if the user does not have the selected character, the error correction is not required.

According to the voice recognition error correction method provided by the embodiment of the invention, when the recognized text is corrected, not only the semantic features of each character in the recognized text but also the acoustic features corresponding to each character are used, and compared with the related technology which only considers the semantic features, the method can capture the acoustic and semantic features of each character, fully utilizes various features to enhance the representation capability of the recognized text to be corrected, and thus improves the accuracy of error positioning and error correction.

Based on the foregoing embodiment, fig. 2 is a schematic flowchart of step 130 in the speech recognition error correction method provided by the present invention, and as shown in fig. 2, step 130 specifically includes:

step 131, determining the position characteristics of each character in the recognition text based on the alignment position of each character in the recognition text in the voice data;

step 132, correcting the recognized text based on the acoustic features, the position features and the semantic features corresponding to the characters in the recognized text.

Specifically, the Position feature for identifying each character in the text can be realized by a Position encoding network (Position Embedding). The position coding network is used for generating position characteristics of each character in the recognized text based on the alignment position of each character in the recognized text in the voice data. Alternatively, the position feature may be in the form of a position vector to perform quantization representation on the position information.

On the basis of obtaining the position characteristics of each character in the recognition text, the recognition text can be corrected based on the acoustic characteristics, the position characteristics and the semantic characteristics corresponding to each character. For example, acoustic features, position features and semantic features corresponding to each character can be fused through a feature fusion network, wherein fusion weight parameters in the feature fusion network can be updated for learning, and error correction is performed based on the fused features; feature fusion can also be performed in a vector addition or splicing manner, and error correction is performed based on the fused features, which is not specifically limited in the embodiment of the present invention.

When the speech recognition error correction method provided by the embodiment of the invention is used for correcting the recognized text, not only the acoustic characteristics and the semantic characteristics corresponding to each character are considered, but also the position characteristics of each character are combined, and the representation capability of the recognized text to be corrected is enhanced by fully utilizing various characteristics, so that the accuracy of error positioning and error correction is further improved.

Based on any of the above embodiments, fig. 3 is a second schematic flow chart of the speech recognition error correction method provided by the present invention, as shown in fig. 3, step 132 specifically includes:

step 132-1, adding the position features and the semantic features of the characters in the identification text to obtain the position semantic features of the characters in the identification text;

step 132-2, splicing the position semantic features and the acoustic features of the characters in the identification text to obtain the splicing features of the characters in the identification text;

and 132-3, correcting the error of the recognized text based on the splicing characteristics of the characters in the recognized text.

Specifically, in order to implement error correction on the recognized text, the position features and the semantic features of each character in the recognized text may be added first, and the position semantic features of each character are obtained after the addition. The obtained position semantic features can provide position information and semantic information of each character at the same time when the text is corrected. And then, splicing the position semantic features and the acoustic features of the characters to obtain the splicing features of the characters in the recognized text. And correcting the error of the recognized text according to the splicing characteristics of the characters.

Suppose that n characters are included in the recognized text, wherein the semantic feature expression vector of the ith character is W _i The position feature expression vector of the ith character is P _i The acoustic feature expression vector corresponding to the ith character is A _i N is an integer greater than 1, and i is a positive integer less than or equal to n.

The position semantic feature representation vector of the ith character is W _i +P _i . The way of vector addition requires that the semantic feature represents the vector W _i And a location feature representation vector P _i Are identical, thereby facilitating the calculation.

On the basis, the position semantic feature expression vector W of the ith character _i +P _i Acoustic feature representation vector a corresponding to the ith character _i Splicing is carried out, and the splicing characteristic expression vector of the ith character is [ W ] _i +P _i ，A _i ]. The way of vector concatenation may not require the location semantic feature representation vector W _i +P _i And acoustic feature representation vector A _i The vector dimensions of (a) are the same.

After the splicing characteristics are obtained, the text after error correction is obtained through calculation in a non-autoregressive mode of a Transformer.

According to the method provided by the embodiment of the invention, the acoustic features, the position features and the semantic features corresponding to the characters in the recognized text are fused by means of feature addition and splicing according to the dimension of the feature representation vector, the calculated amount is small, and the efficiency is improved while the accuracy of error positioning and error correction is improved.

Based on any of the above embodiments, an embodiment of the present invention provides a speech recognition error correction method, where step 110 specifically includes:

and step 111, determining an initial recognition text of the voice data, and displaying the initial recognition text.

And 112, aligning the initial recognition text with the candidate recognition text corresponding to the voice data, determining the aligned initial recognition text as the recognition text of the voice data to be corrected, and displaying the recognition text.

Correspondingly, step 130 specifically includes:

step 133, in response to the user's selection operation of the characters in the recognized text, determining the characters to be corrected from the recognized text.

And step 134, correcting the error of the character to be corrected based on the acoustic characteristic corresponding to the character to be corrected and the semantic characteristic of the character to be corrected.

Specifically, considering that correcting a character in the recognized text may increase the probability that the character that is originally recognized correctly is modified into an incorrect character, one or more characters in the recognized text may be selectively corrected, for example, a user may select to correct certain positions, and other unselected characters may be left without correction.

After obtaining the speech data, an initial recognized text may be determined. In a scene that a user inputs a text through a terminal, the user inputs a section of voice data, the voice data is identified to obtain an initial identification text, and the initial identification text is displayed on the user terminal.

The initial recognition text may have a recognition error, and after confirming that the user needs to correct the initial recognition text, the initial recognition text and the candidate recognition text corresponding to the voice data are aligned. The purpose of aligning the initial recognized text and the candidate recognized text corresponding to the speech data here is to confirm whether there is a missing error in the initial recognized text. For example, if the initial recognized text includes 6 characters and the candidate recognized texts include 7 characters, it may be determined that there is a missing error in the initial recognized text, the length of the aligned initial recognized text may become 7, and the missing character may be filled up using a special symbol, for example, using "#". Of course, in addition to the loss error, there may be other replacement errors, etc.

Further, the initial recognized text and the plurality of candidate recognized texts may be aligned in a time dimension according to a minimum edit distance algorithm at a word level and a pronunciation level. Specifically, the method comprises the steps of selecting the longest text as an anchor point, calculating the minimum editing distance of the word level by using the rest texts and the anchor point text, recording editing paths, calculating the editing paths of the pronunciation level (converting the text into pronunciation) of each alignment mode if multiple editing paths exist, selecting the editing path of the minimum pronunciation level from the editing paths, aligning, and finally combining all the alignment results to obtain the initial identification text after aligning.

And then, determining the aligned initial recognition text as the recognition text of the voice data to be corrected, and displaying the recognition text.

After the user sees the aligned recognized text on the terminal, the user can decide which characters in the positions need to be modified. For example, the user may manually click or otherwise select a character or characters in the recognized text, and the speech recognition error correction device determines the character to be corrected from the recognized text in response to the user selecting the character or characters in the recognized text.

On the basis, the characters to be corrected can be corrected based on the acoustic features corresponding to the characters to be corrected and the semantic features of the characters to be corrected. The acoustic features corresponding to the characters to be corrected can be determined according to the corresponding positions of the characters to be corrected in the recognition text.

According to the method provided by the embodiment of the invention, the characters to be corrected are determined by displaying the recognition text and based on the selection operation of the user on the characters in the recognition text, so that the negative influence possibly caused by the character-by-character correction of the recognition text can be reduced, the selection right is given to the user for determination, the human-computer interaction experience is enhanced, and the accuracy of voice recognition correction is further improved.

Based on any of the above embodiments, the speech recognition error correction method provided by the embodiment of the present invention further includes:

Specifically, as can be seen from the description of the above embodiment, the recognition text may include a special symbol without semantic meaning, where the special symbol is used to complete missing characters, and when the user selects the special symbol, the character to be corrected is the special symbol, and the special symbol needs to be corrected.

In this case, the character to be corrected may be corrected according to the alignment positions of the character to be corrected in the plurality of candidate recognition texts. If a recognition character exists at the corresponding alignment position in the candidate recognition text, the special symbol can be corrected to be the recognition character at the alignment position. Of course, empty is also possible.

For example, a piece of speech data labeled "how today's weather" is obtained, and the initial recognition text is "how today's weather is like", and the other candidate recognition texts are "how today's weather is like", and "how today's weather is like young", respectively.

And aligning the initial recognition text with a plurality of candidate recognition texts, wherein the recognized text determined after alignment is 'today # how many and so on', and meanwhile, the recognized text is fed back to the user terminal. Wherein "#" is a special symbol without meaning.

At this time, the user can click "#", and then "#" can be corrected to be the correct character "day" according to other candidate recognition texts; if the user thinks that there may be an error in "mo", or clicks to select "mo", then "mo" is corrected to the correct character "no" based on the acoustic feature and semantic feature corresponding to "mo".

Aiming at identifying the correct character sample in the text, automatic error correction is not used, and the user is allowed to decide whether the position is corrected, so that the condition that the correct sample is modified into the wrong Yang which may occur is avoided.

According to the method provided by the embodiment of the invention, under the condition that the character to be corrected is a special symbol without semantics, the character to be corrected is corrected based on the alignment positions of the character to be corrected in a plurality of candidate recognition texts, so that the character specified by a single user is corrected, the negative influence possibly caused by correcting the character by character of the recognition texts one by one is avoided, the option is given to the user for determination, the human-computer interaction experience is enhanced, and the accuracy of voice recognition and correction is further improved.

Based on any of the above embodiments, fig. 4 is a schematic flowchart of step 120 in the speech recognition error correction method provided by the present invention, as shown in fig. 4, step 120 specifically includes:

step 121, extracting acoustic features of the voice data to obtain acoustic features of each voice frame of the voice data;

step 122, aligning the recognition text with the predicted text of each voice frame of the voice data, and determining the alignment position of each character in the recognition text in the voice data;

and step 123, selecting the acoustic features of the aligned positions of the characters in the recognition text in the voice data from the acoustic features of the voice frames of the voice data as the acoustic features corresponding to the characters in the recognition text.

Specifically, the acoustic feature extraction is performed on the speech data, and the acoustic feature extraction can be realized through a trained ASR model based on an Encoder-Decoder structure. And using the output of the Encoder end as the acoustic characteristic of each speech frame of the speech data.

Before step 121 is executed, an ASR model based on an Encoder-Decoder structure may be obtained through pre-training. Fig. 5 is a schematic flow chart of the speech recognition model training method provided by the present invention, and as shown in fig. 5, when an ASR model is trained, a loss function constraint needs to be added to an Encoder end, that is, the output of the Encoder needs to pass through a Classification layer Classification, and then a CTC loss function is calculated by a prediction text output by the Classification layer and a labeled text of speech data, and participates in updating model parameters.

In FIG. 5 (a), the loss function at the Decoder end is trained without the conventional Encoder-Decoder based ASR model, using cross entropy loss function constraints. The characteristics of adding the CTC loss function to the Encoder end enable the acoustic features output by the Encoder end to be more concentrated, that is, the acoustic features corresponding to each word in the voice data are concentrated in a small-range region, and the corresponding position classification layer output has an obvious peak form, as shown in fig. 5 (b).

After the acoustic characteristics of each voice frame of the voice data are obtained, aligning the recognition text with the predicted text of each voice frame of the voice data by adopting a CTC (computer-controlled test) alignment algorithm, and determining the alignment position of each character in the recognition text in the voice data. And then, selecting the acoustic features of the characters in the recognized text at the alignment positions in the voice data from the acoustic features of the voice frames in the voice data, and taking the acoustic features as the acoustic features corresponding to the characters in the recognized text. The method can be realized by the following steps:

1) Calculating the output of the Encoder end Classification to obtain a matrix enc _ out of T multiplied by V, which can be expressed as follows:

enc_out＝F·E(x)

wherein, F represents a classification layer, E represents an Encoder end, x voice data, T represents the frame length of the voice data, and V represents the number of categories, namely the size of a word list.

2) Aligning the recognition texts on the enc _ out in sequence, placing a text sequence with the length of m on the enc _ out with the length of T according to the sequence, wherein the position where each word is placed corresponds to a vector with the size of V, and the class corresponding to the wordThe value of the word at the corresponding position of the vector is the Score of the word at that position and is labeled as Score _ij . The final score for the entire recognized text sequence may be expressed in the form:

wherein, pos _i-1 Indicating the location of the i-1 th word in the recognized text in the alignment of the speech data, score _ij Represents the score of the ith word in the recognition text at the jth position of the voice data, m represents the text length of the recognition text, and T represents the frame length of the voice data.

Traversing all legal permutation and combination modes, and selecting the combination with the highest overall score as a CTC alignment mode, wherein the alignment comprises the most accurate pronunciation position corresponding to the recognized text in acoustics, and the alignment position obtained by the method can be marked as Pos = { Pos ₁ ,pos ₂ ,…,pos _m }。

And then, selecting the acoustic features of the aligned positions of the characters in the recognized text in the voice data from the acoustic features of the voice frames of the voice data as the acoustic features corresponding to the characters in the recognized text. Identifying the acoustic feature corresponding to each character in the text can be recorded as a = { a = { (a) ₁ ,A ₂ ,…,A _m The alignment process does not take into account the placeholder "#".

According to the voice recognition error correction method provided by the embodiment of the invention, the acoustic characteristics of the aligned position of each character in the recognition text in the voice data are selected from the acoustic characteristics of each voice frame of the voice data and are used as the acoustic characteristics corresponding to each character in the recognition text, so that a foundation is provided for improving the accuracy of error correction.

Based on any of the above embodiments, step 130 specifically includes:

based on the voice recognition error correction model, applying acoustic features corresponding to all characters in the recognized text and semantic features of all characters in the recognized text to correct errors of the recognized text;

the speech recognition error correction model is obtained by training based on sample speech data, standard recognition texts of the sample speech data and candidate sample recognition texts.

Specifically, the error location and the error correction of the recognized text can be realized based on a speech recognition error correction model, fig. 6 is a schematic structural diagram of the speech recognition error correction model provided by the present invention, and as shown in fig. 6, the speech recognition error correction model can be selected as a Transformer structure, the number of layers is 12, the size of hidden layers is 1024, and the head number of the multi-head attention is 12. The speech recognition error correction model may include a word encoding layer, a position encoding layer, a feature addition layer, a feature concatenation layer, a linear layer, a high-dimensional feature extraction layer, and a classification layer.

The word coding layer is used for carrying out word coding on the recognition text to be corrected to obtain semantic feature expression vectors of the recognition text, and the position coding layer is used for carrying out position coding on the alignment positions of all characters in the recognition text in the voice data to obtain position feature expression vectors. The character coding layer and the position coding layer can both learn, and the coding dimensions are the same and can be 512 dimensions.

The feature adding layer is used for adding the semantic feature expression vector output by the word coding layer and the position feature expression vector output by the position coding layer. The characteristic splicing layer is used for splicing the character coding information added with the position information on a characteristic dimension (512) to obtain a 1024-dimensional vector, wherein the vector comprises character information for identifying each character in the text, and the position information and the acoustic information corresponding to the voice data.

The high-dimensional feature extraction layer can be a Transformer network, can extract high-dimensional information for identifying the text, and then calculates the high-dimensional information of the text by adopting an autoregressive or non-autoregressive mode. The autoregressive scheme mainly uses an Attention mechanism to perform weighted average calculation on text high-dimensional information, and then uses a classifier to classify calculation results. The non-autoregressive scheme adds a classification layer at the output end of the Transformer, and directly classifies the output of the position, so that a correct classification result is expected.

Due to the similarity between the input and the output of the error correction task (a good ASR recognition model has a low error rate), context information is difficult to learn by an error correction scheme based on autoregressive, and the model often learns a one-to-one mapping relationship, so that the error correction effect cannot be really exerted.

Therefore, the method and the device use a non-autoregressive mode based on the Transformer to correct the errors, and obtain the text after error correction through classification by the classification layer after the calculation by the Transformer.

Before this, the speech recognition error correction model may also be trained in advance, for example, the speech recognition error correction model may be trained as follows: first, a large amount of sample voice data, a standard recognition text of the sample voice data, and a candidate sample recognition text of the sample voice data are acquired. The candidate sample identification text may be one or multiple, which is not specifically limited in this embodiment of the present invention. And then training an initial model based on the sample voice data, the standard recognition text of the sample voice data and the candidate sample recognition text, thereby obtaining a voice recognition error correction model.

As shown in FIG. 6, the candidate sample identifies 4, pos, texts ₁₁ The 1 st position character representing the aligned 1 st candidate sample identification text corresponds to the aligned position, pos, of CTC ₂₃ The 3 rd position character representing the aligned 2 nd candidate sample identification text corresponds to the aligned position of the CTC. A. The ₁₂ Representing the acoustic feature corresponding to the 2 nd position character of the 1 st candidate sample recognition text, A ₃₅ And the acoustic features corresponding to the 5 th position characters of the 3 rd candidate sample recognition text are represented, and the information is 512-dimensional vectors.

In order to train the Transformer model, 4 1024-dimensional vectors at each time are fused and converted through a 4096- >1024 linear layer, and the converted 1024-dimensional vectors include word information, position information and acoustic information of 4 candidate sample recognition texts.

In the prior art, parallel data including speech data and corresponding standard recognition texts are collected firstly, then a pre-trained ASR model is used to transcribe the speech data to obtain a recognition text with errors, the recognition text is used as a sample of training data, and the standard recognition text is used as a label for supervised learning. The sample data only contains sample recognition texts with errors, and the model training process lacks acoustic information reference.

The method provided by the embodiment of the invention is based on the sample voice data, the standard recognition text of the sample voice data and the candidate sample recognition text training initial model, so that the obtained voice recognition error correction model can capture the acoustic and semantic characteristics of each character in the candidate sample recognition text, and fully utilizes various characteristics to enhance the representation capability of the recognition text to be corrected, thereby improving the accuracy of error positioning and error correction.

Based on any of the above embodiments, the sample speech data is obtained by performing speech synthesis on the standard recognition text, and the candidate sample recognition text is obtained by performing speech recognition on the sample speech data.

Specifically, considering that the cost of acquiring parallel audio text data is relatively high and the total obtained training data is limited, the sample speech data provided by the embodiment of the present invention is obtained by performing speech synthesis on a standard recognition text. A large amount of plain text data are obtained in advance, the plain text data are used as standard identification texts, corresponding audios are synthesized by using a pre-trained TTS (speech synthesis) model and the standard identification texts, and the audios obtained after synthesis are used as sample speech data, so that the data volume of a sample is increased, and the error correction accuracy is improved.

The prior art typically corrects the optimal solution for ASR, but sometimes the correct decoding results are present in the candidate solutions for ASR, and existing error correction schemes lack the use of such information. Therefore, when the sample identification text is obtained, beam Search decoding is performed on sample voice data to obtain n-best candidate text (n represents the size of the candidate set). Thereby obtaining n candidate sample identification texts.

The ASR Beam Search decoding algorithm is different from a greedy algorithm, a Search space is expanded, for example, when the space size is set to be n, n optimal results are selected at the initial moment to serve as the optimal path of the current time step, then n solutions are searched from n current optimal states, n optimal solutions are obtained by sequencing the solutions, and the process is repeated until the n states all reach the condition of ending the Search. For example, for a piece of speech labeled "science fly," Beam Search decoding using 4-best might yield the results: "Korea fly", "Korea flood fly", "Korea training fly", and "Kodawa fly".

According to the method provided by the embodiment of the invention, the sample voice data is obtained by carrying out voice synthesis on the standard recognition text, and the candidate sample recognition text is obtained by carrying out voice recognition on the sample voice data, so that a large amount of sample data can be conveniently obtained, and the accuracy of voice recognition and error correction is improved.

Based on any embodiment, the candidate sample identification text is obtained by adding disturbance to the standard identification text, wherein the disturbance comprises at least one of character replacement, insertion or deletion; wherein the replacement character is determined based on a similar pronunciation for each character in the standard recognition text.

Specifically, the candidate sample recognition text can be realized by adding disturbance to the standard recognition text besides the n-best candidate text obtained by ASR decoding. Adding the perturbation includes at least one of replacing, inserting, or deleting characters in the standard recognition text. Wherein the replacement character is determined based on a similar pronunciation for each character in the standard recognized text.

Adding a perturbation to a standard recognition text can be achieved by:

1) And collecting the decoding results of the test set of the pre-trained ASR model, and counting the proportion of the replacement errors, the insertion errors and the deletion errors existing in the test set.

2) And (2) constructing a similar pronunciation rule table, taking Chinese as an example, taking the table 1 as the similar pronunciation rule table, wherein the initial consonants b and p belong to similar pronunciations as shown in the table 1, and the initial consonant of the Chinese character containing the initial consonant b in the standard recognition text can be replaced by p to form another Chinese character. For another example, the finals ai and ei belong to similar pronunciations, and the finals of the Chinese characters containing the finals ai in the standard identification text can be replaced by ei to form another Chinese character.

TABLE 1

3) And constructing a replacement error according to the pronunciation similarity rule, wherein the pronunciation of the 'you' is 'nin', and the replacement error comprises an initial consonant 'n' and a final sound 'in'. The initial consonant 'n' is replaced by 'l', and 'you' can be replaced by 'forest', 'adjacent', and other similarly sounding words; the vowel "in" can also be replaced by "ing", you "can be replaced by" ning ", etc. Such a replacement error construction may simulate, to some extent, erroneous recognition of similar acoustic information by ASR.

4) Some deletion and insertion errors are constructed by randomly deleting and inserting random characters. The construction ratios of the three errors are all referred to the real ratio counted in the above 1).

According to the method provided by the embodiment of the invention, the candidate sample recognition text is obtained by performing at least one operation of character replacement, deletion or insertion on the standard recognition text, so that the data volume of the sample is increased, and the error correction accuracy of the speech recognition error correction model is improved.

Based on any of the above embodiments, in the case where there are a plurality of candidate sample recognition texts, the plurality of candidate sample recognition texts may be further aligned.

For the n candidate sample recognition texts decoded by the ASR, the n candidate sample recognition texts may be aligned in a time dimension using a minimum edit distance algorithm. The specific operation is as follows: and selecting the longest text in the candidate sample recognition texts as an anchor point, calculating the minimum editing distance by using the remaining n-1 candidate sample recognition texts and the anchor point text, recording the editing path, and finally combining each alignment result to obtain n aligned texts. For example, the sample speech data labeled "in functional department", the 4-best decoding results are "in functional part", "in functional department", and "in functional part" in this order, and the results of aligning the four candidate sample identification texts can be shown as shown in table 2, where "#" represents a completed placeholder.

TABLE 2

Just before	Job post	Worker's tool	Part (A)	Is divided into
					#	Job post	Can be used for	Section (1)	Door with a door panel
In that	Job post	Can be used for	Section (1)	Door with a door panel
					#	Job post	Can be used for	Section (1)	They all had good results

Aiming at n candidate sample recognition texts obtained by adding disturbance to the standard recognition text, a replacement error is constructed according to a similar pronunciation rule table, and random insertion errors and deletion errors are constructed at the same time, wherein the insertion errors can be the copying of a previous word or the random insertion of a word, and then the n texts are aligned. For the standard recognition text as "department of function", the results after the alignment of the 4 candidate sample recognition texts obtained by construction can be shown as shown in table 3.

TABLE 3

Vegetable dish	Job post	Can be used for	#	Door with a door panel
					In that	Job post	Can be used for	Is not limited to	Door with a door panel
In that	Job post	Job post	Part (A)	Door with a door panel
					#	Job post	Can be used for	Part (A)	Door with a door panel

Based on any of the above embodiments, the speech recognition error correction model is obtained by training based on the following steps:

and training the pre-training model based on the sample voice data, the standard recognition text of the sample voice data and the candidate sample recognition text to obtain a voice recognition error correction model.

In particular, considering that the speech recognition error correction model corrects a word at a certain position, in addition to using the word information, position information and acoustic information at the position, sometimes it is necessary to have a better understanding of the context information, such as "in the moment of thunderstorm", "class" is erroneously recognized as "moment" by ASR, and the error correction of the model needs a good context analysis capability. The initial model is therefore pre-trained before training the speech recognition error correction model. In order to reduce the training cost, plain text is used in the pre-training, and the sample text is not particularly limited and may be any text. The initial model is pre-trained based on context information of the sample text. Further, the pre-training can be trained in a way of a Mask Language Model, and the obtained pre-training Model has good context understanding and analysis capability.

And finally, training the pre-training model based on the sample voice data, the standard recognition text of the sample voice data and the candidate sample recognition text to obtain a voice recognition error correction model.

Based on any one of the above embodiments, there is provided a speech recognition error correction method, including:

s1, determining a recognition text of voice data to be corrected.

S2, extracting acoustic features of the voice data to obtain the acoustic features of each voice frame of the voice data; aligning the recognition text with the predicted text of each voice frame of the voice data, and determining the alignment position of each character in the recognition text in the voice data; and selecting the acoustic features of the aligned positions of the characters in the recognized text in the voice data from the acoustic features of the voice frames of the voice data as the acoustic features corresponding to the characters in the recognized text.

S3, based on a voice recognition error correction model, determining the position characteristics of each character in the recognition text by applying the alignment position of each character in the recognition text in the voice data; and correcting the recognized text by applying the acoustic features corresponding to the characters in the recognized text, the semantic features of the characters in the recognized text and the position features.

The sample speech data is obtained by performing speech synthesis on the standard recognition text, and the candidate sample recognition text is obtained by performing speech recognition on the sample speech data.

The candidate sample recognition text is obtained by adding disturbance to the standard recognition text, wherein the added disturbance comprises at least one of character replacement, insertion or deletion; wherein the replacement character is determined based on a similar pronunciation for each character in the standard recognized text.

Based on any one of the above embodiments, another embodiment of the present invention provides a speech recognition error correction method, including:

determining an initial recognition text of the voice data, and displaying the initial recognition text;

determining acoustic features corresponding to all characters in the recognition text based on the alignment positions of all the characters in the recognition text in the voice data; responding to the selection operation of the user on the characters in the identification text, and determining the characters to be corrected from the identification text;

correcting the character to be corrected based on the acoustic characteristic corresponding to the character to be corrected and the semantic characteristic of the character to be corrected;

and under the condition that the character to be corrected is a special symbol without semantics, correcting the character to be corrected based on the alignment positions of the character to be corrected in the candidate recognition texts.

The following describes the speech recognition error correction apparatus provided by the present invention, and the speech recognition error correction apparatus described below and the speech recognition error correction method described above can be referred to correspondingly.

Fig. 7 is a schematic structural diagram of a speech recognition error correction apparatus provided by the present invention, and as shown in fig. 7, the speech recognition error correction apparatus includes a recognized text determination unit 710, an acoustic feature determination unit 720, and an error correction unit 730.

The device comprises a recognition text determining unit, a correction unit and a correction unit, wherein the recognition text determining unit is used for determining a recognition text of voice data to be corrected;

When the speech recognition error correction device provided by the embodiment of the invention is used for correcting the recognized text, not only the semantic features of each character in the recognized text but also the acoustic features corresponding to each character are used, and compared with the related technology which only considers the semantic features, the speech recognition error correction device can capture the acoustic and semantic features of each character, fully utilizes various features to enhance the representation capability of the recognized text to be corrected, thereby improving the accuracy of error positioning and error correction.

Based on any of the above embodiments, the error correction unit is further configured to:

adding the position features of the characters in the recognition text with the semantic features to obtain the position semantic features of the characters in the recognition text;

and correcting the error of the recognition text based on the splicing characteristics of the characters in the recognition text.

Based on any of the above embodiments, the recognized text determining unit is further configured to:

accordingly, the error correction unit is further configured to:

Based on any of the above embodiments, the apparatus further includes a character error correction unit, configured to:

and under the condition that the character to be corrected is a special symbol without semantic meaning, correcting the character to be corrected based on the alignment positions of the character to be corrected in the candidate recognition texts.

Based on any of the above embodiments, the acoustic feature determination unit is further configured to:

Based on any embodiment, the candidate sample recognition text is obtained by adding disturbance to the standard recognition text, wherein the disturbance includes at least one of character replacement, insertion or deletion; wherein the replacement character is determined based on the similar pronunciation of each character in the standard recognition text.

Based on any of the above embodiments, the speech recognition error correction apparatus further includes a model training unit configured to:

and training the pre-training model based on sample voice data, a standard recognition text of the sample voice data and a candidate sample recognition text to obtain the voice recognition error correction model.

Fig. 8 illustrates a physical structure diagram of an electronic device, and as shown in fig. 8, the electronic device may include: a processor (processor) 810, a communication Interface 820, a memory 830 and a communication bus 840, wherein the processor 810, the communication Interface 820 and the memory 830 communicate with each other via the communication bus 840. The processor 810 may invoke logic instructions in the memory 830 to perform a speech recognition error correction method comprising: determining a recognition text of the voice data to be corrected; determining acoustic features corresponding to the characters in the recognition text based on the alignment positions of the characters in the recognition text in the voice data; and correcting the recognition text based on the acoustic features corresponding to the characters in the recognition text and the semantic features of the characters in the recognition text.

In addition, the logic instructions in the memory 830 can be implemented in the form of software functional units and stored in a computer readable storage medium when the software functional units are sold or used as independent products. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

In another aspect, the present invention also provides a computer program product, the computer program product comprising a computer program, the computer program being storable on a non-transitory computer-readable storage medium, the computer program, when executed by a processor, being capable of executing the speech recognition error correction method provided by the above methods, the method comprising: determining a recognition text of the voice data to be corrected; determining acoustic features corresponding to the characters in the recognition text based on the alignment positions of the characters in the recognition text in the voice data; and correcting the recognition text based on the acoustic features corresponding to the characters in the recognition text and the semantic features of the characters in the recognition text.

In yet another aspect, the present invention also provides a non-transitory computer-readable storage medium, on which a computer program is stored, the computer program being implemented by a processor to perform the speech recognition error correction method provided by the above methods, the method comprising: determining a recognition text of the voice data to be corrected; determining acoustic features corresponding to the characters in the recognition text based on the alignment positions of the characters in the recognition text in the voice data; and correcting the recognition text based on the acoustic features corresponding to the characters in the recognition text and the semantic features of the characters in the recognition text.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment may be implemented by software plus a necessary general hardware platform, and may also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A method for speech recognition error correction, comprising:

determining a recognition text of the voice data to be corrected;

2. The method according to claim 1, wherein the correcting the recognized text based on the acoustic features corresponding to the characters in the recognized text and the semantic features of the characters in the recognized text comprises:

3. The method of claim 2, wherein the correcting the recognized text based on the acoustic feature, the position feature and the semantic feature corresponding to each character in the recognized text comprises:

4. The speech recognition error correction method of claim 1, wherein the determining the recognized text of the speech data to be error corrected comprises:

5. The speech recognition error correction method of claim 4, further comprising:

and under the condition that the character to be corrected is a special symbol without semantic meaning, correcting the character to be corrected based on the alignment position of the character to be corrected in the candidate recognition text.

6. The method according to any one of claims 1 to 5, wherein the determining the acoustic feature corresponding to each character in the recognized text based on the alignment position of each character in the recognized text in the speech data comprises:

7. The method according to claim 1, wherein the correcting the recognized text based on the acoustic features corresponding to the characters in the recognized text and the semantic features of the characters in the recognized text comprises:

based on a voice recognition error correction model, applying acoustic features corresponding to all characters in the recognition text and semantic features of all characters in the recognition text to correct errors of the recognition text;

8. The speech recognition error correction method according to claim 7, wherein the sample speech data is obtained by speech synthesizing the standard recognition text, and the candidate sample recognition text is obtained by speech recognizing the sample speech data.

9. The speech recognition error correction method of claim 7, wherein the candidate sample recognition texts are obtained by adding perturbation to the standard recognition texts, wherein the added perturbation comprises at least one of character replacement, insertion or deletion;

wherein the replacement character is determined based on the similar pronunciation of each character in the standard recognition text.

10. The speech recognition error correction method according to any one of claims 7-9, wherein the speech recognition error correction model is trained based on the following steps:

11. A speech recognition error correction apparatus, comprising:

12. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the speech recognition error correction method according to any one of claims 1 to 10 when executing the program.

13. A non-transitory computer-readable storage medium having stored thereon a computer program, wherein the computer program, when executed by a processor, implements the speech recognition error correction method according to any one of claims 1 to 10.