CN114596845A

CN114596845A - Training method of voice recognition model, voice recognition method and device

Info

Publication number: CN114596845A
Application number: CN202210385772.6A
Authority: CN
Inventors: 孟庆林; 蒋宁; 吴海英; 王洪斌; 刘敏; 陈燕丽
Original assignee: Mashang Xiaofei Finance Co Ltd
Current assignee: Mashang Xiaofei Finance Co Ltd
Priority date: 2022-04-13
Filing date: 2022-04-13
Publication date: 2022-06-07

Abstract

The application discloses a training method of a voice recognition model, a voice recognition method and a voice recognition device. The training method comprises the following steps: acquiring a mixed data set and a labeled text of voice data in the mixed data set, wherein the mixed data set comprises first sample common speech data and sample dialect voice data; inputting a mixed data set and a label text and a language label of voice data in the mixed data set into an initial voice recognition model to obtain a recognition result of the voice data in the mixed data set, wherein a content recognition network is used for coding the voice data to obtain a characteristic vector and carrying out voice recognition based on the characteristic vector of the voice data to obtain a recognition text, and a language classifier is used for carrying out voice recognition based on the characteristic vector of the voice data to obtain a recognition language; determining the total recognition loss based on the recognition result of the voice data in the mixed data set and the labeled text and the language label of the voice data; and carrying out iterative training on the initial voice recognition model based on the total recognition loss to obtain the voice recognition model.

Description

Training method of voice recognition model, voice recognition method and device

Technical Field

The present application relates to the field of speech processing technologies, and in particular, to a method and an apparatus for training a speech recognition model and speech recognition.

Background

In real life, it is often necessary to recognize speech in various languages, such as mandarin, dialect, and the like. In general, a corresponding speech recognition model is trained for each language, and then in the recognition process, the corresponding speech recognition model is used for speech recognition for speech of each language, so that a better recognition effect can be achieved.

However, in practical applications, there may be aliasing between voices in different languages, for example, a speaker may be mixed with mandarin and dialect when speaking, which makes it difficult to determine and select a voice recognition model effective for such voices, and thus, voice recognition cannot be performed effectively. Therefore, how to train a speech recognition model having a better recognition effect for speech of multiple languages is a problem that needs to be solved at present.

Disclosure of Invention

The embodiment of the application aims to provide a training method of a voice recognition model, a voice recognition method and a device, which are used for enabling the trained voice recognition model to have a good recognition effect on various types of voice.

In order to achieve the above purpose, the following technical solutions are adopted in the embodiments of the present application:

in a first aspect, an embodiment of the present application provides a method for training a speech recognition model, including:

acquiring a mixed data set and a labeled text and a language label of voice data in the mixed data set, wherein the mixed data set comprises first sample common speech voice data and sample dialect voice data;

inputting the mixed data set and a labeled text and a language label of the voice data in the mixed data set into an initial voice recognition model to obtain a recognition result of the voice data in the mixed data set, wherein the recognition result comprises a recognition text and a recognition language;

determining the total recognition loss of the initial voice recognition model based on the recognition result of the voice data in the mixed data set and the labeled text and the language label of the voice data in the mixed data set;

performing iterative training on the initial voice recognition model based on the total recognition loss to obtain the voice recognition model;

the initial voice recognition model comprises a content recognition network and a language classifier, wherein the content recognition network is used for coding voice data in the mixed data set to obtain corresponding feature vectors and performing voice recognition based on the feature vectors to obtain the recognition text; the language classifier is used for performing language identification based on the feature vector to obtain the identified language, and the content identification network is obtained by pre-training second sample common speech sound data and a labeled text of the second sample common speech sound data.

It can be seen that, in the embodiment of the present application, a speech recognition model is trained using a mixed data set including first sample common speech data and sample dialect speech data, instead of separately training a speech recognition model for each language, so that the speech recognition model can perform speech recognition on speech data of multiple languages, and the problem that a speech recognition model with a good recognition effect cannot be effectively selected due to the fact that a speech recognition model is separately trained for each language is avoided; on the basis, a multi-task learning framework comprising a content recognition network and a language classifier is adopted, a mixed data set is input into an initial voice recognition model, the content recognition network executes a voice content recognition task, content-related features of voice data of different languages are learned from the mixed data set so as to have the capacity of recognizing texts corresponding to the voice data, the language classifier executes the language recognition task, and language-related features of the voice data of different languages are learned from the mixed data set so as to have the capacity of recognizing languages; further, based on the recognition result output by the initial voice recognition model aiming at the mixed data set and the labeled text and language label of the voice data in the mixed data set, determining the total recognition loss of the initial voice recognition model, and performing iterative training on the initial voice recognition model based on the total recognition loss to obtain the voice recognition model, so that the voice content recognition task executed by a content recognition network and the language recognition task executed by a language classifier are closely linked, information is mutually shared to learn more knowledge, and different learning tasks are mutually promoted to improve the cross-language robustness of the voice recognition model, namely, the voice recognition model has good recognition effect on the voice data of multiple languages; in addition, the content recognition network is pre-trained by using the second sample common speech and voice data and the labeled text thereof, so that the content recognition network has the capability of performing voice recognition on the common speech and voice data before learning the mixed data set, the convergence rate of the voice recognition model can be accelerated, the voice recognition model can pay more attention to the voice content recognition task from the voice data to the text, the difference between the content-related characteristics of the common speech and voice data in dialect can be rapidly learned, and the multilingual voice recognition effect of the voice recognition model can be improved.

In a second aspect, an embodiment of the present application provides a speech recognition method, including:

carrying out feature extraction on the voice to be processed to obtain voice data of the voice to be processed;

performing voice recognition on the voice data of the voice to be processed through a content recognition network of a voice recognition model to obtain a recognition text of the voice to be processed;

wherein the voice recognition model is obtained by model training based on the labeled text and language label of the voice data in the mixed data set and the recognition result output by the voice recognition model aiming at the mixed data set, the mixed data set includes first sample normal speech voice data and sample dialect speech data, the speech recognition model includes a content recognition network and a language classifier, the recognition result comprises a recognition text and a recognition language, the recognition text is obtained by the content recognition network through voice recognition on the voice data in the mixed data set, the language identification is obtained by the language classifier performing language identification on the voice data in the mixed data set, the content recognition network is obtained by pre-training second sample common speech sound data and the labeled text of the second sample common speech sound data.

Therefore, in the embodiment of the application, the recognition text of the voice to be processed can be obtained by inputting the voice data of the voice to be processed into the pre-trained voice recognition model, and the method is simple, quick and high in efficiency; in addition, the speech recognition model is obtained by training through a multi-task learning idea, specifically, in the training process, a mixed data set containing first sample common speech voice data and sample dialect speech data is used for training a speech recognition model to replace the situation that a speech recognition model is trained for each language type, so that the speech recognition model can perform speech recognition on speech data of multiple languages, and the problem that a speech recognition model with a good recognition effect cannot be effectively selected due to the fact that a speech recognition model is trained for each language type is solved; on the basis, a multi-task learning framework comprising a content recognition network and a language classifier is adopted, a mixed data set is input into an initial voice recognition model, the content recognition network executes a voice content recognition task, content-related features of voice data of different languages are learned from the mixed data set so as to have the capacity of recognizing texts corresponding to the voice data, the language classifier executes the language recognition task, and language-related features of the voice data of different languages are learned from the mixed data set so as to have the capacity of recognizing languages; further, based on the recognition result output by the initial voice recognition model aiming at the mixed data set and the labeled text and language label of the voice data in the mixed data set, determining the total recognition loss of the initial voice recognition model, and performing iterative training on the initial voice recognition model based on the total recognition loss to obtain the voice recognition model, so that the voice content recognition task executed by a content recognition network and the language recognition task executed by a language classifier are closely linked, information is mutually shared to learn more knowledge, and different learning tasks are mutually promoted to improve the cross-language robustness of the voice recognition model, namely, the voice recognition model has good recognition effect on the voice data of multiple languages; in addition, the content recognition network is pre-trained by using the second sample common speech and voice data and the labeled text thereof, so that the content recognition network has the capability of performing voice recognition on the common speech and voice data before learning the mixed data set, the convergence rate of the voice recognition model can be accelerated, the voice recognition model can pay more attention to the voice content recognition task from the voice data to the text, the difference between the content-related characteristics of the common speech and voice data in dialect can be rapidly learned, the multilingual voice recognition effect of the voice recognition model can be improved, the voice recognition model obtained based on training performs voice recognition on the voice to be processed, and the voice recognition accuracy can be improved.

In a third aspect, an embodiment of the present application provides a training apparatus for a speech recognition model, including:

the system comprises a first acquisition module, a second acquisition module and a third acquisition module, wherein the first acquisition module is used for acquiring a mixed data set and a labeled text and a language label of voice data in the mixed data set, and the mixed data set comprises first sample common speech voice data and sample dialect voice data;

a first recognition module, configured to input an initial speech recognition model into the mixed data set and a labeled text of the speech data in the mixed data set to obtain a recognition result of the speech data in the mixed data set, where the recognition result includes a recognized text and a recognized language, and the initial speech recognition model includes a content recognition network and a language classifier, the content recognition network is configured to encode the speech data in the mixed data set to obtain a corresponding feature vector, and perform speech recognition based on the feature vector to obtain the recognized text; the language classifier is used for performing language identification based on the feature vector to obtain the identified language, and the content identification network is obtained by pre-training second sample common speech data and a labeled text of the second sample common speech data;

a first loss determining module, configured to determine a total recognition loss of the initial speech recognition model based on a recognition result of the speech data in the mixed data set and a labeled text and a language label of the speech data in the mixed data set;

and the first training module is used for carrying out iterative training on the initial voice recognition model based on the total recognition loss to obtain the voice recognition model.

In a fourth aspect, an embodiment of the present application provides a speech recognition apparatus, including:

the characteristic extraction module is used for extracting the characteristics of the voice to be processed to obtain the voice data of the voice to be processed;

the second recognition module is used for carrying out voice recognition on the voice data of the voice to be processed through a content recognition network of a voice recognition model to obtain a recognition text of the voice to be processed;

In a fifth aspect, an embodiment of the present application provides an electronic device, including: a processor; a memory for storing the processor-executable instructions; wherein the processor is configured to execute the instructions to implement the method of the first or second aspect.

In a sixth aspect, embodiments of the present application provide a computer-readable storage medium, where instructions, when executed by a processor of an electronic device, enable the electronic device to perform the method according to the first aspect or the second aspect.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:

FIG. 1 is a schematic flow chart illustrating a method for training a speech recognition model according to an embodiment of the present application;

FIG. 2 is a schematic diagram of a speech recognition model according to an embodiment of the present application;

FIG. 3 is a flow chart illustrating a method for pre-training a content recognition network according to an embodiment of the present application;

FIG. 4 is a schematic diagram of a pre-training method for a content recognition network according to another embodiment of the present application;

FIG. 5 is a flow chart illustrating a speech recognition method according to an embodiment of the present application;

FIG. 6 is a schematic structural diagram of an apparatus for training a speech recognition model according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of a speech recognition apparatus according to an embodiment of the present application;

fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the technical solutions of the present application will be described in detail and completely with reference to the following specific embodiments of the present application and the accompanying drawings. It should be apparent that the described embodiments are only some of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The terms first, second and the like in the description and in the claims, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application are capable of operation in sequences other than those illustrated or described herein. In addition, "and/or" in the specification and claims means at least one of connected objects, and a character "/" generally means that the former and latter related objects are in an "or" relationship.

Partial concept description:

speech Recognition (Speech Recognition): the aim is to automatically convert the human voice content into corresponding words by a computer.

A Transformer: a time sequence model based on a self-attention mechanism can effectively encode time sequence information, has far better processing capability than other time sequence models, and is high in speed. The Transformer can be widely applied to the fields of natural language processing, computer vision, machine translation, voice recognition and the like.

Former: a time sequence model combining a Transformer and a Convolutional Neural Network (CNN), the Transformer is good at capturing global interaction based on contents, and the CNN can effectively utilize local features, so that the Transformer can effectively combine long-term global interaction information and local features.

LSTM (Long Short-Term Memory): a long and short term memory Network is a time Recurrent Neural Network (RNN) and mainly aims to solve the problems of gradient elimination and gradient explosion in the long sequence training process. In short, LSTM can perform better in longer sequences than normal RNNs.

Multi-task learning: is a parallel migration mode. The traditional transfer learning emphasizes the sequence of learning, namely, knowledge learned in one field is transferred to another field, and the process of knowledge transfer is performed in series. In the multi-task learning, information is shared among different tasks, so that knowledge is migrated among different tasks, and therefore, the multi-task learning is also called parallel migration learning. The multi-task learning method improves the overall learning effect through multi-task information sharing, which is particularly effective for learning on small samples. For a large number of small sample learning tasks, the multi-task learning method can make full use of the information of a plurality of small samples, and the overall multi-task learning effect is improved.

In order to enable a trained voice recognition model to have a good recognition effect on various types of voice, the embodiment of the application aims to provide a training method of the voice recognition model, and a mixed data set containing first sample common speech data and sample dialect voice data is used for training the voice recognition model to replace a voice recognition model which is independently trained for each type of language, so that the voice recognition model can perform voice recognition on the voice data of various types of languages, and the problem that the voice recognition model with a good recognition effect cannot be effectively selected due to the fact that the voice recognition model is independently trained for each type of language is avoided; on the basis, a multi-task learning framework comprising a content recognition network and a language classifier is adopted, a mixed data set is input into an initial voice recognition model, the content recognition network executes a voice content recognition task, content-related features of voice data of different languages are learned from the mixed data set so as to have the capacity of recognizing texts corresponding to the voice data, the language classifier executes the language recognition task, and language-related features of the voice data of different languages are learned from the mixed data set so as to have the capacity of recognizing languages; further, based on the recognition result output by the initial voice recognition model aiming at the mixed data set and the labeled text and language label of the voice data in the mixed data set, determining the total recognition loss of the initial voice recognition model, and performing iterative training on the initial voice recognition model based on the total recognition loss to obtain the voice recognition model, so that the voice content recognition task executed by a content recognition network and the language recognition task executed by a language classifier are closely linked, information is mutually shared to learn more knowledge, and different learning tasks are mutually promoted to improve the cross-language robustness of the voice recognition model, namely, the voice recognition model has good recognition effect on the voice data of multiple languages; in addition, the content recognition network is pre-trained by using the second sample common speech data and the labeled text thereof, so that the content recognition network has the capability of performing speech recognition on the common speech data before learning the mixed data set, the convergence rate of the speech recognition model can be increased, the speech recognition model can pay more attention to the speech content recognition task from the speech data to the text, the difference between the content-related characteristics of the common speech data and the dialect speech data can be rapidly learned, and the multilingual speech recognition effect of the speech recognition model can be improved.

Further, an embodiment of the present application further provides a speech recognition method, where speech recognition is performed on a speech to be processed based on a content recognition network in a speech recognition model obtained through training, and under a condition that a language to which the speech to be processed belongs is unknown, especially when the speech to be processed is mixed with mandarin and dialect, a recognition text of the speech to be processed can be accurately obtained without human participation in language judgment and switching of recognition modes, so that speech recognition efficiency can be improved.

It should be understood that the training method and the speech recognition method of the speech recognition model provided in the embodiments of the present application may be executed by an electronic device or software installed in an electronic sound, and specifically may be executed by a terminal device or a server device.

The technical solutions provided by the embodiments of the present application are described in detail below with reference to the accompanying drawings.

Referring to fig. 1, a flow chart of a method for training a speech recognition model according to an embodiment of the present application is shown, where the method includes the following steps:

s102, acquiring the mixed data set and the labeled text and language labels of the voice data in the mixed data set.

In an embodiment of the present application, the mixed data set includes first sample normal speech data and sample dialect speech data. The first sample common speech data may include acoustic features of the first sample Mandarin speech, and the sample dialect speech data may include acoustic features of the sample dialect speech, wherein the acoustic features of the speech may include fbank features of the speech, which may be obtained by feature extraction of the speech by a kaldi toolkit or a torchaudio toolkit. In practical applications, the mixed data set may include a plurality of first sample common speech data and a plurality of sample dialect speech data (including, for example, chongqing speech data, cantonese speech data, minnan speech data, etc.).

In practical application, each sample voice in the sample library has a corresponding voice index and language index, where the voice index of the sample voice is used to uniquely identify the sample voice, for example, the voice index of the sample voice may be represented by a number; the language index of the sample voice is used to uniquely identify the language of the sample voice, for example, the language index of mandarin voice is 0, the language index of the chuan and Yu voice is 1, the language index of the Guangdong voice is 2, and so on. Therefore, based on the voice index and the language index of each sample voice in the sample library, the first sample mandarin voice and the sample dialect voice can be obtained from the sample library, and the voice data corresponding to the sample voice is obtained by extracting the characteristics of the extracted sample voice.

And S104, inputting the mixed data set and the labeled text and the language label of the voice data in the mixed data set into the initial voice recognition model to obtain a recognition result of the voice data in the mixed data set.

In this embodiment, the initial speech recognition model may adopt a multitasking architecture, as shown in fig. 2, the initial speech recognition model may include a content recognition network and a language classifier, where the content recognition network is configured to recognize content of speech data to obtain a recognition text, and the language classifier is configured to recognize language of the speech data to obtain a recognition language. Based on this, a speech content recognition task is performed by a content recognition network by inputting a mixed data set and a labeled text of speech data in the mixed data set into an initial speech recognition model, a content recognition capability is provided by learning content-related features of speech data of different languages from the mixed data set based on a recognized text of speech data in the mixed data set, a language recognition task is performed by a language classifier, and a language-related feature of speech data of different languages is learned from the mixed data set based on a language tag of speech data in the mixed data set to provide a language recognition capability.

That is to say, in the embodiment of the present application, the hybrid data set and the labeled text and the language tag of the voice data in the hybrid data set are input into the initial voice recognition model, and the obtained recognition result of the voice data in the hybrid data set includes the recognized text and the recognized language of the voice data in the hybrid data set.

In this embodiment of the present application, the content identification network may have any appropriate structure, and may be specifically configured according to actual needs, which is not limited in this embodiment of the present application. Alternatively, to improve the speech recognition effect of the content recognition network, as shown in fig. 2, the content recognition network may include an encoder and a decoder.

The encoder is configured to encode the voice data in the mixed data set based on a label text of the voice data in the mixed data set to obtain a feature vector of the voice data in the mixed data set, and perform voice recognition on the voice data in the mixed data set based on a Connection Timing Classification (CTC) mechanism and the feature vector of the voice data in the mixed data set to obtain a first recognition text of the voice data in the mixed data set. Specifically, the encoder may extract acoustic features from the speech data in the mixed data set based on the labeled text of the speech data in the mixed data set, encode the extracted acoustic features to obtain feature vectors capable of characterizing the acoustic features of the speech data in the mixed data set, and perform speech recognition on the speech data in the mixed data set from the overall time sequence based on the acoustic features characterized by the connection time sequence classification mechanism and the feature vectors to obtain a recognition text of the speech data in the mixed data set, that is, a first recognition text of the speech data in the mixed data set.

In practical application, the encoder may adopt any structure having encoding and speech recognition functions, and may be specifically set according to actual needs, which is not limited in the embodiment of the present application. Optionally, since the voice data is a kind of time series data, for this purpose, the encoder may include a Transformer encoder and/or a Transformer encoder, wherein the Transformer encoder can effectively encode the time series information, the processing capability of the time series information is much better than other time series models, and the speed is high, and the Transformer encoder is a time series model combining the Transformer encoder and the CNN, the Transformer encoder is good at capturing global interaction based on the voice data, and the CNN can effectively utilize local features of the voice data, so that the Transformer encoder can effectively combine long-term global interaction information and local features of the voice data, encode the voice data, and perform voice recognition on the voice data based on the CTC mechanism.

The decoder is used for carrying out voice recognition on the voice data in the mixed data set based on an Attention mechanism (Attention) and the feature vectors of the voice data in the mixed data set to obtain a second recognition text of the voice data in the mixed data set. Specifically, the attention mechanism refers to screening out a small amount of important information from a large amount of information, focusing on the important information, and ignoring most of the unimportant information. In the embodiment of the application, the decoder mainly increases the weight of the acoustic features of one or more frames of speech which have a large influence on the speech recognition result according to the relation and difference between the acoustic features of each frame of speech in the feature vector obtained by the encoder, reduces the acoustic features of one or more frames of speech which have a small influence on the speech recognition result, obtains the feature matrix which can accurately express the semantics of the speech data in the mixed data set, and then can identify the content of the speech data in the mixed data set based on the obtained feature matrix.

In specific application, the decoder may adopt any structure having coding and speech recognition functions, and may be specifically set according to actual needs, which is not limited in the embodiment of the present application. Optionally, since the speech data is a kind of time series data, the decoder may include a Transformer decoder and/or a long-short term memory network, wherein the Transformer decoder may effectively decode the feature vectors of the speech data in the mixed data set based on the time sequence of the speech data in the mixed data set, and the decoding speed is fast, and the long-short term memory network may solve the problems of gradient disappearance, gradient explosion and the like occurring during the model training process of the speech data with a long sequence, so as to have a better decoding performance on the speech data with a long sequence.

It can be understood that, because the speech signal has non-stationarity, only short-time fourier transform can be performed on the speech signal in the speech recognition process, which causes that a section of speech contains many frame data, and one character in the output text may correspond to multi-frame data, and the length of the finally output text is much smaller than that of the input speech data, by performing speech recognition in the encoder by using a connection timing sequence classification mechanism, the sequence of frames of the speech over time and the sequence of characters in the corresponding text can be directly aligned automatically in the model training process without labeling the starting and ending time of each character or phoneme, thereby realizing direct classification on the sequence of times, i.e. performing speech recognition on the input speech data from the perspective of the whole timing sequence; meanwhile, an attention mechanism is adopted in a decoder for voice recognition, so that a content recognition network can pay more attention to the relation among the voice data of different frames in the model training process, including the relation among different characteristics of the voice data of each frame and the relation among the voice data of each frame and the voice data of the frames before and after the voice data of each frame, so that the characteristics which have larger influence on the voice recognition result are enhanced, the semantics of a characteristic matrix of the voice data are enhanced, an accurate recognition text is output, and the method is equivalent to performing the voice recognition from the angle of single frame data in a time sequence. Therefore, the content recognition network can learn the voice content recognition from different angles, so that the learned knowledge is richer, and the voice recognition effect of the content recognition network is further improved.

And S106, determining the total recognition loss of the initial voice recognition model based on the recognition result of the voice data in the mixed data set and the labeled text and the language label of the voice data in the mixed data set.

In the embodiment of the present application, the total recognition loss of the initial speech recognition model is used to represent the difference between the recognition result obtained by the initial speech recognition model for the input speech data and the tagging information of the speech data.

In an optional implementation manner, considering that the initial speech recognition model adopts a multitask learning architecture, it needs to recognize not only the content of the input speech data, but also the language of the input speech data, where recognition results obtained by the two tasks and corresponding tagging information may have a certain difference, and in order to enable the obtained recognition loss to accurately reflect the multitask learning effect of the initial speech recognition model, the above S106 may include the following steps:

s161, determining a first recognition loss of the content recognition network based on the recognition text and the markup text of the voice data.

Wherein the first recognition loss of the content recognition network is used to represent the recognition loss caused by the content recognition network performing speech recognition on the mixed data set.

Specifically, as shown in fig. 2, since the encoder and the decoder in the content recognition network perform speech recognition on the input speech data from two different angles, in the above S161, a first recognizer loss may be determined based on a first recognition text of the speech data in the mixed data set and a labeled text of the speech data in the mixed data set, where the first recognizer loss is used to represent a recognition loss caused by the encoder performing speech recognition on the mixed data set based on a connection timing mechanism, that is, to reflect a difference between the recognition text and the labeled text obtained by the encoder performing speech recognition through a connection timing classification mechanism; determining a second identifier loss based on a second recognition text of the speech data in the mixed data set and an annotation text of the speech data in the mixed data set, wherein the second identifier loss is used for representing a recognition loss caused by the decoder performing the speech recognition on the mixed data set based on an attention mechanism, namely reflecting that the content recognition network performs the speech recognition through the attention mechanism to obtain a difference between the recognition text and the annotation text; then, a first identifier loss of the content identification network is determined based on the first identifier loss and the second identifier loss.

For example, for the first recognizer loss, it may be determined based on the first recognition text and the annotation text of the speech data in the mixed dataset and a loss function CTC loss, which is commonly used in the art. For the second recognizer loss, a determination may be made based on the second recognized text and the annotation text of the speech data in the mixed dataset and a loss function corresponding to the attention mechanism. Further, the first identifier loss and the second identifier loss can be weighted and summed to obtain the first identification loss of the content identification network, namely Char loss₁＝λ₁·CTC loss₁+λ₂·attention loss₁Wherein, Char loss₁Indicating a first loss of identification, CTC loss, of a content identification network₁Denotes the loss of the first recognizer, lambda₁Represents the weight corresponding to the loss of the first identifier, the attribute loss₁Denotes the loss of the second recognizer, lambda₂Weight, λ, representing the correspondence of the loss of the second identifier₁And λ₂Can be set according to actual needs, such as lambda₁＝λ₂＝0.5。

Further, considering that the mixed data set includes voice data of multiple languages, and the second recognized text obtained by the content recognition network based on the attention mechanism is usually represented in the form of one-hot (one-hot) code or text index (i.e. the text is indexed in the dictionary), this may easily result in that the knowledge learned by the content recognition network in the learning process is limited, an excessive confidence may occur in the correct labeled text, and the difference between the respective recognized texts of the obtained positive sample (the voice data with the correct labeled text) and the negative sample (the voice data with the wrong labeled text) is not large, so that an excessive fitting may easily occur. In view of this, in the above S161, the determining the second recognizer loss based on the second recognized text of the voice data and the labeled text of the voice data may be specifically implemented as: based on the language quantity of the voice data in the mixed data set and the labeling form of the labeling text of the voice data in the mixed data set, performing smoothing processing on the labeling text of the voice data in the mixed data set; then, a second recognizer loss is determined based on the second recognized text of the speech data in the mixed dataset and the smoothed annotation text.

The labeling form of the labeling text can comprise a one-hot coding form and a text index form. And for different labeling forms, smoothing the labeling text by adopting a corresponding smoothing method. For example, if the labeling form of the labeling text is a one-hot encoding form, the labeling text of each voice data in the mixed data set may be smoothed by the following formula (1); if the labeling form of the labeling text is a text index form, the labeling text of each voice data in the mixed data set can be smoothed through the following formula (2).

Wherein the content of the first and second substances,

representing the smooth-processed labeled text of the ith voice data in the mixed data set; y is_hotA markup text representing a one-hot coded form of the ith piece of speech data; target represents the ith voiceA text index of the real text of the data; i-target represents that the labeled text in the text index form of the ith voice data is the same as the text index of the real text; i ≠ target represents that the text index of the labeling text in the text index form of the ith voice data is different from the text index of the real text; k represents the language number of the voice data in the mixed data set; α represents a preset adjustment coefficient, which may be actually set as needed, for example, α is 0.1.

It can be understood that, by performing smoothing processing on the labeled text of each piece of speech data in the mixed data set, it is equivalent to adding noise to the true category distribution of the speech data in the mixed data set; and then, determining the second identifier loss of the content recognition network based on the second recognition text of the voice data in the mixed data set and the smoothly processed labeled text, so that the problem that the content recognition model is too confident for the correct labeled text can be avoided, and the generalization of the content recognition network to wrong voice data is improved.

It can be understood that the recognition loss caused by the connection timing sequence classification mechanism and the recognition loss caused by the attention mechanism of the content recognition network are respectively determined based on the first recognition text, the second recognition text and the labeled text of the voice data in the mixed data set, and then the recognition loss of the content recognition network is determined based on the two types of recognition losses, so that the obtained recognition loss can accurately reflect the voice recognition effect of the content recognition network from different angles, and the content recognition network is favorable for effectively fusing knowledge learned from different angles, thereby being favorable for improving the recognition effect of the content recognition network on common voice data.

And S162, determining the recognition loss of the language classifier based on the recognized language and the language label of the voice data in the mixed data set.

In the embodiment of the present application, the language tag of the voice data is used to indicate the real language to which the voice data belongs. The language tags may be represented in any suitable form, such as unique hot codes, language indexes, etc., wherein the language indexes are used to uniquely represent the corresponding languages, e.g., the language index of mandarin speech is 0, the language index of chuan and yu speech is 1, the language index of cantonese speech is 2, etc.

Specifically, the recognition loss of the language classifier can be determined based on the difference between the recognized language and the language tag of the voice data in the mixed data set and the preset loss function. In practical application, the preset loss function may be set according to actual needs, for example, the preset loss function may be a cross entropy loss function, and the like, which is not limited in this embodiment of the present application.

S163, normalizing the first recognition loss of the content recognition network and the recognition loss of the language classifier to obtain the total recognition loss of the initial speech recognition model.

Considering that the distribution of the sample common speech voice data and the sample dialect speech voice data in the mixed data set is inconsistent, and the inconsistent distribution can cause the initial speech recognition model to vibrate, so that the recognition effect of the initial speech recognition model is influenced, in view of the above, the initial speech recognition model vibration caused by the inconsistent distribution can be reduced as much as possible by carrying out normalization processing on the first recognition loss of the content recognition network and the recognition loss of the language classifier, so that the recognition effect of the initial speech recognition model is improved.

Optionally, in order to better closely associate the speech content recognition task performed by the content recognition network and the language recognition task performed by the language classifier while reducing the oscillation of the initial speech recognition model caused by the inconsistent distribution as much as possible, and share information with each other to learn more knowledge, thereby improving the recognition effect of the initial speech recognition model, the above S163 may be specifically implemented as follows: carrying out weighted summation on the first recognition loss of the content recognition network and the recognition loss of the language classifier to obtain the Total recognition loss of the initial speech recognition model, namely Total loss lambda₃·Char loss₁+(1-λ₃) CE loss, where Total loss represents the Total recognition loss of the initial speech recognition model, Char loss₁Denotes the first recognition loss of the content recognition network, CE loss denotes the recognition loss of the language classifier, lambda₃It is shown that the weight adjustment coefficients,it can be set according to the actual need, for example lambda₃It may be set to 0.9 or 0.95.

It can be understood that the first recognition loss of the content recognition network and the recognition loss of the language classifier are respectively determined based on the recognition text, the recognition language, the labeled text and the language label of the voice data in the mixed data set, and then the total recognition loss of the initial voice recognition model is determined based on the two types of recognition losses, so that the obtained total recognition loss can accurately reflect the learning effect of the initial voice recognition model for executing multi-task learning, the knowledge sharing of the initial voice recognition model among different learning tasks is facilitated, the mutual promotion among different learning tasks is facilitated, the cross-language robustness of the finally obtained voice recognition model is improved, and the finally obtained voice recognition model has a better recognition effect on the voice data of multiple languages.

The embodiment of the present application shows a specific implementation manner of the above S106. Of course, it should be understood that S106 may also be implemented in other manners, and this is not limited in this embodiment of the application.

And S108, carrying out iterative training on the initial voice recognition model based on the total recognition loss of the initial voice recognition model to obtain the voice recognition model.

In particular, model parameters of the initial speech recognition model may be adjusted based on the total recognition loss of the initial speech recognition model. The model parameters of the initial speech recognition model may specifically include model parameters of a content recognition network and model parameters of a language classifier. For example, the model parameters of the content identification model shown in fig. 2 may specifically include the model parameters of an encoder and the model parameters of a decoder, where the model parameters of the encoder include, but are not limited to, the number of nodes in each network layer in the encoder, the connection relationships and connection edge weights between nodes in different network layers, offsets corresponding to the nodes in each network layer, and the like; the model parameters of the decoder include, but are not limited to, the number of nodes in each network layer in the decoder, the connection relationship and connection edge weight between nodes in different network layers, the bias corresponding to the nodes in each network layer, and the like. The model parameters of the language classifier include, but are not limited to, the number of nodes in each network layer in the language classifier, the connection relationship and connection edge weight between nodes in different network layers, and the bias corresponding to the nodes in each network layer.

Specifically, based on the total recognition loss of the initial voice recognition model and the current model parameters of the initial voice recognition model, determining the recognition loss caused by each network layer in the initial voice recognition model by adopting a back propagation algorithm; then, model parameters of the initial speech recognition model are adjusted layer by layer, with the goal of reducing the total recognition loss of the initial speech recognition model.

For example, taking the initial speech recognition model shown in fig. 2 as an example, the above S108 may be specifically implemented as: determining the recognition loss caused by each network layer in a decoder and a coder of the content recognition network and the recognition loss caused by each network layer in a language classifier by adopting a back propagation algorithm based on the total recognition loss of the initial speech recognition model and the current model parameters of the content recognition network; then, aiming at reducing the recognition loss of the initial speech recognition model, the model parameters of the content recognition network and the language classifier are adjusted layer by layer.

It should be noted that the above-mentioned process is only a single adjustment process, and in practical applications, multiple adjustments may be required, so that the above-mentioned adjustment process may be repeatedly performed until a first preset training stop condition is met, thereby obtaining a final speech recognition model. Wherein the first preset training stop condition may include: the total recognition loss of the initial speech recognition model is smaller than a first preset loss threshold value or the adjustment times reach a first preset times, and the like, which is not limited in the embodiment of the present application.

The embodiment of the present application shows a specific implementation manner of the above S108. Of course, it should be understood that S108 may also be implemented in other manners, and this is not limited in this embodiment of the application.

In the embodiment of the present application, in order to enable the content recognition network to have the capability of performing speech recognition on mandarin chinese speech data before learning the mixed data set, so as to accelerate the convergence speed of the initial speech recognition model, and simultaneously enable the initial speech recognition model to pay more attention to the speech content recognition task from speech data to text, quickly learn the difference of content-related features between the common speech data and the dialect speech data, therefore, the multilingual speech recognition effect of the finally obtained speech recognition model is improved, and the content recognition network in the initial speech recognition model can be pre-trained by utilizing the second sample mandarin speech data and the labeled text thereof before the initial speech recognition model is trained by utilizing the mixed data set and the labeled text and the language label of the speech data in the mixed data set.

In an optional implementation manner, a difference between a recognition text obtained by the content recognition network for the input voice data and a tagged text of the input voice data can reflect a recognition effect of the content recognition network, and to improve the voice recognition effect of the content recognition network, before S104, the method for training the voice recognition model provided in the embodiment of the present application may further include a pre-training method of the content recognition network, as shown in fig. 3, where the pre-training method may include:

and S302, inputting the second sample common speech sound data and the labeled text of the second sample common speech sound data into the initial content recognition network to obtain the recognition text of the second sample common speech sound data.

Wherein the number of the second sample general speech sound data may be plural. The second sample normal speech data may include acoustic features of the second sample mandarin chinese speech, such as fbank features and the like. The acoustic features of the second sample Mandarin speech may be obtained by various feature extraction methods in the art, for example, the fbank features of the second sample Mandarin speech may be obtained by feature extraction of the second sample Mandarin speech by a kaldi toolkit or a torchaudio toolkit.

Specifically, as shown in fig. 4, the encoder in the initial content recognition network may encode the second sample general speech data based on the labeled text of the second sample general speech data to obtain a feature vector of the second sample general speech data, and perform speech recognition on the second sample general speech data based on the connection timing classification mechanism and the feature vector of the second sample general speech data to obtain a third recognition text; the decoder in the initial content recognition network may perform speech recognition on the second sample normal speech data based on the attention mechanism and the feature vector of the second sample normal speech data obtained by the encoder, to obtain a fourth recognized text. That is, the recognized text of the second sample ordinary speech data includes the third recognized text and the fourth recognized text, which are obtained by speech-recognizing the second sample ordinary speech data from different angles by the decoder and the encoder, respectively.

S304, determining a second identification loss of the initial content identification network based on the identification text and the labeled text of the second sample common speech data.

Wherein the second recognition loss of the content recognition network is used to represent a recognition loss caused by the content recognition network performing speech recognition on the second sample of the common speech data.

Specifically, as shown in fig. 4, since the encoder and the decoder in the content recognition network perform speech recognition on the second sample normal speech data from two different angles, in the above step S122, a third recognizer loss may be determined based on a third recognition text and an annotated text of the second sample normal speech data, where the third recognizer loss is used to represent a recognition loss caused by the encoder performing speech recognition on the second sample normal speech data based on a connection timing mechanism, that is, to reflect a difference between the recognized text and the annotated text obtained by the content recognition network performing speech recognition through a connection timing classification mechanism; determining a fourth recognizer loss based on a fourth recognition text and an annotation text of the second sample common speech data, wherein the fourth recognizer loss is used for representing the recognition loss caused by the decoder performing speech recognition on the second sample common speech data based on an attention mechanism, namely reflecting the difference between the recognition text and the annotation text obtained by the content recognition network performing speech recognition through the attention mechanism; then, a second recognition loss of the content recognition network is determined based on the third recognizer loss and the fourth recognizer loss.

For example, for the third recognizer loss, it may be determined based on the third recognized text and the annotated text of the second sample of common speech data and a loss function CTC loss, which is commonly used in the art. For the fourth recognizer loss, a cross-entropy loss function may be determined based on the fourth recognized text and the annotated text of the second sample of common speech data. Further, the third identifier loss and the fourth identifier loss may be weighted and summed to obtain a second identification loss of the content identification network, i.e., Char loss₂＝λ₁'·CTC loss₂+λ₂'·attention loss₂Wherein, Char loss₂Indicating a second loss of identification, CTC loss, of the content identification network₂Denotes the loss of the third recognizer, lambda₁' represents the weight corresponding to the loss of the third identifier, attribute loss₂Denotes the loss of the fourth recognizer, lambda₂' denotes the weight corresponding to the fourth recognizer loss, λ₁' and λ₂' can be set according to actual needs, e.g. λ₁'＝λ₂'＝0.5。

It can be understood that the recognition loss caused by the content through the connection timing sequence classification mechanism and the recognition loss caused by the attention mechanism are respectively determined based on the third recognition text, the fourth text and the labeled text of the second sample common speech data, and then the second recognition loss of the content recognition network is determined based on the two recognition losses, so that the obtained second recognition loss can accurately reflect the effect of the content recognition network on speech recognition from different angles of the common speech data, and the content recognition network can effectively fuse knowledge learned from different angles, thereby being beneficial to improving the recognition effect of the content recognition network on the common speech data.

And S306, performing iterative training on the initial content recognition network based on the second recognition loss of the initial content recognition network to obtain the content recognition network in the initial voice recognition model.

Specifically, the model parameters of the initial content recognition network may be adjusted based on a second recognition loss of the initial content recognition network. The model parameters of the initial content recognition network may specifically include, but are not limited to: the number of nodes in each network layer, the connection relationship and connection edge weight between nodes in different network layers, the bias corresponding to the nodes in each network layer, and the like. For example, the model parameters of the initial content identification network shown in fig. 4 may specifically include the model parameters of an encoder and the model parameters of a decoder, where the model parameters of the encoder include, but are not limited to, the number of nodes in each network layer in the encoder, the connection relationships and connection edge weights between nodes in different network layers, offsets corresponding to the nodes in each network layer, and the like; the model parameters of the decoder include, but are not limited to, the number of nodes in each network layer in the decoder, the connection relationship and connection edge weight between nodes in different network layers, the bias corresponding to the nodes in each network layer, and the like.

Specifically, based on the second identification loss of the initial content identification network and the current model parameters of the initial content identification network, a back propagation algorithm is adopted to determine the identification loss caused by each network layer in the initial content identification network; then, model parameters of the initial content recognition network are adjusted layer by layer with the goal of reducing a second recognition loss of the initial content recognition network.

For example, taking the initial content identification network shown in fig. 4 as an example, the above S306 may be specifically implemented as: determining the identification loss caused by each network layer in the decoder and the identification loss caused by each network layer in the encoder by adopting a back propagation algorithm based on the second identification loss of the initial content identification network and the current model parameter of the initial content identification network; then, the respective model parameters of the decoder and the encoder are adjusted layer by layer, with the goal of reducing the second recognition loss of the initial content recognition network.

It should be noted that the above-mentioned process is only a one-time adjustment process, and in practical applications, multiple adjustments may be required, so that the above-mentioned steps S302 to S306 may be repeatedly performed until a second preset training stop condition is met, thereby obtaining the content recognition network in the initial speech recognition model. The second preset training stop condition may include that the second recognition loss of the initial content recognition network is less than a second preset loss threshold, or the adjustment number of times reaches a second preset number of times, and the like, which is not limited in the embodiment of the present application.

The embodiment of the present application shows a specific implementation manner of pre-training a content recognition network in an initial speech recognition model. Of course, it should be understood that the pre-training of the content recognition network in the initial speech recognition model may also be implemented in other manners, which is not limited in the embodiment of the present application.

According to the training method of the voice recognition model, a mixed data set containing first sample common speech voice data and sample dialect voice data is used for training a voice recognition model to replace a voice recognition model which is independently trained for each language, so that the voice recognition model can perform voice recognition on voice data of multiple languages, and the problem that a voice recognition model with a good recognition effect cannot be effectively selected due to the fact that one voice recognition model is independently trained for each language is solved; on the basis, a multi-task learning framework comprising a content recognition network and a language classifier is adopted, a mixed data set is input into an initial voice recognition model, the content recognition network executes a voice content recognition task, content-related features of voice data of different languages are learned from the mixed data set so that the capacity of recognizing texts corresponding to the voice data is realized, the language classifier executes the language recognition task, and language-related features of the voice data of different languages are learned from the mixed data set so that the capacity of recognizing languages is realized; further, based on the recognition result output by the initial voice recognition model aiming at the mixed data set and the labeled text and language label of the voice data in the mixed data set, determining the total recognition loss of the initial voice recognition model, and performing iterative training on the initial voice recognition model based on the total recognition loss to obtain the voice recognition model, so that the voice content recognition task executed by a content recognition network and the language recognition task executed by a language classifier are closely linked, information is mutually shared to learn more knowledge, and different learning tasks are mutually promoted to improve the cross-language robustness of the voice recognition model, namely, the voice recognition model has good recognition effect on the voice data of multiple languages; in addition, the content recognition network is pre-trained by using the second sample common speech and voice data and the labeled text thereof, so that the content recognition network has the capability of performing voice recognition on the common speech and voice data before learning the mixed data set, the convergence rate of the voice recognition model can be accelerated, the voice recognition model can pay more attention to the voice content recognition task from the voice data to the text, the difference between the content-related characteristics of the common speech and voice data in dialect can be rapidly learned, and the multilingual voice recognition effect of the voice recognition model can be improved.

The above embodiments describe a method for training a speech recognition model, and the speech recognition model trained by the above training method can be used for speech recognition in various application scenarios, for example, but not limited to speech translation, speech notepad, customer service speech quality inspection, speech content inspection, audio/video subtitle configuration, and the like.

Based on the training method of the speech recognition model provided by the embodiment of the application, the trained speech recognition model can be applied to any scene needing speech recognition. The application process based on the speech recognition model is explained in detail below.

Referring to fig. 5, a flow chart of a speech recognition method according to an embodiment of the present application is schematically shown, where the method includes the following steps:

and S502, performing feature extraction on the voice to be processed to obtain voice data of the voice to be processed.

In the embodiment of the present application, the to-be-processed speech refers to speech that needs to be subjected to speech recognition. The speech to be processed may be either mandarin speech or dialect speech, or speech in which mandarin speech and dialect speech are mixed.

The speech data of the speech to be processed may comprise acoustic features of the speech to be processed, such as fbank features. The acoustic features of the speech to be processed can be obtained by various feature extraction methods in the art, for example, the fbank features of the speech to be processed can be obtained by feature extraction of the speech to be processed through a kaldi toolkit or a torchaudio toolkit.

S404, performing voice recognition on the voice data of the voice to be processed through the content recognition network of the voice recognition model to obtain a recognition text of the voice to be processed.

Specifically, the voice data of the voice to be processed can be input into a content recognition network of the voice recognition model, the voice data of the voice to be processed is encoded through an encoder in the content recognition network to obtain a corresponding feature vector, and the voice to be processed is subjected to voice recognition based on a connection timing sequence classification mechanism and the feature vector of the voice to be processed to obtain a first recognition text of the voice to be processed; in addition, the decoder in the content recognition network performs voice recognition on the voice to be processed based on the feature vector of the voice to be processed in the attention mechanism, so as to obtain a second recognition text of the voice to be processed.

Further, the first recognition text and the second recognition text of the speech to be processed can be synthesized to determine the recognition text of the speech to be recognized. For example, if a first recognized text of the speech to be recognized is consistent with a second recognized text, the first recognized text or the second recognized text may be determined as the recognized text of the speech to be recognized; for another example, if the first recognized text of the speech to be recognized does not coincide with the second recognized text, the recognized text of the speech to be recognized may be determined based on the intersection between the first recognized text and the second recognized text, and so on.

According to the voice recognition method provided by the embodiment of the application, the voice data of the voice to be processed is input into the pre-trained voice recognition model, so that the recognition text of the voice to be processed can be obtained, and the method is simple, quick and high in efficiency; in addition, the speech recognition model is obtained by training through a multi-task learning idea, specifically, in the training process, a mixed data set containing first sample common speech voice data and sample dialect speech data is used for training a speech recognition model to replace the situation that a speech recognition model is trained for each language type, so that the speech recognition model can perform speech recognition on speech data of multiple languages, and the problem that a speech recognition model with a good recognition effect cannot be effectively selected due to the fact that a speech recognition model is trained for each language type is solved; on the basis, a multi-task learning framework comprising a content recognition network and a language classifier is adopted, a mixed data set is input into an initial voice recognition model, the content recognition network executes a voice content recognition task, content-related features of voice data of different languages are learned from the mixed data set so as to have the capacity of recognizing texts corresponding to the voice data, the language classifier executes the language recognition task, and language-related features of the voice data of different languages are learned from the mixed data set so as to have the capacity of recognizing languages; further, based on the recognition result output by the initial voice recognition model aiming at the mixed data set and the labeled text and language label of the voice data in the mixed data set, determining the total recognition loss of the initial voice recognition model, and performing iterative training on the initial voice recognition model based on the total recognition loss to obtain the voice recognition model, so that the voice content recognition task executed by a content recognition network and the language recognition task executed by a language classifier are closely linked, information is mutually shared to learn more knowledge, and different learning tasks are mutually promoted to improve the cross-language robustness of the voice recognition model, namely, the voice recognition model has good recognition effect on the voice data of multiple languages; in addition, the content recognition network is pre-trained by using the second sample common speech and voice data and the labeled text thereof, so that the content recognition network has the capability of performing voice recognition on the common speech and voice data before learning the mixed data set, the convergence rate of the voice recognition model can be accelerated, the voice recognition model can pay more attention to the voice content recognition task from the voice data to the text, the difference between the content-related characteristics of the common speech and voice data in dialect can be rapidly learned, the multilingual voice recognition effect of the voice recognition model can be improved, the voice recognition model obtained based on training performs voice recognition on the voice to be processed, and the voice recognition accuracy can be improved.

The foregoing description has been directed to specific embodiments of this disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.

In addition, in correspondence with the above-described method for training a speech recognition model shown in fig. 1, an embodiment of the present application also provides a device for training a speech recognition model. Referring to fig. 6, a schematic structural diagram of an apparatus 600 for training a speech recognition model according to an embodiment of the present application is provided, the apparatus including:

a first obtaining module 610, configured to obtain a mixed data set and a labeled text and a language label of voice data in the mixed data set, where the mixed data set includes first sample ordinary speech data and sample dialect voice data;

a first recognition module 620, configured to input the mixed data set and labeled texts of the voice data in the mixed data set into an initial voice recognition model, to obtain a recognition result of the voice data in the mixed data set, where the recognition result includes a recognized text and a recognized language, and the initial voice recognition model includes a content recognition network and a language classifier, and the content recognition network is configured to encode the voice data in the mixed data set to obtain corresponding feature vectors, and perform voice recognition based on the feature vectors, to obtain the recognized text; the language classifier is used for performing language identification based on the feature vector to obtain the identified language, and the content identification network is obtained by pre-training second sample common speech data and a labeled text of the second sample common speech data;

a first loss determining module 630, configured to determine a total recognition loss of the initial speech recognition model based on a recognition result of the speech data in the mixed data set and a labeled text and a language label of the speech data in the mixed data set;

a first training module 640, configured to perform iterative training on the initial speech recognition model based on the total recognition loss to obtain the speech recognition model.

According to the training device of the voice recognition model, a mixed data set containing first sample common speech voice data and sample dialect voice data is used for training a voice recognition model to replace a voice recognition model which is independently trained for each language, so that the voice recognition model can perform voice recognition on voice data of multiple languages, and the problem that a voice recognition model with a good recognition effect cannot be effectively selected due to the fact that one voice recognition model is independently trained for each language is solved; on the basis, a multi-task learning framework comprising a content recognition network and a language classifier is adopted, a mixed data set is input into an initial voice recognition model, the content recognition network executes a voice content recognition task, content-related features of voice data of different languages are learned from the mixed data set so as to have the capacity of recognizing texts corresponding to the voice data, the language classifier executes the language recognition task, and language-related features of the voice data of different languages are learned from the mixed data set so as to have the capacity of recognizing languages; further, based on the recognition result output by the initial voice recognition model aiming at the mixed data set and the labeled text and language label of the voice data in the mixed data set, determining the total recognition loss of the initial voice recognition model, and performing iterative training on the initial voice recognition model based on the total recognition loss to obtain the voice recognition model, so that the voice content recognition task executed by a content recognition network and the language recognition task executed by a language classifier are closely linked, information is mutually shared to learn more knowledge, and different learning tasks are mutually promoted to improve the cross-language robustness of the voice recognition model, namely, the voice recognition model has good recognition effect on the voice data of multiple languages; in addition, the content recognition network is pre-trained by using the second sample common speech and voice data and the labeled text thereof, so that the content recognition network has the capability of performing voice recognition on the common speech and voice data before learning the mixed data set, the convergence rate of the voice recognition model can be accelerated, the voice recognition model can pay more attention to the voice content recognition task from the voice data to the text, the difference between the content-related characteristics of the common speech and voice data in dialect can be rapidly learned, and the multilingual voice recognition effect of the voice recognition model can be improved.

Optionally, the first loss determination module includes:

a first loss determination sub-module for determining a first recognition loss of the content recognition network based on the recognition text and an annotation text of the voice data;

a second loss determination submodule, configured to determine a recognition loss of the language classifier based on the recognition language and a language tag of the speech data;

and the total loss determining submodule is used for carrying out normalization processing on the first recognition loss of the content recognition network and the recognition loss of the language classifier to obtain the total recognition loss of the initial voice recognition model.

Optionally, the content recognition network comprises:

the coder is used for coding the voice data based on the labeled text of the voice data to obtain a characteristic vector of the voice data, and carrying out voice recognition on the voice data based on a connection time sequence classification mechanism and the characteristic vector to obtain a first recognition text of the voice data;

and the decoder is used for carrying out voice recognition on the voice data based on an attention mechanism and the feature vector of the voice data to obtain a second recognition text of the voice data.

Optionally, the first loss determination submodule is configured to:

determining a first recognizer loss based on a first recognition text of the voice data and a label text of the voice data, wherein the first recognizer loss is used for representing the recognition loss caused by the voice recognition of the mixed data set by the encoder based on a connection time sequence classification mechanism;

determining a second recognizer loss based on a second recognition text of the voice data and an annotation text of the voice data, wherein the second recognizer loss is used for representing a recognition loss caused by the decoder performing voice recognition on the mixed data set based on an attention mechanism;

determining a first recognition loss of the content recognition network based on the first recognizer loss and the second recognizer loss.

Optionally, the determining the first loss sub-module determines a second recognizer loss based on the second recognized text of the speech data and the labeled text of the speech data, including:

based on the language quantity of the voice data and the labeling form of the labeling text of the voice data, performing smoothing processing on the labeling text of the voice data;

and determining the second recognizer loss based on the second recognition text of the voice data and the smoothed annotation text.

Optionally, the training apparatus further comprises a second training module, the second training module is configured to:

before the first recognition module inputs the mixed data set and the labeled text of the voice data in the mixed data set into an initial voice recognition model, inputting the second sample common speech data and the labeled text of the second sample common speech data into an initial content recognition network to obtain a third recognition text and a fourth recognition text, wherein the third recognition text is obtained by performing voice recognition on an encoder in the initial content recognition network based on a connection timing classification mechanism and a feature vector of the second sample common speech data, and the fourth recognition text is obtained by performing voice recognition on a decoder in the initial content recognition network based on an attention mechanism and a feature vector of the second sample common speech data;

determining a third recognizer loss based on the third recognition text and the annotation text of the second sample common speech sound data, wherein the third recognizer loss is used for representing the recognition loss caused by speech recognition of the second sample common speech sound data by an encoder in the initial content recognition network based on a connection time sequence classification mechanism;

determining a fourth recognizer loss based on the fourth recognition text and the annotation text of the second sample common speech sound data, wherein the fourth recognizer loss is used for representing the recognition loss caused by a decoder in the initial content recognition network performing speech recognition on the second sample common speech sound data based on an attention mechanism;

determining a second recognition loss of the initial content recognition network based on the third recognizer loss and the fourth recognizer loss;

and performing iterative training on the initial content recognition network based on the second recognition loss of the initial content recognition network to obtain the content recognition network.

Optionally, the encoder comprises a former encoder and/or a transform encoder.

Optionally, the decoder comprises a transform decoder and/or a long-short term memory network.

Obviously, the training method of the speech recognition model provided in the embodiment of the present application can be used as the execution subject of the training method of the speech recognition model shown in fig. 1, and thus the functions of the training device of the speech recognition model in fig. 1 can be realized. Since the principle is the same, the description will not be repeated here.

In addition, corresponding to the voice recognition method shown in fig. 5, the embodiment of the present application further provides a voice recognition apparatus. Referring to fig. 7, a schematic structural diagram of a speech recognition apparatus 700 according to an embodiment of the present application is provided, the apparatus including:

the feature extraction module 710 is configured to perform feature extraction on a voice to be processed to obtain voice data of the voice to be processed;

a second recognition module 720, configured to perform voice recognition on the voice data of the voice to be processed through a content recognition network of a voice recognition model to obtain a recognition text of the voice to be processed;

The voice recognition device provided by the embodiment of the application can obtain the recognition text of the voice to be processed by inputting the voice data of the voice to be processed into the pre-trained voice recognition model, and is simple, quick and high in efficiency; in addition, the speech recognition model is obtained by training through a multi-task learning idea, specifically, in the training process, a mixed data set containing first sample common speech voice data and sample dialect speech data is used for training a speech recognition model to replace the situation that a speech recognition model is trained for each language type, so that the speech recognition model can perform speech recognition on speech data of multiple languages, and the problem that a speech recognition model with a good recognition effect cannot be effectively selected due to the fact that a speech recognition model is trained for each language type is solved; on the basis, a multi-task learning framework comprising a content recognition network and a language classifier is adopted, a mixed data set is input into an initial voice recognition model, the content recognition network executes a voice content recognition task, content-related features of voice data of different languages are learned from the mixed data set so that the capacity of recognizing texts corresponding to the voice data is realized, the language classifier executes the language recognition task, and language-related features of the voice data of different languages are learned from the mixed data set so that the capacity of recognizing languages is realized; further, based on the recognition result output by the initial voice recognition model aiming at the mixed data set and the labeled text and language label of the voice data in the mixed data set, determining the total recognition loss of the initial voice recognition model, and performing iterative training on the initial voice recognition model based on the total recognition loss to obtain the voice recognition model, so that the voice content recognition task executed by a content recognition network and the language recognition task executed by a language classifier are closely linked, information is mutually shared to learn more knowledge, and different learning tasks are mutually promoted to improve the cross-language robustness of the voice recognition model, namely, the voice recognition model has good recognition effect on the voice data of multiple languages; in addition, the content recognition network is pre-trained by using the second sample common speech and voice data and the labeled text thereof, so that the content recognition network has the capability of performing voice recognition on the common speech and voice data before learning the mixed data set, the convergence rate of the voice recognition model can be accelerated, the voice recognition model can pay more attention to the voice content recognition task from the voice data to the text, the difference between the content-related characteristics of the common speech and voice data in dialect can be rapidly learned, the multilingual voice recognition effect of the voice recognition model can be improved, the voice recognition model obtained based on training performs voice recognition on the voice to be processed, and the voice recognition accuracy can be improved.

Obviously, the speech recognition method provided in the embodiment of the present application can be used as the execution subject of the speech recognition method shown in fig. 5, and thus the functions of the speech recognition apparatus in fig. 5 can be realized. Since the principle is the same, the description will not be repeated here.

Fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present application. Referring to fig. 8, at a hardware level, the electronic device includes a processor, and optionally further includes an internal bus, a network interface, and a memory. The Memory may include a Memory, such as a Random-Access Memory (RAM), and may further include a non-volatile Memory, such as at least 1 disk Memory. Of course, the electronic device may also include hardware required for other services.

The processor, the network interface, and the memory may be connected to each other via an internal bus, which may be an ISA (Industry Standard Architecture) bus, a PCI (Peripheral Component Interconnect) bus, an EISA (Extended Industry Standard Architecture) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one double-headed arrow is shown in FIG. 8, but that does not indicate only one bus or one type of bus.

And the memory is used for storing programs. In particular, the program may include program code comprising computer operating instructions. The memory may include both memory and non-volatile storage and provides instructions and data to the processor.

The processor reads the corresponding computer program from the nonvolatile memory into the memory and then runs the computer program to form the training device of the speech recognition model on the logic level. The processor is used for executing the program stored in the memory and is specifically used for executing the following operations:

Alternatively, the processor reads a corresponding computer program from the non-volatile memory into the memory and then runs the computer program, thereby forming the voice recognition device on a logic level. The processor is used for executing the program stored in the memory and is specifically used for executing the following operations:

wherein the voice recognition model is obtained by model training based on the labeled text and language label of the voice data in the mixed data set and the recognition result output by the voice recognition model aiming at the mixed data set, the mixed data set includes first sample normal speech voice data and sample dialect speech data, the speech recognition model includes a content recognition network and a language classifier, the recognition result comprises a recognition text and a recognition language, the recognition text is obtained by the content recognition network through voice recognition on the voice data in the mixed data set, the language-mixed identification language is obtained by the language classifier through language identification on the voice data in the mixed data set, the content recognition network is obtained by pre-training second sample common speech sound data and the labeled text of the second sample common speech sound data.

The method performed by the training apparatus for speech recognition model disclosed in the embodiment of fig. 1 of the present application or the method performed by the speech recognition apparatus disclosed in the embodiment of fig. 5 of the present application can be applied to or implemented by a processor. The processor may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware in a processor or instructions in the form of software. The Processor may be a general-purpose Processor, including a Central Processing Unit (CPU), a Network Processor (NP), and the like; but also Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components. The various methods, steps, and logic blocks disclosed in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in a memory, and a processor reads information in the memory and completes the steps of the method in combination with hardware of the processor.

The electronic device may also execute the method shown in fig. 1, and implement the function of the training apparatus for speech recognition models in the embodiment shown in fig. 1 or the function of the speech recognition apparatus in the embodiment shown in fig. 5, which is not described herein again in this embodiment of the present application.

Of course, besides the software implementation, the electronic device of the present application does not exclude other implementations, such as a logic device or a combination of software and hardware, and the like, that is, the execution subject of the following processing flow is not limited to each logic unit, and may also be hardware or a logic device.

Embodiments of the present application also provide a computer-readable storage medium storing one or more programs, where the one or more programs include instructions, which when executed by a portable electronic device including a plurality of application programs, enable the portable electronic device to perform the method of the embodiment shown in fig. 1, and are specifically configured to:

inputting the mixed data set and the labeled text and language labels of the voice data in the mixed data set into an initial voice recognition model to obtain a recognition result of the voice data in the mixed data set, wherein the recognition result comprises a recognition text and a recognition language;

Alternatively, the instructions, when executed by a portable electronic device comprising a plurality of application programs, can cause the portable electronic device to perform the method of the embodiment shown in fig. 5, and in particular to perform the following operations:

In short, the above description is only a preferred embodiment of the present application, and is not intended to limit the scope of the present application. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.

The systems, apparatuses, modules or units described in the above embodiments may be specifically implemented by a computer chip or an entity, or implemented by a product with certain functions. One typical implementation device is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smartphone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.

Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the system embodiment, since it is substantially similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.

Claims

1. A method for training a speech recognition model, comprising

the initial voice recognition model comprises a content recognition network and a language classifier, wherein the content recognition network is used for coding voice data in the mixed data set to obtain corresponding feature vectors and performing voice recognition based on the feature vectors to obtain the recognition text; the language classifier is used for performing language recognition based on the feature vector to obtain the recognized language, and the content recognition network is obtained by utilizing second sample common speech data and a labeled text of the second sample common speech data for pre-training.

2. The method of claim 1, wherein determining the total recognition loss of the initial speech recognition model based on the recognition results of the speech data in the mixed data set and the labeled text and language tags of the speech data in the mixed data set comprises:

determining a first recognition loss of the content recognition network based on the recognition text and an annotation text of the speech data;

determining the recognition loss of the language classifier based on the recognition language and the language label of the voice data;

and normalizing the first recognition loss of the content recognition network and the recognition loss of the language classifier to obtain the total recognition loss of the initial speech recognition model.

3. The method of claim 2, wherein the content recognition network comprises:

the coder is used for coding the voice data based on the labeled text of the voice data to obtain the characteristic vector, and carrying out voice recognition on the voice data based on a connection time sequence classification mechanism and the characteristic vector to obtain a first recognition text of the voice data;

and the decoder is used for carrying out voice recognition on the voice data based on the attention mechanism and the feature vector to obtain a second recognition text of the voice data.

4. The method of claim 3, wherein determining a first recognition loss of the content recognition network based on the recognition text and an annotation text of the speech data comprises:

determining a second recognizer loss based on a second recognition text of the voice data and a label text of the voice data, wherein the second recognizer loss is used for representing the recognition loss caused by the decoder performing voice recognition on the mixed data set based on an attention mechanism;

5. The method of claim 4, wherein determining a second recognizer loss based on the second recognized text of the speech data and the annotated text of the speech data comprises:

6. The method of claim 3, wherein prior to entering the mixed data set and the annotated text of the speech data into an initial speech recognition model, the method further comprises:

inputting the second sample common speech data and the labeled text of the second sample common speech data into an initial content recognition network to obtain a third recognition text and a fourth recognition text, wherein the third recognition text is obtained by performing speech recognition on an encoder in the initial content recognition network based on a connection timing classification mechanism and a feature vector of the second sample common speech data, and the fourth recognition text is obtained by performing speech recognition on a decoder in the initial content recognition network based on an attention mechanism and the feature vector of the second sample common speech data;

7. A speech recognition method, comprising:

wherein the voice recognition model is obtained by model training based on the labeled text and language label of the voice data in the mixed data set and the recognition result output by the voice recognition model aiming at the mixed data set, the mixed dataset including first sample normal speech data and sample dialect speech data, the speech recognition model including a content recognition network and a language classifier, the recognition result comprises a recognition text and a recognition language, the recognition text is obtained by the content recognition network through voice recognition on the voice data in the mixed data set, the language identification is obtained by the language classifier performing language identification on the voice data in the mixed data set, the content recognition network is obtained by pre-training second sample common speech sound data and the labeled text of the second sample common speech sound data.

8. An apparatus for training a speech recognition model, comprising:

9. An electronic device, comprising:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to execute the instructions to implement the method of any one of claims 1 to 7.

10. A computer-readable storage medium, wherein instructions in the storage medium, when executed by a processor of an electronic device, enable the electronic device to perform the method of any of claims 1-7.