CN115881101A

CN115881101A - Training method and device of voice recognition model and processing equipment

Info

Publication number: CN115881101A
Application number: CN202211392542.9A
Authority: CN
Inventors: 李登实; 高雨; 朱晨倚; 王前瑞; 宋昊; 薛童; 陈澳雷
Original assignee: Jianghan University
Current assignee: Jianghan University
Priority date: 2022-11-08
Filing date: 2022-11-08
Publication date: 2023-03-31

Abstract

The application provides a training method, a training device and a processing device of a voice recognition model, which are used for training the voice recognition model with higher voice recognition precision, so that the interference of environmental noise on voice recognition can be greatly reduced in specific application, and more accurate voice recognition results can be obtained.

Description

Training method and device of voice recognition model and processing equipment

Technical Field

The present application relates to the field of speech recognition, and in particular, to a method and an apparatus for training a speech recognition model, and a processing device.

Background

In the application of Artificial Intelligence (AI), speech recognition is one of the large application scenarios, and the speech recognition technology can be understood as converting vocabulary contents in human speech into computer-readable input, so that data input can be directly completed from audio and video according to sound, and the speech recognition technology can relate to different fields such as industry, household appliances, communication, automotive electronics, medical treatment, home services, consumer electronics, and has a wide application prospect.

In the process of recording audio and video data, a user is basically impossible to be in an absolutely quiet place, and certain environmental noise often exists around the user under actual conditions, so that the environmental noise is recorded into the audio and video data together, and interference is brought to the subsequent voice recognition processing.

In the research process of the existing related technologies, the inventor finds that the recognition accuracy of the existing voice recognition technology needs to be improved under the interference of environmental noise.

Disclosure of Invention

The application provides a training method, a device and a processing device of a voice recognition model, which are used for training the voice recognition model with higher voice recognition precision, so that the interference of environmental noise on voice recognition can be greatly reduced in specific application, and more accurate voice recognition results can be obtained.

In a first aspect, the present application provides a method for training a speech recognition model, the method including:

acquiring a sample set, wherein the sample set comprises audio and video data LRW at a character level and audio and video data LRS2 at a sentence level;

inputting audio data in the audio and video data LRW into an initial MoCov2 model for pre-training to obtain a pre-training MoCov2 model;

video data L in audio/video data LRS2 _v Inputting a video coding module configured with a pre-training MoCov2 model and a first transform coder, and coding to obtain lip characteristics X _v And audio data L in the audio/video data LRS2 _a Inputting an audio coding module configured with a Wav2vec2.0 model and a second transform coder, and coding to obtain audio characteristics X _a ；

By lip feature X _v And audio feature X _a Inputting a combination module consisting of a cross-modal attention module and a time attention module to obtain a fusion characteristic f;

inputting the fusion feature f into a speech recognition model formed by a Transformer decoder and a CTC model for training, wherein the loss function in the training process is the speech recognition result output by the speech recognition model and the text feature X of the audio and video data LRS2 _w And (4) calculating.

In a second aspect, the present application provides an apparatus for training a speech recognition model, the apparatus comprising:

the system comprises a sample acquisition unit, a data processing unit and a data processing unit, wherein the sample acquisition unit is used for acquiring a sample set, and the sample set comprises audio and video data LRW at a character level and audio and video data LRS2 at a sentence level;

the pre-training unit is used for inputting audio data in the audio and video data LRW into an initial MoCov2 model for pre-training to obtain a pre-training MoCov2 model;

a feature coding unit for coding the video data L in the audio/video data LRS2 _v Inputting a video coding module configured with a pre-training MoCov2 model and a first transform coder, and coding to obtain lip characteristics X _v And audio data L in the audio/video data LRS2 _a Inputting an audio coding module configured with a Wav2vec2.0 model and a second transform coder, and coding to obtain audio characteristics X _a ；

A feature fusion unit for passing lip feature X _v And audio feature X _a Inputting a combination module consisting of a cross-modal attention module and a time attention module to obtain a fusion characteristic f;

a training unit for inputting the fusion feature f into a speech recognition model composed of a Transformer decoder and a CTC model for training, wherein the loss function in the training process is the speech recognition result output by the speech recognition model and the text feature X of the audio/video data LRS2 _w And (4) calculating.

In a third aspect, the present application provides a processing device, which includes a processor and a memory, where the memory stores a computer program, and the processor executes the method provided by the first aspect of the present application or any one of the possible implementation manners of the first aspect of the present application when calling the computer program in the memory.

In a fourth aspect, the present application provides a computer-readable storage medium storing a plurality of instructions adapted to be loaded by a processor to perform the method provided in the first aspect of the present application or any one of the possible implementations of the first aspect of the present application.

From the above, the present application has the following advantageous effects:

aiming at the training of a voice recognition model, the MoCov2 model and the wav2vec2.0 model are added into a model training framework, so that more robust and stable audio and video characteristics can be obtained, then a cross-mode attention mechanism and a time attention mechanism are continuously introduced, so that audio and video characteristic information can be corrected, aligned and fusion representation can be obtained, then the voice recognition model is trained to decode and output a voice recognition result, in the training process, the video characteristics and the text characteristics are combined to obtain more text information video characteristics, so that sample data with better quality can be obtained, the voice recognition model can be better trained, and the voice recognition model obtained by training has higher voice recognition precision, so that the interference of environmental noise on the voice recognition can be greatly reduced in specific application, and a more accurate voice recognition result can be obtained.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings required to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the description below are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a schematic flow chart of a method for training a speech recognition model according to the present application;

FIG. 2 is a schematic diagram of a model training architecture of the present application;

FIG. 3 is a schematic diagram of a structure of a training apparatus for speech recognition models according to the present application;

FIG. 4 is a schematic diagram of a processing apparatus according to the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The terms "first," "second," and the like in the description and in the claims of the present application and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It will be appreciated that the data so used may be interchanged under appropriate circumstances such that the embodiments described herein may be practiced otherwise than as specifically illustrated or described herein. Moreover, the terms "comprises," "comprising," and any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or modules is not necessarily limited to those steps or modules explicitly listed, but may include other steps or modules not expressly listed or inherent to such process, method, article, or apparatus. The naming or numbering of the steps appearing in the present application does not mean that the steps in the method flow have to be executed in the chronological/logical order indicated by the naming or numbering, and the named or numbered process steps may be executed in a modified order depending on the technical purpose to be achieved, as long as the same or similar technical effects are achieved.

The division of the modules presented in this application is a logical division, and in practical applications, there may be another division, for example, multiple modules may be combined or integrated into another system, or some features may be omitted, or not executed, and in addition, the shown or discussed coupling or direct coupling or communication connection between each other may be through some interfaces, and the indirect coupling or communication connection between the modules may be in an electrical or other similar form, which is not limited in this application. The modules or sub-modules described as separate components may or may not be physically separated, may or may not be physical modules, or may be distributed in a plurality of circuit modules, and some or all of the modules may be selected according to actual needs to achieve the purpose of the present disclosure.

Before describing the training method of the speech recognition model provided in the present application, the background related to the present application will be described first.

The training method and device for the voice recognition model and the computer readable storage medium can be applied to processing equipment and used for training the voice recognition model with higher voice recognition precision, so that the interference of environmental noise on voice recognition can be greatly reduced in specific application, and a more accurate voice recognition result can be obtained.

In the training method of the speech recognition model, the main execution body may be a training apparatus of the speech recognition model, or different types of processing devices such as a server, a physical host, or User Equipment (UE) that integrates the training apparatus of the speech recognition model. The training device of the speech recognition model may be implemented in hardware or software, the UE may specifically be a terminal device such as a smart phone, a tablet computer, a notebook computer, a desktop computer, or a Personal Digital Assistant (PDA), and the processing device may be set in a device cluster manner.

It can be understood that the specific form of the processing device can be configured according to actual needs, and the main function of the processing device is to serve as a carrier to implement data processing related to training of the speech recognition model of the application, and further, the trained speech recognition model can be loaded according to needs to perform functions of a speech recognition application.

In addition, the following abbreviations for the following positions are referred to in the following description.

A neural network based time series class Classification (CTC);

region of interest (ROI);

bidirectional Encoding (BERT).

Next, a method for training a speech recognition model provided by the present application will be described.

First, referring to fig. 1, fig. 1 shows a schematic flow chart of a training method of a speech recognition model according to the present application, and the training method of the speech recognition model according to the present application may specifically include the following steps S101 to S105:

step S101, a sample set is obtained, wherein the sample set comprises audio and video data LRW at a character level and audio and video data LRS2 at a sentence level;

it will be appreciated that for training of the speech recognition model, the acquisition process of its sample set may be performed in an initial stage.

The acquisition processing of the sample set may be real-time acquisition of related data, or may be capture processing of thread data, and the acquisition path is not specifically limited in this application.

The sample set referred to in the present application specifically includes two types, one type is character-level audiovisual data LRW, and the other type is sentence-level audiovisual data LRS2, where the audiovisual data LRW specifically may include the following sounds referred to belowThe audio data, the audio-visual data LRS2 may in particular comprise video data L as will be referred to below _v Audio data L _a And text data L _w Which may be extracted from the original video or configured directly with the original video.

For the audio and video data LRW at the character level, the data set only contains a single character, and only the single character or word needs to be recognized correspondingly in the audio and video speech recognition.

For the audio and video data LRS2 at the sentence level, the data set is a sentence, the whole sentence is recognized correspondingly, and the spaces between english words and the end symbol of the sentence can be distinguished in the audio and video speech recognition.

In a specific application, the audio-video data can be obtained by capturing from a related audio-video library, or can be the related audio-video library.

The sample set corresponds to the training process of a subsequent speech recognition model and can be further divided into a training set, a verification set and a test set, the training set, the verification set and the test set correspond to three different model training stages, namely a training stage, a verification stage and a test stage, and the basic model training mechanism is not the key point of the application, so the description is not given.

As an example, the data in the sample set may be configured to: 98% as training data, 1% as validation data and 1% as test data.

In addition, for the sample set, it may also relate to the application of data preprocessing in order to improve the data quality of the sample and/or to perform data normalization, for example, a mouth region of a video may be cut out and an ROI feature of 112 × 112 size may be extracted and converted into a gray image, which is then compared with the audio data L extracted from the video _a And carrying out normalization processing.

Step S102, inputting audio data in the audio and video data LRW into an initial MoCov2 model for pre-training to obtain a pre-training MoCov2 model;

for the overall model training architecture, the method and the device for training the speech recognition model combine the pre-training model to assist the training of the speech recognition model, so as to help the speech recognition model to perform model training with better effect.

As shown here, the MoCov2 model in the initial state (which may be referred to as an initial MoCov2 model) may be trained on the audio data in the audio/video data LRW in the sample set, so as to complete processing of a pre-training model, and obtain the pre-training MoCov2 model.

It is understood that the specific training mode for the pre-training of the initial MoCov2 model is similar to the prior art and is not described herein.

The characteristics of the MoCov2 model are exemplified by taking the initial MoCov2 model as an object.

The initial MoCov2 model is specifically an auto-supervision model, and the initial MoCov2 model specifically may include an encoder module, a multilayer perceptron module, and a queue module, and for the three, there are:

the encoder module processes a tensor corresponding to the input image data to obtain an feature matrix and constructs a key in the unsupervised learning dictionary data (the key in the key value pair) to retrieve corresponding data;

the multilayer perception module acquires image characteristics;

the queue module completes storage and maintenance of dictionary data by setting a queue rule.

Step S103, the video data L in the audio/video data LRS2 is processed _v Inputting a video coding module configured with a pre-training MoCov2 model and a first transform coder, and coding to obtain lip characteristics X _v And audio data L in the audio/video data LRS2 _a Inputting an audio coding module configured with a Wav2vec2.0 model and a second transform coder, and coding to obtain audio characteristics X _a ；

After the pre-training MoCov2 model and the Wav2vec2.0 model are configured, the training of the subsequent speech recognition model can be expanded.

In the present application, the speech recognition model refers to a transform decoder, and is provided by both a pre-trained MoCov2 model and a wav2vec2.0 model corresponding to the input of the speech recognition model, and they may be provided with transform encoders, and for convenience of description, the pre-trained MoCov2 model and the wav2vec2.0 model are referred to as a first transform encoder and a second transform encoder.

For this arrangement, it can be understood that the output of the three-dimensional convolutional layer is intentionally configured to be the same as the input of the first reblock in the MoCov2 model, thereby providing a compatible interface and extracting temporally and spatially deeper features.

At this time, in the process of developing the training process of the speech recognition model, on one hand, the video data L in the audio/video data LRS2 may be processed _v Inputting a video coding module configured with a pre-training MoCov2 model and a first Transformer coder, and coding to obtain lip characteristics X _v On the other hand, the audio data L in the audio-video data LRS2 can be converted into audio data L _a Inputting an audio coding module configured with a Wav2vec2.0 model and a second transform coder, and coding to obtain audio features X _a Features of both aspects are provided.

Wherein, the Wav2vec2.0 model is a well-trained existing model.

Step S104, lip feature X is used _v And audio feature X _a Inputting a combination module consisting of a cross-modal attention module and a time attention module to obtain a fusion characteristic f;

in addition, to facilitate the input of the speech recognition model and to characterize more accurate feature content, the two previously acquired features, namely the lip feature X _v And audio features Xa, the feature fusion process specifically designed in this application can be performed to achieve the purpose of alignment correction and feature fusion.

The feature fusion process referred to herein can be performed by a combination module composed of both the cross-modal attention module and the temporal attention module configured in the present application.

It is easy to understand that the combination module comprises two modules, namely a cross-mode attention module and a time attention module, and the better fusion effect of the two characteristics is achieved through the cross-mode attention module and the time attention module.

In particular, the attention mechanism for both can also be understood with reference to the following.

1) The feature processing across modal attention modules includes the following:

X _a representing extracted audio features of an audio modality, X _v Representing the video features extracted by the video modality,

l represents the number of sequence segments in an input sequence that is given video and audio, and->

And &>

Respectively, L =1, 2., the feature vector of the L segment,

will audio frequency feature X _a And video feature X _v Splicing as audio-video feature J = [ X ] _a ；X _v ]∈R ^d×L ，d＝d _a +d _v ，d _a Characteristic dimension representing the A mode, d _v The characteristic dimensions of the V-mode are represented,

audio feature X _a And the incidence matrix C of the splicing characteristics J _a Comprises the following steps:

video feature X _v And the incidence matrix C of the splicing characteristics J _v Comprises the following steps:

correlation matrix C _a And a correlation matrix C _v Not only providing the semantic relativity of the same mode, but also providing the semantic relativity of different modes, and combining the correlation matrix C _a Correlation matrix sum C _v The higher correlation coefficient of (a) indicates that the corresponding samples have strong correlation in the same modality as well as in other modalities.

In the correlation matrix C _a And audio feature X _a On the basis of the weight matrix, combining with the learnable weight matrix

And W _a In combination, the attention weight of an audio modality is calculated as follows:

similarly, in the correlation matrix C _v And video feature X _v On the basis of the weight matrix, combining with the learnable weight matrix

And W _v In combination, the attention weight of the video modality is calculated as follows:

the attention characteristics of the audio modality and the video modality are respectively calculated by using an attention diagram, and are represented as follows:

X _att，a ＝W _ha H _a +X _a ，

X _att，v ＝W _hv H _v +X _v ，

will be characterized by X _att，a And feature X _att，v Stitching, resulting in features for the input temporal attention modality:

X _att ＝[X _att，a ；X _att，v ]。

for the specific setting of the cross-modal attention mechanism, it is easy to understand that the audio features and the video features belong to two different modalities, so that the audio features and the video features cannot be directly spliced together, the features of the two modalities need to be fused, and the features of the two modalities can be learned mutually through the nested application of the series of formulas, so that the semantic modal relationships of the audio features and the visual features can be better captured.

2) The feature processing of the temporal attention modality includes the following:

in a trainable weight matrix W _T On the basis of (A), a corresponding representation matrix T = X is created _att W _T ，T∈R ^L×d And defining a temporal attention feature of the temporal attention modality by:

wherein Q is a query vector in the attention mechanism, K is a key vector in the attention mechanism, and V is a value vector in the attention mechanism, which are obtained by multiplying the input features X by the corresponding weight matrixes respectively.

With this arrangement, it will be appreciated that, to calculate the attention score, the matrix W will be queried _T The time matrix T is multiplied and then multiplied by its transpose to keep the dimension constant and then divided by the norm of the time matrix T to avoid getting too large a value.

With respect to the specific arrangement of the temporal attention mechanism herein, it is readily understood that it provides a specific implementation for the feature of aggregate fusion in the temporal dimension, and the present application considers that both video data and audio data are sometimes sequential, and thus can focus on content information at the current time as well as at all previous times.

Step S105, inputting the fusion characteristic f into a voice recognition model formed by a Transformer decoder and a CTC model for training, wherein the loss function in the training process is the voice recognition result output by the voice recognition model and the text characteristic X of the audio and video data LRS2 _w And (4) calculating.

It can be seen that, for the training of the speech recognition model, the loss function involved therein processes the speech recognition result output by the model itself, and may also involve the standard speech recognition result, i.e. the text feature X of the audio/video data LRS2 mentioned herein _w 。

Text characteristic X for the audio/video data LRS2 _w The method of the present application may further include corresponding extraction processing, that is:

text data L from audio-visual data LRS2 _w In the method, text features X are extracted through a BERT model _w 。

For the BERT model, in the extraction process, token embedding, segment embedding and position embedding are obtained through an input layer of the BERT model, and then the token embedding, the segment embedding and the position embedding are added to finally obtain an output vector of the input layer.

For the BERT model, not only can word vectors in a text be extracted, but also semantics in the text and relations between words can be obtained, the BERT belongs to a language model, other models only pay attention to the word vectors of each word or each word when text characteristics are extracted, context relations of sentences are ignored, and the BERT model utilizes the properties of the language model and combines the context relations to adjust text characteristics.

Of course, in a particular application, for text feature X _w The method can also be realized by other types of model algorithms, and can be adjusted according to actual needs.

In addition, on a microscopical level, in the text feature extraction process of the BERT model, the random words can be specifically shielded, and then the text feature X is carried out _w Thereby obtaining good word vector characteristics.

It is to be understood that the BERT model itself can extract text features, where a task of masking random words is configured, 15% of words in a data set are replaced with [ mask ], in order to identify the words that are masked, during the identification process, it needs to learn what probability the current mask has through semantic relationships in the preceding and following text, and by selecting the value with the highest probability as the final output, the BERT model can well learn semantic relationships between sentences from such a task, so that the context relationship can also be considered when extracting the features of a single word, thereby obtaining good word vector features.

It can be seen that the speech recognition model to be trained in the present application is specifically composed of both the transform decoder and the CTC model.

It can be understood that the object of the transform is to project the audio modality and the video modality into the same coding space through the structure of the encoder-decoder, so as to achieve the effect of fusing multi-modal information, and the decoder is composed of a CTC and a transform decoder.

During the training process of the transform decoder, the loss may specifically consist of L1 loss, CTC loss, and cross entropy loss, and specifically, the following may be included:

the Smooth L1 loss function using the following formula:

wherein,

d is the text characteristic X of the audio/video data LRS2 _w Based on the language feature decoded by the transform decoder>

The difference between the values of the two signals,

CTC loss assumes conditional independence between each output prediction, in the form:

the autoregressive decoder breaks away from the assumption by directly estimating the posteriori of the chain rule, which is of the form:

x = [ x1, \8230;, xT ] is the output sequence of the speech recognition model output, y = [ y1, \8230;, yL ] is the target, T represents the input length, L represents the target length,

the total loss is defined as:

Loss＝λlog p _CTC (y|x)+(1-λ)log p _CE (y|x)+L ₁ (d)，

where λ controls the relative weight of the hybrid CTC/attention mechanism between CTC loss and cross-entropy loss.

In addition, before training the Transformer decoder, related network settings, such as the number of hidden layers, the number of nodes of the hidden layers, the learning rate, and the like, can be set.

In the training process, the model can be optimized through the calculation result of the loss function, for example, the model can be optimized through the calculation result optimization model of Smooth L1 loss, CTC loss and cross entropy loss, the back propagation is carried out, the model parameters are optimized, and when the training requirements of 1000 times of iteration, recognition accuracy, training duration and the like are met through multiple N rounds of propagation, the training of the model can be completed.

For ease of understanding of the above description, including exemplary embodiments, reference may also be made to an architectural diagram of the model training architecture of the present application, shown in FIG. 2.

Compared with the training architecture shown in fig. 2, it can be understood that, in the aspect of the visual modality, the self-supervised learning pre-training is performed on the visual front end, where the self-supervised learning pre-training is completed by using a MoCov2 model, and then the sequence classification modification and training are performed through the word-level video segments in the audio/video data LRW;

after that, the visual front end is inherited by a pure video model, wherein the visual back end and a special decoder are used, finally, the audio features and the visual features are extracted through a trained audio model and a lip language model and sent to a fusion module, the audio and video back end outputs can be pre-calculated in consideration of the calculation limit, and only the parameters of the fusion module and the decoder part are learned in the final stage.

After the training of the model is finished, the model can be put into practical use, speech recognition application is carried out, and the speech recognition application is developed.

Correspondingly, the method of the application can further comprise:

and inputting the voice data to be recognized into the voice recognition model, performing voice recognition processing by the voice recognition model, and obtaining a voice recognition result output by the voice recognition model.

In addition, for the trained model, the voice recognition accuracy can be verified (the same verification mode can be adopted for training in the training process).

As an example, the video and audio of the speaker plus the snr of 0db, 5db, and 10db may be respectively sent to a trained model, the recognized characters are output, and the recognition effect is evaluated by using the word error rate, which is an index for evaluating the speech recognition performance, for evaluating the error rate between the predicted text and the standard text, where the maximum word error rate is the smaller the better, and the calculation formula is:

wherein S represents the number of replacements that occur when converting the predicted sample into a real sample, D represents the number of replacements that occur when converting the real sample into a predicted sample, I represents the number of insertions that occur when converting the test sample into a real sample, N represents the total number of words or English words in a standard sample sentence, and C represents the number of words that are correctly identified in the predicted sample sentence.

From the above, aiming at the training of the speech recognition model, the MoCov2 model and the wav2vec2.0 model are added into the model training architecture, so that more robust and stable audio and video characteristics can be obtained, then the cross-mode attention mechanism and the time attention mechanism are continuously introduced, so that audio and video characteristic information can be corrected and aligned and fusion representation can be obtained, then the speech recognition model is trained to decode and output a speech recognition result, in the training process, the video characteristics and the text characteristics are combined to obtain more text information video characteristics, so that sample data with better quality can be obtained, the speech recognition model can be better trained, the speech recognition model obtained by training has higher speech recognition precision, and therefore, the interference of environmental noise on the speech recognition can be greatly reduced in specific application, and more accurate speech recognition result can be obtained.

The above is the introduction of the training method of the speech recognition model provided by the present application, and in order to better implement the training method of the speech recognition model provided by the present application, the present application also provides a training device of the speech recognition model from the perspective of a functional module.

Referring to fig. 3, fig. 3 is a schematic structural diagram of a training apparatus for speech recognition models according to the present application, in which the training apparatus 300 for speech recognition models specifically includes the following structure:

a sample obtaining unit 301, configured to obtain a sample set, where the sample set includes audio and video data LRW at a character level and audio and video data LRS2 at a sentence level;

the pre-training unit 302 is configured to input audio data in the audio/video data LRW into an initial MoCov2 model for pre-training to obtain a pre-training MoCov2 model;

a feature encoding unit 303, configured to encode the video data L in the audio/video data LRS2 _v Inputting a video coding module configured with a pre-training MoCov2 model and a first transform coder, and coding to obtain lip characteristics X _v And audio data L in the audio/video data LRS2 _a Inputting an audio coding module configured with a Wav2vec2.0 model and a second transform coder, and coding to obtain audio characteristics X _a ；

A feature fusion unit 304 for passing the lip feature X _v And audio feature X _a Inputting a combined module consisting of a cross-modal attention module and a time attention module to obtain a fusion characteristic f;

a training unit 305 for inputting the fusion feature f into a speech recognition model composed of a Transformer decoder and a CTC model for training, wherein the loss function in the training process is speech recognitionSpeech recognition result output by the model and text feature X of audio-video data LRS2 _w And (4) calculating.

In one exemplary implementation, the initial MoCov2 model is an auto-supervision model, and the initial MoCov2 model includes an encoder module, a multilayer perceptron module, and a queue module;

the encoder module processes a tensor corresponding to the input image data to obtain a feature matrix, and constructs keys in the unsupervised learning dictionary data to retrieve corresponding data;

the multilayer perception module acquires image characteristics;

In yet another exemplary implementation, the apparatus further includes a feature extraction unit 306, configured to:

In another exemplary implementation manner, in the text feature extraction process, the BERT model performs the text feature X after blocking the random word _w And (4) predicting.

In yet another exemplary implementation, the feature processing across modal attention modules includes the following:

l denotes the number of sequence segments, based on the input sequence for a given video and audio, based on>

And &>

Respectively, L =1, 2., the feature vector of the L segment,

will audio frequency feature X _a And video feature X _v Splicing as audio-video feature J = [ X ] _a ；X _v ]∈R ^d×L ，d＝d _a +d _v ，d _a Characteristic dimension representing A mode, d _v The characteristic dimensions of the V-mode are represented,

in the incidence matrix C _v And video feature X _v On the basis of the weight matrix, combining with the learnable weight matrix

X _att，a ＝W _ha H _a +X _a ，

X _att，v ＝W _hv H _v +X _v ，

will be characterized by X _att，a And feature X _att，u Stitching, resulting in features for the input temporal attention modality:

X _att ＝[X _att，a ；X _att，v ]。

in yet another exemplary implementation, the feature processing of the temporal attention modality includes the following:

in another exemplary implementation, the training process of the transform decoder includes the following steps:

the Smooth L1 loss function using the following formula:

wherein,

The difference between the values of the two signals,

the total loss is defined as:

Loss＝λlog p _CTC (y | x) + (1- λ) log p _CE (y|x)+L ₁ (d)，

The present application further provides a processing device from a hardware structure perspective, referring to fig. 4, fig. 4 shows a schematic structural diagram of the processing device of the present application, specifically, the processing device of the present application may include a processor 401, a memory 402, and an input/output device 403, where the processor 401 is configured to implement the steps of the training method of the speech recognition model in the corresponding embodiment of fig. 1 when executing the computer program stored in the memory 402; alternatively, the processor 401 is configured to implement the functions of the units in the embodiment corresponding to fig. 3 when executing the computer program stored in the memory 402, and the memory 402 is configured to store the computer program required by the processor 401 to execute the training method of the speech recognition model in the embodiment corresponding to fig. 1.

Illustratively, a computer program may be partitioned into one or more modules/units, which are stored in memory 402 and executed by processor 401 to accomplish the present application. One or more modules/units may be a series of computer program instruction segments capable of performing certain functions, the instruction segments being used to describe the execution of a computer program in a computer device.

The processing devices may include, but are not limited to, a processor 401, a memory 402, and input-output devices 403. Those skilled in the art will appreciate that the illustration is merely an example of a processing device and does not constitute a limitation of the processing device and may include more or less components than those illustrated, or combine certain components, or different components, e.g., the processing device may also include a network access device, bus, etc., through which the processor 401, memory 402, input output device 403, etc., are connected.

The Processor 401 may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. The general purpose processor may be a microprocessor or the processor may be any conventional processor or the like, the processor being the control center for the processing device and the various interfaces and lines connecting the various parts of the overall device.

The memory 402 may be used to store computer programs and/or modules, and the processor 401 may implement various functions of the computer device by operating or executing the computer programs and/or modules stored in the memory 402 and invoking data stored in the memory 402. The memory 402 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required for at least one function, and the like; the storage data area may store data created according to use of the processing apparatus, and the like. In addition, the memory may include high speed random access memory, and may also include non-volatile memory, such as a hard disk, a memory, a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), at least one magnetic disk storage device, a Flash memory device, or other volatile solid state storage device.

The processor 401, when executing the computer program stored in the memory 402, may specifically implement the following functions:

inputting video data Lv in audio and video data LRS2 into a video coding module configured with a pre-training MoCov2 model and a first transform coder, and coding to obtain lip characteristics X _v And audio data L in the audio/video data LRS2 _a Inputting an audio coding module configured with a Wav2vec2.0 model and a second-transform coder, and coding to obtain audio characteristics X _a ；

inputting the fusion characteristic f into a speech recognition model formed by a Transformer decoder and a CTC model for training, wherein the loss function in the training process is the speech recognition result output by the speech recognition model and the text characteristic X of the audio and video data LRS2 _w And (4) calculating.

It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the above-described specific working processes of the training apparatus and the processing device for the speech recognition model and the corresponding units thereof may refer to the description of the training method for the speech recognition model in the embodiment corresponding to fig. 1, and are not described herein again in detail.

It will be understood by those skilled in the art that all or part of the steps of the methods of the above embodiments may be performed by instructions or by associated hardware controlled by the instructions, which may be stored in a computer readable storage medium and loaded and executed by a processor.

For this reason, the present application provides a computer-readable storage medium, in which a plurality of instructions are stored, and the instructions can be loaded by a processor to execute the steps of the training method of the speech recognition model in the embodiment corresponding to fig. 1 in the present application, and for specific operations, reference may be made to the description of the training method of the speech recognition model in the embodiment corresponding to fig. 1, which is not repeated herein.

Wherein the computer-readable storage medium may include: read Only Memory (ROM), random Access Memory (RAM), magnetic or optical disk, and the like.

Since the instructions stored in the computer-readable storage medium can execute the steps of the method for training the speech recognition model in the embodiment corresponding to fig. 1, the beneficial effects that can be achieved by the method for training the speech recognition model in the embodiment corresponding to fig. 1 can be achieved, which are detailed in the foregoing description and will not be repeated herein.

The above detailed description is provided for the training method, apparatus, processing device and computer-readable storage medium of the speech recognition model provided in the present application, and a specific example is applied in this document to explain the principle and implementation of the present application, and the description of the above embodiment is only used to help understanding the method and core ideas of the present application; meanwhile, for those skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims

1. A method for training a speech recognition model, the method comprising:

inputting audio data in the audio and video data LRW into an initial MoCov2 model for pre-training to obtain a pre-trained MoCov2 model;

the video data L in the audio and video data LRS2 is processed _v Inputting a video coding module configured with the pre-training MoCov2 model and a first transform coder, and coding to obtain lip characteristics X _v And audio data L in the audio and video data LRS2 _a Inputting an audio coding module configured with a Wav2vec2.0 model and a second transform coder, and coding to obtain audio features X _a ；

By the lip feature X _v And the audio feature X _a Inputting a combination module consisting of a cross-modal attention module and a time attention module to obtain a fusion characteristic f;

inputting the fusion characteristic f into a speech recognition model formed by a Transformer decoder and a CTC model for training, wherein a loss function in the training process is a speech recognition result output by the speech recognition model and a text characteristic X of the audio and video data LRS2 _w And (4) calculating.

2. The method of claim 1, wherein the initial MoCov2 model is an auto-supervised model, the initial MoCov2 model comprising an encoder module, a multilayer perceptron module, and a queue module;

the encoder module processes a tensor corresponding to the input image data to obtain an feature matrix, and constructs keys in unsupervised learning dictionary data to retrieve corresponding data;

the multilayer perception module acquires image characteristics;

3. The method of claim 1, further comprising:

text data L from the audio/video data LRS2 _w Extracting the text feature X through a BERT model _w 。

4. The method of claim 3, wherein the BERT model performs the text feature X by blocking random words during the text feature extraction process _w And (4) predicting.

5. The method according to claim 1, wherein the cross-modality attention module feature processing comprises the following:

And &>

Respectively representing the feature vectors of L =1, 2., L segments,

will audio frequency feature X _a And video feature X _v Splicing as audio-video feature J = [ X ] _a ；X _v ]∈R ^d×L ，d＝d _a +d _v ，d _a Characteristic dimension representing the A mode, d _v A characteristic dimension representing the V-mode shape,

the audio frequency characteristic X _a And the incidence matrix C of the splicing characteristics J _a Comprises the following steps:

the video feature X _v And the incidence matrix C of the splicing characteristics J _v Comprises the following steps:

in the correlation matrix C _a And the audio feature X _a On the basis of the weight matrix, combining with the learnable weight matrix

And W _a Combining, calculating attention weights for the audio modalities as follows:

in the incidence matrix C _v And the video feature X _v On the basis of the weight matrix, combining with the learnable weight matrix

And W _v Combining, calculating an attention weight of the video modality as follows:

calculating attention characteristics of the audio modality and the video modality respectively by using an attention map, wherein the attention characteristics are represented as:

X _att，a ＝W _ha H _a +X _a ，

X _att，v ＝W _hv H _v +X _v ，

will be characterized by X _att，a And feature X _att，v Stitching, to obtain features for inputting the temporal attention modality:

X _att ＝［X _att，a ；X _att，v ]。

6. the method according to claim 1, wherein the feature processing of the temporal attention modality comprises:

in a trainable weight matrix W _T On the basis of (A), a corresponding representation matrix T = X is created _att W _T ，T∈R ^L×d And defining a temporal attention characteristic of the temporal attention modality by:

7. the method of claim 1, wherein the training process of the fransformer decoder comprises the following steps:

the Smooth L1 loss function using the following formula:

wherein,

d is the text characteristic X of the audio and video data LRS2 _w And the language feature decoded by the transform decoder->

The difference between the values of the two signals,

x = [ x1, ·, xT ] is the output sequence of the speech recognition model output, y = [ y1, ·, yL ] is the target, T denotes the input length, L denotes the target length,

the total loss is defined as:

Loss＝λlog p _CTC (y|x)+(1―λ)logp _CE (y|x)+L ₁ (d)，

wherein λ controls the relative weight of the hybrid CTC/attention mechanism between the CTC loss and cross-entropy loss.

8. An apparatus for training a speech recognition model, the apparatus comprising:

the pre-training unit is used for inputting the audio data in the audio and video data LRW into an initial MoCov2 model for pre-training to obtain a pre-training MoCov2 model;

a feature coding unit for coding the video data L in the audio/video data LRS2 _v Inputting a video coding module configured with the pre-training MoCov2 model and a first transform coder, and coding to obtain lip characteristics X _v And audio data L in the audio and video data LRS2 _a Inputting an audio coding module configured with a Wav2vec2.0 model and a second transform coder, and coding to obtain audio characteristics X _a ；

A feature fusion unit for passing the lip feature X _v And the audio feature X _a Inputting a combined module consisting of a cross-modal attention module and a time attention module to obtain a fusion characteristic f;

a training unit, configured to input the fusion feature f into a speech recognition model formed by a transform decoder and a CTC model for training, where a loss function in a training process is a speech recognition result output by the speech recognition model and a text feature X of the audio/video data LRS2 _w And (4) calculating.

9. A processing device comprising a processor and a memory, a computer program being stored in the memory, the processor performing the method according to any of claims 1 to 7 when calling the computer program in the memory.

10. A computer-readable storage medium storing a plurality of instructions adapted to be loaded by a processor to perform the method of any one of claims 1 to 7.