CN116110378B

CN116110378B - Model training method, voice recognition device and electronic equipment

Info

Publication number: CN116110378B
Application number: CN202310383270.4A
Authority: CN
Inventors: 韩明伦; 石晶; 徐爽; 徐波
Original assignee: Institute of Automation of Chinese Academy of Science
Current assignee: Institute of Automation of Chinese Academy of Science
Priority date: 2023-04-12
Filing date: 2023-04-12
Publication date: 2023-07-18
Anticipated expiration: 2043-04-12
Also published as: CN116110378A

Abstract

The application provides a model training method, a voice recognition method, a device and electronic equipment, and relates to the technical field of voice recognition. The method comprises the following steps: acquiring a voice recognition model obtained based on continuous integration and issuing CIF mechanism training, and respectively initializing the model parameters of an initial acoustic coding module and the model parameters of an initial CIF module in the initial multi-mode voice recognition model based on the model parameters of the acoustic coding module and the model parameters of the CIF module in the voice recognition model; and training the initialized multi-modal voice recognition model based on the voice sample, the visual image sample corresponding to the voice sample and the text sample, so that the trained multi-modal voice recognition model introduces situation visual knowledge and situation language knowledge in multi-modal voice recognition when carrying out voice recognition, thereby effectively improving voice recognition performance and expanding the boundary of multi-modal voice recognition.

Description

Model training method, voice recognition device and electronic equipment

Technical Field

The present disclosure relates to the field of speech recognition technologies, and in particular, to a model training method, a speech recognition method, a device, and an electronic apparatus.

Background

In recent years, with the continuous development of speech recognition technology, speech recognition models have been widely used in various scenes such as video subtitle generation, video conference transcription, and the like because of their strong learning ability.

In the prior art, when a voice recognition model is trained, a voice sample is mainly used, lip movement visual information corresponding to the voice sample is combined to train and generate the voice recognition model, however, by adopting the technical scheme, the lip movement visual information and the content of the voice sample are required to be strictly aligned in time, and a plurality of actual voice recognition scenes are difficult to ensure the lip movement visual information and the content of the voice sample to be strictly aligned in time, so that the voice recognition performance of the voice recognition model is poor.

Therefore, how to train a speech recognition model with better speech recognition performance is a problem to be solved by those skilled in the art.

Disclosure of Invention

The application provides a model training method, a voice recognition method, a device and electronic equipment, which can effectively improve voice recognition performance.

In a first aspect, the present application provides a model training method, which may include:

And acquiring a voice recognition model obtained based on continuous integration and release CIF mechanism training.

And initializing the model parameters of the initial acoustic coding module and the model parameters of the initial CIF module in the initial multi-mode voice recognition model respectively based on the model parameters of the acoustic coding module and the model parameters of the CIF module in the voice recognition model to obtain an initialized multi-mode voice recognition model.

A plurality of sample pairs are obtained, wherein each sample pair comprises a voice sample, a visual image sample and a text sample corresponding to the voice sample.

Training the initialized multi-modal speech recognition model based on the plurality of samples to obtain a trained multi-modal speech recognition model.

According to the model training method provided by the application, the initialized multi-modal speech recognition model comprises a multi-modal sensing module, an acoustic coding module, a CIF module and a decoding module, and the training of the initialized multi-modal speech recognition model based on the plurality of samples comprises the following steps:

for each of the sample pairs, the following is performed:

inputting an acoustic characterization sequence corresponding to a voice sample in the sample pair into the acoustic coding module to obtain a first voice feature vector sequence corresponding to the voice sample, inputting the first voice feature vector sequence into the CIF module, determining a prediction weight sequence corresponding to the voice sample through the CIF module, and determining a second voice feature vector sequence corresponding to the voice sample based on the prediction weight sequence.

And inputting the visual image samples in the sample pair into a visual image encoder in the multi-mode sensing module to obtain a visual feature vector sequence corresponding to the visual image samples.

And inputting the text samples in the sample pair into a text encoder in the multi-mode sensing module to obtain a text feature vector sequence corresponding to the text samples.

And at each decoding moment, inputting a predicted text representation vector at the previous moment, a second voice feature vector corresponding to the current moment, the visual feature vector sequence and the text feature vector sequence into the decoding module to obtain a probability value of the predicted text of the voice sample at the current moment.

And training the initialized multi-modal voice recognition model according to the label text sequence corresponding to each voice sample, the first voice feature vector sequence, the weight sequence label, the predicted weight sequence, the probability value of the label text sequence and the probability value of the predicted text sequence.

According to the model training method provided by the application, the decoding module comprises a feature fusion layer and a post-processing module which are connected in series, the predicted text representation vector at the previous moment, the second voice feature vector corresponding to the current moment, the visual feature vector sequence and the text feature vector sequence are input into the decoding module, so as to obtain the probability value of the predicted text of the voice sample at the current moment, and the method comprises the following steps:

And inputting the feature vector obtained by fusing the predicted text characterization vector at the previous moment and the second voice feature vector corresponding to the previous moment, the visual feature vector sequence and the text feature vector sequence into the feature fusion layer for fusion, so as to obtain a target fusion feature vector.

And inputting the target fusion feature vector and the second voice feature vector corresponding to the current moment into the post-processing module to obtain the probability value of the predicted text.

According to the model training method provided by the application, the feature fusion layer comprises an acoustic language fusion layer, a visual fusion layer and a language fusion layer which are sequentially connected in series, the feature vector obtained by fusing the predicted text characterization vector at the previous moment and the second voice feature vector corresponding to the previous moment, the visual feature vector sequence and the text feature vector sequence are all input into the feature fusion layer for fusion, and a target fusion feature vector is obtained, and the model training method comprises the following steps:

and inputting the feature vector obtained by fusing the predicted text representation vector at the previous moment and the second voice feature vector corresponding to the previous moment into the acoustic language fusion layer to obtain a first fused feature vector.

And inputting the visual feature vector sequence and the first fusion feature vector into the visual fusion layer for fusion to obtain a second fusion feature vector.

And inputting the text feature vector sequence and the second fusion feature vector to the language fusion layer for fusion to obtain the target fusion feature vector.

According to the model training method provided by the application, the feature fusion layer comprises an acoustic language fusion layer, a language fusion layer and a visual fusion layer which are sequentially connected in series, the feature vector obtained by fusing the predicted text characterization vector at the previous moment and the second voice feature vector corresponding to the previous moment, the visual feature vector sequence and the text feature vector sequence are all input into the feature fusion layer for fusion, and a target fusion feature vector is obtained, and the model training method comprises the following steps:

And inputting the text feature vector sequence and the first fusion feature vector into the language fusion layer for fusion to obtain a third fusion feature vector.

And inputting the visual feature vector sequence and the third fusion feature vector into the visual fusion layer for fusion to obtain the target fusion feature vector.

According to the model training method provided by the application, training the initialized multi-modal speech recognition model according to the label text sequence corresponding to each speech sample, the first speech feature vector sequence, the weight sequence label, the predicted weight sequence, the probability value of the label text sequence and the probability value of the predicted text sequence includes:

for each voice sample, constructing a connection time sequence classification loss function corresponding to the voice sample according to a label text sequence corresponding to the voice sample and the first voice feature vector sequence; constructing a quantity loss function corresponding to the voice sample according to the weight sequence label and the predicted weight sequence corresponding to the voice sample; and constructing a cross entropy loss function corresponding to the voice sample according to the probability value of the label text sequence corresponding to the voice sample and the probability value of the predicted text.

And training the initialized multi-modal voice recognition model according to the connection time sequence classification loss function, the quantity loss function and the cross entropy loss function corresponding to each voice sample.

According to the model training method provided by the application, the training of the initialized multi-modal speech recognition model according to the connection time sequence classification loss function, the number loss function and the cross entropy loss function corresponding to each speech sample comprises the following steps:

and for each voice sample, carrying out weighting processing on the connection time sequence classification loss function, the quantity loss function and the cross entropy loss function corresponding to the voice sample to obtain a target loss function corresponding to the voice sample.

And training the initialized multi-modal voice recognition model according to the target loss function corresponding to each voice sample.

In a second aspect, the present application further provides a speech recognition method, the speech recognition method including:

and acquiring the voice to be recognized, and a visual image and a text corresponding to the voice to be recognized.

Inputting the voice to be recognized, the visual image and the text into a multi-modal voice recognition model to obtain a predicted text corresponding to the voice to be recognized and a probability value of the predicted text, wherein the multi-modal voice recognition model is the multi-modal voice recognition model obtained by training any one of the first aspect.

According to the voice recognition method provided by the application, the multi-modal voice recognition model comprises a multi-modal sensing module, an acoustic coding module, a continuous integration issuing CIF module and a decoding module, wherein the voice to be recognized, the visual image and the text are input into the multi-modal voice recognition model to obtain a predicted text corresponding to the voice to be recognized and a probability value of the predicted text, and the voice recognition method comprises the following steps:

inputting the acoustic characterization sequence corresponding to the voice to be recognized into the acoustic encoding module to obtain a third voice feature vector sequence corresponding to the voice to be recognized, inputting the third voice feature vector sequence into the CIF module, determining a prediction weight sequence corresponding to the voice to be recognized through the CIF module, and determining a fourth voice feature vector sequence corresponding to the voice to be recognized based on the prediction weight sequence.

And inputting the visual image into a visual image encoder in the multi-mode sensing module to obtain a visual feature vector sequence corresponding to the visual image.

And inputting the text into a text encoder in the multi-mode sensing module to obtain a text feature vector sequence corresponding to the text.

And at each decoding moment, inputting a predicted text representation vector at the previous moment, a fourth voice feature vector corresponding to the current moment, the visual feature vector sequence and the text feature vector sequence into the decoding module to obtain probability values of the predicted text and the predicted text of the voice to be recognized at the current moment.

According to the voice recognition method provided by the application, the decoding module includes a feature fusion layer and a post-processing module connected in series, and the inputting the predicted text characterization vector at the previous moment, the fourth voice feature vector corresponding to the current moment, the visual feature vector sequence and the text feature vector sequence into the decoding module, to obtain the probability values of the predicted text and the predicted text of the voice to be recognized at the current moment includes:

and inputting the feature vector obtained by fusing the predicted text characterization vector at the previous moment and the fourth voice feature vector corresponding to the previous moment, the visual feature vector sequence and the text feature vector sequence into the feature fusion layer for fusion to obtain a fusion feature vector.

And inputting the fusion feature vector and the fourth voice feature vector corresponding to the current moment into the post-processing module to obtain the predicted text and the probability value of the predicted text.

According to the voice recognition method provided by the application, the feature fusion layer comprises an acoustic language fusion layer, a visual fusion layer and a language fusion layer which are sequentially connected in series, the feature vector obtained by fusing the predicted text characterization vector at the previous moment and the fourth voice feature vector corresponding to the previous moment, the visual feature vector sequence and the text feature vector sequence are input to the feature fusion layer to be fused, and the fusion feature vector is obtained, and comprises the following steps:

and inputting the feature vector obtained by fusing the predicted text representation vector at the previous moment and the fourth voice feature vector corresponding to the previous moment into the acoustic language fusion layer to obtain a fourth fused feature vector.

And inputting the visual feature vector sequence and the fourth fusion feature vector into the visual fusion layer for fusion to obtain a fifth fusion feature vector.

And inputting the text feature vector sequence and the fifth fusion feature vector into the language fusion layer for fusion to obtain the fusion feature vector.

According to the voice recognition method provided by the application, the feature fusion layer comprises an acoustic language fusion layer, a language fusion layer and a visual fusion layer which are sequentially connected in series, the feature vector obtained by fusing the predicted text characterization vector at the previous moment and the fourth voice feature vector corresponding to the previous moment, the visual feature vector sequence and the text feature vector sequence are input to the feature fusion layer to be fused, and the fusion feature vector is obtained, and comprises the following steps:

And inputting the text feature vector sequence and the fourth fusion feature vector into the language fusion layer for fusion to obtain a sixth fusion feature vector.

And inputting the visual feature vector sequence and the sixth fusion feature vector into the visual fusion layer for fusion to obtain the fusion feature vector.

In a third aspect, the present application further provides a model training apparatus, including:

the first acquisition unit is used for acquiring a voice recognition model obtained through training based on a continuous integration issuing CIF mechanism.

The first processing unit is used for respectively initializing the model parameters of the initial acoustic coding module and the model parameters of the initial CIF module in the initial multi-mode voice recognition model based on the model parameters of the acoustic coding module and the model parameters of the CIF module in the voice recognition model to obtain an initialized multi-mode voice recognition model.

And the second acquisition unit is used for acquiring a plurality of sample pairs, wherein each sample pair comprises a voice sample, a visual image sample and a text sample corresponding to the voice sample.

And the second processing unit is used for training the initialized multi-modal voice recognition model based on the plurality of samples so as to obtain a trained multi-modal voice recognition model.

According to the model training device provided by the application, the initialized multi-mode voice recognition model comprises a multi-mode sensing module, an acoustic coding module, a CIF module and a decoding module, and the second processing unit is specifically used for:

for each of the sample pairs, the following is performed:

According to the model training device provided by the application, the decoding module comprises a feature fusion layer and a post-processing module which are connected in series, and the second processing unit is specifically used for:

According to the model training device that this application provided, the characteristic fuses layer and fuses the layer including the acoustic language that establishes ties in proper order and fuses layer, vision and language, the second processing unit, specifically be used for:

According to the model training device that this application provided, the characteristic fuses layer and fuses layer, second processing unit, the concrete use in that is established ties in proper order including acoustical language fusion layer, language fusion layer and vision fusion layer:

According to the model training device provided by the application, the second processing unit is used for: the method is particularly used for:

for each voice sample, constructing a connection time sequence classification loss function corresponding to the voice sample according to a label text sequence corresponding to the voice sample and the first voice feature vector sequence; constructing a quantity loss function corresponding to the voice sample according to the weight sequence label and the predicted weight sequence corresponding to the voice sample; and constructing a cross entropy loss function corresponding to the voice sample according to the probability value of the label text sequence corresponding to the voice sample and the probability value of the predicted text sequence.

According to the model training device provided by the application, the second processing unit is specifically configured to:

In a fourth aspect, the present application further provides a speech recognition apparatus, including:

and the third acquisition unit is used for acquiring the voice to be recognized, the visual image and the text corresponding to the voice to be recognized.

The third processing unit is configured to input the voice to be recognized, the visual image and the text into a multi-modal voice recognition model to obtain a predicted text corresponding to the voice to be recognized and a probability value of the predicted text, where the multi-modal voice recognition model is the multi-modal voice recognition model obtained by training any one of the first aspect.

According to the voice recognition device provided by the application, the multi-modal voice recognition model comprises a multi-modal sensing module, an acoustic coding module, a continuous integrated issuing CIF module and a decoding module, and the third processing unit is specifically configured to:

inputting the acoustic characterization sequence corresponding to the voice to be recognized into the acoustic encoding module to obtain a first voice feature vector sequence corresponding to the voice to be recognized, inputting the first voice feature vector sequence into the CIF module, determining a prediction weight sequence corresponding to the voice to be recognized through the CIF module, and determining a second voice feature vector sequence corresponding to the voice to be recognized based on the prediction weight sequence.

According to the voice recognition device provided by the application, the decoding module comprises a feature fusion layer and a post-processing module which are connected in series, and the third processing unit is specifically used for:

According to the voice recognition device provided by the application, the characteristic fusion layer comprises an acoustic language fusion layer, a visual fusion layer and a language fusion layer which are sequentially connected in series, and the third processing unit is specifically used for:

According to the voice recognition device provided by the application, the characteristic fusion layer comprises an acoustic language fusion layer, a language fusion layer and a visual fusion layer which are sequentially connected in series, and the third processing unit is specifically used for:

In a fifth aspect, the present application further provides an electronic device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the model training method according to any one of the first aspects when executing the program; alternatively, a speech recognition method as in any one of the second aspects above is implemented.

In a sixth aspect, the present application also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the model training method according to any of the first aspects above; alternatively, a speech recognition method as in any one of the second aspects above is implemented.

In a seventh aspect, the present application also provides a computer program product comprising a computer program which, when executed by a processor, implements the model training method according to any of the first aspects above; alternatively, a speech recognition method as in any one of the second aspects above is implemented.

According to the model training method, the voice recognition method, the device and the electronic equipment, when the multi-modal voice recognition model is trained, the voice recognition model obtained through training based on a continuous integration issuing CIF mechanism is obtained, and the model parameters of an initial acoustic coding module and the model parameters of an initial CIF module in an initial multi-modal voice recognition model are initialized respectively based on the model parameters of the acoustic coding module and the model parameters of the CIF module in the voice recognition model; and training the initialized multi-modal speech recognition model based on the plurality of samples to obtain a trained multi-modal speech recognition model. The multi-modal perception information is integrated, namely, the voice sample is integrated, the visual image sample and the text sample corresponding to the voice sample are fused, and the initialized multi-modal voice recognition model is trained together, so that the trained multi-modal voice recognition model introduces situation visual knowledge and situation language knowledge in multi-modal voice recognition when carrying out voice recognition, and the lip movement visual information and voice content are not required to be strictly aligned in time, thereby effectively improving the voice recognition performance and expanding the boundary of the multi-modal voice recognition.

Drawings

For a clearer description of the present application or of the prior art, the drawings that are used in the description of the embodiments or of the prior art will be briefly described, it being apparent that the drawings in the description below are some embodiments of the present application, and that other drawings may be obtained from these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic flow chart of a model training method according to an embodiment of the present application;

fig. 2 is a schematic structural diagram of a speech recognition model obtained based on CIF mechanism training according to an embodiment of the present application;

FIG. 3 is a schematic structural diagram of an initialized multimodal speech recognition model according to an embodiment of the present application;

fig. 4 is a schematic flow chart of a voice recognition method according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of a model training device according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of a voice recognition device according to an embodiment of the present application;

fig. 7 is a schematic entity structure diagram of an electronic device according to an embodiment of the present application.

Detailed Description

For the purposes of making the objects, technical solutions and advantages of the present application more apparent, the technical solutions in the present application will be clearly and completely described below with reference to the drawings in the present application, and it is apparent that the described embodiments are some, but not all, embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.

In embodiments of the present application, "at least one" means one or more, and "a plurality" means two or more. "and/or", describes an association relationship of an association object, and indicates that there may be three relationships, for example, a and/or B, and may indicate: there are three cases, a alone, a and B together, and B alone, wherein a, B may be singular or plural. In the text description of the present application, the character "/" generally indicates that the front-rear association object is an or relationship.

The technical scheme provided by the embodiment of the application can be applied to more scenes such as video subtitle generation, video conference transfer and the like. Taking the video subtitle generation scene as an example, the relevant voice information can be identified through a voice identification model, and the corresponding voice information is converted into text information, so that the video subtitle is generated according to the converted text information.

In the prior art, when a voice recognition model is trained, voice information is mainly used as main mode information, lip movement visual information corresponding to the voice information is combined to be used as auxiliary information for training the voice recognition model, namely, a voice sample is used as main mode, and lip movement visual information corresponding to the voice sample is combined to train the voice recognition model. However, with the technical scheme, the lip movement visual information and the content of the voice sample are required to be strictly aligned in time, and many actual voice recognition scenes are difficult to ensure that the lip movement visual information and the content of the voice sample are strictly aligned in time, so that the voice recognition performance of the voice recognition model is poor.

Therefore, in order to train a speech recognition model with better speech recognition performance, when training a multi-modal speech recognition model, taking a speech sample as a main, respectively initializing the model parameters of an initial acoustic coding module and the model parameters of an initial CIF module in the initial multi-modal speech recognition model by using model parameters of a middle acoustic coding module and model parameters of the CIF module of the speech recognition model obtained by training based on a continuous integration release (Continuous Integrate-and-Fire, CIF) mechanism, so as to obtain an initialized multi-modal speech recognition model, wherein the process can be understood as a pre-training process of the multi-modal speech recognition model; on the basis of obtaining an initialized multi-modal speech recognition model through pre-training, integrating multi-modal perception information, namely a speech sample, fusing a visual image sample corresponding to the speech sample and a text sample, and jointly training the initialized multi-modal speech recognition model, wherein the process can be understood as a process of carrying out mixed fine tuning on the initialized multi-modal speech recognition model by adopting the multi-modal perception information, so that the trained multi-modal speech recognition model can expand boundaries of multi-modal speech recognition when carrying out speech recognition, and introduces situation visual knowledge and situation language knowledge in the multi-modal speech recognition, thereby effectively improving speech recognition performance.

It can be appreciated that in the embodiment of the present application, the purpose of the pre-training process and the hybrid fine tuning process of the multi-modal speech recognition model is to: on one hand, the acoustic coding module is pre-trained through a general speech recognition data set for training a speech recognition model so as to provide stronger general acoustic modeling capability; on the other hand, in the hybrid fine tuning stage, the initialized multi-modal speech recognition model is trained through the speech sample, the visual image sample and the text sample corresponding to the speech sample, the situation visual knowledge and the situation language knowledge are introduced into the multi-modal speech recognition, the boundary of the multi-modal speech recognition is expanded, and the trained multi-modal speech recognition model has the capability of integrating the situation multi-modal knowledge.

The model training method provided in the present application will be described in detail by the following examples. It is to be understood that the following embodiments may be combined with each other and that some embodiments may not be repeated for the same or similar concepts or processes.

Example 1

Fig. 1 is a flow chart of a model training method provided in an embodiment of the present application, where the model training method may be performed by a software and/or hardware device. For example, referring to fig. 1, the model training method may include:

S101, acquiring a voice recognition model trained based on a continuous integration and release CIF mechanism.

The speech recognition model obtained through training based on the continuous integration and release CIF mechanism is obtained through training based on a plurality of speech samples. For example, referring to fig. 2, fig. 2 is a schematic structural diagram of a speech recognition model based on CIF mechanism training according to an embodiment of the present application, where the speech recognition model includes an acoustic encoder, a CIF module, and a decoder. The acoustic encoder mainly comprises a convolution front end and a Conformer module; the CIF module mainly comprises a one-dimensional convolution layer, a full connection layer and a sigmoid activation function which follows the full connection layer; the decoder mainly comprises a plurality of full connection layers and a transducer module, and is an autoregressive decoder with future masks.

When training the speech recognition model based on CIF mechanism training shown in fig. 2 through multiple speech samples, the feature sequence of acoustic characterization of the speech sample may be first performed for each speech sampleInput to a convolution front-end in an acoustic encoder, a feature sequence characterized acoustically by the convolution front-endTaking 2 times of downsampling, taking the output after downsampling at the front end of convolution as the input of a Conformer module, and taking 4 times of downsampling on the sampled output by the Conformer module through two largest pooling layers to obtain and output a low-level acoustic representation sequence of a voice sample >。

Low-level acoustic characterization sequence outputting acoustic encoderInputting the low-level acoustic characterization sequence into a one-dimensional convolution layer in a CIF module, and extracting the low-level acoustic characterization sequence by the one-dimensional convolution layer>Outputting corresponding feature sequences, mapping the feature sequences into feature sequences with the dimension of 1, and obtaining corresponding predicted weight sequences +_ by activating the feature sequences with the dimension of 1 through an activating function>Wherein->Is [0,1]A value of->Representing the weight corresponding to the ith moment in the predicted weight sequence; prediction weight sequence->Then Scaling (Scaling) and integrating issuing (integration)&Fire) to obtain a high-level acoustic representation sequence of the speech samples>。

In the process of using the CIF module to characterize the low-level acousticsProcessing into a sequence of advanced acoustic representationsIn the course of (2) gradually accumulating the weight +.>When the accumulated weight exceeds the threshold valueThe issuance of acoustic boundaries between adjacent symbols occurs.

Illustratively, in the present embodiment, the weights for the time of issue will be divided into two parts: the first part is used for issuing weight accumulation of the symbol at the moment before the boundary, when the weight accumulation value reaches When the CIF module sums the +.>Summarizing; the second part is used for issuing the boundary at the momentIs accumulated for the sign of the symbol. Repeating the integrated issuing process to finally obtain and output the high-level acoustic representation sequence of the voice sample。

Predictive text for a time instantAnd advanced acoustic characterization of the previous moment +.>Fusing and mapping are carried out to map the fused acoustic characterization sequence to the dimension matched with the transducer module, the mapped information is input into the transducer module for further encoding, and the encoded acoustic characterization sequence and the advanced acoustic characterization at the current moment are subjected to the process of encoding>Fusion and mapping are carried out to output a predicted text sequence corresponding to the voice sampleAnd outputting the predicted text sequence +.>Corresponding probability value，/>For predicting text +.>A corresponding probability value.

In combination with the above description, a predicted text sequence corresponding to each voice sample in a plurality of voice samples for training a voice recognition model obtained based on continuous integration and release CIF mechanism training can be obtainedAnd pre-heatingText sequence is measured->A corresponding probability value; for each voice sample, according to the label text sequence and low-level acoustic representation sequence corresponding to the voice sample Constructing a connection time sequence classification (Connectionist temporal classification, CTC) loss function corresponding to the voice sample, and predicting the weight sequence according to the weight sequence label and the predicted weight sequence corresponding to the voice sample>Constructing a quantity loss function corresponding to the voice sample; and predicts the text sequence according to the probability value of the tag text sequence corresponding to the voice sample +.>Probability value of->Constructing a cross entropy loss function corresponding to the voice sample; and training the initialized multi-modal voice recognition model according to the connection time sequence classification loss function, the quantity loss function and the cross entropy loss function corresponding to each voice sample so as to obtain a voice recognition model based on continuous integration and release CIF mechanism training.

The speech recognition model obtained based on continuous integration and issuing CIF mechanism training may retain the model parameters of the acoustic coding module and the model parameters of the CIF module in the speech recognition model, and initialize the model parameters of the initial acoustic coding module and the model parameters of the initial CIF module in the initial multi-mode speech recognition model based on the model parameters of the acoustic coding module and the model parameters of the CIF module in the speech recognition model, respectively, that is, execute the following S102:

S102, respectively initializing the model parameters of the initial acoustic coding module and the model parameters of the initial CIF module in the initial multi-mode voice recognition model based on the model parameters of the acoustic coding module and the model parameters of the CIF module in the voice recognition model to obtain an initialized multi-mode voice recognition model.

It should be noted that, in the embodiment of the present application, the process of initializing the model parameters of the initial acoustic coding module and the model parameters of the initial CIF module in the initial multi-modal speech recognition model based on the model parameters of the acoustic coding module and the model parameters of the CIF module in the speech recognition model may be understood as a pre-training process of the multi-modal speech recognition model; after that, the multi-modal sensing information, that is, the voice sample, and the visual image sample and the text sample corresponding to the voice sample can be integrated on the basis of the initialized multi-modal voice recognition model, so that the initialized multi-modal voice recognition model is trained together, that is, the following steps S103 and S104 are executed, and the process can be understood as a process of performing mixed fine tuning on the initialized multi-modal voice recognition model by adopting the multi-modal sensing information, so that the general voice recognition capability of the trained multi-modal voice recognition model can be better improved.

It can be understood that when the initial acoustic coding module is initialized, other parameters are initialized in addition to the model parameters of the initial acoustic coding module and the model parameters of the initial CIF module, and here, the embodiments of the present application will not be described in detail.

S103, a plurality of sample pairs are obtained, wherein each sample pair comprises a voice sample, a visual image sample and a text sample corresponding to the voice sample.

The voice sample is a main basis for voice recognition, and basic pronunciation information is provided, so that an audio mode in the voice sample is an input main mode of the model; the visual image sample and the text sample corresponding to the voice sample correspond to a visual mode and a language mode respectively, and multi-mode information consisting of an audio mode, the visual mode and the language mode is used for realizing cross-mode knowledge fusion, so that more semantically related information can be captured at a model level.

For example, when multiple sample pairs are acquired, multiple sample pairs sent by other electronic devices may be received; multiple pairs of samples may also be looked up from local storage; alternatively, a plurality of sample pairs may be obtained from a third party database, and may be specifically set according to actual needs, where the embodiment of the present application only uses these three ways to obtain a plurality of sample pairs as an example, but the embodiment of the present application is not limited thereto.

After the plurality of sample pairs are acquired, the initialized multimodal speech recognition model may be trained based on the plurality of sample pairs, i.e. the following S104 is performed:

s104, training the initialized multi-modal speech recognition model based on a plurality of samples to obtain a trained multi-modal speech recognition model.

For example, referring to fig. 3, fig. 3 is a schematic structural diagram of an initialized multi-modal speech recognition model according to an embodiment of the present application, which is different from the speech recognition model based on the continuous integration and release CIF mechanism shown in fig. 2, in that a multi-modal sensing module is added to the initialized multi-modal speech recognition model, and is different from the decoding module of the speech recognition model based on the continuous integration and release CIF mechanism, that is, in the embodiment of the present application, as shown in fig. 2, the initialized multi-modal speech recognition model includes a multi-modal sensing module, an acoustic encoding module, a CIF module, and a decoding module.

For example, the multi-modal awareness module may employ a modality independent encoding, i.e., including a text encoder and a visual encoder, respectively. Illustratively, in the embodiments of the present application, BERT may be used as the text encoder, vision Transformer may be used as the visual encoder, although other text encoders and visual encoders, or visual text joint encoders, may be used. The setting may be specifically performed according to actual needs, and herein, the embodiment of the present application is not specifically limited.

The method comprises the steps that an initialized multi-mode voice recognition model is trained based on a plurality of samples, and an acoustic characterization sequence corresponding to a voice sample in the sample pair can be input into an acoustic coding module to obtain a first voice feature vector sequence corresponding to the voice sample; inputting the first voice feature vector sequence into a CIF module, and determining a prediction weight sequence corresponding to the voice sample through the CIF module; determining a second voice characteristic vector sequence corresponding to the voice sample based on the predicted weight sequence; meanwhile, inputting the visual image samples in the sample pair into a visual image encoder in the multi-modal sensing module to obtain a visual feature vector sequence corresponding to the visual image samples; inputting the text samples in the sample pairs into a text encoder in the multi-modal sensing module to obtain a text feature vector sequence corresponding to the text samples; it is understood that the second speech feature vector may be understood as a sequence of a plurality of second speech feature vectors. Then, at each decoding moment, a predicted text representation vector at the previous moment, a second voice feature vector corresponding to the previous moment output by the CIF module, a second voice feature vector corresponding to the current moment output by the CIF module, a visual feature vector sequence and a text feature vector sequence are input into the decoding module to obtain a probability value of a predicted text of a voice sample at the current moment; and training the initialized multi-mode voice recognition model according to the label text sequence, the first voice feature vector sequence, the weight sequence label, the predicted weight sequence, the probability value of the label text sequence and the probability value of the predicted text sequence corresponding to each voice sample.

Illustratively, in the present embodiment, the acoustic encoder mainly includes a convolution front end and a Conformer module. When the first voice characteristic vector corresponding to the voice sample is obtained through the acoustic encoder, the characteristic sequence of the acoustic characterization of the voice sample in the sample pair can be firstly obtainedIn a convolution front-end input to an acoustic encoder, a characteristic sequence characterized acoustically by the convolution front-end is +.>Performing 2 times downsampling, taking the downsampled output of the convolution front end as the input of a Conformer module, and performing 4 times downsampling on the sampled output by the Conformer module through two maximum pooling layers to obtain low-level sound of a voice sampleSequence is expressed chemicallyThe low-level acoustic representation sequence of the speech sample +.>Namely, a first voice characteristic vector sequence corresponding to the voice sample.

Illustratively, in the embodiment of the present application, the CIF module mainly includes a one-dimensional convolution layer, a full connection layer, and an immediately following sigmoid activation function. Acquiring a predicted weight sequence corresponding to the voice sample through a CIF module; when determining the second voice characteristic vector sequence corresponding to the voice sample based on the predicted weight sequence, the low-level acoustic characteristic sequence output by the acoustic encoder can be firstly output Inputting the low-level acoustic characterization sequence into a one-dimensional convolution layer in a CIF module, and further extracting the low-level acoustic characterization sequence by the one-dimensional convolution layer>Outputting a corresponding feature sequence, mapping the feature sequence into a feature sequence with the dimension of 1, and obtaining a corresponding predicted weight sequence after the feature sequence with the dimension of 1 is subjected to a sigmoid activation function>The method comprises the steps of carrying out a first treatment on the surface of the Wherein (1)>Is [0,1]A value in between,/->Representing the weight corresponding to the ith moment in the predicted weight sequence; prediction weight sequence->Then Scaling (Scaling) and integrating issuing (integration)&Fire) to obtain a high-level acoustic representation sequence of the speech samples>Advanced acoustic representation sequence of the speech sample +.>Namely a second voice characteristic vector sequence corresponding to the voice sample.

It should be noted that, in the embodiment of the present application, the related implementation of the CIF module is similar to the related implementation of the CIF module in S101, which may be referred to the related implementation of the CIF module in S101, and here, the embodiment of the present application will not be described in detail.

Illustratively, in the embodiment of the present application, the decoding module mainly includes a feature fusion layer and a post-processing module connected in series. When obtaining the probability value of the predicted text corresponding to the voice sample through the decoding module, the predicted text characterization vector at the previous moment and the second voice feature vector corresponding to the previous moment can be first obtained Inputting the feature vector, the visual feature vector sequence and the text feature vector sequence obtained by fusion into a feature fusion layer for fusion to obtain a target fusion feature vector; then the target fusion feature vector and the second voice feature vector corresponding to the current moment, namely the advanced acoustic characterization of the current momentInputting into a post-processing module to obtain a predicted text sequence +.>Corresponding probability value，/>For predicting text +.>A corresponding probability value.

Illustratively, in one possible scenario, the feature fusion layer includes an acoustic language fusion layer, a visual fusion layer, and a language fusion layer in series. In this possible scenario, the predicted text token vector at the previous time and the second speech feature vector corresponding to the previous time, i.e. the advanced acoustic token at the previous time, may also be first usedInputting the feature vectors obtained through fusion into an acoustic language fusion layer to obtain first fusion feature vectors; inputting the visual feature vector sequence and the first fusion feature vector into a visual fusion layer for fusion to obtain a second fusion feature vector; and inputting the text feature vector sequence and the second fusion feature vector into a language fusion layer for fusion to obtain a target fusion feature vector.

Illustratively, in another possible scenario, the feature fusion layer includes an acoustic language fusion layer, a language fusion layer, and a visual fusion layer in series. In this possible scenario, the predicted text token vector at the previous time and the corresponding second speech feature vector at the previous time, i.e. the advanced acoustic token at the previous time, may be first usedInputting the feature vectors obtained through fusion into an acoustic language fusion layer to obtain first fusion feature vectors; inputting the text feature vector sequence and the first fusion feature vector into a language fusion layer for fusion to obtain a third fusion feature vector; and inputting the visual feature vector sequence and the third fusion feature vector into a visual fusion layer for fusion to obtain a target fusion feature vector.

Wherein, the vision fusion layer integrates the situation vision knowledge through a Cross-Attention mechanism (Cross-Attention); the language fusion layer integrates the contextual language knowledge through a cross-attention mechanism.

Illustratively, in an embodiment of the present application, referring to FIG. 3, the aftertreatment module primarily includes a fully connected layer (FC layer) and a softmax layer. Obtaining a target fusion feature vector through a feature fusion layer Thereafter, the feature vector and the second speech feature vector, i.e., the above-described advanced acoustic characterization of the current time, may be fused to the targetFusion is carried out, and a predicted text sequence corresponding to the voice sample is output +.>And outputting the predicted text sequence through the softmax layerCorresponding probability value->，/>For predicting text +.>A corresponding probability value.

And combining the description, obtaining a label text sequence, a first voice characteristic vector sequence, a prediction weight sequence and a probability value of the prediction text corresponding to each voice sample. After the tag text sequence, the first voice feature vector sequence, the predicted weight sequence and the probability value of the predicted text corresponding to each voice sample are obtained, the initialized multi-modal voice recognition model can be trained according to the tag text sequence, the first voice feature vector sequence, the weight sequence tag, the predicted weight sequence, the probability value of the tag text sequence and the probability value of the predicted text sequence corresponding to each voice sample.

For example, when training the initialized multi-modal speech recognition model according to the tag text sequence, the first speech feature vector sequence, the weight sequence tag, the predicted weight sequence, the probability value of the tag text sequence and the probability value of the predicted text sequence corresponding to each speech sample, a connection time sequence classification loss function corresponding to the speech sample can be constructed according to the tag text sequence and the first speech feature vector sequence corresponding to the speech sample; constructing a quantity loss function corresponding to the voice sample according to the weight sequence label and the predicted weight sequence corresponding to the voice sample; constructing a cross entropy loss function corresponding to the voice sample according to the probability value of the label text sequence corresponding to the voice sample and the probability value of the predicted text sequence; and training the initialized multi-modal voice recognition model according to the connection time sequence classification loss function, the quantity loss function and the cross entropy loss function corresponding to each voice sample.

For example, when training the initialized multi-mode speech recognition model according to the connection time sequence classification loss function, the quantity loss function and the cross entropy loss function corresponding to each speech sample, the connection time sequence classification loss function, the quantity loss function and the cross entropy loss function corresponding to each speech sample can be weighted first to obtain a target loss function corresponding to the speech sample; and training the initialized multi-modal speech recognition model according to the target loss function corresponding to each speech sample until the trained multi-modal speech recognition model meets the preset condition, and determining the multi-modal speech recognition model meeting the preset condition as the trained multi-modal speech recognition model.

Illustratively, the preset condition includes that the training times reach the preset times, and/or that the trained multi-modal speech recognition model converges.

It can be seen that, in the embodiment of the present application, when a multi-modal speech recognition model is trained, the speech recognition model obtained by training based on a continuous integration and issuance CIF mechanism is obtained, and based on the model parameters of the acoustic coding module and the model parameters of the CIF module in the speech recognition model, the model parameters of the initial acoustic coding module and the model parameters of the initial CIF module in the initial multi-modal speech recognition model are initialized respectively; and training the initialized multi-modal speech recognition model based on the plurality of samples to obtain a trained multi-modal speech recognition model. The multi-modal perception information is integrated, namely, the voice sample is integrated, the visual image sample and the text sample corresponding to the voice sample are fused, and the initialized multi-modal voice recognition model is trained together, so that the trained multi-modal voice recognition model introduces situation visual knowledge and situation language knowledge in multi-modal voice recognition when carrying out voice recognition, and the lip movement visual information is not required to be strictly aligned with voice content in time, thereby effectively improving voice recognition performance and expanding the boundary of the multi-modal voice recognition.

Example two

Fig. 4 is a flowchart of a voice recognition method according to an embodiment of the present application, where the voice recognition method may be performed by a software and/or hardware device. For example, referring to fig. 4, the model training method may include:

s401, acquiring a voice to be recognized, and a visual image and a text corresponding to the voice to be recognized.

S402, inputting the voice to be recognized, the visual image and the text into a multi-modal voice recognition model to obtain a predicted text corresponding to the voice to be recognized and a probability value of the predicted text.

The multi-modal speech recognition model is the multi-modal speech recognition model trained by the embodiment.

The multi-modal speech recognition model includes a multi-modal sensing module, an acoustic encoding module, a continuous integration issuing CIF module, and a decoding module, inputs a speech to be recognized, a visual image, and a text into the multi-modal speech recognition model, and obtains a predicted text corresponding to the speech to be recognized and a probability value of the predicted text, including:

inputting the acoustic characterization sequence corresponding to the voice to be recognized into an acoustic encoding module to obtain a third voice feature vector sequence corresponding to the voice to be recognized, inputting the third voice feature vector sequence into a CIF module, determining a predicted weight sequence corresponding to the voice to be recognized through the CIF module, and determining a fourth voice feature vector sequence corresponding to the voice to be recognized based on the predicted weight sequence.

And inputting the visual image into a visual image encoder in the multi-modal sensing module to obtain a visual feature vector sequence corresponding to the visual image.

Inputting the text into a text encoder in the multi-modal sensing module to obtain a text feature vector sequence corresponding to the text.

And at each decoding moment, inputting the predicted text characterization vector at the previous moment, the fourth voice feature vector corresponding to the current moment, the visual feature vector sequence and the text feature vector sequence into a decoding module to obtain the predicted text and the probability value of the predicted text of the voice to be recognized at the current moment.

The decoding module includes a feature fusion layer and a post-processing module connected in series, and inputs a predicted text characterization vector at a previous time, a fourth voice feature vector corresponding to the current time, a visual feature vector sequence and a text feature vector sequence into the decoding module to obtain a predicted text and a probability value of the predicted text of the voice to be recognized at the current time, including:

and inputting the feature vector, the visual feature vector sequence and the text feature vector sequence obtained by fusing the predicted text characterization vector at the previous moment and the fourth voice feature vector corresponding to the previous moment into a feature fusion layer for fusion to obtain a fusion feature vector.

And inputting the fusion feature vector and the fourth voice feature vector into a post-processing module to obtain the predicted text and the probability value of the predicted text.

The feature fusion layer includes an acoustic language fusion layer, a visual fusion layer and a language fusion layer connected in series in sequence, and inputs a feature vector, a visual feature vector sequence and a text feature vector sequence obtained by fusing a predicted text characterization vector at a previous moment and a fourth voice feature vector corresponding to the previous moment into the feature fusion layer to obtain a fused feature vector, including:

and inputting the feature vector obtained by fusing the predicted text characterization vector at the previous moment and the fourth voice feature vector corresponding to the previous moment into an acoustic language fusion layer to obtain a fourth fused feature vector.

And inputting the visual feature vector sequence and the fourth fusion feature vector into a visual fusion layer for fusion to obtain a fifth fusion feature vector.

And inputting the text feature vector sequence and the fifth fusion feature vector into a language fusion layer for fusion to obtain the fusion feature vector.

The feature fusion layer includes an acoustic language fusion layer, a language fusion layer and a visual fusion layer, which are sequentially connected in series, and inputs a feature vector, a visual feature vector sequence and a text feature vector sequence obtained by fusing a predicted text characterization vector at a previous moment and a fourth voice feature vector corresponding to the previous moment into the feature fusion layer to obtain a fused feature vector, including:

And inputting the text feature vector sequence and the fourth fusion feature vector into a language fusion layer for fusion to obtain a sixth fusion feature vector.

And inputting the visual feature vector sequence and the sixth fusion feature vector into a visual fusion layer for fusion to obtain the fusion feature vector.

It should be noted that, in the embodiment of the present application, the specific implementation of the speech recognition method is similar to the specific implementation of the multi-modal speech recognition model training method in the embodiment shown in fig. 1, and reference may be made to the specific implementation of the multi-modal speech recognition model training method in the embodiment shown in fig. 1, and here, the embodiment of the present application will not be repeated.

It can be seen that, in the embodiment of the present application, when performing speech recognition, the speech to be recognized, the visual image and the text corresponding to the speech to be recognized are obtained; and inputting the voice to be recognized, the visual image and the text into the multi-modal voice recognition model to obtain the predicted text corresponding to the voice to be recognized and the probability value of the predicted text, so that when the multi-modal voice recognition model is used for voice recognition, the situation visual knowledge and the situation language knowledge are introduced into the multi-modal voice recognition, thereby effectively improving the voice recognition performance and expanding the boundary of the multi-modal voice recognition.

Fig. 5 is a schematic structural diagram of a model training apparatus provided in an embodiment of the present application, for example, please refer to fig. 5, the model training apparatus 50 may include:

the first obtaining unit 501 is configured to obtain a speech recognition model obtained by training based on a continuous integration and issuance CIF mechanism.

The first processing unit 502 is configured to initialize the model parameters of the initial acoustic coding module and the model parameters of the initial CIF module in the initial multi-modal speech recognition model based on the model parameters of the acoustic coding module and the model parameters of the CIF module in the speech recognition model, respectively, to obtain an initialized multi-modal speech recognition model.

The second obtaining unit 503 is configured to obtain a plurality of sample pairs, where each sample pair includes a voice sample, a visual image sample corresponding to the voice sample, and a text sample.

The second processing unit 504 is configured to train the initialized multimodal speech recognition model based on the plurality of samples, so as to obtain a trained multimodal speech recognition model.

Optionally, the initialized multi-modal speech recognition model includes a multi-modal sensing module, an acoustic encoding module, a CIF module, and a decoding module, and the second processing unit 504 is specifically configured to:

The following is performed for each sample pair:

inputting an acoustic characterization sequence corresponding to a voice sample in the sample pair into an acoustic coding module to obtain a first voice feature vector sequence corresponding to the voice sample, inputting the first voice feature vector sequence into a CIF module, determining a prediction weight sequence corresponding to the voice sample through the CIF module, and determining a second voice feature vector sequence corresponding to the voice sample based on the prediction weight sequence.

And inputting the visual image samples in the sample pair into a visual image encoder in the multi-modal sensing module to obtain a visual feature vector sequence corresponding to the visual image samples.

And inputting the text samples in the sample pairs into a text encoder in the multi-modal sensing module to obtain a text feature vector sequence corresponding to the text samples.

Training the initialized multi-modal speech recognition model according to the label text sequence, the first speech feature vector sequence, the weight sequence label, the predicted weight sequence, the probability value of the label text sequence and the probability value of the predicted text sequence corresponding to each speech sample.

Optionally, the decoding module includes a feature fusion layer and a post-processing module connected in series, and the second processing unit 504 is specifically configured to:

Optionally, the feature fusion layer includes an acoustic language fusion layer, a visual fusion layer, and a language fusion layer sequentially connected in series, and the second processing unit 504 is specifically configured to:

And inputting the visual feature vector sequence and the first fusion feature vector into a visual fusion layer for fusion to obtain a second fusion feature vector.

Inputting the text feature vector sequence and the second fusion feature vector into a language fusion layer for fusion, and obtaining a target fusion feature vector.

Optionally, the feature fusion layer includes an acoustic language fusion layer, a language fusion layer, and a visual fusion layer sequentially connected in series, and the second processing unit 504 is specifically configured to:

And inputting the text feature vector sequence and the first fusion feature vector into a language fusion layer for fusion to obtain a third fusion feature vector.

And inputting the visual feature vector sequence and the third fusion feature vector into a visual fusion layer for fusion to obtain a target fusion feature vector.

Optionally, the second processing unit 504: the method is particularly used for:

constructing a connection time sequence classification loss function corresponding to the voice sample according to the label text sequence and the first voice feature vector sequence corresponding to the voice sample aiming at each voice sample; constructing a quantity loss function corresponding to the voice sample according to the weight sequence label and the predicted weight sequence corresponding to the voice sample; and constructing a cross entropy loss function corresponding to the voice sample according to the probability value of the label text sequence corresponding to the voice sample and the probability value of the predicted text sequence.

Optionally, the second processing unit 504 is specifically configured to:

and for each voice sample, carrying out weighting treatment on the connection time sequence classification loss function, the quantity loss function and the cross entropy loss function corresponding to the voice sample to obtain the target loss function corresponding to the voice sample.

The model training device 50 provided in this embodiment may execute the technical scheme of the model training method in any of the above embodiments, and the implementation principle and beneficial effects of the model training device are similar to those of the model training method, and may refer to the implementation principle and beneficial effects of the model training method, which are not described herein.

Fig. 6 is a schematic structural diagram of a voice recognition device according to an embodiment of the present application, for example, referring to fig. 6, the voice recognition device 60 may include:

the third obtaining unit 601 is configured to obtain a voice to be recognized, a visual image and a text corresponding to the voice to be recognized.

The third processing unit 602 is configured to input the speech to be recognized, the visual image and the text into a multi-modal speech recognition model, to obtain a predicted text corresponding to the speech to be recognized and a probability value of the predicted text, where the multi-modal speech recognition model is a multi-modal speech recognition model obtained by model training.

Optionally, the multi-modal speech recognition model includes a multi-modal sensing module, an acoustic encoding module, a continuous integration issuing CIF module, and a decoding module, and the third processing unit 602 is specifically configured to:

Optionally, the decoding module includes a feature fusion layer and a post-processing module connected in series, and the third processing unit 602 is specifically configured to:

And inputting the fusion feature vector and the fourth voice feature vector corresponding to the current moment into a post-processing module to obtain the predicted text and the probability value of the predicted text.

Optionally, the feature fusion layer includes an acoustic language fusion layer, a visual fusion layer, and a language fusion layer sequentially connected in series, and the third processing unit 602 is specifically configured to:

Optionally, the feature fusion layer includes an acoustic language fusion layer, a language fusion layer, and a visual fusion layer sequentially connected in series, and the third processing unit 602 is specifically configured to:

The voice recognition device 60 provided in this embodiment may execute the technical scheme of the voice recognition method in any of the above embodiments, and the implementation principle and beneficial effects of the voice recognition device are similar to those of the voice recognition method, and reference may be made to the implementation principle and beneficial effects of the voice recognition method, which are not described herein.

Fig. 7 is a schematic physical structure diagram of an electronic device according to an embodiment of the present application, as shown in fig. 7, the electronic device may include: processor 710, communication interface (Communications Interface) 720, memory 730, and communication bus 740, wherein processor 710, communication interface 720, memory 730 communicate with each other via communication bus 740. Processor 710 may invoke logic instructions in memory 730 to perform the model training methods described above for the methods, or speech recognition methods.

The model training method comprises the following steps: acquiring a voice recognition model obtained based on continuous integration and release CIF mechanism training; respectively initializing the model parameters of the initial acoustic coding module and the model parameters of the initial CIF module in the initial multi-mode voice recognition model based on the model parameters of the acoustic coding module and the model parameters of the CIF module in the voice recognition model to obtain an initialized multi-mode voice recognition model; acquiring a plurality of sample pairs, wherein each sample pair comprises a voice sample, a visual image sample and a text sample corresponding to the voice sample; training the initialized multi-modal speech recognition model based on the plurality of samples to obtain a trained multi-modal speech recognition model.

The voice recognition method comprises the following steps: the method comprises the steps of obtaining voice to be recognized, and a visual image and a text corresponding to the voice to be recognized; inputting the voice to be recognized, the visual image and the text into a multi-modal voice recognition model to obtain the predicted text corresponding to the voice to be recognized and the probability value of the predicted text, wherein the multi-modal voice recognition model is the multi-modal voice recognition model obtained by training any one of the above embodiments.

Further, the logic instructions in the memory 730 described above may be implemented in the form of software functional units and may be stored in a computer readable storage medium when sold or used as a stand alone product. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

In another aspect, the present application also provides a computer program product comprising a computer program, the computer program being storable on a non-transitory computer readable storage medium, the computer program, when executed by a processor, being capable of performing the model training method, or the speech recognition method, of the methods described above.

In yet another aspect, the present application also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, is implemented to perform the model training method or the speech recognition method described above.

The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present application, and are not limiting thereof; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the corresponding technical solutions.

Claims

1. A method of model training, comprising:

acquiring a voice recognition model obtained based on continuous integration and release CIF mechanism training;

respectively initializing the model parameters of the initial acoustic coding module and the model parameters of the initial CIF module in the initial multi-mode voice recognition model based on the model parameters of the acoustic coding module and the model parameters of the CIF module in the voice recognition model to obtain an initialized multi-mode voice recognition model;

acquiring a plurality of sample pairs, wherein each sample pair comprises a voice sample, a visual image sample and a text sample corresponding to the voice sample;

training the initialized multi-modal speech recognition model based on the plurality of samples to obtain a trained multi-modal speech recognition model;

The initialized multi-modal speech recognition model comprises a multi-modal sensing module, an acoustic coding module, a CIF module and a decoding module, and the training of the initialized multi-modal speech recognition model based on the plurality of samples comprises the following steps:

for each of the sample pairs, the following is performed:

inputting an acoustic characterization sequence corresponding to a voice sample in the sample pair into the acoustic coding module to obtain a first voice feature vector sequence corresponding to the voice sample, inputting the first voice feature vector sequence into the CIF module, determining a prediction weight sequence corresponding to the voice sample through the CIF module, and determining a second voice feature vector sequence corresponding to the voice sample based on the prediction weight sequence;

inputting the visual image samples in the sample pair into a visual image encoder in the multi-mode sensing module to obtain a visual feature vector sequence corresponding to the visual image samples;

inputting the text samples in the sample pair into a text encoder in the multi-mode sensing module to obtain a text feature vector sequence corresponding to the text samples;

At each decoding moment, inputting a predicted text characterization vector at the previous moment, a second voice feature vector corresponding to the current moment, the visual feature vector sequence and the text feature vector sequence into the decoding module to obtain a probability value of a predicted text of the voice sample at the current moment;

2. The model training method according to claim 1, wherein the decoding module includes a feature fusion layer and a post-processing module connected in series, the inputting the predicted text token vector at the previous time, the second speech feature vector corresponding to the current time, the visual feature vector sequence, and the text feature vector sequence into the decoding module, to obtain a probability value of the predicted text of the speech sample at the current time, includes:

The feature vector, the visual feature vector sequence and the text feature vector sequence which are obtained by fusing the predicted text characterization vector at the previous moment and the second voice feature vector corresponding to the previous moment are input into the feature fusion layer for fusion, so that a target fusion feature vector is obtained;

3. The model training method according to claim 2, wherein the feature fusion layer includes an acoustic language fusion layer, a visual fusion layer, and a language fusion layer that are sequentially connected in series, the feature vector obtained by fusing the predicted text token vector at the previous time and the second speech feature vector corresponding to the previous time, the visual feature vector sequence, and the text feature vector sequence are all input to the feature fusion layer to be fused, and a target fusion feature vector is obtained, and the method includes:

the feature vector obtained by fusing the predicted text representation vector at the previous moment and the second voice feature vector corresponding to the previous moment is input to the acoustic language fusion layer, and a first fusion feature vector is obtained;

Inputting the visual feature vector sequence and the first fusion feature vector into the visual fusion layer for fusion to obtain a second fusion feature vector;

4. The model training method according to claim 2, wherein the feature fusion layer includes an acoustic language fusion layer, a language fusion layer, and a visual fusion layer that are sequentially connected in series, the feature vector obtained by fusing the predicted text token vector at the previous time and the second speech feature vector corresponding to the previous time, the visual feature vector sequence, and the text feature vector sequence are all input to the feature fusion layer to be fused, so as to obtain a target fusion feature vector, and the method includes:

inputting the text feature vector sequence and the first fusion feature vector into the language fusion layer for fusion to obtain a third fusion feature vector;

5. The model training method according to any one of claims 1 to 4, wherein training the initialized multi-modal speech recognition model according to the tag text sequence corresponding to each of the speech samples, the first speech feature vector sequence, the weight sequence tag, the predicted weight sequence, the probability value of the tag text sequence, and the probability value of the predicted text sequence comprises:

for each voice sample, constructing a connection time sequence classification loss function corresponding to the voice sample according to a label text sequence corresponding to the voice sample and the first voice feature vector sequence; constructing a quantity loss function corresponding to the voice sample according to the weight sequence label and the predicted weight sequence corresponding to the voice sample; constructing a cross entropy loss function corresponding to the voice sample according to the probability value of the label text sequence corresponding to the voice sample and the probability value of the predicted text sequence;

6. The method according to claim 5, wherein the training the initialized multi-modal speech recognition model according to the connection time sequence classification loss function, the number loss function, and the cross entropy loss function corresponding to each of the speech samples comprises:

for each voice sample, carrying out weighting processing on the connection time sequence classification loss function, the quantity loss function and the cross entropy loss function corresponding to the voice sample to obtain a target loss function corresponding to the voice sample;

7. A method of speech recognition, comprising:

acquiring a voice to be recognized, and a visual image and a text corresponding to the voice to be recognized;

inputting the voice to be recognized, the visual image and the text into a multi-modal voice recognition model to obtain a predicted text corresponding to the voice to be recognized and a probability value of the predicted text, wherein the multi-modal voice recognition model is the multi-modal voice recognition model obtained by training any one of claims 1-6.

8. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the model training method of any of claims 1 to 6 when the program is executed by the processor; alternatively, the speech recognition method of claim 7 is implemented.

9. A non-transitory computer readable storage medium having stored thereon a computer program, which when executed by a processor implements the model training method according to any of claims 1 to 6; alternatively, the speech recognition method of claim 7 is implemented.