CN111462733B

CN111462733B - Multi-modal speech recognition model training method, device, equipment and storage medium

Info

Publication number: CN111462733B
Application number: CN202010247184.7A
Authority: CN
Inventors: 景子君; 潘嘉; 吴华鑫
Original assignee: iFlytek Co Ltd
Current assignee: iFlytek Co Ltd
Priority date: 2020-03-31
Filing date: 2020-03-31
Publication date: 2024-04-16
Anticipated expiration: 2040-03-31
Also published as: CN111462733A; WO2021196802A1

Abstract

The embodiment of the application discloses a multi-modal speech recognition model training method, device, equipment and storage medium, wherein training data can comprise single audio signals (namely, video signals are not synchronously acquired) and a data set for generating corresponding image characteristics based on the single audio signals in the training process of a multi-modal speech processing model, so that the training data set in the training process of the multi-modal speech processing model is enriched, the generalization capability of the multi-modal speech processing method is improved, and the reliability of the multi-modal speech recognition model is improved.

Description

Multi-modal speech recognition model training method, device, equipment and storage medium

Technical Field

The present disclosure relates to the field of machine learning technologies, and in particular, to a method, an apparatus, a device, and a storage medium for training a multi-modal speech recognition model.

Background

The traditional voice recognition technology obtains a recognition result by processing only voice signals, and the voice recognition method can achieve a very high recognition effect in a clear voice environment. However, under some high noise, far field environments, the recognition rate of conventional speech recognition techniques may drop rapidly. In order to improve the speech recognition rate, a multi-mode speech recognition method for carrying out speech recognition with the help of lip action video is proposed, and the speech recognition rate in a high-noise scene is improved to a certain extent.

However, the generalization ability of the existing multi-modal speech recognition model for multi-modal speech recognition is weak, resulting in poor reliability of the multi-modal speech recognition model.

Therefore, how to improve the reliability of the multi-modal speech recognition model is a technical problem to be solved.

Disclosure of Invention

In view of the foregoing, the present application provides a method, apparatus, device and storage medium for training a multi-modal speech recognition model, so as to improve the reliability of the multi-modal speech recognition model.

In order to achieve the above object, the following solutions have been proposed:

a multi-modal speech recognition model training method, comprising:

acquiring training data through the multi-modal voice recognition model;

if the training data only comprises a sample voice signal, the multi-mode voice recognition model processes each basic image feature in a preset data set by using the sample voice signal to obtain a target image feature corresponding to the sample voice signal; the basic image features are obtained according to known lip movement related region images;

performing voice recognition according to the voice characteristics of the sample voice signal and the target image characteristics to obtain a voice recognition result of the sample voice signal;

And updating parameters of the multi-mode voice recognition model by taking the voice recognition result of the sample voice signal approaching to the voice content of the sample voice signal as a target.

The above method, preferably, further comprises:

if the training data simultaneously comprises a sample voice signal and a lip movement related region image synchronously collected with the sample voice signal, the multi-mode voice recognition model acquires the characteristics of the lip movement related region image as target image characteristics corresponding to the sample voice signal.

The above method, preferably, the processing each basic image feature in the preset data set by using the sample voice signal includes:

obtaining weights of all basic image features by using the sample voice signals;

and weighting and summing the basic image features by using the weights of the basic image features to obtain target image features corresponding to the sample voice signals.

The above method, preferably, the obtaining weights of the respective basic image features by using the sample voice signal includes:

respectively performing space conversion on the voice characteristics and each basic image characteristic of the sample voice signal by using space conversion parameters;

And calculating the weight of each basic image feature by using the converted voice feature and the converted basic image feature.

Preferably, the updating the parameters of the multi-modal speech recognition model includes updating the spatial transformation parameters.

In the above method, preferably, the sample speech signal is a speech signal in a first language; after the multi-modal speech recognition model is trained, the method further comprises:

obtaining the voice characteristics of the sample voice signal of the second language through the voice characteristic extraction module of the multi-mode voice recognition model;

processing each basic image feature in the preset data set by using the voice feature of the sample voice signal of the second language through the image feature generation module of the multi-mode voice recognition model to obtain a target image feature corresponding to the sample voice signal of the second language;

performing voice recognition according to the voice characteristics of the second-language sample voice signal and the target image characteristics corresponding to the second-language sample voice signal by the recognition module of the multi-mode voice recognition model to obtain a voice recognition result of the second-language sample voice signal;

And updating parameters of the voice feature extraction module, the image feature generation module and the recognition module by taking the voice recognition result of the second-language sample voice signal as a target that the voice content of the second-language sample voice signal approaches.

The above method, preferably, the process of obtaining the basic image features from the known lip movement related region image, includes:

acquiring a lip movement related region image sequence synchronously acquired with a plurality of known voice signals;

sampling each lip movement related region image sequence respectively to obtain a basic lip movement related region image corresponding to each voice signal;

and acquiring the characteristic of each basic lip movement related region image as the basic image characteristic.

The above method, preferably, the process of obtaining the basic image features from the known lip related image, includes:

acquiring the characteristics of a plurality of known lip movement related region images;

clustering the characteristics of the known lip movement related region images to obtain a plurality of clusters;

and extracting the clustering center of each cluster as the basic image characteristic.

The above method, preferably, the clustering the features of the known lip related images includes:

For the feature of each lip movement related region image to be clustered, determining a clustering center with the minimum distance from the feature of the lip movement related region image as a target clustering center;

the features of the lip movement related region image are aggregated to a cluster to which the target cluster center belongs;

updating the cluster center of the cluster to which the target cluster center belongs.

The method, preferably, the acquiring features of the known lip related images, includes:

and acquiring the characteristics of the known lip movement related region images by using an image characteristic extraction model.

In the above method, preferably, the image feature extraction model is: and the image feature extraction module is used for extracting features of the lip movement related region image in a lip language recognition model trained by taking the lip movement related region image and the lip pronunciation content corresponding to the lip movement related region image as training data.

A multimodal speech recognition model training apparatus comprising:

the data acquisition module is used for acquiring training data through the multi-modal voice recognition model;

the first feature acquisition module is used for processing each basic image feature in a preset data set by utilizing the sample voice signal through the multi-mode voice recognition model if the training data only comprises the sample voice signal, so as to obtain a target image feature corresponding to the sample voice signal; the basic image features are obtained according to known lip movement related region images;

The recognition module is used for carrying out voice recognition according to the voice characteristics of the sample voice signal and the target image characteristics through the multi-mode voice recognition model to obtain a voice recognition result of the sample voice signal;

and the updating module is used for updating parameters of the multi-modal voice recognition model by taking the voice recognition result of the sample voice signal as a target to approach the voice content of the sample voice signal through the multi-modal voice recognition model.

A multimodal speech recognition model training apparatus comprising: comprising a memory and a processor;

the memory is used for storing programs;

the processor is configured to execute the program to implement the steps of the multi-modal speech recognition model training method according to any one of the above claims.

A computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the multimodal speech recognition model training method as defined in any of the preceding claims.

As can be seen from the above technical solution, after the multi-modal speech recognition model training method, apparatus, device and storage medium provided in the embodiments of the present application acquires training data, if the training data only includes a sample speech signal, processing each basic image feature in a preset data set obtained according to a known lip related image by using the sample speech signal to obtain a target image feature corresponding to the sample speech signal; performing voice recognition according to the voice characteristics of the sample voice signal and the target image characteristics to obtain a voice recognition result of the sample voice signal; and updating parameters of the multi-modal speech recognition model by taking the speech recognition result of the sample speech signal approaching to the speech content of the sample speech signal as a target. Based on the model training scheme, in the training process of the multi-mode voice processing model, training data can comprise single audio signals (namely, video signals are not synchronously acquired), and a data set for generating corresponding image features based on the single audio signals, so that the training data set in the training process of the multi-mode voice processing model is enriched, the generalization capability of the multi-mode voice processing method is improved, and the reliability of the multi-mode voice recognition model is improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only embodiments of the present application, and that other drawings may be obtained according to the provided drawings without inventive effort to a person skilled in the art.

FIG. 1a is a flowchart of one implementation of a multi-modal speech recognition model training method disclosed in an embodiment of the present application;

FIG. 1b is a flowchart of another implementation of the multi-modal speech recognition model training method disclosed in embodiments of the present application;

FIG. 2a is a flowchart of one implementation of obtaining basic image features from a known lip movement related region image as disclosed in an embodiment of the present application;

FIG. 2b is a flowchart of another implementation of obtaining basic image features from a known lip movement related region image as disclosed in an embodiment of the present application;

FIG. 3 is a flowchart of an implementation of the multi-modal speech recognition model disclosed in the embodiments of the present application, in which each basic image feature in a preset data set is processed by using a sample speech signal to obtain a target image feature corresponding to the sample speech signal;

FIG. 4 is a schematic structural diagram of a multi-modal speech recognition model according to an embodiment of the present application;

FIG. 5 is a flowchart of an implementation of further training a first multi-modal speech recognition model using a second-language sample speech signal after the first multi-modal speech recognition model is obtained according to an embodiment of the present disclosure;

FIG. 6a is a schematic structural diagram of a multi-modal speech recognition model training apparatus according to an embodiment of the present disclosure;

FIG. 6b is a schematic diagram of another embodiment of a multi-modal speech recognition model training apparatus disclosed in the present application;

fig. 7 is a block diagram of a hardware structure of a multi-modal speech recognition model training apparatus according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all, of the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.

The inventor researches find that the current multi-modal speech recognition model is obtained by training audio and video synchronous data (namely, the voice and lip videos of a sounder are synchronously collected), and the audio and video synchronous data are difficult to obtain and have small quantity, so that the current multi-modal speech recognition model can only be trained on a small data set, the generalization of the multi-modal speech recognition model is poor, and the over-fitting phenomenon exists. Namely, the trained multimodal speech recognition model is effective on the training dataset but less effective on the test dataset.

In order to overcome the technical problems, the basic idea of the scheme is that the training data set in the training process of the multi-modal speech processing model can be enriched by single audio data (namely, only the speech of the sounder is acquired and the video of the sounder is not acquired), or by combining the audio data and the video data which are synchronously acquired, so that the generalization capability of the multi-modal speech processing model is improved, and the reliability of multi-modal speech recognition is improved.

Based on the basic idea, an implementation flowchart of the multi-modal speech recognition model training method provided in the embodiment of the present application is shown in fig. 1a, and may include:

step S111: training data is obtained through the multi-modal speech recognition model.

The training data may include only the sample speech signal, or may include both the sample speech signal and the region image associated with the play acquired in synchronization with the sample speech signal. That is, the training data set for training the multimodal speech recognition model may include two types of training data, one type being a single speech signal and the other type being a synchronously acquired speech signal and video. In the embodiment of the present application, the speech signals in the training data set are collectively referred to as sample speech signals.

Step S112: if the training data only comprises the sample voice signal, processing each basic image feature in the preset data set by utilizing the sample voice signal through the multi-mode voice recognition model to obtain a target image feature corresponding to the sample voice signal; the basic image features are obtained from known lip movement related region images.

If the training data only contains a sample voice signal, it is indicated that the lip movement related region image is not synchronously acquired when the sample voice signal is acquired.

The known lip-movement related region image refers to an image or a part of an image in the audio-video synchronization data. Wherein,

the lip movement related region may refer to a lip-only region; or,

the labial area may be the lips and surrounding areas, such as the lips and chin areas; alternatively, the labial related area may be the entire face area.

In the embodiment of the present application, the basic image feature set, that is, the preset data set, is determined in advance according to several known lip movement related region images. In the process of training the multi-modal speech recognition model, if the training data is single speech, virtual lip language features corresponding to the single speech are generated by utilizing the single speech and the basic image feature set to serve as target image features corresponding to the sample speech signals.

There are various ways to obtain the basic image features, and the following describes two preferred implementations:

referring to fig. 2a, fig. 2a is a flowchart of an implementation of obtaining basic image features according to a known lip movement related region image according to an embodiment of the present application, which may include:

step S211: a sequence of labial area-related images acquired in synchrony with a number of known speech signals is acquired.

If the number of the lip movement related region image sequences is N, the lip movement related region image sequences are synchronously acquired by N known voice signals.

Step S212: and respectively sampling each lip movement related region image sequence to obtain a basic lip movement related region image corresponding to each voice signal.

The sampling rate of each lip-motion related region image sequence is not particularly limited, and only one frame of lip-motion related region image may be sampled in each lip-motion related region image sequence, or two or more frames of lip-motion related region images may be sampled in each lip-motion related region image sequence.

The specific sampling mode may be random sampling, or may be sampling according to a predetermined sampling mode. For example, 1 frame per Q frames, etc.

Step S213: and acquiring the characteristic of each basic lip movement related region image as a basic image characteristic.

Alternatively, the image feature extraction model may be used to obtain features of the basic labial movement related area image. The image feature extraction model may specifically be: and the image feature extraction module is used for extracting features of the lip movement related region image in a lip language recognition model trained by taking the lip movement related region image and the lip pronunciation content corresponding to the lip movement related region image as training data. Specifically, the image of the basic lip movement related region can be input into a lip language identification model, and the feature output by the image feature extraction module in the lip language identification model is the basic image feature.

In the embodiment of the application, the specific architecture of the lip language recognition model is not limited, but no matter what the architecture of the lip language recognition model is, the image feature extraction module may be included. For example, in an alternative embodiment, the lip language recognition model may include: the image feature extraction module is used for extracting features of an image sequence input into the lip language identification model; and the lip language identification module is used for carrying out lip language identification according to the features extracted by the image feature extraction module. The training process of the lip language recognition model can refer to the existing training method, and the detailed description is omitted here.

Referring to fig. 2b, fig. 2b is a flowchart of another implementation of obtaining basic image features according to a known lip movement related region image according to an embodiment of the present application, which may include:

step S221: features of several known lip movement related region images are acquired.

The number of known lip movement related region images may be all images in a sequence of lip movement related region images acquired in synchronization with a plurality of known speech signals. The features of the lip movement related region image may be features of the lip movement related region image acquired using the image feature extraction model in the embodiment shown in fig. 2 a. The feature of the lip motion related region image may be a feature vector of a certain dimension, for example, a feature vector of 512 dimensions, or a feature vector of 1024 dimensions, or a feature vector of 256 dimensions, or a feature vector of 128 dimensions, or the like. Optionally, the lip language recognition module is a frame classification network, which may only include one layer of full connection, so that the feature extracted by the image feature extraction module in the lip language recognition model can more directly reflect the lip language feature, and convenience is provided for obtaining the basic image feature.

Step S222: and clustering the characteristics of a plurality of known lip movement related region images to obtain a plurality of clusters.

Alternatively, cosine distance-based clustering, such as kmeans clustering, may be performed on all feature vectors. The number of categories of the clusters may be 128, or, 56, or 256, etc., but may be other numbers, and is not particularly limited herein. The specific clustering process may include:

for the feature of each lip movement related region image to be clustered, determining a clustering center with the minimum distance from the feature of the lip movement related region image as a target clustering center; that is, for each feature of the lip movement related region image to be clustered, the distance between the feature of the lip movement related region image and each clustering center is calculated, and the calculated distances are compared to determine the minimum distance. Specifically, the Distance between the feature P of the lip movement related region image and the Center of the cluster can be calculated by the following formula:

and aggregating the features of the lip movement related region image to a cluster to which the target cluster center belongs. And if the distance between the features of the lip movement related region image and the clustering center J is the smallest, the features of the lip movement related region image are aggregated to the clustering cluster to which the target clustering center J belongs.

And updating the cluster center of the cluster to which the target cluster center belongs. Alternatively, a new cluster center of the cluster to which the target cluster center belongs may be determined according to the cluster center of the cluster to which the target cluster center belongs, the feature of the lip movement related region image, and the number of image features in the cluster to which the target cluster center belongs. Specifically, it is assumed that n members, i.e., 1 cluster Center (for convenience of description, denoted Center _n-1 ) And n-1 cluster points of the cluster center (namely, the features of the lip movement related region images in the cluster to which the cluster center belongs) are clustered, and the feature P of the lip movement related region image is clustered to the cluster to which the target cluster center belongs, and then the target is clusteredThe cluster Center of the cluster to which the cluster Center belongs is updated to Center _n Updated cluster Center _n The method can be obtained by the following formula:

step S223: and extracting the clustering center of each cluster as a basic image characteristic.

Step S113: and performing voice recognition according to the voice characteristics of the sample voice signals and the target image characteristics corresponding to the sample voice signals through the multi-mode voice recognition model to obtain voice recognition results of the sample voice signals.

The specific process of obtaining the speech recognition result can be referred to in the existing manner, and will not be described in detail here.

Step S114: and updating parameters of the multi-modal voice recognition model by taking the voice recognition result of the sample voice signal approaching to the voice content of the sample voice signal as a target through the multi-modal voice recognition model.

According to the multi-modal speech recognition model training method disclosed by the embodiment of the application, in the training process of the multi-modal speech processing model, training data can comprise single audio signals (namely, video signals are not synchronously acquired), and the data set for generating corresponding image features based on the single audio signals is enriched, so that the training data set in the multi-modal speech processing model training process is enriched, the generalization capability of the multi-modal speech processing method is improved, and the reliability of the multi-modal speech recognition model is improved.

In order to enrich the training set further, the training data may further include audio data and video data collected synchronously, based on which, referring to fig. 1b, fig. 1b is another implementation flowchart of the multi-modal speech recognition model training method provided in the embodiment of the present application, which may include:

step S121: training data is obtained through the multi-modal speech recognition model.

Step S122: if the training data only comprises the sample voice signal, processing each basic image feature in the preset data set by utilizing the sample voice signal through the multi-mode voice recognition model to obtain a target image feature corresponding to the sample voice signal; the basic image features are obtained from known lip movement related region images.

For the specific implementation of step S121 to step S122, reference may be made to the foregoing step S111 to step S112, which are not described herein.

Step S123: if the training data simultaneously comprises a sample voice signal and a lip movement related region image synchronously collected with the sample voice signal, acquiring the characteristics of the lip movement related region image through a multi-mode voice recognition model as target image characteristics corresponding to the sample voice signal.

If the training data simultaneously comprises voice and lip movement related region images, the features are directly extracted from the lip movement related region images, and target image features corresponding to the sample voice signals are obtained. In the embodiment of the application, regardless of whether the lip movement related region image is directly acquired or is obtained by clipping from the acquired image, the lip movement related region image acquired in synchronization with the voice signal is collectively referred to as a lip movement related region image.

Step S124 performs voice recognition according to the voice characteristics of the sample voice signal and the target image characteristics corresponding to the sample voice signal through the multi-mode voice recognition model, and obtains a voice recognition result of the sample voice signal.

Step S125: and updating parameters of the multi-modal voice recognition model by taking the voice recognition result of the sample voice signal approaching to the voice content of the sample voice signal as a target through the multi-modal voice recognition model.

The specific implementation of step S124 to step S125 can refer to the foregoing step S113 to step S114, and will not be described herein.

In the embodiment of the application, the training data set includes two types of training data (one type of training data is single-voice data, and the other type of training data is audio data and video data which are synchronously collected), and the multi-modal speech recognition model is trained based on the training data set, so that the generalization capability of the multi-modal speech processing method can be further improved, and the reliability of the multi-modal speech recognition model is further improved.

In an alternative embodiment, the multi-modal speech recognition model processes each basic image feature in the preset data set with the sample speech signal to obtain a flowchart of one implementation of the target image feature corresponding to the sample speech signal, as shown in fig. 3, and may include:

Step S31: weights for the individual basic image features are obtained using the sample speech signal.

The sample speech signal may be different for the same base image feature, which may be weighted differently.

Alternatively, the spatial conversion parameters may be used to perform spatial conversion on the speech feature of the sample speech signal and each of the basic image features, and the weights of each of the basic image features may be calculated using the converted speech feature and the converted basic image features. Wherein,

the first space conversion parameter may be used to perform space conversion on the speech feature of the sample speech signal to obtain a converted speech feature, and the second space conversion parameter may be used to perform space conversion on the basic image feature to obtain a converted basic image feature. The second spatial conversion parameter is composed of a plurality of subspace conversion parameters, and each basic image feature corresponds to one subspace conversion parameter.

The first spatial transformation parameter and the respective subspace transformation parameter may each be a spatial transformation matrix.

Alternatively, the weight a of the i (i=1, 2,3, … …, n; n is the number of basic image features in the preset dataset) th basic image feature when the speech feature is a can be calculated using the following formula _Ai ：

Wherein K is _A Representing a space conversion matrix corresponding to the voice feature A; m is M _i Representing the ith basic image feature;representing the ith basic image feature M _i A corresponding space conversion matrix; />Representing the jth basic image feature M _j A corresponding space conversion matrix; m is M _j Representing the jth elemental image feature.

Step S32: and weighting and summing the basic image features by using the weights of the basic image features to obtain target image features corresponding to the sample voice signals.

If the voice signal is subjected to feature extraction to obtain voice feature A, target image feature M corresponding to the voice feature A _Ao The formula can be expressed as:

in an alternative embodiment, a schematic structural diagram of the multi-modal speech recognition model is shown in fig. 4, and may include:

a voice feature extraction module 41, an image feature generation module 42, an image feature extraction module 43 and a recognition module 44; wherein,

the speech feature extraction module 41 is configured to obtain speech features of the sample speech signal. The speech features may be hidden features of acoustic features such as fbank features, or mel-frequency cepstrum coefficient (MFCC) features, etc.

The image feature generation module 42 is configured to process each basic image feature in the preset data set by using the sample voice signal if the training data acquired by the multi-modal voice recognition model only includes the sample voice signal, so as to obtain a target image feature corresponding to the sample voice signal.

The image feature extraction module 43 is configured to, if the training data acquired by the multi-modal speech recognition model includes both the sample speech signal and the lip movement related region image synchronously acquired therewith, perform feature extraction on the lip movement related region image to obtain a target image feature corresponding to the sample speech signal.

In real-time of acquiring audio and video synchronous data, an audio signal with a certain duration and a video within the certain duration are generally acquired.

Taking the example that the lip movement related region image only includes a lip region, the lip movement related region image may be a mouth region image with a predetermined size taken from the captured video image centering on a mouth center point. The predetermined size may be 80×80.

For synchronization with video data, in this embodiment of the present application, a sliding window is used to frame an acquired voice signal, specifically, a sliding window with a window length of 25ms and a frame shift of 10ms may be used to slide in the acquired voice signal to obtain a voice frame of 100fps, and initial feature (such as fbank feature) extraction is performed for each voice frame to obtain an initial fbank feature sequence, where the initial fbank feature is a 40-dimensional vector. In the embodiment of the application, the sample voice signal input into the multi-modal voice recognition model is an initial fbank feature sequence of 100fps of the sample voice signal. The image feature extraction module 43 performs feature extraction on the initial fbank feature sequence of 100fps to obtain 512-dimensional speech feature vectors (usually hidden layer features) of 25 fps.

The recognition module 44 is configured to perform speech recognition according to the speech feature and the target image feature, so as to obtain a speech recognition result of the sample speech signal. Specifically, the recognition module fuses the voice feature and the target image feature to obtain a fusion feature, and then performs voice recognition by using the fusion feature to obtain a voice recognition result.

The Loss function Loss of the multi-modal speech recognition model may be:

Loss＝α*CELoss(φ(A,V),Label)+(1-α)*CELoss(φ _M (A,M),Label)

wherein, if the training data acquired by the multi-modal speech recognition model only includes the sample speech signal, α=0, and if the training data acquired by the multi-modal speech recognition model includes both the sample speech signal and the lip movement related region image acquired synchronously therewith, α=1. Phi (A, V) represents training data simultaneouslyOutputting a multi-modal speech recognition model when the sample speech signal and the lip movement related region image are included; phi (phi) _M (a, M) represents the output of the multimodal speech recognition model when the training data comprises only sample speech signals; label represents the Label corresponding to the training data, namely the real voice content; CELoss represents a cross entropy loss function, and of course, in the embodiment of the present application, the loss function is not limited to the cross entropy loss function, but may be other loss functions, which are not specifically limited in the present application.

In an alternative embodiment, the updating the parameters of the multimodal speech recognition model includes updating the parameters of the speech feature extraction module 41, the parameters of the image feature generation module 42, the parameters of the image feature extraction module 43, and the parameters of the recognition module 44. Wherein the parameters of the image feature generation module 42 include the spatial transformation parameters, and thus, the updating of the parameters of the multi-modal speech recognition model includes updating the spatial transformation parameters.

In an alternative embodiment, in order to further increase the recognition accuracy of the multimodal speech recognition model, some of the functional modules may be pre-trained before training the multimodal speech recognition model.

Optionally, before training the multi-modal speech recognition model, the initial parameters of the speech feature extraction module 41 may be parameters of a feature extraction module for extracting features of the speech signal in a speech recognition model trained by using the speech signal and its corresponding speech content as training data.

That is, the initial parameters of the speech feature extraction module 41 are parameters of the feature extraction module in a speech recognition model trained with pure speech samples.

In the embodiment of the present application, the specific architecture of the speech recognition model is not limited, but no matter what the architecture of the speech recognition model is, the feature extraction module may be included. For example, in an alternative embodiment, the speech recognition model may include: the feature extraction module is used for extracting acoustic features of the input voice recognition model; and the recognition module is used for carrying out voice recognition according to the features extracted by the feature extraction module. The training process of the speech recognition model can refer to the existing training method, and will not be described in detail here.

The speech samples for training the speech recognition model may or may not include the speech samples for training the multimodal speech recognition model, which is not particularly limited in this application.

Optionally, before training the multimodal speech recognition model, the initial parameters of the image feature extraction module 43 may be parameters of an image feature extraction module for extracting features of the image sequence in the lip language recognition model trained by using the image sequence and the corresponding pronunciation content as training data.

That is, the initial parameters of the image feature extraction module 43 are parameters of the image feature extraction module in the lip recognition model trained using the pure image sequence samples.

In the embodiment of the application, the specific architecture of the lip language recognition model is not limited, but no matter what the architecture of the lip language recognition model is, the image feature extraction module may be included. For example, in an alternative embodiment, the lip language recognition model may include: the image feature extraction module is used for extracting features of an image sequence input into the lip language identification model; and the identification module is used for carrying out lip language identification according to the features extracted by the image feature extraction module. The training process of the lip language recognition model can refer to the existing training method, and the detailed description is omitted here.

The image sequence sample for training the lip language recognition model may or may not include the image sequence sample for training the multi-modal speech recognition model, which is not specifically limited in this application.

In an optional embodiment, the speech signal included in the training data is a speech signal of the first language, and after the multi-modal speech recognition model is trained, the trained multi-modal speech recognition model may be used to perform speech recognition of the first language. The first language may be any one of the following languages, for example: chinese, english, korean, japanese, french, italian, etc.

After the multi-modal speech recognition model is trained by using the training data of the first language, the trained multi-modal speech recognition model (for convenience of description, denoted as a first multi-modal speech recognition model) may be shifted to the training of the multi-modal speech recognition model (for convenience of description, denoted as a second multi-modal speech recognition model) for multi-modal speech recognition of the second language, under the condition that the training data of the second language has no video data. If the training data set of the first language has audio and video synchronous data and the training data set of the second language does not have audio and video synchronous data, the multi-modal speech recognition model can be trained by the training data set of the first language according to the method, after the training data set of the first language is used for obtaining the first multi-modal speech recognition model, the trained first multi-modal speech recognition model is further trained by the training data set of the second language, and a second multi-modal speech recognition model is obtained, and the second multi-modal speech recognition model can realize multi-modal speech recognition by the audio and video synchronous data of the second language. The second multi-modal speech recognition model is obtained by training on the basis of the first multi-modal speech recognition model, and the first multi-modal speech recognition model is obtained by pre-training, so that the training data set of the second language is utilized to train the first multi-modal speech recognition model, the convergence speed is high, the multi-modal speech recognition model obtained by training has higher accuracy in multi-modal speech recognition on the audio and video synchronous data of the second language, and the migration of the multi-modal speech recognition model among different languages is realized.

Specifically, after the first multi-modal speech recognition model is obtained, an implementation flowchart for further training the first multi-modal speech recognition model by using the sample speech signal of the second language is shown in fig. 5, which may include:

step S51: the speech features of the sample speech signal in the second language are obtained by the speech feature extraction module 41 of the first multimodal speech recognition model.

Step S52: the image feature generating module 42 of the first multi-mode speech recognition model processes each basic image feature in the preset data set by using the speech features of the second language sample speech signal to obtain the target image feature corresponding to the second language sample speech signal.

Step S53: and performing voice recognition according to the voice characteristics of the second-language sample voice signal and the target image characteristics corresponding to the second-language sample voice signal by the recognition module 44 of the first multi-mode voice recognition model to obtain a voice recognition result of the second-language sample voice signal.

Step S54: the parameters of the speech feature extraction module 41, the image feature generation module 42 and the recognition module 44 are updated with the aim of the speech recognition result of the sample speech signal of the second language approaching the speech content of the sample speech signal of the second language.

Since the training data set of the second language only contains single-speech data, the image feature extraction module 43 is not used when the training data set of the second language is used to further train the first multi-modal speech recognition model, so that the parameters of the image feature extraction module in the second multi-modal speech recognition model obtained by training are the same as those of the image feature extraction module in the first multi-modal speech recognition model.

Corresponding to the method embodiment, the embodiment of the application also provides a multi-modal speech recognition model training device.

As shown in fig. 6a, a schematic structural diagram of a multi-modal speech recognition model training apparatus according to an embodiment of the present application may include:

a data acquisition module 611, a first feature acquisition module 612, an identification module 613 and an update module 614; wherein,

the data acquisition module 611 is configured to acquire training data through the multi-modal speech recognition model;

the first feature obtaining module 612 is configured to process, by using the multi-modal speech recognition model, each basic image feature in a preset dataset by using the sample speech signal if the training data only includes the sample speech signal, so as to obtain a target image feature corresponding to the sample speech signal; the basic image features are obtained according to known lip movement related region images;

The recognition module 613 is configured to perform speech recognition according to the speech features of the sample speech signal and the target image features through the multi-modal speech recognition model, so as to obtain a speech recognition result of the sample speech signal;

the updating module 614 is configured to update parameters of the multi-modal speech recognition model by using the multi-modal speech recognition model to target that the speech recognition result of the sample speech signal approaches to the speech content of the sample speech signal.

According to the multi-modal speech recognition model training device, in the training process of the multi-modal speech processing model, training data can comprise single audio signals (namely, video signals are not synchronously collected) and the data set for generating image features corresponding to the single audio signals, so that the training data set in the multi-modal speech processing model training process is enriched, the generalization capability of the multi-modal speech processing method is improved, and the reliability of the multi-modal speech recognition model is improved.

In an alternative embodiment, the first feature acquisition module 612 may include:

the weight acquisition module is used for acquiring the weight of each basic image characteristic by utilizing the sample voice signal through the multi-mode voice recognition model if the training data only comprises the sample voice signal;

And the target acquisition module is used for weighting and summing the basic image features by using the weights of the basic image features through the multi-mode voice recognition model to obtain target image features corresponding to the sample voice signals.

In an alternative embodiment, the weight acquisition module may include:

the space conversion module is used for carrying out space conversion on the voice characteristics of the sample voice signals and the characteristics of each basic image respectively by utilizing space conversion parameters through the multi-mode voice recognition model if the training data only comprises the sample voice signals;

and the calculating module is used for calculating the weight of each basic image feature by using the converted voice feature and the converted basic image feature through the multi-mode voice recognition model.

In an alternative embodiment, updating the parameters of the multimodal speech recognition model by the updating module 614 includes: and updating the space conversion parameters.

In an alternative embodiment, the sample speech signal is a speech signal in a first language; the multi-modal speech recognition model training device is also used for: obtaining the voice characteristics of the sample voice signal of the second language through the voice characteristic extraction module of the multi-mode voice recognition model;

The first feature acquisition module 612 is further configured to: processing each basic image feature in the preset data set by using the voice feature of the sample voice signal of the second language through the image feature generation module of the multi-mode voice recognition model to obtain a target image feature corresponding to the sample voice signal of the second language;

the identification module 613 is further configured to: performing voice recognition according to the voice characteristics of the second-language sample voice signal and the target image characteristics corresponding to the second-language sample voice signal by the recognition module of the multi-mode voice recognition model to obtain a voice recognition result of the second-language sample voice signal;

the update module 614 is further configured to: and updating parameters of the voice feature extraction module, the image feature generation module and the recognition module by taking the voice recognition result of the second-language sample voice signal as a target that the voice content of the second-language sample voice signal approaches.

In an alternative embodiment, the multi-modal speech recognition model training apparatus may further include:

the basic image feature acquisition module is used for acquiring a lip movement related region image sequence synchronously acquired with a plurality of known voice signals; sampling each lip movement related region image sequence respectively to obtain a basic lip movement related region image corresponding to each voice signal; and acquiring the characteristic of each basic lip movement related region image as the basic image characteristic.

the basic image feature acquisition module is used for acquiring features of a plurality of known lip movement related region images; clustering the characteristics of the known lip movement related region images to obtain a plurality of clusters; and extracting the clustering center of each cluster as the basic image characteristic.

In an alternative embodiment, when the basic image feature acquisition module clusters the features of the several known lip movement related region images, the basic image feature acquisition module is specifically configured to:

In an alternative embodiment, the basic image feature acquisition module is specifically configured to, when acquiring features of a plurality of known lip related images:

In an alternative embodiment, the image feature extraction model is: and the image feature extraction module is used for extracting features of the lip movement related region image in a lip language recognition model trained by taking the lip movement related region image and the lip pronunciation content corresponding to the lip movement related region image as training data.

As shown in fig. 6b, another schematic structural diagram of the multi-modal speech recognition model training apparatus provided in the embodiment of the present application may include:

a data acquisition module 621, a first feature acquisition module 622, a second feature acquisition module 623, an identification module 624, and an update module 625; wherein,

the data acquisition module 621 is configured to acquire training data through the multi-modal speech recognition model;

the first feature obtaining module 622 is configured to process, by using the multi-modal speech recognition model, each basic image feature in a preset dataset by using the sample speech signal if the training data only includes the sample speech signal, so as to obtain a target image feature corresponding to the sample speech signal; the basic image features are obtained according to known lip movement related region images;

the second feature obtaining module 623 is configured to obtain, if the training data includes both a sample speech signal and a lip movement related region image synchronously collected therewith, features of the lip movement related region image through the multi-modal speech recognition model as target image features corresponding to the sample speech signal;

The recognition module 624 is configured to perform speech recognition according to the speech features of the sample speech signal and the target image features through the multi-modal speech recognition model, so as to obtain a speech recognition result of the sample speech signal;

the updating module 625 is configured to update parameters of the multi-modal speech recognition model by using the multi-modal speech recognition model to target that the speech recognition result of the sample speech signal approaches to the speech content of the sample speech signal.

According to the multi-modal speech recognition model training device, in the training process of the multi-modal speech processing model, training data are not limited to audio data and video data which are synchronously collected, and the device also comprises single audio signals (namely, video signals which are not synchronously collected) and a data set for generating image characteristics corresponding to the single audio signals, so that the training data set in the training process of the multi-modal speech processing model is further enriched, the generalization capability of the multi-modal speech processing method is further improved, and the reliability of the multi-modal speech recognition model is further improved.

The multi-modal speech recognition model training device provided by the embodiment of the application can be applied to multi-modal speech recognition model training equipment, such as PC terminals, cloud platforms, servers, server clusters and the like. Alternatively, fig. 7 shows a block diagram of a hardware structure of the multi-modal speech recognition model training apparatus, and referring to fig. 7, the hardware structure of the multi-modal speech recognition model training apparatus may include: at least one processor 1, at least one communication interface 2, at least one memory 3 and at least one communication bus 4;

In the embodiment of the application, the number of the processor 1, the communication interface 2, the memory 3 and the communication bus 4 is at least one, and the processor 1, the communication interface 2 and the memory 3 complete communication with each other through the communication bus 4;

processor 1 may be a central processing unit CPU, or a specific integrated circuit ASIC (Application Specific Integrated Circuit), or one or more integrated circuits configured to implement embodiments of the present invention, etc.;

the memory 3 may comprise a high-speed RAM memory, and may further comprise a non-volatile memory (non-volatile memory) or the like, such as at least one magnetic disk memory;

wherein the memory stores a program, the processor is operable to invoke the program stored in the memory, the program operable to:

acquiring training data through the multi-modal voice recognition model;

Alternatively, the refinement function and the extension function of the program may be described with reference to the above.

The embodiment of the application also provides a storage medium, which may store a program adapted to be executed by a processor, the program being configured to:

acquiring training data through the multi-modal voice recognition model;

Finally, it is further noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.

It should be understood that in the embodiments of the present application, the claims, the various embodiments, and the features may be combined with each other, so as to solve the foregoing technical problems.

In the present specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, and identical and similar parts between the embodiments are all enough to refer to each other.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method for training a multimodal speech recognition model, comprising:

Acquiring training data through the multi-modal voice recognition model;

2. The method as recited in claim 1, further comprising:

if the training data simultaneously comprises a sample voice signal and a lip movement related region image synchronously collected with the sample voice signal, the multi-mode voice recognition model acquires the characteristics of the lip movement related region image as target image characteristics corresponding to the sample voice signal;

the sample voice signal is a voice signal of a first language; after the multi-modal speech recognition model is trained, the method further comprises:

3. The method of claim 1, wherein said processing each base image feature in a preset dataset with said sample speech signal comprises:

4. A method according to claim 3, wherein said obtaining weights for individual base image features using said sample speech signal comprises:

5. The method of claim 4, wherein updating parameters of the multimodal speech recognition model includes updating the spatial transformation parameters.

6. The method according to any one of claims 1-5, wherein the process of obtaining the base image features from the known lip movement related region image comprises:

7. The method according to any one of claims 1-5, wherein the process of obtaining the base image features from the known lip-related image comprises:

8. The method of claim 7, wherein the clustering features of the number of known lip-related images comprises:

9. The method of claim 7, wherein the capturing features of a number of known lip related images comprises:

Acquiring the characteristics of the known lip movement related region images by using an image characteristic extraction model;

the image feature extraction model is as follows: and the image feature extraction module is used for extracting features of the lip movement related region image in a lip language recognition model trained by taking the lip movement related region image and the lip pronunciation content corresponding to the lip movement related region image as training data.

10. A multi-modal speech recognition model training apparatus, comprising:

11. A multimodal speech recognition model training apparatus, comprising: comprising a memory and a processor;

the memory is used for storing programs;

the processor being configured to execute the program to implement the steps of the multi-modal speech recognition model training method of any one of claims 1-9.

12. A computer readable storage medium having stored thereon a computer program, which when executed by a processor, implements the steps of the multimodal speech recognition model training method according to any of claims 1-9.