CN111462733A

CN111462733A - Multi-modal speech recognition model training method, device, equipment and storage medium

Info

Publication number: CN111462733A
Application number: CN202010247184.7A
Authority: CN
Inventors: 景子君; 潘嘉; 吴华鑫
Original assignee: iFlytek Co Ltd
Current assignee: iFlytek Co Ltd
Priority date: 2020-03-31
Filing date: 2020-03-31
Publication date: 2020-07-28
Anticipated expiration: 2040-03-31
Also published as: CN111462733B; WO2021196802A1

Abstract

The embodiment of the application discloses a multi-modal speech recognition model training method, a device, equipment and a storage medium, wherein in the training process of a multi-modal speech processing model, training data can contain a single audio signal (namely, video signals are not synchronously acquired) and a data set used for generating corresponding image characteristics based on the single audio signal, and the training data set in the training process of the multi-modal speech processing model is enriched, so that the generalization capability of the multi-modal speech processing method is improved, and the reliability of the multi-modal speech recognition model is improved.

Description

Multi-modal speech recognition model training method, device, equipment and storage medium

Technical Field

The present application relates to the field of machine learning technologies, and in particular, to a method, an apparatus, a device, and a storage medium for training a multi-modal speech recognition model.

Background

The traditional speech recognition technology is to obtain a recognition result by processing a speech signal only, and the speech recognition method can achieve a high recognition effect in a clear speech environment. However, in some high noise, far-field environments, the recognition rate of conventional speech recognition techniques can drop rapidly. In order to improve the speech recognition rate, a multi-modal speech recognition method for performing speech recognition by means of lip motion video assistance is proposed, so that the speech recognition rate in a high-noise scene is improved to a certain extent.

However, the existing multi-modal speech recognition models for multi-modal speech recognition have weak generalization capability, resulting in poor reliability of the multi-modal speech recognition models.

Therefore, how to improve the reliability of the multi-modal speech recognition model becomes an urgent technical problem to be solved.

Disclosure of Invention

In view of this, the present application provides a method, an apparatus, a device and a storage medium for training a multi-modal speech recognition model to improve the reliability of the multi-modal speech recognition model.

In order to achieve the above object, the following solutions are proposed:

a multi-modal speech recognition model training method, comprising:

acquiring training data through the multi-modal voice recognition model;

if the training data only contains a sample voice signal, the multi-modal voice recognition model processes each basic image feature in a preset data set by using the sample voice signal to obtain a target image feature corresponding to the sample voice signal; the basic image features are obtained according to the known lip movement related region images;

performing voice recognition according to the voice characteristics of the sample voice signal and the target image characteristics to obtain a voice recognition result of the sample voice signal;

and updating the parameters of the multi-modal speech recognition model by taking the speech recognition result of the sample speech signal approaching the speech content of the sample speech signal as a target.

The above method, preferably, further comprises:

and if the training data simultaneously comprises a sample voice signal and a lip movement related area image synchronously acquired with the sample voice signal, the multi-modal voice recognition model acquires the characteristics of the lip movement related area image as target image characteristics corresponding to the sample voice signal.

Preferably, the processing, by using the sample voice signal, each basic image feature in the preset data set includes:

obtaining the weight of each basic image characteristic by using the sample voice signal;

and weighting and summing all the basic image features by utilizing the weight of each basic image feature to obtain the target image feature corresponding to the sample voice signal.

The above method, preferably, the obtaining the weight of each basic image feature by using the sample speech signal includes:

respectively carrying out space conversion on the voice characteristics of the sample voice signal and each basic image characteristic by using space conversion parameters;

and calculating the weight of each basic image characteristic by using the converted voice characteristic and the converted basic image characteristic.

In the method, preferably, the updating of the parameters of the multi-modal speech recognition model includes updating the spatial conversion parameters.

In the above method, preferably, the sample speech signal is a speech signal of a first language; after the multi-modal speech recognition model is trained, the method further comprises the following steps:

acquiring the voice characteristics of a sample voice signal of a second language through a voice characteristic extraction module of the multi-modal voice recognition model;

processing each basic image feature in the preset data set by using the voice feature of the sample voice signal of the second language through an image feature generation module of the multi-modal voice recognition model to obtain a target image feature corresponding to the sample voice signal of the second language;

performing voice recognition according to the voice characteristics of the sample voice signal of the second language and the target image characteristics corresponding to the sample voice signal of the second language through a recognition module of the multi-modal voice recognition model to obtain a voice recognition result of the sample voice signal of the second language;

and updating the parameters of the voice feature extraction module, the image feature generation module and the recognition module by taking the voice recognition result of the sample voice signal of the second language approaching to the voice content of the sample voice signal of the second language as a target.

The method, preferably, includes a process of obtaining basic image features according to a known lip movement related region image, including:

acquiring a lip movement related region image sequence synchronously acquired with a plurality of known voice signals;

sampling each lip movement related region image sequence respectively to obtain a basic lip movement related region image corresponding to each voice signal;

and acquiring the characteristics of each basic lip movement related area image as the basic image characteristics.

The method, preferably, comprises a process of obtaining the basic image features from a known lip-related image, including:

acquiring the characteristics of a plurality of known lip movement related area images;

clustering the characteristics of the known lip movement related area images to obtain a plurality of cluster clusters;

and extracting the clustering center of each clustering cluster as the basic image feature.

The above method, preferably, the clustering features of the several known lip-related images, comprises:

for the characteristics of each lip movement related area image to be clustered, determining a clustering center with the minimum distance from the characteristics of the lip movement related area image as a target clustering center;

aggregating the features of the lip movement related region image to a cluster to which the target cluster center belongs;

and updating the cluster center of the cluster to which the target cluster center belongs.

The above method, preferably, the acquiring of the characteristics of the plurality of known lip-related images includes:

and acquiring the characteristics of the images of the known lip movement related regions by using an image characteristic extraction model.

In the above method, preferably, the image feature extraction model is: and the image feature extraction module is used for extracting features of the lip movement related region images in a lip language recognition model which is trained by taking the lip movement related region images and the corresponding lip pronunciation contents as training data.

A multi-modal speech recognition model training apparatus, comprising:

the data acquisition module is used for acquiring training data through the multi-mode voice recognition model;

the first feature acquisition module is used for processing each basic image feature in a preset data set by using the sample voice signal through the multi-modal voice recognition model to obtain a target image feature corresponding to the sample voice signal if the training data only contains the sample voice signal; the basic image features are obtained according to the known lip movement related region images;

the recognition module is used for carrying out voice recognition according to the voice characteristics of the sample voice signal and the target image characteristics through the multi-modal voice recognition model to obtain a voice recognition result of the sample voice signal;

and the updating module is used for updating the parameters of the multi-modal speech recognition model by taking the speech recognition result of the sample speech signal approaching the speech content of the sample speech signal as a target through the multi-modal speech recognition model.

A multimodal speech recognition model training apparatus comprising: comprising a memory and a processor;

the memory is used for storing programs;

the processor is configured to execute the program to implement the steps of the multi-modal speech recognition model training method according to any one of the above descriptions.

A computer-readable storage medium, having stored thereon a computer program which, when being executed by a processor, carries out the steps of the multi-modal speech recognition model training method according to any one of the preceding claims.

According to the technical scheme, after the multi-modal speech recognition model acquires the training data, if the training data only contains the sample speech signal, each basic image feature in the preset data set acquired according to the known lip related image is processed by using the sample speech signal, so that the target image feature corresponding to the sample speech signal is obtained; performing voice recognition according to the voice characteristics of the sample voice signal and the target image characteristics to obtain a voice recognition result of the sample voice signal; and updating the parameters of the multi-modal speech recognition model by taking the speech recognition result of the sample speech signal approaching the speech content of the sample speech signal as a target. Based on the model training scheme of the application, in the training process of the multi-modal speech processing model, the training data can contain a single audio signal (namely, video signals are not synchronously acquired) and a data set used for generating corresponding image characteristics based on the single audio signal, and the training data set in the training process of the multi-modal speech processing model is enriched, so that the generalization capability of the multi-modal speech processing method is improved, and the reliability of the multi-modal speech recognition model is improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

FIG. 1a is a flowchart of an implementation of a multi-modal speech recognition model training method disclosed in an embodiment of the present application;

FIG. 1b is a flowchart illustrating another implementation of a multi-modal speech recognition model training method disclosed in the embodiments of the present application;

FIG. 2a is a flow chart of one implementation of obtaining basic image features from known lip movement-related region images as disclosed in embodiments of the present application;

FIG. 2b is a flowchart of another implementation of obtaining basic image features from known lip movement-related region images, as disclosed in an embodiment of the present application;

FIG. 3 is a flow chart of an implementation of the multi-modal speech recognition model disclosed in the embodiment of the present application, which uses a sample speech signal to process each of the basic image features in a preset data set to obtain a target image feature corresponding to the sample speech signal;

FIG. 4 is a schematic diagram of a multi-modal speech recognition model according to an embodiment of the present application;

FIG. 5 is a flowchart illustrating an implementation of further training a first multi-modal speech recognition model with a sample speech signal of a second language after obtaining the first multi-modal speech recognition model according to an embodiment of the present application;

FIG. 6a is a schematic structural diagram of a multi-modal speech recognition model training apparatus according to an embodiment of the present application;

FIG. 6b is a schematic structural diagram of a multi-modal speech recognition model training apparatus according to an embodiment of the present application;

fig. 7 is a block diagram of a hardware structure of a multi-modal speech recognition model training device disclosed in an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The inventor of the application discovers that the existing multi-modal speech recognition model is obtained by training audio and video synchronous data (namely, the audio and video synchronous data synchronously collect the speech and lip video of a speaker), and the audio and video synchronous data is difficult to obtain and small in quantity, so that the existing multi-modal speech recognition model can only be trained on a small data set, and the multi-modal speech recognition model is poor in generalization and has an overfitting phenomenon. That is, the trained multi-modal speech recognition model is good in effect on the training data set, but poor in effect on the testing data set.

In order to overcome the technical problems, the basic idea of the scheme of the application is that a training data set in the multi-modal speech processing model training process can be enriched through single audio data (namely, only the speech of a speaker is collected and the video of the speaker is not collected), and also can be combined with synchronously collected audio data and video data, so that the generalization capability of the multi-modal speech processing model is improved, and the reliability of multi-modal speech recognition is improved.

Based on the foregoing basic idea, an implementation flowchart of the multi-modal speech recognition model training method provided in the embodiment of the present application is shown in fig. 1a, and may include:

step S111: training data is obtained through a multi-modal speech recognition model.

The training data may contain only the sample voice signal, or may contain both the sample voice signal and the image of the motion-related region acquired in synchronization with the sample voice signal. That is, the training data set used to train the multi-modal speech recognition model may contain two types of training data, one type being a single speech signal and the other type being a synchronously captured speech signal and video. In the embodiment of the present application, the speech signals in the training data set are collectively referred to as sample speech signals.

Step S112: if the training data only contain a sample voice signal, processing each basic image feature in the preset data set by using the sample voice signal through a multi-modal voice recognition model to obtain a target image feature corresponding to the sample voice signal; the basic image features are obtained from known lip movement related region images.

If the training data only contains the sample voice signal, it indicates that the lip movement related area image is not synchronously acquired when the sample voice signal is acquired.

The known lip motion-related region image refers to an image or a portion of an image in the av sync data. Wherein the content of the first and second substances,

the lip movement-related region may refer to the lip region only; alternatively, the first and second electrodes may be,

the lip movement related region may be the lip and its surrounding regions, e.g., the lip and chin region; alternatively, the lip movement-related region may be the entire face region.

In the embodiment of the present application, the basic image feature set, i.e. the preset data set, is determined in advance according to several known lip movement related region images. In the process of training the multi-modal voice recognition model, if the training data is a single voice, a virtual lip language feature corresponding to the single voice is generated by using the single voice and the basic image feature set and is used as a target image feature corresponding to the sample voice signal.

There are various ways to obtain the basic image features, and two preferred implementations are described below:

referring to fig. 2a, fig. 2a is a flowchart illustrating an implementation of obtaining basic image features according to a known lip movement-related region image according to an embodiment of the present application, which may include:

step S211: a sequence of lip motion-related region images acquired in synchronization with a number of known speech signals is acquired.

If N known voice signals are synchronously acquired, the number of the lip motion related region image sequences is also N.

Step S212: and respectively sampling each lip movement related area image sequence to obtain a basic lip movement related area image corresponding to each voice signal.

The sampling rate of each lip movement related region image sequence is not specifically limited, and only one frame of lip movement related region image may be sampled in each lip movement related region image sequence, or two or more frames of lip movement related region images may be sampled in each lip movement related region image sequence.

The specific sampling mode may be random sampling, or sampling may be performed according to a predetermined sampling mode. For example, 1 frame is sampled per Q frame, etc.

Step S213: and acquiring the characteristics of each basic lip movement related area image as basic image characteristics.

Optionally, the image feature extraction model may be used to obtain features of the basic lip movement related region image. The image feature extraction model may specifically be: and the image feature extraction module is used for extracting features of the lip movement related region images in a lip language recognition model which is trained by taking the lip movement related region images and the corresponding lip pronunciation contents as training data. Specifically, the basic lip movement related region image may be input into a lip language recognition model, and the features output by the image feature extraction module in the lip language recognition model are the basic image features.

In the embodiment of the present application, the specific architecture of the lip language recognition model is not limited, but the image feature extraction module may be included regardless of the architecture of the lip language recognition model. For example, in an alternative embodiment, the lip language recognition model may include: the image feature extraction module is used for extracting features of an image sequence input into the lip language recognition model; and the lip language identification module is used for carrying out lip language identification according to the features extracted by the image feature extraction module. The training process of the lip language recognition model can refer to the existing training method, and is not detailed here.

Referring to fig. 2b, fig. 2b is a flowchart illustrating another implementation of obtaining basic image features according to a known lip movement-related region image according to an embodiment of the present application, where the flowchart may include:

step S221: features of a number of known lip movement-related region images are acquired.

The number of known lip movement related region images may be all images in a sequence of lip movement related region images acquired in synchronization with the plurality of known speech signals. The features of the lip movement-related region image may be features of the lip movement-related region image obtained by using the image feature extraction model in the embodiment shown in fig. 2 a. The features of the lip movement related region image may be a feature vector of a certain dimension, such as a feature vector of 512 dimensions, or a feature vector of 1024 dimensions, or a feature vector of 256 dimensions, or a feature vector of 128 dimensions, etc. Optionally, the lip language recognition module is a frame classification network, which may only include one layer of full connection, so that features extracted by the image feature extraction module in the lip language recognition model can more directly reflect lip language features, and convenience is provided for obtaining basic image features.

Step S222: and clustering the characteristics of a plurality of known lip movement related area images to obtain a plurality of cluster clusters.

Alternatively, cosine distance-based clustering, such as kmeans clustering, may be performed on all feature vectors. The number of the clusters may be 128, 56, 256, etc., or may be other numbers, which are not limited herein. The specific clustering process may include:

for the characteristics of each lip movement related area image to be clustered, determining a clustering center with the minimum distance from the characteristics of the lip movement related area image as a target clustering center; namely, for the characteristics of each lip movement related area image to be clustered, the distance between the characteristics of the lip movement related area image and the center of each cluster is respectively calculated, and the calculated distances are compared to determine the minimum distance. Specifically, the Distance between the feature P of the lip movement related region image and the cluster Center can be calculated by the following formula:

and aggregating the characteristics of the lip movement related area images to a cluster to which the target cluster center belongs. And if the distance between the features of the lip movement related area image and the clustering center J is minimum, aggregating the features of the lip movement related area image to the clustering cluster to which the target clustering center J belongs.

And updating the cluster center of the cluster to which the target cluster center belongs. Optionally, the method may be performed according to the cluster center of the cluster to which the target cluster center belongs, the feature of the lip movement related region image, and the cluster in which the target cluster center belongsAnd determining the number of the image features, and determining a new clustering center of the clustering cluster to which the target clustering center belongs. Specifically, it is assumed that before the cluster to which the target cluster Center belongs is updated, there are n members in the cluster, i.e. 1 cluster Center (for convenience of description, it is denoted as Center)_n-1) And n-1 cluster points of the cluster Center (namely the features of lip movement related area images in the cluster to which the cluster Center belongs), after the features P of the lip movement related area images are aggregated to the cluster to which the target cluster Center belongs, the cluster Center of the cluster to which the target cluster Center belongs is updated to the Center_nThen the updated cluster Center_nThis can be obtained by the following formula:

step S223: and extracting the clustering center of each clustering cluster as a basic image feature.

Step S113: and performing voice recognition through the multi-modal voice recognition model according to the voice characteristics of the sample voice signal and the target image characteristics corresponding to the sample voice signal to obtain a voice recognition result of the sample voice signal.

The specific process of obtaining the speech recognition result can be referred to in the prior art, and is not described in detail here.

Step S114: and updating the parameters of the multi-modal speech recognition model by taking the speech recognition result of the sample speech signal approaching the speech content of the sample speech signal as a target through the multi-modal speech recognition model.

In the multi-modal speech recognition model training method disclosed in the embodiment of the application, in the training process of the multi-modal speech processing model, the training data can contain a single audio signal (namely, video signals are not synchronously acquired) and a data set used for generating corresponding image characteristics based on the single audio signal, and the training data set in the training process of the multi-modal speech processing model is enriched, so that the generalization capability of the multi-modal speech processing method is improved, and the reliability of the multi-modal speech recognition model is improved.

In order to further enrich the training set, the training data may further include audio data and video data that are synchronously collected, and based on this, please refer to fig. 1b, where fig. 1b is a flowchart of another implementation of the multimodal speech recognition model training method provided in the embodiment of the present application, which may include:

step S121: training data is obtained through a multi-modal speech recognition model.

Step S122: if the training data only contain a sample voice signal, processing each basic image feature in the preset data set by using the sample voice signal through a multi-modal voice recognition model to obtain a target image feature corresponding to the sample voice signal; the basic image features are obtained from known lip movement related region images.

The specific implementation of steps S121 to S122 can refer to the aforementioned steps S111 to S112, which are not described herein again.

Step S123: if the training data simultaneously comprises the sample voice signal and the lip movement related area image synchronously acquired with the sample voice signal, the characteristics of the lip movement related area image are acquired through the multi-modal voice recognition model and serve as the target image characteristics corresponding to the sample voice signal.

And if the training data simultaneously contains the voice and the lip movement related area image, directly extracting the features from the lip movement related area image to obtain the target image features corresponding to the sample voice signals. In the embodiment of the application, the lip movement related region image acquired synchronously with the voice signal is generally referred to as the lip movement related region image regardless of whether the lip movement related region image is directly acquired or is cut from the acquired image.

Step S124 performs speech recognition according to the speech features of the sample speech signal and the target image features corresponding to the sample speech signal by using the multi-modal speech recognition model, so as to obtain a speech recognition result of the sample speech signal.

Step S125: and updating the parameters of the multi-modal speech recognition model by taking the speech recognition result of the sample speech signal approaching the speech content of the sample speech signal as a target through the multi-modal speech recognition model.

The specific implementation of steps S124 to S125 can refer to the aforementioned steps S113 to S114, which are not described herein again.

In the embodiment of the application, the training data set comprises two types of training data (one type of training data is single-voice data, and the other type of training data is synchronously acquired audio data and video data), and the multi-modal voice recognition model is trained based on the training data set, so that the generalization capability of the multi-modal voice processing method can be further improved, and the reliability of the multi-modal voice recognition model can be further improved.

In an alternative embodiment, an implementation flow chart of the multi-modal speech recognition model processing each basic image feature in the preset data set by using the sample speech signal to obtain the target image feature corresponding to the sample speech signal is shown in fig. 3, and may include:

step S31: the sample speech signal is used to obtain the weights of the respective basic image features.

The sample speech signal is different for the same basic image feature, which may be weighted differently.

Optionally, the speech features of the sample speech signal and each of the basic image features may be subjected to spatial conversion respectively using spatial conversion parameters, and the weights of each of the basic image features may be calculated using the converted speech features and the converted basic image features. Wherein the content of the first and second substances,

the first spatial conversion parameter may be used to perform spatial conversion on the speech feature of the sample speech signal to obtain a converted speech feature, and the second spatial conversion parameter may be used to perform spatial conversion on the basic image feature to obtain a converted basic image feature. The second spatial transformation parameter is composed of a plurality of subspace transformation parameters, and each basic image feature corresponds to one subspace transformation parameter.

The first spatial transformation parameter and each subspace transformation parameter may be a spatial transformation matrix.

Alternatively, when the speech feature is a, the ith (i ═ 1, 2, 3, … …, n; n is the number of basic image features in the preset data set) basic image feature may be calculated by using the following formulaWeight a_Ai：

Wherein, K_ARepresenting a space transformation matrix corresponding to the voice feature A; m_iRepresenting the ith basic image feature;

representing the ith basic image feature M_iA corresponding spatial transformation matrix;

representing the jth basic image feature M_jA corresponding spatial transformation matrix; m_jRepresenting the jth base image feature.

Step S32: and weighting and summing the basic image features by using the weights of the basic image features to obtain the target image features corresponding to the sample voice signals.

If the voice signal is subjected to feature extraction to obtain a voice feature A, a target image feature M corresponding to the voice feature A_AoCan be expressed by the formula:

in an alternative embodiment, a structural schematic diagram of the multi-modal speech recognition model is shown in fig. 4, and may include:

a voice feature extraction module 41, an image feature generation module 42, an image feature extraction module 43 and a recognition module 44; wherein the content of the first and second substances,

the voice feature extraction module 41 is configured to obtain a voice feature of the sample voice signal. The speech features may be hidden layer features of acoustic features such as fbank features, or mel-frequency cepstral coefficient (MFCC) features, etc.

The image feature generation module 42 is configured to, if the training data acquired by the multimodal speech recognition model only includes a sample speech signal, process each basic image feature in the preset data set by using the sample speech signal to obtain a target image feature corresponding to the sample speech signal.

The image feature extraction module 43 is configured to, if the training data obtained by the multi-modal speech recognition model includes both the sample speech signal and the lip movement related region image synchronously acquired with the sample speech signal, perform feature extraction on the lip movement related region image to obtain a target image feature corresponding to the sample speech signal.

In real time of acquiring audio and video synchronous data, an audio signal with a certain time length and a video within the certain time length are generally acquired.

Taking the lip movement related region image only containing the lip region as an example, the lip movement related region image may be a mouth region image taking a predetermined size in the captured video image with the center point of the mouth as the center.

The frame rate of the acquired video is usually 25fps, in order to synchronize with video data, in the embodiment of the present application, a sliding window is used for framing the acquired voice signal, specifically, a 100fps voice frame can be obtained by sliding in the acquired voice signal through a sliding window with a window length of 25ms and a frame shift of 10ms, and for each voice frame, initial feature (such as fbank feature) extraction is performed to obtain an initial fbank feature sequence, where the initial fbank feature is a 40-dimensional vector. In the embodiment of the present application, the sample speech signal input into the multimodal speech recognition model is an initial fbank feature sequence of 100fps of the sample speech signal. The image feature extraction module 43 performs feature extraction on the initial fbank feature sequence of 100fps to obtain a 512-dimensional speech feature vector (usually, hidden layer feature) of 25 fps.

The recognition module 44 is configured to perform speech recognition according to the speech features and the target image features to obtain a speech recognition result of the sample speech signal. Specifically, the recognition module fuses the voice feature and the target image feature to obtain a fused feature, and then performs voice recognition by using the fused feature to obtain a voice recognition result.

The penalty function L oss of the multi-modal speech recognition model may be:

Loss＝α*CELoss(φ(A,V),Label)+(1-α)*CELoss(φ_M(A,M),Label)

wherein, if the training data obtained by the multi-modal speech recognition model only comprises the sample speech signal, α is equal to 0, if the training data obtained by the multi-modal speech recognition model simultaneously comprises the sample speech signal and the lip movement related area image synchronously collected with the sample speech signal, α is equal to 1, phi (A, V) represents the output of the multi-modal speech recognition model when the training data simultaneously comprises the sample speech signal and the lip movement related area image_MThe method comprises the steps of (A, M) representing the output of a multi-modal speech recognition model when training data only comprise a sample speech signal, (L abel representing a label corresponding to the training data, namely real speech content, and (CE L oss) representing a cross-entropy loss function.

In an alternative embodiment, the updating of the parameters of the multi-modal speech recognition model includes updating the parameters of the speech feature extraction module 41, the parameters of the image feature generation module 42, the parameters of the image feature extraction module 43 and the parameters of the recognition module 44. Since the parameters of the image feature generation module 42 include the spatial conversion parameters, the updating of the parameters of the multi-modal speech recognition model includes updating the spatial conversion parameters.

In an alternative embodiment, in order to further improve the recognition accuracy of the multi-modal speech recognition model, some functional modules may be pre-trained before the multi-modal speech recognition model is trained.

Optionally, before training the multi-modal speech recognition model, the initial parameters of the speech feature extraction module 41 may be parameters of a feature extraction module, which is used for performing feature extraction on a speech signal in a speech recognition model trained by using the speech signal and corresponding speech content as training data.

That is, the initial parameters of the speech feature extraction module 41 are the parameters of the feature extraction module in the speech recognition model trained with pure speech samples.

In the embodiment of the present application, the specific architecture of the speech recognition model is not limited, but the feature extraction module may be included regardless of the architecture of the speech recognition model. For example, in an alternative embodiment, the speech recognition model may include: the characteristic extraction module is used for extracting acoustic characteristics of the input voice recognition model; and the recognition module is used for carrying out voice recognition according to the features extracted by the feature extraction module. The training process of the speech recognition model can refer to the existing training method, and is not detailed here.

The speech samples used for training the speech recognition model may or may not include the speech samples used for training the multi-modal speech recognition model, and the present application is not limited thereto.

Optionally, before training the multi-modal speech recognition model, the initial parameters of the image feature extraction module 43 may be parameters of an image feature extraction module, which is used for performing feature extraction on the image sequence, in a lip speech recognition model trained by using the image sequence and its corresponding pronunciation content as training data.

That is, the initial parameters of the image feature extraction module 43 are the parameters of the image feature extraction module in the lip language recognition model trained by using pure image sequence samples.

In the embodiment of the present application, the specific architecture of the lip language recognition model is not limited, but the image feature extraction module may be included regardless of the architecture of the lip language recognition model. For example, in an alternative embodiment, the lip language recognition model may include: the image feature extraction module is used for extracting features of an image sequence input into the lip language recognition model; and the recognition module is used for carrying out lip language recognition according to the features extracted by the image feature extraction module. The training process of the lip language recognition model can refer to the existing training method, and is not detailed here.

The image sequence samples used for training the lip language recognition model may or may not include the image sequence samples used for training the multi-modal speech recognition model, and the present application is not limited thereto.

In an alternative embodiment, the speech signal included in the training data is a speech signal of a first language, and after the multi-modal speech recognition model is trained, the trained multi-modal speech recognition model can be used for performing speech recognition of the first language. The first language may be any one of the following languages, for example: chinese, english, korean, japanese, french, italian, etc.

After training the multimodal speech recognition model using the training data of the first language, in the case where the training data of the second language does not include video data, the trained multimodal speech recognition model (for convenience of description, referred to as the first multimodal speech recognition model) may be migrated to the multimodal speech recognition model for multimodal speech recognition of the second language (for convenience of description, referred to as the second multimodal speech recognition model). That is, if the training data set of the first language has audio/video synchronous data, and the training data set of the second language does not have audio/video synchronous data, the multi-modal speech recognition model may be trained according to the aforementioned method by using the training data set of the first language, after the first multi-modal speech recognition model is obtained by training using the training data set of the first language, the trained first multi-modal speech recognition model may be further trained by using the training data set of the second language, so as to obtain a second multi-modal speech recognition model, and the second multi-modal speech recognition model may realize multi-modal speech recognition by using the audio/video synchronous data of the second language. The second multi-modal speech recognition model is obtained by training on the basis of the first multi-modal speech recognition model, and the first multi-modal speech recognition model is obtained by pre-training, so that the first multi-modal speech recognition model is trained by utilizing the training data set of the second language, the convergence speed is high, the accuracy of the multi-modal speech recognition model obtained by training for performing multi-modal speech recognition on the audio and video synchronous data of the second language is high, and the multi-modal speech recognition model can migrate among different languages.

Specifically, after obtaining the first multi-modal speech recognition model, an implementation flowchart of further training the first multi-modal speech recognition model by using the sample speech signal of the second language is shown in fig. 5, and may include:

step S51: the speech feature of the sample speech signal of the second language is obtained by the speech feature extraction module 41 of the first multimodal speech recognition model.

Step S52: and processing each basic image feature in the preset data set by using the voice feature of the sample voice signal of the second language through the image feature generation module 42 of the first multi-modal voice recognition model to obtain a target image feature corresponding to the sample voice signal of the second language.

Step S53: and performing voice recognition through the recognition module 44 of the first multi-modal voice recognition model according to the voice feature of the sample voice signal of the second language and the target image feature corresponding to the sample voice signal of the second language to obtain a voice recognition result of the sample voice signal of the second language.

Step S54: the parameters of the speech feature extraction module 41, the image feature generation module 42, and the recognition module 44 are updated with the aim that the speech recognition result of the sample speech signal in the second language approaches the speech content of the sample speech signal in the second language.

Since the training data set of the second language only contains single-voice data, the image feature extraction module 43 is not used when the training data set of the second language is used to train the first multi-modal speech recognition model, so that the parameters of the image feature extraction module in the trained second multi-modal speech recognition model are the same as the parameters of the image feature extraction module in the first multi-modal speech recognition model.

Corresponding to the embodiment of the method, the embodiment of the application also provides a multi-mode speech recognition model training device.

As shown in fig. 6a, a schematic structural diagram of a multi-modal speech recognition model training apparatus provided in an embodiment of the present application may include:

a data acquisition module 611, a first feature acquisition module 612, an identification module 613 and an update module 614; wherein the content of the first and second substances,

the data obtaining module 611 is configured to obtain training data through the multi-modal speech recognition model;

the first feature obtaining module 612 is configured to, if the training data only includes a sample voice signal, process, by using the multi-modal voice recognition model, each basic image feature in a preset data set by using the sample voice signal to obtain a target image feature corresponding to the sample voice signal; the basic image features are obtained according to the known lip movement related region images;

the recognition module 613 is configured to perform voice recognition according to the voice feature of the sample voice signal and the target image feature through the multi-modal voice recognition model to obtain a voice recognition result of the sample voice signal;

the updating module 614 is configured to update parameters of the multimodal speech recognition model by using the multimodal speech recognition model as a target that a speech recognition result of the sample speech signal approaches a speech content of the sample speech signal.

According to the multi-modal speech recognition model training device provided by the embodiment of the application, in the training process of the multi-modal speech processing model, the training data can contain a single audio signal (namely, video signals are not synchronously acquired) and a data set used for generating image characteristics corresponding to the single audio signal, and the training data set in the training process of the multi-modal speech processing model is enriched, so that the generalization capability of the multi-modal speech processing method is improved, and the reliability of the multi-modal speech recognition model is improved.

In an alternative embodiment, the first feature obtaining module 612 may include:

the weight acquisition module is used for acquiring the weight of each basic image feature by utilizing the sample voice signal through the multi-modal voice recognition model if the training data only contains the sample voice signal;

and the target acquisition module is used for weighting and summing all the basic image features by utilizing the weights of all the basic image features through the multi-modal voice recognition model to obtain the target image features corresponding to the sample voice signals.

In an alternative embodiment, the weight obtaining module may include:

the spatial conversion module is used for respectively carrying out spatial conversion on the voice characteristics of the sample voice signals and all the basic image characteristics by utilizing spatial conversion parameters through the multi-modal voice recognition model if the training data only contains the sample voice signals;

and the calculation module is used for calculating the weight of each basic image characteristic by utilizing the converted voice characteristic and the converted basic image characteristic through the multi-modal voice recognition model.

In an alternative embodiment, the updating module 614 updates the parameters of the multi-modal speech recognition model by: updating the spatial transformation parameters.

In an alternative embodiment, the sample speech signal is a speech signal of a first language; the multi-modal speech recognition model training device is further used for: acquiring the voice characteristics of a sample voice signal of a second language through a voice characteristic extraction module of the multi-modal voice recognition model;

the first feature obtaining module 612 is further configured to: processing each basic image feature in the preset data set by using the voice feature of the sample voice signal of the second language through an image feature generation module of the multi-modal voice recognition model to obtain a target image feature corresponding to the sample voice signal of the second language;

the identifying module 613 is further configured to: performing voice recognition according to the voice characteristics of the sample voice signal of the second language and the target image characteristics corresponding to the sample voice signal of the second language through a recognition module of the multi-modal voice recognition model to obtain a voice recognition result of the sample voice signal of the second language;

the update module 614 is further configured to: and updating the parameters of the voice feature extraction module, the image feature generation module and the recognition module by taking the voice recognition result of the sample voice signal of the second language approaching to the voice content of the sample voice signal of the second language as a target.

In an optional embodiment, the multi-modal speech recognition model training apparatus may further include:

the basic image characteristic acquisition module is used for acquiring a lip movement related region image sequence synchronously acquired with a plurality of known voice signals; sampling each lip movement related region image sequence respectively to obtain a basic lip movement related region image corresponding to each voice signal; and acquiring the characteristics of each basic lip movement related area image as the basic image characteristics.

the basic image characteristic acquisition module is used for acquiring the characteristics of a plurality of known lip movement related area images; clustering the characteristics of the known lip movement related area images to obtain a plurality of cluster clusters; and extracting the clustering center of each clustering cluster as the basic image feature.

In an alternative embodiment, the basic image feature obtaining module is specifically configured to, when clustering features of the known lip movement-related region images:

In an optional embodiment, when the basic image feature obtaining module obtains features of a plurality of known lip-related images, the basic image feature obtaining module is specifically configured to:

In an optional embodiment, the image feature extraction model is: and the image feature extraction module is used for extracting features of the lip movement related region images in a lip language recognition model which is trained by taking the lip movement related region images and the corresponding lip pronunciation contents as training data.

As shown in fig. 6b, another schematic structural diagram of a multi-modal speech recognition model training apparatus provided in the embodiment of the present application may include:

a data acquisition module 621, a first feature acquisition module 622, a second feature acquisition module 623, a recognition module 624, and an update module 625; wherein the content of the first and second substances,

the data acquisition module 621 is configured to acquire training data through the multi-modal speech recognition model;

the first feature obtaining module 622 is configured to, if the training data only includes a sample voice signal, process, through the multi-modal voice recognition model, each basic image feature in a preset data set by using the sample voice signal to obtain a target image feature corresponding to the sample voice signal; the basic image features are obtained according to the known lip movement related region images;

the second feature obtaining module 623 is configured to, if the training data simultaneously includes a sample voice signal and a lip movement related region image synchronously acquired with the sample voice signal, obtain a feature of the lip movement related region image through the multi-modal voice recognition model, and use the feature as a target image feature corresponding to the sample voice signal;

the recognition module 624 is configured to perform voice recognition according to the voice feature of the sample voice signal and the target image feature through the multi-modal voice recognition model, so as to obtain a voice recognition result of the sample voice signal;

the updating module 625 is configured to update parameters of the multimodal speech recognition model by using the multimodal speech recognition model as a target for a speech recognition result of the sample speech signal to approach a speech content of the sample speech signal.

In the multi-modal speech recognition model training device provided by the embodiment of the application, in the training process of the multi-modal speech processing model, the training data is not limited to audio data and video data which are synchronously acquired, and the training data further comprises a single audio signal (namely, the video signal is not synchronously acquired) and a data set for generating image characteristics corresponding to the single audio signal, so that the training data set in the training process of the multi-modal speech processing model is further enriched, the generalization capability of the multi-modal speech processing method is further improved, and the reliability of the multi-modal speech recognition model is further improved.

The multi-modal speech recognition model training device provided by the embodiment of the application can be applied to multi-modal speech recognition model training equipment, such as a PC terminal, a cloud platform, a server cluster and the like. Alternatively, fig. 7 shows a block diagram of a hardware structure of the multi-modal speech recognition model training device, and referring to fig. 7, the hardware structure of the multi-modal speech recognition model training device may include: at least one processor 1, at least one communication interface 2, at least one memory 3 and at least one communication bus 4;

in the embodiment of the application, the number of the processor 1, the communication interface 2, the memory 3 and the communication bus 4 is at least one, and the processor 1, the communication interface 2 and the memory 3 complete mutual communication through the communication bus 4;

the processor 1 may be a central processing unit CPU, or an application specific Integrated circuit asic, or one or more Integrated circuits configured to implement embodiments of the present invention, etc.;

the memory 3 may include a high-speed RAM memory, and may further include a non-volatile memory (non-volatile memory) or the like, such as at least one disk memory;

wherein the memory stores a program and the processor can call the program stored in the memory, the program for:

acquiring training data through the multi-modal voice recognition model;

Alternatively, the detailed function and the extended function of the program may be as described above.

Embodiments of the present application further provide a storage medium, where a program suitable for execution by a processor may be stored, where the program is configured to:

acquiring training data through the multi-modal voice recognition model;

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

It should be understood that the technical problems can be solved by combining and combining the features of the embodiments from the claims.

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A multi-modal speech recognition model training method, comprising:

acquiring training data through the multi-modal voice recognition model;

2. The method of claim 1, further comprising:

3. The method of claim 1, wherein processing each of the basic image features in the preset data set using the sample speech signal comprises:

4. The method of claim 3, wherein the obtaining weights for the respective basic image features using the sample speech signal comprises:

5. The method of claim 4, wherein the updating parameters of the multi-modal speech recognition model comprises updating the spatial transformation parameters.

6. The method of claim 1, wherein the sample speech signal is a speech signal in a first language; after the multi-modal speech recognition model is trained, the method further comprises the following steps:

7. The method according to any one of claims 1 to 6, wherein the process of obtaining the basic image features from the known lip movement related region image comprises:

8. The method according to any one of claims 1 to 6, wherein the process of obtaining the basic image features from the known lip related image comprises:

9. The method of claim 8, wherein clustering features of the number of known lip-related images comprises:

10. The method of claim 8, wherein said obtaining features of a plurality of known lip-related images comprises:

obtaining the characteristics of the known lip movement related area images by using an image characteristic extraction model;

the image feature extraction model is as follows: and the image feature extraction module is used for extracting features of the lip movement related region images in a lip language recognition model which is trained by taking the lip movement related region images and the corresponding lip pronunciation contents as training data.

11. A multi-modal speech recognition model training apparatus, comprising:

12. A multimodal speech recognition model training apparatus, comprising: comprising a memory and a processor;

the memory is used for storing programs;

the processor, configured to execute the program, implementing the steps of the multi-modal speech recognition model training method according to any one of claims 1 to 10.

13. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method for training a multi-modal speech recognition model according to any one of claims 1-10.