CN115861670A

CN115861670A - Training method of feature extraction model and data processing method and device

Info

Publication number: CN115861670A
Application number: CN202211415707.XA
Authority: CN
Inventors: 万根顺; 潘嘉; 熊世富; 高建清; 刘聪; 胡国平; 刘庆峰
Original assignee: iFlytek Co Ltd
Current assignee: iFlytek Co Ltd
Priority date: 2022-11-11
Filing date: 2022-11-11
Publication date: 2023-03-28

Abstract

The invention provides a training method of a feature extraction model, a data processing method and a data processing device, wherein the training method comprises the following steps: acquiring sample data of at least one modality; executing a supervised task corresponding to the mode to which the sample data belongs, and acquiring data characteristics of the sample data generated in the execution process of the supervised task; clustering the data characteristics of the sample data, determining the benchmark data characteristics of the sample data in the mode of the sample data based on the clustering result, and determining the benchmark data characteristics matched with the sample data based on the similarity between the benchmark data characteristics and the data characteristics of the sample data; and training the feature extraction model based on the sample data of at least one modality and the benchmark data features matched with the sample data. The method and the device provided by the invention can strengthen the distinguishability and the representation capability of the guide label during the training of the feature extraction model, thereby achieving the effects of accelerating the convergence speed of the feature extraction model and improving the expression capability of the feature extraction model.

Description

Training method of feature extraction model and data processing method and device

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a training method of a feature extraction model, a data processing method and a data processing device.

Background

The multi-modal data have complementarity, and are more suitable for the popularization requirement of a complex scene compared with single-modal data. In addition, in consideration of higher acquisition difficulty and labeling difficulty of multi-modal data, the multi-modal pre-training technology is applied to multi-modal pre-training at present, so that the multi-modal pre-training framework is applied to migration of multi-modal tasks.

However, the target design of the multi-modal pre-training is mostly constructed based on unsupervised data, and specifically applies the modes of audio and video synchronization information, cross-modal information clustering, mask prediction and the like, however, the above modes usually have a long pre-training period and a poor pre-training effect due to insufficient distinctiveness among features.

Disclosure of Invention

The invention provides a training method of a feature extraction model, a data processing method and a data processing device, which are used for solving the defect of poor pre-training effect of multi-modal data in the prior art.

The invention provides a training method of a feature extraction model, which comprises the following steps:

obtaining sample data of at least one modality, the at least one modality comprising audio and/or video;

executing a supervised task corresponding to the mode to which the sample data belongs, and acquiring data characteristics of the sample data generated in the execution process of the supervised task;

clustering the data features of the sample data, determining the benchmark data features of the sample data in the mode based on the clustering result, and determining the benchmark data features matched with the sample data based on the similarity between the benchmark data features and the data features of the sample data;

and training a feature extraction model based on the sample data of the at least one modality and the benchmark data features matched with the sample data.

According to the training method of the feature extraction model provided by the invention, the training of the feature extraction model based on the sample data of the at least one modality and the reference data features matched with the sample data comprises the following steps:

performing mask processing on the sample data of the at least one modality to obtain mask data of the at least one modality;

based on an initial feature extraction model, performing feature prediction on a mask part in the mask data of the at least one modality to obtain prediction features of the mask part;

and performing parameter iteration on the initial feature extraction model based on the reference data features matched with the sample data and the prediction features of the mask part to obtain the feature extraction model.

According to a training method of a feature extraction model provided by the present invention, in a case that the at least one modality includes audio and video, the feature prediction is performed on a mask portion in mask data of the at least one modality based on an initial feature extraction model to obtain a predicted feature of the mask portion, including:

and respectively extracting the audio features of the mask data of the audio and the video features of the mask data of the video based on the initial feature extraction model, and fusing the audio features and the video features to obtain the predicted features of the mask part of the audio and the predicted features of the mask part of the video.

According to the training method of the feature extraction model provided by the invention, the parameter iteration is performed on the initial feature extraction model based on the reference data features matched with the sample data and the prediction features of the mask part to obtain the feature extraction model, and the training method comprises the following steps:

determining a first loss based on a reference data feature that matches the sample data and a predicted feature of the mask portion;

determining a second loss based on the audio features and the video features;

and performing parameter iteration on the initial feature extraction model based on the first loss and the second loss to obtain the feature extraction model.

According to the training method of the feature extraction model provided by the invention, the sample data of the video comprises a complete human face region.

According to the training method of the feature extraction model provided by the invention, the supervised task corresponding to the audio comprises at least one of voice recognition, voiceprint recognition and prosody recognition;

and the corresponding supervised task of the video comprises emotion recognition and/or face recognition.

The invention also provides a data processing method, which comprises the following steps:

acquiring data to be processed of at least one modality, wherein the at least one modality comprises audio and/or video;

performing data processing on the data to be processed based on a multi-modal processing model;

the multi-modal processing model is obtained by performing transfer learning on a feature extraction model, the feature extraction model is obtained by training based on sample data of an audio and video mode and reference data features matched with the sample data, and the reference data features are obtained by clustering the data features of the sample data generated in the process of executing a supervised task corresponding to an audio and a video respectively.

The invention also provides a training device of the feature extraction model, which comprises:

a sample obtaining unit, configured to obtain sample data of at least one modality, where the at least one modality includes audio and/or video;

the characteristic acquisition unit is used for executing a supervised task corresponding to the mode to which the sample data belongs and acquiring the data characteristics of the sample data generated in the execution process of the supervised task;

the benchmark determining unit is used for clustering the data characteristics of the sample data, determining the benchmark data characteristics of the sample data in the mode of the sample data based on the clustering result, and determining the benchmark data characteristics matched with the sample data based on the similarity between the benchmark data characteristics and the data characteristics of the sample data;

and the training unit is used for training a feature extraction model based on the sample data of the at least one mode and the reference data features matched with the sample data.

The present invention also provides a data processing apparatus comprising:

the data acquisition unit is used for acquiring data to be processed of at least one modality, and the at least one modality comprises audio and/or video;

the data processing unit is used for processing the data to be processed based on a multi-modal processing model;

The invention also provides an electronic device, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor executes the program to realize the training method of the feature extraction model.

The invention also provides an electronic device, which comprises a microphone and/or a camera, and further comprises a memory, a processor and a computer program which is stored on the memory and can be run on the processor, wherein the microphone is used for collecting audio data to be processed;

the camera is used for acquiring data to be processed of the video;

the processor executes a multi-mode processing model in the computer program to process the data to be processed, the multi-mode processing model is obtained by carrying out transfer learning on a feature extraction model, the feature extraction model is obtained by training sample data of an audio-video mode and reference data features matched with the sample data, and the reference data features are obtained by clustering the data features of the sample data generated in the process of executing supervision tasks corresponding to audio and video respectively.

The present invention also provides a non-transitory computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements a method of training a feature extraction model as described in any of the above, or a method of processing data as described in any of the above.

The present invention also provides a computer program product comprising a computer program which, when executed by a processor, implements a method of training a feature extraction model as described in any one of the above, or a method of processing data as described above.

According to the training method and the data processing method and device for the feature extraction model, high-level data feature extraction of sample data is achieved through the supervision task, and then more representative datum data features are obtained through clustering and serve as guiding labels during feature extraction model training, so that the distinguishing performance and the representing capacity of the guiding labels during feature extraction model training are enhanced, the convergence speed of the feature extraction model is increased, and the effect of improving the expression capacity of the feature extraction model is achieved.

Drawings

In order to more clearly illustrate the technical solutions of the present invention or the prior art, the drawings needed for the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.

FIG. 1 is a schematic flow chart of a training method of a feature extraction model provided in the present invention;

FIG. 2 is a schematic flow chart illustrating step 140 of the training method for feature extraction models provided in the present invention;

FIG. 3 is a second schematic flowchart of a training method for feature extraction models according to the present invention;

FIG. 4 is a schematic flow chart of a data processing method provided by the present invention;

FIG. 5 is a schematic structural diagram of a training apparatus for feature extraction models provided in the present invention;

FIG. 6 is a schematic diagram of a data processing apparatus according to the present invention;

FIG. 7 is a schematic structural diagram of an electronic device according to the present invention;

fig. 8 is a second schematic structural diagram of the electronic device provided in the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In the application and popularization of the artificial intelligence technology, data in a single mode is easily influenced by factors such as noise interference, so that information is lost or the expression capability is not strong, and the popularization requirement of a complex scene is difficult to meet. Therefore, people usually utilize the complementarity between multimodal data to realize the enhancement of modality information. In addition, in consideration of higher acquisition difficulty and labeling difficulty of multi-modal data, the multi-modal pre-training technology is applied to multi-modal pre-training at present, so that the multi-modal pre-training framework is applied to migration of multi-modal tasks.

Aiming at the problem, the invention provides a training method of a feature extraction model. The feature extraction model can be used as a multi-modal pre-training framework and applied to migration of various subsequent multi-modal tasks.

Fig. 1 is a schematic flow chart of a training method of a feature extraction model provided in the present invention, as shown in fig. 1, the method includes:

step 110, sample data of at least one modality is obtained, wherein the at least one modality includes audio and/or video.

The sample data here is a sample for training the feature extraction model. The sample data may comprise at least one modality, for example the sample data may be audio data, or the sample data may be video data. To meet the requirements of multi-modal pre-training, the sample data needs to include at least two modalities, i.e., the sample data may include audio data and video data, and may also include data of other modalities besides audio and video modalities, such as text data.

It is understood that the sample data here is unsupervised data.

And 120, executing a supervised task corresponding to the mode to which the sample data belongs, and acquiring data characteristics of the sample data generated in the process of executing the supervised task.

And step 130, clustering the data features of the sample data, determining the benchmark data features in the mode to which the sample data belongs based on the clustering result, and determining the benchmark data features matched with the sample data based on the similarity between the benchmark data features and the data features of the sample data.

Specifically, in order to solve the problem of insufficient distinctiveness between features when the unsupervised pre-training is executed in the related art, the embodiment of the invention applies the existing supervised task, so that the high-level features, namely the reference data features, in unsupervised sample data are extracted and serve as the guide labels for the pre-training of the feature extraction model, and therefore, the effects of accelerating the convergence speed of the feature extraction model and improving the expression capacity of the feature extraction model are achieved.

Here, the selection application for the supervised task is related to the modality of the sample data itself. For example, sample data of an audio modality is generally applied to supervised tasks such as speech recognition, voiceprint recognition and prosody recognition, and sample data of a video modality is generally applied to supervised tasks such as emotion recognition and face recognition. That is, different supervised tasks may be corresponded to for different modalities.

After the sample data is obtained, a supervised task can be executed for the sample data based on a supervised model corresponding to the modality to which the pre-trained sample data belongs. For example, for sample data of an audio modality, voice recognition may be performed on the sample data of the audio modality, and for example, for sample data of a video modality, emotion recognition may be performed on the sample data of the video modality.

In the execution process of the supervised task, the supervised model extracts the features of sample data and acquires the supervised task result based on the extracted features. Therefore, in the embodiment of the invention, the characteristics obtained by performing characteristic extraction through the supervised model in the process can be obtained and are marked as the data characteristics of the sample data. For example, speech recognition may be performed on sample data of an audio modality based on a supervised speech recognition model, and a final hidden layer feature in the supervised speech recognition model is used as a content feature of each frame of speech in the sample data of the audio modality, that is, a data feature of the sample data of the audio modality; for another example, emotion recognition may be performed on sample data of the video modality based on the supervised emotion recognition model, and a last hidden layer feature in the supervised emotion recognition model is used as an emotion feature of each frame image in the sample data of the video modality, that is, a data feature of the sample data of the video modality. It will be appreciated that feature extraction is directed to obtaining supervised task results, and the resulting data features are high-level features.

After the high-level data features of the sample data are obtained, more representative data features, namely benchmark data features, can be selected from the sample data to serve as reference labels in the subsequent feature extraction model training. It is understood that, in step 130, the selection and matching of the reference data features are performed separately for the sample data of each modality.

Here, taking any modality as an example, the data features of the sample data in the modality may be clustered to obtain a plurality of feature clusters, and here, the feature of the cluster center of each feature cluster may be used as a representative reference data feature in the modality, so that a plurality of reference data features may exist in one modality. Moreover, when clustering is performed, considering the scale of sample data in one modality, if clustering is performed on the data features of all sample data, a large amount of time and calculation resources may be consumed, or some sample data may be extracted from the sample data in one modality, and only the data features of some sample data are applied for clustering. In the extraction of partial sample data, it is necessary to balance factors such as the source and form of the sample data in the modality so that the extracted partial sample data itself has diversity, and thus the reference data features obtained by clustering can also have representativeness.

After the benchmark data features in the mode are obtained, feature classification can be performed on all sample data in the mode, namely, the benchmark data features matched with the sample data are selected from the benchmark data features by calculating the similarity between the data features of the sample data and the benchmark data features in the mode to which the sample data belongs.

Here, the reference data feature with the highest similarity may be selected as the reference data feature matching the sample data. In addition, the probability of matching the sample data with each reference data feature can be calculated according to the similarity between the data feature of the sample data and each reference data feature in the belonging mode, so that the reference data feature matching the sample data can be determined, and the probability calculation mode p (c | a) _x ) Can be expressed in the following form:

in the formula, A _x Data characteristics representing the frame level in the sample data, e _c Representing the corresponding reference data features, C representing the number of reference data features, sim representing the calculation of the similarity of the data features and the reference data features, τ representing the degree of hyperparametric modulation reduction.

It can be understood that the reference data features matched with the sample data are guide labels when the sample data is applied to feature extraction model training.

Step 140, training a feature extraction model based on the sample data of the at least one modality and the reference data features matched with the sample data.

Specifically, after the sample data and the reference data characteristics matched with the sample data are obtained, the sample data can be used as a training sample of the feature extraction model, the reference data characteristics matched with the sample data are used as training labels, and the feature extraction model is trained, so that a feature extraction model which can be used as a transfer learning basis in the following process is obtained.

It will be appreciated that if there are sample data of multiple modalities, it may be input into the feature extraction model in parallel to enable the feature extraction model to better learn feature extraction based on the relationships between the sample data of the multiple modalities.

Taking the at least one modality including audio and/or video as an example, the trained feature extraction model can implement high-level feature extraction of data in the audio and/or video modality, and the feature extraction model can be applied to migration learning of downstream tasks such as voice recognition, identity recognition, conversational recognition, emotion recognition and the like, so as to implement feature extraction of semantic features, identity features, conversational features, emotion features and the like required by the downstream tasks.

According to the method provided by the embodiment of the invention, high-level data feature extraction of sample data is realized through a supervision task, and then more representative datum data features are obtained through clustering and serve as guide labels during feature extraction model training, so that the distinguishability and the representation capability of the guide labels during feature extraction model training are enhanced, the convergence speed of the feature extraction model is increased, and the expression capability of the feature extraction model is improved.

Based on the above embodiment, fig. 2 is a schematic flow chart of step 140 in the training method of the feature extraction model provided by the present invention, as shown in fig. 2, step 140 includes:

step 141, performing mask processing on the sample data of the at least one modality to obtain mask data of the at least one modality;

142, performing feature prediction on a mask part in the mask data of the at least one modality based on an initial feature extraction model to obtain a predicted feature of the mask part;

step 143, performing parameter iteration on the initial feature extraction model based on the reference data features matched with the sample data and the prediction features of the mask part to obtain the feature extraction model.

Specifically, the training of the feature extraction model may be implemented in a mask prediction manner. In other words, in the training process of the feature extraction model, the input of the initial feature extraction model is mask data, which is sample data subjected to mask processing, and the output of the initial feature extraction model is a predicted feature of the mask data obtained by performing feature extraction on the mask data, including a predicted feature of a mask portion obtained by performing feature prediction on a mask portion covered in the mask data. The masking processing is performed on the sample data, that is, a part of frames in the sample data are masked in a random masking manner.

After the prediction features aiming at the mask part and output by the initial feature extraction model are obtained, the reference data features corresponding to the sample data which are matched in advance can be compared with the prediction features of the mask part, the loss is determined based on the difference between the reference data features and the prediction features, and then the initial feature extraction model is subjected to parameter iteration by applying the loss, so that the feature extraction model is obtained. For example, for a case where sample data of both audio and video modalities exists, a mask portion of audio and a mask portion of video may be taken as a part of an objective function of parameter iteration, and thus, a loss for parameter iteration may be expressed as a sum of a loss determined based on the mask portion of audio and a loss determined based on the mask portion of video.

In this process, the initial feature extraction model may learn a mapping relationship between sample data of at least one modality and a reference data feature, and, for a case where the sample data is input as a plurality of modalities, the initial feature extraction model may learn complementary features of inter-modality data in a feature extraction process, thereby enhancing reliability of output features.

Based on any of the above embodiments, in the case that the at least one modality includes audio and video, step 142 includes:

Specifically, for multi-modal sample data, namely audio and video data obtained by synchronous acquisition, in the training process of a feature extraction model, firstly, performing modal feature extraction on the audio and video data through an initial feature extraction model, namely respectively extracting audio features in mask data of an audio and video features in mask data of a video; on the basis, the extracted audio features and video features can be subjected to feature fusion through an initial feature extraction model, so that the complementarity between audio and video data can provide richer supplementary information for data under respective modes in feature extraction, the predicted features of the mask data of the audio and the predicted features of the mask data of the video can be obtained through feature fusion, and the predicted features of the mask data include the predicted features of the mask part.

Further, for fusion of audio and video features, attention mechanisms may be employed to enhance information complementation between different modalities, e.g., cross Attention (Cross Attention) may be applied to enable fusion interaction of features.

The method provided by the embodiment of the invention fuses the features of different modes in the multi-mode feature extraction process, fully applies the complementarity between multi-mode data, and ensures the reliability of the trained feature extraction model in multi-mode feature extraction.

In any of the above embodiments, where the at least one modality includes audio and video, step 143 includes:

determining a first loss based on a baseline data feature matching the sample data and a predicted feature of the mask portion;

determining a second loss based on the audio feature and the video feature;

Specifically, in the process of performing feature extraction model training based on multi-modal data, not only can reference data features matched with sample data be compared with predicted features of a mask part output by an initial feature extraction model, and a first loss is determined based on a difference between the reference data features and the predicted features for the initial feature extraction model to perform model training with the reference data features as a guide label, but also identity consistency of personnel in the synchronously recorded multi-modal data can be applied to constrain the initial feature extraction model in an independent feature extraction process for data of each modality.

In the process of feature extraction of the initial feature extraction model for the mask data of the audio and the mask data of the video, firstly, the audio features in the mask data of the audio and the video features in the mask data of the video are respectively extracted, and then feature fusion is performed on the audio features and the video features. That is, the audio feature and the video feature are obtained by performing feature extraction on data of an audio modality and data of a video modality in the initial feature extraction model.

Considering that the data of the audio modality and the video modality are recorded synchronously, that is, the persons included in the data of the audio modality and the data of the video modality should be the same person, and thus, the identity information of the persons reflected by the audio features obtained based on the data of the audio modality and the video features obtained based on the data of the video modality should be consistent.

Based on this, the audio features may be regarded as a feature representation of the voiceprint ID of the person, the video features may be regarded as a feature representation of the face ID of the person, and the second loss may be determined based on the difference between the audio features and the video features under the constraint of consistency of the person identity information. It is understood that the smaller the difference between the audio feature and the video feature, the smaller the difference between the identities of the persons respectively reflected by the audio feature and the video feature, and the smaller the second loss; the greater the difference between the audio features and the video features, the greater the difference between the identities of the persons that the audio features and the video features each reflect, and the greater the second loss.

After the first loss and the second loss are obtained, parameter iteration can be performed on the initial feature extraction model by combining the first loss and the second loss, in the process, the first loss and the second loss can be added to serve as a total loss, the total loss is applied to perform parameter iteration on the initial feature extraction model, or the first loss and the second loss can be weighted and summed to obtain a total loss, and then the total loss is applied to perform parameter iteration on the initial feature extraction model.

The method provided by the embodiment of the invention determines the second loss based on the audio characteristics and the video characteristics so as to realize the constraint of personnel identity consistency in the training process of the feature extraction model and strengthen and supplement identity information, thereby ensuring the reliability of the trained feature extraction model in multi-mode feature extraction.

In addition, the current multimodal pre-training framework is less generalizable to downstream tasks. For example, in a common multi-modal pre-training framework, the input of a model includes a lip branch and a voice branch, and after training is completed, the model can be used only in scenes focusing on speech information recognition, such as lip speech recognition scenes and multi-modal voice recognition scenes, and cannot be popularized in multi-modal interaction tasks, such as multi-modal emotion recognition and multi-modal voiceprint recognition, focusing on information, such as expressions and identities. To address this problem, in the embodiment of the present invention, in a case where the modality includes a video, sample data of the video includes a complete human face region.

Specifically, in the sample data of the video modality, that is, in the sample video data, each frame image includes a complete human face region. The video data containing the complete human face area is used as sample data of a video modality and applied to training of the feature extraction model, so that the feature extraction model does not only pay attention to information of the lip area, but can refer to all areas of the human face, and therefore the trained feature extraction model can have stronger popularization when being applied to downstream tasks as a pre-training frame.

For example, in speech expression, information feedback is mainly reflected in the positions below the nose and above the neck (including the neck); when emotion expression is carried out, information feedback is mainly embodied in the whole face; when identity is confirmed, information feedback is mainly reflected in the whole face. When the embodiment of the invention is used for model training, the applied sample data of the video comprises a complete human face area and can cover information required by various downstream tasks such as conversational expression, emotional expression, identity confirmation and the like, so that the method has higher downstream task popularization.

According to any one of the above embodiments, the audio-corresponding supervised task includes at least one of speech recognition, voiceprint recognition and prosody recognition;

Specifically, for the selection of the supervised tasks of different modalities, reference needs to be made to the roles played by different modalities in different tasks, for example, for the supervised tasks of content categories, such as speech recognition, the effect of a speech branch is generally better than that of a video branch, and compared with the video branch, the reserve and technical reserve of audio data are more, so for the supervised tasks of content categories, the audio data can be selected as the data of the supervised tasks; for example, for the supervised task of emotion type, such as emotion recognition, in general, the effect of the video branch is due to the fact that the voice branch is used, the labeling quality of the video data is higher than that of the audio data, and the labeling confusion is lower, so that for the supervised task of emotion type, the video data can be selected as the data of the supervised task.

Based on the method, the supervised tasks corresponding to different modes can be determined, so that the reference data features under different modes are obtained, and training guidance aiming at the feature extraction model is realized.

Based on any of the above embodiments, fig. 3 is a second schematic flow chart of the training method for the feature extraction model provided by the present invention, and as shown in fig. 3, for training of the feature extraction model of audio/video multimodal, first, sample data of a video and sample data of an audio can be obtained.

For the sample data of the video, a supervised emotion recognition model can be applied to carry out emotion recognition on the sample data of the video, and the last hidden layer representation extracted by the emotion recognition model in the emotion recognition process is obtained and used as the data characteristic of each frame of image in the sample data of the video. And performing feature clustering on the data features of each frame of image in the sample data of the video, so that the cluster center of the preset category obtained by clustering is used as the reference data feature of the video mode. On the basis, the reference data characteristics can be matched with the sample data of the video according to the similarity between the data characteristics of each frame of image in the sample data of the video and the reference data characteristics, and the sample data of the video can be used as the emotion category label. Here, feature clustering may be implemented using a K-means clustering algorithm. And, for the number of preset categories of the video modality cluster, the number and degree of emotion categories of the actual emotion recognition task can be referred to.

And aiming at the sample data of the audio frequency, a supervised speech recognition model can be applied to perform speech recognition on the sample data of the audio frequency, and the last hidden layer representation extracted by the speech recognition model in the speech recognition process is obtained and used as the data characteristic of each frame of speech in the sample data of the audio frequency. And performing feature clustering on the data features of each frame of voice in the sample data of the audio, so that the class center of the preset class obtained by clustering is used as the reference data feature of the audio mode. On the basis, the reference data characteristics can be matched with the sample data of the audio frequency according to the similarity between the data characteristics of each frame of voice in the sample data of the audio frequency and each reference data characteristic, and the sample data of the audio frequency is used as a content category label. And, the number of preset categories for audio modality clustering may be determined with reference to the number of modeling units of the actual speech recognition task.

On the basis, random mask can be respectively carried out on the sample data of the video and the sample data of the audio, so that mask data of the video and mask data of the audio are obtained and input into the initial feature extraction model.

The initial feature extraction model here corresponds to three modules of video feature extraction, audio feature extraction, and feature fusion in fig. 3. The video feature extraction is used for independently extracting the video feature of the mask data of the video modality, the audio feature extraction is used for independently extracting the audio feature of the mask data of the audio modality, and the feature fusion is used for fusing the video feature and the audio feature, so that the prediction feature of the mask data of the video and the prediction feature of the mask data of the audio are output.

The predicted characteristic of the mask data of the video can be compared with the reference data characteristic as the emotion class label, and the predicted characteristic of the mask data of the audio can be compared with the reference data characteristic as the content class label, so that the first loss is calculated; the video features can be regarded as face ID feature representation, the audio features can be regarded as voiceprint ID feature representation, and the second loss can be determined based on the difference between the two to realize the personnel identity consistency constraint.

Based on the first loss and the second loss, parameter iteration can be carried out on the initial feature extraction model, and therefore the feature extraction model which can be used as an audio and video multi-mode pre-training framework is obtained.

According to the method provided by the embodiment of the invention, existing supervised tasks are utilized, high-level data features are extracted for unsupervised sample data and used as guide labels of the feature extraction model, the convergence of the model is accelerated, and the expression capability of the model is improved; meanwhile, in the training process, sample data including a complete human face area is applied, the conventional local lip information is expanded to the global face information, and the full utilization of face related information expressed by expressions, dialects, identities and the like is realized; and reinforcing the information complementation and information enhancement of video and voice branches from the aspects of emotion, content, ID (identity) attributes and the like, thereby matching more downstream task use scenes.

Based on any of the above embodiments, fig. 4 is a schematic flow chart of a data processing method provided by the present invention, as shown in fig. 4, the method includes:

step 410, acquiring data to be processed of at least one modality, wherein the at least one modality comprises audio and/or video;

step 420, processing data of the data to be processed based on a multi-modal processing model;

Specifically, the data to be processed is data that needs to be processed, and the data to be processed may be in a single mode, for example, audio data or video data, or may be in a multi-mode, for example, synchronously recorded audio and video data, which is not specifically limited in this embodiment of the present invention.

After the data to be processed is obtained, the data to be processed can be input into a multi-modal processing model obtained by pre-training, so that a corresponding data processing result is obtained.

Here, the multi-modal processing model may implement data processing of an audio/video modality, and a data processing function specifically implemented by the multi-modal processing model is related to a downstream task applied by the multi-modal processing model when performing migration learning with respect to the feature extraction model. For example, the data processing functions implemented by the multi-modal processing model may include one or more of identity recognition, emotion recognition, voice recognition, and the like.

In particular, the multi-modal pre-training framework serving as the basis of the multi-modal processing model migration learning in the embodiment of the present invention is the feature extraction model obtained by training in the above embodiment. The feature extraction model can realize the migration of different tasks such as multi-modal voice recognition, multi-modal emotion recognition, multi-modal voiceprint recognition and the like, and can also meet the migration and use of single-modal tasks such as lip language recognition, voice recognition and the like.

In the training process of the feature extraction model, a supervision task corresponding to the mode of sample data is used for realizing high-level data feature extraction of the sample data, and then more representative datum data features are obtained through clustering and serve as guide labels during feature extraction model training, so that the distinguishability and the representation capability of the guide labels during feature extraction model training are enhanced, the convergence speed of the feature extraction model is increased, and the effect of improving the expression capability of the feature extraction model is achieved. The specific training method of the feature extraction model is the same as that in the above embodiments, and is not described herein again.

The feature extraction model obtained based on the method is applied to transfer learning, and a more reliable and accurate multi-modal processing model can be realized, so that data processing for multi-modal data can be better realized.

The data processing method provided by the embodiment of the invention applies the feature extraction model obtained by training the reference data features as the guide labels to perform transfer learning, so that the multi-modal processing model can realize more accurate and reliable multi-modal data processing.

Based on any of the above embodiments, fig. 5 is a schematic structural diagram of a training apparatus for feature extraction models provided by the present invention, as shown in fig. 5, the apparatus includes:

a sample obtaining unit 510 for obtaining sample data of at least one modality, the at least one modality comprising audio and/or video;

a feature obtaining unit 520, configured to execute a supervised task corresponding to a modality to which the sample data belongs, and obtain a data feature of the sample data generated in a process of executing the supervised task;

a benchmark determining unit 530, configured to cluster data features of the sample data, determine benchmark data features in a mode to which the sample data belongs based on a clustering result, and determine benchmark data features matched with the sample data based on a similarity between the benchmark data features and the data features of the sample data;

a training unit 540, configured to train a feature extraction model based on the sample data of the at least one modality and the reference data features matched with the sample data.

According to the device provided by the embodiment of the invention, high-level data feature extraction of sample data is realized through a supervision task, and then more representative datum data features are obtained through clustering and are used as guide labels during feature extraction model training, so that the distinguishability and the representation capability of the guide labels during feature extraction model training are enhanced, and the effects of accelerating the convergence speed of the feature extraction model and improving the expression capability of the feature extraction model are achieved.

Based on any of the above embodiments, the training unit 540 includes:

the mask subunit is configured to perform mask processing on the sample data of the at least one modality to obtain mask data of the at least one modality;

the prediction subunit is configured to perform feature prediction on a mask portion in the mask data of the at least one modality based on an initial feature extraction model to obtain a prediction feature of the mask portion;

and the iteration subunit is used for performing parameter iteration on the initial feature extraction model based on the reference data features matched with the sample data and the prediction features of the mask part to obtain the feature extraction model.

Based on any of the above embodiments, the prediction subunit is specifically configured to:

and respectively extracting the audio features of the mask data of the audio and the video features of the mask data of the video based on the initial feature extraction model, and fusing the audio features and the video features to obtain the prediction features of the mask part of the audio and the prediction features of the mask part of the video.

Based on any of the embodiments described above, the iteration subunit is specifically configured to:

determining a second loss based on the audio feature and the video feature;

Based on any of the above embodiments, the sample data of the video comprises a complete human face region.

and the supervision tasks corresponding to the videos comprise emotion recognition and/or face recognition.

Based on any of the above embodiments, fig. 6 is a schematic structural diagram of a data processing apparatus provided by the present invention, as shown in fig. 6, the apparatus includes:

a data obtaining unit 610, configured to obtain data to be processed in at least one modality, where the at least one modality includes audio and/or video;

the data processing unit 620 is configured to perform data processing on the data to be processed based on a multi-modal processing model;

The data processing device provided by the embodiment of the invention applies the feature extraction model obtained by training the reference data features as the guide labels to perform transfer learning, so that the multi-modal processing model can realize more accurate and reliable multi-modal data processing.

Fig. 7 illustrates one of the entity structure diagrams of an electronic device, which may include, as shown in fig. 7: a processor (processor) 710, a communication Interface (Communications Interface) 720, a memory (memory) 730, and a communication bus 740, wherein the processor 710, the communication Interface 720, and the memory 730 communicate with each other via the communication bus 740. Processor 710 may invoke logic instructions in memory 730 to perform a method of training a feature extraction model, the method comprising: obtaining sample data of at least one modality, the at least one modality comprising audio and/or video; executing a supervised task corresponding to the mode to which the sample data belongs, and acquiring data characteristics of the sample data generated in the execution process of the supervised task; clustering the data characteristics of the sample data, determining benchmark data characteristics under the mode of the sample data based on a clustering result, and determining the benchmark data characteristics matched with the sample data based on the similarity between the benchmark data characteristics and the data characteristics of the sample data; and training a feature extraction model based on the sample data of the at least one modality and the benchmark data features matched with the sample data.

In addition, the logic instructions in the memory 730 can be implemented in the form of software functional units and stored in a computer readable storage medium when the software functional units are sold or used as independent products. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

Fig. 8 illustrates a second physical structure diagram of an electronic device, and as shown in fig. 8, the electronic device may include: processor (processor) 810, communication Interface (Communications Interface) 820, memory (memory) 830, and communication bus 840, and further including microphone 850 and/or camera 860, wherein processor 810, communication Interface 820, memory 830, microphone 850, and camera 860 communicate with each other via communication bus 840. The microphone 850 is used for collecting audio data to be processed, and the camera 860 is used for collecting video data to be processed; the processor 810 may call a logic instruction in the memory 830 to execute a multi-modal processing model in the computer program, and perform data processing on the data to be processed, where the multi-modal processing model is obtained by performing migration learning on a feature extraction model, the feature extraction model is obtained by training based on sample data of an audio/video modality and reference data features matched with the sample data, and the reference data features are obtained by clustering data features of the sample data generated in a process of executing a supervised task corresponding to audio and video, respectively.

In addition, the logic instructions in the memory 830 can be implemented in the form of software functional units and stored in a computer readable storage medium when the software functional units are sold or used as independent products. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

In another aspect, the present invention also provides a computer program product, the computer program product comprising a computer program, the computer program being storable on a non-transitory computer readable storage medium, wherein when the computer program is executed by a processor, the computer is capable of executing the method for training a feature extraction model provided by the above methods, the method comprising: obtaining sample data of at least one modality, wherein the at least one modality comprises audio and/or video; executing a supervised task corresponding to the mode to which the sample data belongs, and acquiring data characteristics of the sample data generated in the execution process of the supervised task; clustering the data features of the sample data, determining the benchmark data features of the sample data in the mode based on the clustering result, and determining the benchmark data features matched with the sample data based on the similarity between the benchmark data features and the data features of the sample data; and training a feature extraction model based on the sample data of the at least one modality and the benchmark data features matched with the sample data.

The present invention also provides a computer program product comprising a computer program, the computer program being storable on a non-transitory computer-readable storage medium, the computer program, when executed by a processor, being further capable of executing a data processing method provided by the above methods, the method comprising: acquiring data to be processed of at least one modality, wherein the at least one modality comprises audio and/or video; performing data processing on the data to be processed based on a multi-modal processing model; the multi-modal processing model is obtained by performing transfer learning on a feature extraction model, the feature extraction model is obtained by training based on sample data of an audio and video mode and reference data features matched with the sample data, and the reference data features are obtained by clustering the data features of the sample data generated in the process of executing a supervised task corresponding to an audio and a video respectively.

In yet another aspect, the present invention also provides a non-transitory computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements a method of training a feature extraction model provided by the above methods, the method comprising: obtaining sample data of at least one modality, the at least one modality comprising audio and/or video; executing a supervised task corresponding to the mode to which the sample data belongs, and acquiring data characteristics of the sample data generated in the execution process of the supervised task; clustering the data features of the sample data, determining the benchmark data features of the sample data in the mode based on the clustering result, and determining the benchmark data features matched with the sample data based on the similarity between the benchmark data features and the data features of the sample data; and training a feature extraction model based on the sample data of the at least one modality and the benchmark data features matched with the sample data.

The present invention also provides a non-transitory computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements a data processing method provided by the above methods, the method comprising: acquiring data to be processed of at least one modality, wherein the at least one modality comprises audio and/or video; performing data processing on the data to be processed based on a multi-modal processing model; the multi-modal processing model is obtained by performing transfer learning on a feature extraction model, the feature extraction model is obtained by training based on sample data of an audio and video mode and reference data features matched with the sample data, and the reference data features are obtained by clustering the data features of the sample data generated in the process of executing a supervised task corresponding to an audio and a video respectively.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, and not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A training method of a feature extraction model is characterized by comprising the following steps:

and training a feature extraction model based on the sample data of the at least one mode and the reference data features matched with the sample data.

2. The method for training a feature extraction model according to claim 1, wherein training a feature extraction model based on the sample data of the at least one modality and the reference data features matching the sample data comprises:

performing feature prediction on a mask part in the mask data of the at least one modality based on an initial feature extraction model to obtain prediction features of the mask part;

3. A method for training a feature extraction model according to claim 2, wherein in a case that the at least one modality includes audio and video, the performing feature prediction on a mask portion in mask data of the at least one modality based on the initial feature extraction model to obtain a predicted feature of the mask portion comprises:

4. The method of claim 3, wherein the performing parameter iteration on the initial feature extraction model based on the reference data features matching the sample data and the predicted features of the mask portion to obtain the feature extraction model comprises:

determining a second loss based on the audio feature and the video feature;

5. A training method for a feature extraction model according to any one of claims 1 to 4, wherein the sample data of the video comprises a complete human face region.

6. The method for training a feature extraction model according to any one of claims 1 to 4, wherein the audio corresponding supervised task comprises at least one of speech recognition, voiceprint recognition and prosody recognition;

7. A data processing method, comprising:

8. A training device for a feature extraction model is characterized by comprising:

the characteristic obtaining unit is used for executing a supervised task corresponding to the mode to which the sample data belongs and obtaining the data characteristics of the sample data generated in the execution process of the supervised task;

9. A data processing apparatus, comprising:

10. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements a method of training a feature extraction model according to any one of claims 1 to 6 when executing the program.

11. An electronic device comprising a microphone and/or a camera, further comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the microphone is adapted to collect audio data to be processed;

the camera is used for acquiring data to be processed of the video;

the processor executes a multi-mode processing model in the computer program to process the data to be processed, the multi-mode processing model is obtained by performing transfer learning on a feature extraction model, the feature extraction model is obtained by training based on sample data of an audio and video mode and reference data features matched with the sample data, and the reference data features are obtained by clustering data features of the sample data generated in the process of executing a supervised task corresponding to audio and video respectively.

12. A non-transitory computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements a method for training a feature extraction model according to any one of claims 1 to 6, or implements a method for data processing according to claim 7.