CN115130591A

CN115130591A - Cross supervision-based multi-mode data classification method and device

Info

Publication number: CN115130591A
Application number: CN202210773999.8A
Authority: CN
Inventors: 朱心洲; 潘晓华; 沈诗靖
Original assignee: Binjiang Research Institute Of Zhejiang University; Zhejiang University ZJU
Current assignee: Binjiang Research Institute Of Zhejiang University; Zhejiang University ZJU
Priority date: 2022-07-01
Filing date: 2022-07-01
Publication date: 2022-09-30

Abstract

The invention discloses a cross supervision-based multi-modal data classification method, which comprises the following steps: step 1, obtaining multi-modal data, and constructing a sample set containing marked data and unmarked data; step 2, constructing a first classification model and a second classification model based on the same network structure; step 3, training and parameter adjusting the first classification model and the second classification model by using the sample set; step 4, adopting the labeled data, respectively training the obtained first classification model and the second classification model to test, and selecting the model with the best test result as a final multi-modal data classification model; and 5, inputting the multi-modal data to be classified into a multi-modal data classification model, and outputting a classification result corresponding to the multi-modal data. The invention also provides a multi-mode data classification device. The method provided by the invention can ensure the robustness, generalization capability and prediction accuracy of the classification model under the condition of small sample multi-modal data.

Description

Cross supervision-based multi-mode data classification method and device

Technical Field

The invention relates to the technical field of deep learning data classification, in particular to a cross supervision-based multi-mode data classification method and device.

Background

The development of the internet and 5G technology provides a great deal of multi-modal data (including texts, videos and images) for deep learning research, and when the multi-modal data is used for research, the characteristics of each modal data can be fully utilized, thereby avoiding the situation that the expressive force of the single-mode data is limited, such as when deep learning is used for a short-video classification prediction task, only short video title data are used for classification, the data are one-sided, the classification accuracy is influenced, meanwhile, the prediction result can be better obtained by combining the modal data such as video, audio and the like, in a multi-modal data scene, the data which is manually subjected to category marking is less, the unmarked data is more, how to combine labeled data and unlabeled data in the model training process so as to improve the accuracy of model prediction becomes an important problem.

Patent document CN114443864A discloses a cross-modal data matching method, device and computer program product, the method: acquiring a training sample set; the training sample set comprises first modality data, second modality data and a label for representing whether the multimodal data are matched or not; respectively extracting first-layer characteristic information and second-layer characteristic information of first modality data and second modality data in a training sample; and constraining a matching result between the first modal data and the second modal data based on the first-level feature information by using a matching loss, constraining a classification result obtained respectively based on the first modal data, the first-level feature information and the second-level feature information by using a classification loss function, and training to obtain a cross-modal matching model based on the classification result obtained respectively based on the first-level feature information and the second-level feature information of the second modal data. The method can better utilize unmarked multi-modal data, improves the relevance of the model to different modal data through cross-modal contrast learning, but if the method is applied to a multi-modal data classification scene, a model needs to be retrained by combining the marked data with the model, the network structure of the model cannot be shared in many times, and the model parameters are updated according to new data in the later period, so that the cost is high.

Patent document CN110363239A discloses a method, system and medium for learning a hand sample machine for multi-modal data, which includes training and testing with 3 functional modules of multi-modal data representation, hierarchical pooling and relational network, firstly vectorizing multi-modal data features through an encoder, then adopting hierarchical pooling of first maximum pooling and then average pooling to reduce the dimension of a time/space continuous vector sequence into category feature vectors, and finally performing learning classification under the condition of developing a hand sample based on the relational network. The method only uses a small amount of existing marked data, so that the model has poor generalization capability, and information in unmarked sample data cannot be utilized, thereby affecting the robustness, the generalization capability and the prediction accuracy of the model.

Disclosure of Invention

In order to solve the problems, the invention provides a cross supervision-based multi-modal data classification method, which can complete the training of a classification model under the condition of small sample multi-modal data and simultaneously ensure the robustness, generalization capability and prediction accuracy of the classification model.

A cross supervision based multi-modal data classification method comprises the following steps:

step 1, obtaining multi-modal data, labeling part of the multi-modal data with three dimensions of text, audio and video, and constructing a sample set containing labeled data and unlabeled data;

step 2, constructing a first classification model and a second classification model on the basis of the same network structure, wherein the parameter initialization modes of the first classification model and the second classification model are different;

step 3, training and parameter adjustment are carried out on the first classification model and the second classification model constructed in the step 2 by using the sample set constructed in the step 1, wherein the training comprises supervised training and cross-supervised training;

step 4, adopting the labeled data to respectively test the first classification model and the second classification model obtained by training in the step 3, and screening out a model with a higher corresponding F1 value according to a test result to serve as a final multi-modal data classification model;

and 5, inputting the multi-modal data to be classified into a multi-modal data classification model, and outputting a classification result corresponding to the multi-modal data.

In the invention, a cross supervision training method is introduced into the training process of the multi-modal data classification model, the two classification models with the same network structure and different model parameter initialization modes are used for mutual supervision, limited sample data is fully utilized to train the model, finally, the model with the highest F1 value is selected as the final multi-modal data classification model by judging the F1 value of each model, and the generalization capability and the prediction accuracy of the multi-modal data classification model are ensured by the training strategy.

Preferably, the sample set construction of step 1 further comprises a preprocessing of multi-modal data:

generating a corresponding digital combination by a word segmentation conversion method aiming at the text;

extracting features by using a pre-training model aiming at video and audio, and generating corresponding tensors according to the extracted features;

the expression of multi-modal features is convenient to unify, so that the model can be better learned.

Preferably, the ratio of the annotated data to the unlabeled data in the step 1 is 1:10, so that the time consumption of manual labeling is saved, and meanwhile, the proportion of the unlabeled data is improved, so that the finally trained model has better generalization capability.

Specifically, the network structure in step 2 includes a feature extraction module, a processing module, a fusion module and a classification module, the feature extraction module includes a multi-modal feature extractor, the multi-modal feature extractor is used for extracting multi-modal feature vectors of input data, the processing module is used for converting the extracted multi-modal feature quantities into dense vectors with unified dimensions, and inputting the dense vectors into the fusion module, the fusion module is used for splicing the input dense vectors to obtain fusion vectors and inputting the fusion vectors into the classification module, and the classification module performs a hidden projection operation according to the input fusion vectors to obtain classification results.

Specifically, the unified dimension of the processing module is realized by normalization processing, and the expression thereof is as follows:

where e (x) represents the mean value of the input x, var (x) represents the variance of x, and e represents noise, and γ ═ 1 and β ═ 0 in the initialization stage, and γ and β are adjusted according to the back-propagating gradient during the training of the neural network.

Specifically, the splicing operation of the fusion module is expressed by the following formula:

E _i ＝concat({P _t (x _i ),P _m (v _i ),P _v (y _i )})

in the formula, P _t (x _i ) Embedding function, P, representing text modal data _m (v _i ) Embedding function, P, representing data of a video modality _v (y _i ) An embedding function representing audio modality data.

Specifically, the supervised training in step 3 is to train the same labeled data set respectively corresponding to the first classification model and the second classification model so as to minimize the difference between the model prediction result and the actual labeled label, and comprehensively consider the result back propagation update gradient of the loss function to learn the model prediction target, where the expression of the loss function is as follows:

in the formula, bs represents the number of a batch _ size in the labeled data, j represents the model number, i represents the ith data in the batch, CE represents the cross entropy loss function, y _i Label representing the ith piece of data, p _ji And representing the prediction result of the model j on the ith multi-modal data.

Specifically, the cross-supervision training in step 3 includes prediction result cross-supervision and cross-modal cross-supervision.

Preferably, the prediction result cross-supervision is to predict the unlabelled data by using a first classification model and a second classification model respectively, exchange respective prediction results as training data labels for supervised training, minimize the difference between the two model prediction results, comprehensively consider the result back propagation update gradient of respective cross loss functions to update the model parameters, expand the unlabelled data set by adopting a pseudo label mode, and enable the model to learn a more compact characteristic coding representation in the training process.

Preferably, the cross-modal cross supervision is to perform separation and reconstruction on the modalities of the data to obtain partial modality data, predict the partial modality data respectively by using the first classification model and the second classification model to minimize the difference between the prediction results of the two models, and update the model parameters by comprehensively considering the result back propagation update gradient of the respective cross loss function, so that the model can learn the cross-modal connection in the multi-modal data better, and meanwhile, pollution to the model caused by weak modality relevance in partial data is avoided when the cross-modal connection is learned.

Specifically, the expression of the cross-over loss function is as follows:

in the formula, un _ bs represents the number of one batch _ size with label data, i represents the ith piece of data in the batch, and CE is a cross entropy loss function.

Specifically, after cross supervision and transmembrane state cross supervision training is completed, Loss generated by the cross supervision and transmembrane state cross supervision training is calculated, and the expression is as follows:

Loss _unlabeled ＝Loss _csc *β+Loss _cs *γ

in the initialization phase, γ is 1, and β is 1, and during the training process of the neural network, both γ and β parameters are learnable and can be adjusted accordingly according to the gradient of back propagation.

Specifically, the calculation formula of the model F1 value in step 4 is as follows:

wherein l represents the number of classes, Precision _i Indicating the prediction accuracy, Recall, of the ith category _i Representing the predicted recall for the ith category.

The invention also provides a multi-modal data classification device, which comprises a computer, a computer processor and a computer program stored in a computer memory and executed on the computer processor, wherein the computer memory adopts the multi-modal data classification model; the computer processor when executing the computer program implements the steps of: and inputting the multi-modal data to be classified into a multi-modal data classification model, and outputting the classification corresponding to the multi-modal data through calculation and analysis.

Compared with the prior art, the invention has the beneficial effects that:

(1) on the basis of traditional supervision training, prediction result cross supervision and cross-mode cross supervision are introduced, so that the utilization rate of sample data can be improved, and the robustness, generalization capability and prediction accuracy of the model can be guaranteed.

(2) The two models with the same network structure but different model parameter initialization modes are adopted for supervision training, so that the relevance of different modal data can be ensured, and the cost for adjusting the model parameters in the preparation stage can be saved.

Drawings

FIG. 1 is a schematic flow chart of a cross-supervised based multi-modal data classification method according to the present invention;

fig. 2 is a schematic diagram of a network structure provided in this embodiment;

fig. 3 is a schematic flow chart of cross supervision of prediction results provided in this embodiment;

fig. 4 is a schematic flow chart of cross-modal cross supervision provided in this embodiment.

Detailed Description

There are a lot of multi-modal english learning data on the internet, such as english learning short videos, english audio, english example sentences, english short texts, etc., and each english learning content provider has different chapter standards and classification granularities for the english learning content produced by the provider, collects and collates the multi-modal content on the internet, and classifies chapters according to corresponding textbook data (labeled data) by referring to part of data, and there is still a lot of data that is not classified according to chapters corresponding to the textbooks (unlabeled data).

As shown in fig. 1, in order to complete section classification of a textbook, the present embodiment provides a cross-supervision-based multi-modal data classification method, including:

step 1, obtaining multi-modal data comprising character data, video data and audio data, and processing the multi-modal data:

processing the character data, and selecting a fixed value y which represents the audio length used in processing the video and audio data, for example, when y is 256, representing the data trained by using the first 256 characters of the text as a model, the part exceeding 256 characters is cut off, and the part less than 256 characters is completed by using a placeholder; the text content is then participled, the participles are converted into corresponding ID representations according to a vocabulary, and when placeholders are identified, a "-1" is used for substitution.

Processing video data, selecting a fixed value k representing the number of frames used in processing video modality data, for example, when k is 32, representing data trained by using the first 32 frames of a video as a model, cutting off the part exceeding 32 frames, and completing the part less than 32 frames by using placeholders; the video frame features extracted using the pre-trained model (Swin-Transformer) are then converted to corresponding tensors, and when placeholders are identified, the tensors are replaced with all zeros.

Processing the audio data, processing the audio modal data, selecting a fixed value x representing the length of the audio used when the audio data is processed, for example, when x is 32, representing data trained by using the first 32 seconds of the audio as a model, cutting off the part exceeding 32 seconds, and completing the part less than 32 seconds by using a placeholder; features of the audio are then extracted using a pre-trained model (TERA), the audio features are converted to corresponding tensors, and when placeholders are identified, the tensors are replaced with fully zero tensors.

Labeling the multimode data in three dimensions of text, audio and video in a labeling ratio of 1:10 to construct a sample set containing labeled data and unlabeled data.

And 2, constructing a first classification model and a second classification model on the basis of the same network structure.

As shown in fig. 2, the network structure includes three identical imbedding layers, which are respectively used for performing an imbedding operation on data of three modalities, and for the imbedding layers, a pretrained model of Bert is used for performing corresponding initialization.

For text data, after a data preparation stage, a text mode is converted into a corresponding ID representation, and three different functions of dense vectors representing the text are obtained by using an embedding layer of the text mode: word _ embedding, token _ type _ embedding and position _ embedding, adding the three vectors, converting the text mode into a dense vector with fixed dimension, and then acquiring the embedded representation corresponding to the text data through normalization operation, wherein the embedded representation is marked as P _t (x _i )。

For the video modality, the video modality is converted into corresponding tensor representation through a linear layer after the data preparation stage, and the video modality is displayed through the linear layerMapping tensor dimensionality of the modality to be the same as dimensionality expressed by embedding of the text modality to be used as vector expression of the video modality, acquiring token _ type _ embeddings and position _ embeddings corresponding to the video modality according to the size of the video modality and whether placeholder content exists, adding the three vectors, and acquiring embedded expression corresponding to the video modality through normalization operation, wherein the embedded expression is marked as P _m (v _i )。

For an audio modality, after a data preparation stage, a video modality is converted into corresponding tensor representation, tensor dimension of the audio modality is mapped to be the same as dimension represented by embedding of a text modality through a linear layer to serve as vector representation of the audio modality, token _ type _ embeddings and position _ embeddings corresponding to the audio modality are obtained according to the size of the video modality and whether placeholder content exists, the three vectors are added, and then embedded representation corresponding to the audio modality is obtained through normalization operation and is recorded as P _v (y _i )。

If some modal data is missing, all 0 s are used, and the tensor with the same dimension represents the embedded representation of the missing mode.

The specific expression of the normalization operation is as follows:

After the conversion of the embedded representation of the multi-modal data is completed, the embedded representations of different modalities are spliced, and the splicing part can be represented as follows:

E _i ＝concat({P _t (x _i ),P _m (v _i ),P _v (y _i )})

subsequently, the embedded representation E _i Through the Encoder layer of Bert, obtain the last _ hidden _ sta of the outputAnd te, performing mean pooling operation on last _ hidden _ state, passing the result through a linear layer with the corresponding classification number, and obtaining the final corresponding classification result by obtaining the subscript of the maximum element of the linear layer.

Step 3, training and parameter adjusting the first classification model and the second classification model constructed in the step 2 by using the sample set constructed in the step 1:

1. supervised data training phase

In this stage, the first classification model and the second classification model respectively use two different sets of labeled data to train the models, learning is performed by minimizing the difference between the prediction result of the labeled data and the actual labeled label, and in this stage, based on a cross entropy loss function as a method for measuring the difference between the prediction result and the actual labeled label, the loss function can be expressed as:

in the formula, bs represents the number of a batch _ size in labeled data, j represents a model number, i represents the ith data in batch, CE represents a cross entropy loss function, and y represents _i Label representing the ith piece of data, p _ji And representing the prediction result of the model j on the ith multi-modal data.

After the calculation of the loss function is completed, the result of the minimized loss function is used as a training target, back propagation is carried out according to the result of the loss function respectively, the gradients of the first classification model and the second classification model are updated, and the learning of the model prediction target is carried out.

2. Prediction result cross supervision phase

As shown in fig. 3, after completing the training of the monitor data pair model of the batch, the cross monitor training stage is entered, in which the model is trained using unlabeled data, the same data is predicted using the first classification model and the second classification model, after passing through a linear classification layer of a corresponding category, the prediction result of the first classification model is denoted as Pa, the prediction result of the second classification model is denoted as Pb, the calculation method can be expressed as:

the method comprises the following steps of obtaining a first classification model, predicting a first feature code of a sample, and carrying out cross-supervised training on the first classification model, wherein un _ bs is the number of a batch _ size with labeled data, i is the ith data in the batch, and CE is a cross entropy loss function.

3. Cross-modal cross-supervision phase

As shown in fig. 4, training is performed by using labeled data and unlabeled data simultaneously, and two data forms are generated by performing cross-modality combination on different parts of a piece of data, where a first part of data includes a part of modalities of the piece of data, such as X1 in the figure, including text and video modalities in data X; the second part of data comprises another part of modality of the piece of data, such as X2 in the figure, including video and audio modalities in the data; predicting X1 by using a first classification model, predicting X2 by using a second classification model, and after a linear classification layer of a corresponding class is passed, recording the prediction result of the first classification model as Pa, recording the prediction result of the second classification model as Pb, obtaining the subscript corresponding to the maximum value in Pa as Ya, and the subscript corresponding to the maximum value in Pb as Yb, and calculating the cross entropy loss of Yb and Pa, and the cross entropy loss of Ya and Pb so as to calculate the loss in a cross-mode cross supervision training stage, wherein the calculation method can be expressed as follows:

in the formula, un _ bs represents the number of one batch _ size with labeled data, i represents the ith data in the batch, and CE is a cross entropy loss function, so that the model can better learn the cross-modal connection in the multi-modal data, and meanwhile, the pollution to the model caused by the weak modal relevance when the cross-modal connection is learned is avoided.

After the training is completed, calculating the generated Loss in a way that:

Loss _unlabeled ＝Loss _csc *β+Loss _cs *γ

After the Loss is calculated _unlabeled And then, carrying out back propagation according to the loss, updating the gradients of the first classification model and the second classification model, iteratively completing the steps until the models are converged, completing the training of the models, and freezing corresponding model parameters to realize the prediction of specific characteristic data after completing the training of the models.

And 4, adopting the marked data to respectively test the first classification model and the second classification model obtained by training in the step 3:

in the formula, l represents the number of classes, Precision _i Indicating the prediction accuracy, Recall, of the ith category _i Representing the predicted recall for the ith category.

And selecting the model with the highest F1 value from the two models as the final multi-modal data classification model.

And 5, inputting the unclassified teaching material content into the multi-modal data classification model, and outputting chapter classification results corresponding to the teaching material content.

The invention also provides a multi-modal data classification device, which comprises a computer, a computer processor and a computer program stored in the computer memory and executed on the computer processor, wherein the computer memory adopts the multi-modal data classification model;

the computer program when executed by a computer processor implements the steps of: and inputting the multi-modal data to be classified into a multi-modal data classification model, and outputting the classification corresponding to the multi-modal data through calculation and analysis.

The method provided by the invention can fully utilize marked data and unmarked data, train out a model with strong generalization capability and high prediction accuracy, and realize high-efficiency classification of massive multi-mode English data.

Claims

1. A multi-modal data classification method based on cross supervision is characterized by comprising the following steps:

2. The cross-supervised-based multimodal data classification method of claim 1, wherein the sample set construction of step 1 further comprises a pre-processing of multimodal data:

and aiming at the video and the audio, extracting the features by using a pre-training model, and generating a corresponding tensor according to the extracted features.

3. A cross-supervision based multi-modal data classification method according to claim 1, characterized in that the ratio of annotated data to unlabeled data in step 1 is 1: 10.

4. The cross supervision-based multi-modal data classification method according to claim 1, wherein the network structure in step 2 includes a feature extraction module, a processing module, a fusion module and a classification module, the feature extraction module includes a multi-modal feature extractor, the multi-modal feature extractor is used for extracting multi-modal feature vectors of input data, the processing module is used for converting the extracted multi-modal feature quantities into dense vectors with uniform dimensions and inputting the dense vectors into the fusion module, the fusion module is used for performing a splicing operation on the input dense vectors to obtain fusion vectors and inputting the fusion vectors into the classification module, and the classification module performs a hidden projection operation according to the input fusion vectors to obtain classification results.

5. The cross-supervision-based multi-modal data classification method according to claim 1, wherein the supervision training in step 3 is to train the same labeled data set respectively corresponding to the first classification model and the second classification model so as to minimize the difference between the model prediction result and the actual labeled label, and comprehensively consider the result back propagation update gradient of the loss function to learn the model prediction target.

6. The cross-supervision-based multimodal data classification method according to claim 1, characterized in that the cross-supervision training in step 3 comprises predictive outcome cross-supervision and cross-modality cross-supervision.

7. The cross-supervision-based multi-modal data classification method according to claim 6, wherein the prediction result cross-supervision is to predict the unlabeled data by using a first classification model and a second classification model respectively, and exchange the respective prediction results as training data labels for supervised training, so as to minimize the difference between the two model prediction results, and update the model parameters by comprehensively considering the result back propagation update gradient of the respective cross-loss function.

8. A cross-mode supervision based multi-modal data classification method according to claim 6, characterized in that the cross-mode cross supervision is to perform separation and reconstruction on the modes of the data, obtain partial mode data, predict the partial mode data respectively by using a first classification model and a second classification model, so as to minimize the difference between the two model prediction results, and update model parameters by comprehensively considering the result back propagation update gradient of the respective cross loss function.

9. The cross-supervision-based multi-modal data classification method according to claim 1, characterized in that the model F1 value in step 4 is calculated as follows:

10. A multi-modal data classification apparatus comprising a computer, a computer processor, and a computer program stored in and executed on the computer processor, wherein the multi-modal data classification model of claim 1 is employed in the computer memory; the computer processor when executing the computer program implements the steps of: and inputting the multi-modal data to be classified into a multi-modal data classification model, and outputting the classification corresponding to the multi-modal data through calculation and analysis.