CN115130591A - Cross supervision-based multi-mode data classification method and device - Google Patents

Cross supervision-based multi-mode data classification method and device Download PDF

Info

Publication number
CN115130591A
CN115130591A CN202210773999.8A CN202210773999A CN115130591A CN 115130591 A CN115130591 A CN 115130591A CN 202210773999 A CN202210773999 A CN 202210773999A CN 115130591 A CN115130591 A CN 115130591A
Authority
CN
China
Prior art keywords
data
classification
cross
model
modal
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210773999.8A
Other languages
Chinese (zh)
Inventor
朱心洲
潘晓华
沈诗靖
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Binjiang Research Institute Of Zhejiang University
Zhejiang University ZJU
Original Assignee
Binjiang Research Institute Of Zhejiang University
Zhejiang University ZJU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Binjiang Research Institute Of Zhejiang University, Zhejiang University ZJU filed Critical Binjiang Research Institute Of Zhejiang University
Priority to CN202210773999.8A priority Critical patent/CN115130591A/en
Publication of CN115130591A publication Critical patent/CN115130591A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a cross supervision-based multi-modal data classification method, which comprises the following steps: step 1, obtaining multi-modal data, and constructing a sample set containing marked data and unmarked data; step 2, constructing a first classification model and a second classification model based on the same network structure; step 3, training and parameter adjusting the first classification model and the second classification model by using the sample set; step 4, adopting the labeled data, respectively training the obtained first classification model and the second classification model to test, and selecting the model with the best test result as a final multi-modal data classification model; and 5, inputting the multi-modal data to be classified into a multi-modal data classification model, and outputting a classification result corresponding to the multi-modal data. The invention also provides a multi-mode data classification device. The method provided by the invention can ensure the robustness, generalization capability and prediction accuracy of the classification model under the condition of small sample multi-modal data.

Description

Cross supervision-based multi-mode data classification method and device
Technical Field
The invention relates to the technical field of deep learning data classification, in particular to a cross supervision-based multi-mode data classification method and device.
Background
The development of the internet and 5G technology provides a great deal of multi-modal data (including texts, videos and images) for deep learning research, and when the multi-modal data is used for research, the characteristics of each modal data can be fully utilized, thereby avoiding the situation that the expressive force of the single-mode data is limited, such as when deep learning is used for a short-video classification prediction task, only short video title data are used for classification, the data are one-sided, the classification accuracy is influenced, meanwhile, the prediction result can be better obtained by combining the modal data such as video, audio and the like, in a multi-modal data scene, the data which is manually subjected to category marking is less, the unmarked data is more, how to combine labeled data and unlabeled data in the model training process so as to improve the accuracy of model prediction becomes an important problem.
Patent document CN114443864A discloses a cross-modal data matching method, device and computer program product, the method: acquiring a training sample set; the training sample set comprises first modality data, second modality data and a label for representing whether the multimodal data are matched or not; respectively extracting first-layer characteristic information and second-layer characteristic information of first modality data and second modality data in a training sample; and constraining a matching result between the first modal data and the second modal data based on the first-level feature information by using a matching loss, constraining a classification result obtained respectively based on the first modal data, the first-level feature information and the second-level feature information by using a classification loss function, and training to obtain a cross-modal matching model based on the classification result obtained respectively based on the first-level feature information and the second-level feature information of the second modal data. The method can better utilize unmarked multi-modal data, improves the relevance of the model to different modal data through cross-modal contrast learning, but if the method is applied to a multi-modal data classification scene, a model needs to be retrained by combining the marked data with the model, the network structure of the model cannot be shared in many times, and the model parameters are updated according to new data in the later period, so that the cost is high.
Patent document CN110363239A discloses a method, system and medium for learning a hand sample machine for multi-modal data, which includes training and testing with 3 functional modules of multi-modal data representation, hierarchical pooling and relational network, firstly vectorizing multi-modal data features through an encoder, then adopting hierarchical pooling of first maximum pooling and then average pooling to reduce the dimension of a time/space continuous vector sequence into category feature vectors, and finally performing learning classification under the condition of developing a hand sample based on the relational network. The method only uses a small amount of existing marked data, so that the model has poor generalization capability, and information in unmarked sample data cannot be utilized, thereby affecting the robustness, the generalization capability and the prediction accuracy of the model.
Disclosure of Invention
In order to solve the problems, the invention provides a cross supervision-based multi-modal data classification method, which can complete the training of a classification model under the condition of small sample multi-modal data and simultaneously ensure the robustness, generalization capability and prediction accuracy of the classification model.
A cross supervision based multi-modal data classification method comprises the following steps:
step 1, obtaining multi-modal data, labeling part of the multi-modal data with three dimensions of text, audio and video, and constructing a sample set containing labeled data and unlabeled data;
step 2, constructing a first classification model and a second classification model on the basis of the same network structure, wherein the parameter initialization modes of the first classification model and the second classification model are different;
step 3, training and parameter adjustment are carried out on the first classification model and the second classification model constructed in the step 2 by using the sample set constructed in the step 1, wherein the training comprises supervised training and cross-supervised training;
step 4, adopting the labeled data to respectively test the first classification model and the second classification model obtained by training in the step 3, and screening out a model with a higher corresponding F1 value according to a test result to serve as a final multi-modal data classification model;
and 5, inputting the multi-modal data to be classified into a multi-modal data classification model, and outputting a classification result corresponding to the multi-modal data.
In the invention, a cross supervision training method is introduced into the training process of the multi-modal data classification model, the two classification models with the same network structure and different model parameter initialization modes are used for mutual supervision, limited sample data is fully utilized to train the model, finally, the model with the highest F1 value is selected as the final multi-modal data classification model by judging the F1 value of each model, and the generalization capability and the prediction accuracy of the multi-modal data classification model are ensured by the training strategy.
Preferably, the sample set construction of step 1 further comprises a preprocessing of multi-modal data:
generating a corresponding digital combination by a word segmentation conversion method aiming at the text;
extracting features by using a pre-training model aiming at video and audio, and generating corresponding tensors according to the extracted features;
the expression of multi-modal features is convenient to unify, so that the model can be better learned.
Preferably, the ratio of the annotated data to the unlabeled data in the step 1 is 1:10, so that the time consumption of manual labeling is saved, and meanwhile, the proportion of the unlabeled data is improved, so that the finally trained model has better generalization capability.
Specifically, the network structure in step 2 includes a feature extraction module, a processing module, a fusion module and a classification module, the feature extraction module includes a multi-modal feature extractor, the multi-modal feature extractor is used for extracting multi-modal feature vectors of input data, the processing module is used for converting the extracted multi-modal feature quantities into dense vectors with unified dimensions, and inputting the dense vectors into the fusion module, the fusion module is used for splicing the input dense vectors to obtain fusion vectors and inputting the fusion vectors into the classification module, and the classification module performs a hidden projection operation according to the input fusion vectors to obtain classification results.
Specifically, the unified dimension of the processing module is realized by normalization processing, and the expression thereof is as follows:
Figure BDA0003725783380000041
where e (x) represents the mean value of the input x, var (x) represents the variance of x, and e represents noise, and γ ═ 1 and β ═ 0 in the initialization stage, and γ and β are adjusted according to the back-propagating gradient during the training of the neural network.
Specifically, the splicing operation of the fusion module is expressed by the following formula:
E i =concat({P t (x i ),P m (v i ),P v (y i )})
in the formula, P t (x i ) Embedding function, P, representing text modal data m (v i ) Embedding function, P, representing data of a video modality v (y i ) An embedding function representing audio modality data.
Specifically, the supervised training in step 3 is to train the same labeled data set respectively corresponding to the first classification model and the second classification model so as to minimize the difference between the model prediction result and the actual labeled label, and comprehensively consider the result back propagation update gradient of the loss function to learn the model prediction target, where the expression of the loss function is as follows:
Figure BDA0003725783380000051
in the formula, bs represents the number of a batch _ size in the labeled data, j represents the model number, i represents the ith data in the batch, CE represents the cross entropy loss function, y i Label representing the ith piece of data, p ji And representing the prediction result of the model j on the ith multi-modal data.
Specifically, the cross-supervision training in step 3 includes prediction result cross-supervision and cross-modal cross-supervision.
Preferably, the prediction result cross-supervision is to predict the unlabelled data by using a first classification model and a second classification model respectively, exchange respective prediction results as training data labels for supervised training, minimize the difference between the two model prediction results, comprehensively consider the result back propagation update gradient of respective cross loss functions to update the model parameters, expand the unlabelled data set by adopting a pseudo label mode, and enable the model to learn a more compact characteristic coding representation in the training process.
Preferably, the cross-modal cross supervision is to perform separation and reconstruction on the modalities of the data to obtain partial modality data, predict the partial modality data respectively by using the first classification model and the second classification model to minimize the difference between the prediction results of the two models, and update the model parameters by comprehensively considering the result back propagation update gradient of the respective cross loss function, so that the model can learn the cross-modal connection in the multi-modal data better, and meanwhile, pollution to the model caused by weak modality relevance in partial data is avoided when the cross-modal connection is learned.
Specifically, the expression of the cross-over loss function is as follows:
Figure BDA0003725783380000061
in the formula, un _ bs represents the number of one batch _ size with label data, i represents the ith piece of data in the batch, and CE is a cross entropy loss function.
Specifically, after cross supervision and transmembrane state cross supervision training is completed, Loss generated by the cross supervision and transmembrane state cross supervision training is calculated, and the expression is as follows:
Loss unlabeled =Loss csc *β+Loss cs
in the initialization phase, γ is 1, and β is 1, and during the training process of the neural network, both γ and β parameters are learnable and can be adjusted accordingly according to the gradient of back propagation.
Specifically, the calculation formula of the model F1 value in step 4 is as follows:
Figure BDA0003725783380000062
wherein l represents the number of classes, Precision i Indicating the prediction accuracy, Recall, of the ith category i Representing the predicted recall for the ith category.
The invention also provides a multi-modal data classification device, which comprises a computer, a computer processor and a computer program stored in a computer memory and executed on the computer processor, wherein the computer memory adopts the multi-modal data classification model; the computer processor when executing the computer program implements the steps of: and inputting the multi-modal data to be classified into a multi-modal data classification model, and outputting the classification corresponding to the multi-modal data through calculation and analysis.
Compared with the prior art, the invention has the beneficial effects that:
(1) on the basis of traditional supervision training, prediction result cross supervision and cross-mode cross supervision are introduced, so that the utilization rate of sample data can be improved, and the robustness, generalization capability and prediction accuracy of the model can be guaranteed.
(2) The two models with the same network structure but different model parameter initialization modes are adopted for supervision training, so that the relevance of different modal data can be ensured, and the cost for adjusting the model parameters in the preparation stage can be saved.
Drawings
FIG. 1 is a schematic flow chart of a cross-supervised based multi-modal data classification method according to the present invention;
fig. 2 is a schematic diagram of a network structure provided in this embodiment;
fig. 3 is a schematic flow chart of cross supervision of prediction results provided in this embodiment;
fig. 4 is a schematic flow chart of cross-modal cross supervision provided in this embodiment.
Detailed Description
There are a lot of multi-modal english learning data on the internet, such as english learning short videos, english audio, english example sentences, english short texts, etc., and each english learning content provider has different chapter standards and classification granularities for the english learning content produced by the provider, collects and collates the multi-modal content on the internet, and classifies chapters according to corresponding textbook data (labeled data) by referring to part of data, and there is still a lot of data that is not classified according to chapters corresponding to the textbooks (unlabeled data).
As shown in fig. 1, in order to complete section classification of a textbook, the present embodiment provides a cross-supervision-based multi-modal data classification method, including:
step 1, obtaining multi-modal data comprising character data, video data and audio data, and processing the multi-modal data:
processing the character data, and selecting a fixed value y which represents the audio length used in processing the video and audio data, for example, when y is 256, representing the data trained by using the first 256 characters of the text as a model, the part exceeding 256 characters is cut off, and the part less than 256 characters is completed by using a placeholder; the text content is then participled, the participles are converted into corresponding ID representations according to a vocabulary, and when placeholders are identified, a "-1" is used for substitution.
Processing video data, selecting a fixed value k representing the number of frames used in processing video modality data, for example, when k is 32, representing data trained by using the first 32 frames of a video as a model, cutting off the part exceeding 32 frames, and completing the part less than 32 frames by using placeholders; the video frame features extracted using the pre-trained model (Swin-Transformer) are then converted to corresponding tensors, and when placeholders are identified, the tensors are replaced with all zeros.
Processing the audio data, processing the audio modal data, selecting a fixed value x representing the length of the audio used when the audio data is processed, for example, when x is 32, representing data trained by using the first 32 seconds of the audio as a model, cutting off the part exceeding 32 seconds, and completing the part less than 32 seconds by using a placeholder; features of the audio are then extracted using a pre-trained model (TERA), the audio features are converted to corresponding tensors, and when placeholders are identified, the tensors are replaced with fully zero tensors.
Labeling the multimode data in three dimensions of text, audio and video in a labeling ratio of 1:10 to construct a sample set containing labeled data and unlabeled data.
And 2, constructing a first classification model and a second classification model on the basis of the same network structure.
As shown in fig. 2, the network structure includes three identical imbedding layers, which are respectively used for performing an imbedding operation on data of three modalities, and for the imbedding layers, a pretrained model of Bert is used for performing corresponding initialization.
For text data, after a data preparation stage, a text mode is converted into a corresponding ID representation, and three different functions of dense vectors representing the text are obtained by using an embedding layer of the text mode: word _ embedding, token _ type _ embedding and position _ embedding, adding the three vectors, converting the text mode into a dense vector with fixed dimension, and then acquiring the embedded representation corresponding to the text data through normalization operation, wherein the embedded representation is marked as P t (x i )。
For the video modality, the video modality is converted into corresponding tensor representation through a linear layer after the data preparation stage, and the video modality is displayed through the linear layerMapping tensor dimensionality of the modality to be the same as dimensionality expressed by embedding of the text modality to be used as vector expression of the video modality, acquiring token _ type _ embeddings and position _ embeddings corresponding to the video modality according to the size of the video modality and whether placeholder content exists, adding the three vectors, and acquiring embedded expression corresponding to the video modality through normalization operation, wherein the embedded expression is marked as P m (v i )。
For an audio modality, after a data preparation stage, a video modality is converted into corresponding tensor representation, tensor dimension of the audio modality is mapped to be the same as dimension represented by embedding of a text modality through a linear layer to serve as vector representation of the audio modality, token _ type _ embeddings and position _ embeddings corresponding to the audio modality are obtained according to the size of the video modality and whether placeholder content exists, the three vectors are added, and then embedded representation corresponding to the audio modality is obtained through normalization operation and is recorded as P v (y i )。
If some modal data is missing, all 0 s are used, and the tensor with the same dimension represents the embedded representation of the missing mode.
The specific expression of the normalization operation is as follows:
Figure BDA0003725783380000101
where e (x) represents the mean value of the input x, var (x) represents the variance of x, and e represents noise, and γ ═ 1 and β ═ 0 in the initialization stage, and γ and β are adjusted according to the back-propagating gradient during the training of the neural network.
After the conversion of the embedded representation of the multi-modal data is completed, the embedded representations of different modalities are spliced, and the splicing part can be represented as follows:
E i =concat({P t (x i ),P m (v i ),P v (y i )})
subsequently, the embedded representation E i Through the Encoder layer of Bert, obtain the last _ hidden _ sta of the outputAnd te, performing mean pooling operation on last _ hidden _ state, passing the result through a linear layer with the corresponding classification number, and obtaining the final corresponding classification result by obtaining the subscript of the maximum element of the linear layer.
Step 3, training and parameter adjusting the first classification model and the second classification model constructed in the step 2 by using the sample set constructed in the step 1:
1. supervised data training phase
In this stage, the first classification model and the second classification model respectively use two different sets of labeled data to train the models, learning is performed by minimizing the difference between the prediction result of the labeled data and the actual labeled label, and in this stage, based on a cross entropy loss function as a method for measuring the difference between the prediction result and the actual labeled label, the loss function can be expressed as:
Figure BDA0003725783380000111
in the formula, bs represents the number of a batch _ size in labeled data, j represents a model number, i represents the ith data in batch, CE represents a cross entropy loss function, and y represents i Label representing the ith piece of data, p ji And representing the prediction result of the model j on the ith multi-modal data.
After the calculation of the loss function is completed, the result of the minimized loss function is used as a training target, back propagation is carried out according to the result of the loss function respectively, the gradients of the first classification model and the second classification model are updated, and the learning of the model prediction target is carried out.
2. Prediction result cross supervision phase
As shown in fig. 3, after completing the training of the monitor data pair model of the batch, the cross monitor training stage is entered, in which the model is trained using unlabeled data, the same data is predicted using the first classification model and the second classification model, after passing through a linear classification layer of a corresponding category, the prediction result of the first classification model is denoted as Pa, the prediction result of the second classification model is denoted as Pb, the calculation method can be expressed as:
Figure BDA0003725783380000121
the method comprises the following steps of obtaining a first classification model, predicting a first feature code of a sample, and carrying out cross-supervised training on the first classification model, wherein un _ bs is the number of a batch _ size with labeled data, i is the ith data in the batch, and CE is a cross entropy loss function.
3. Cross-modal cross-supervision phase
As shown in fig. 4, training is performed by using labeled data and unlabeled data simultaneously, and two data forms are generated by performing cross-modality combination on different parts of a piece of data, where a first part of data includes a part of modalities of the piece of data, such as X1 in the figure, including text and video modalities in data X; the second part of data comprises another part of modality of the piece of data, such as X2 in the figure, including video and audio modalities in the data; predicting X1 by using a first classification model, predicting X2 by using a second classification model, and after a linear classification layer of a corresponding class is passed, recording the prediction result of the first classification model as Pa, recording the prediction result of the second classification model as Pb, obtaining the subscript corresponding to the maximum value in Pa as Ya, and the subscript corresponding to the maximum value in Pb as Yb, and calculating the cross entropy loss of Yb and Pa, and the cross entropy loss of Ya and Pb so as to calculate the loss in a cross-mode cross supervision training stage, wherein the calculation method can be expressed as follows:
Figure BDA0003725783380000131
in the formula, un _ bs represents the number of one batch _ size with labeled data, i represents the ith data in the batch, and CE is a cross entropy loss function, so that the model can better learn the cross-modal connection in the multi-modal data, and meanwhile, the pollution to the model caused by the weak modal relevance when the cross-modal connection is learned is avoided.
After the training is completed, calculating the generated Loss in a way that:
Loss unlabeled =Loss csc *β+Loss cs
in the initialization phase, γ is 1, and β is 1, and during the training process of the neural network, both γ and β parameters are learnable and can be adjusted accordingly according to the gradient of back propagation.
After the Loss is calculated unlabeled And then, carrying out back propagation according to the loss, updating the gradients of the first classification model and the second classification model, iteratively completing the steps until the models are converged, completing the training of the models, and freezing corresponding model parameters to realize the prediction of specific characteristic data after completing the training of the models.
And 4, adopting the marked data to respectively test the first classification model and the second classification model obtained by training in the step 3:
Figure BDA0003725783380000132
in the formula, l represents the number of classes, Precision i Indicating the prediction accuracy, Recall, of the ith category i Representing the predicted recall for the ith category.
And selecting the model with the highest F1 value from the two models as the final multi-modal data classification model.
And 5, inputting the unclassified teaching material content into the multi-modal data classification model, and outputting chapter classification results corresponding to the teaching material content.
The invention also provides a multi-modal data classification device, which comprises a computer, a computer processor and a computer program stored in the computer memory and executed on the computer processor, wherein the computer memory adopts the multi-modal data classification model;
the computer program when executed by a computer processor implements the steps of: and inputting the multi-modal data to be classified into a multi-modal data classification model, and outputting the classification corresponding to the multi-modal data through calculation and analysis.
The method provided by the invention can fully utilize marked data and unmarked data, train out a model with strong generalization capability and high prediction accuracy, and realize high-efficiency classification of massive multi-mode English data.

Claims (10)

1. A multi-modal data classification method based on cross supervision is characterized by comprising the following steps:
step 1, obtaining multi-modal data, labeling part of the multi-modal data with three dimensions of text, audio and video, and constructing a sample set containing labeled data and unlabeled data;
step 2, constructing a first classification model and a second classification model on the basis of the same network structure, wherein the parameter initialization modes of the first classification model and the second classification model are different;
step 3, training and parameter adjustment are carried out on the first classification model and the second classification model constructed in the step 2 by using the sample set constructed in the step 1, wherein the training comprises supervised training and cross-supervised training;
step 4, adopting the labeled data to respectively test the first classification model and the second classification model obtained by training in the step 3, and screening out a model with a higher corresponding F1 value according to a test result to serve as a final multi-modal data classification model;
and 5, inputting the multi-modal data to be classified into a multi-modal data classification model, and outputting a classification result corresponding to the multi-modal data.
2. The cross-supervised-based multimodal data classification method of claim 1, wherein the sample set construction of step 1 further comprises a pre-processing of multimodal data:
generating a corresponding digital combination by a word segmentation conversion method aiming at the text;
and aiming at the video and the audio, extracting the features by using a pre-training model, and generating a corresponding tensor according to the extracted features.
3. A cross-supervision based multi-modal data classification method according to claim 1, characterized in that the ratio of annotated data to unlabeled data in step 1 is 1: 10.
4. The cross supervision-based multi-modal data classification method according to claim 1, wherein the network structure in step 2 includes a feature extraction module, a processing module, a fusion module and a classification module, the feature extraction module includes a multi-modal feature extractor, the multi-modal feature extractor is used for extracting multi-modal feature vectors of input data, the processing module is used for converting the extracted multi-modal feature quantities into dense vectors with uniform dimensions and inputting the dense vectors into the fusion module, the fusion module is used for performing a splicing operation on the input dense vectors to obtain fusion vectors and inputting the fusion vectors into the classification module, and the classification module performs a hidden projection operation according to the input fusion vectors to obtain classification results.
5. The cross-supervision-based multi-modal data classification method according to claim 1, wherein the supervision training in step 3 is to train the same labeled data set respectively corresponding to the first classification model and the second classification model so as to minimize the difference between the model prediction result and the actual labeled label, and comprehensively consider the result back propagation update gradient of the loss function to learn the model prediction target.
6. The cross-supervision-based multimodal data classification method according to claim 1, characterized in that the cross-supervision training in step 3 comprises predictive outcome cross-supervision and cross-modality cross-supervision.
7. The cross-supervision-based multi-modal data classification method according to claim 6, wherein the prediction result cross-supervision is to predict the unlabeled data by using a first classification model and a second classification model respectively, and exchange the respective prediction results as training data labels for supervised training, so as to minimize the difference between the two model prediction results, and update the model parameters by comprehensively considering the result back propagation update gradient of the respective cross-loss function.
8. A cross-mode supervision based multi-modal data classification method according to claim 6, characterized in that the cross-mode cross supervision is to perform separation and reconstruction on the modes of the data, obtain partial mode data, predict the partial mode data respectively by using a first classification model and a second classification model, so as to minimize the difference between the two model prediction results, and update model parameters by comprehensively considering the result back propagation update gradient of the respective cross loss function.
9. The cross-supervision-based multi-modal data classification method according to claim 1, characterized in that the model F1 value in step 4 is calculated as follows:
Figure FDA0003725783370000031
in the formula, l represents the number of classes, Precision i Indicating the prediction accuracy, Recall, of the ith category i Representing the predicted recall for the ith category.
10. A multi-modal data classification apparatus comprising a computer, a computer processor, and a computer program stored in and executed on the computer processor, wherein the multi-modal data classification model of claim 1 is employed in the computer memory; the computer processor when executing the computer program implements the steps of: and inputting the multi-modal data to be classified into a multi-modal data classification model, and outputting the classification corresponding to the multi-modal data through calculation and analysis.
CN202210773999.8A 2022-07-01 2022-07-01 Cross supervision-based multi-mode data classification method and device Pending CN115130591A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210773999.8A CN115130591A (en) 2022-07-01 2022-07-01 Cross supervision-based multi-mode data classification method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210773999.8A CN115130591A (en) 2022-07-01 2022-07-01 Cross supervision-based multi-mode data classification method and device

Publications (1)

Publication Number Publication Date
CN115130591A true CN115130591A (en) 2022-09-30

Family

ID=83382332

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210773999.8A Pending CN115130591A (en) 2022-07-01 2022-07-01 Cross supervision-based multi-mode data classification method and device

Country Status (1)

Country Link
CN (1) CN115130591A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115828162A (en) * 2023-02-08 2023-03-21 支付宝(杭州)信息技术有限公司 Classification model training method and device, storage medium and electronic equipment
CN116594838A (en) * 2023-05-18 2023-08-15 上海麓霏信息技术服务有限公司 Multi-mode data pre-training method and system
CN116701303A (en) * 2023-07-06 2023-09-05 浙江档科信息技术有限公司 Electronic file classification method, system and readable storage medium based on deep learning

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115828162A (en) * 2023-02-08 2023-03-21 支付宝(杭州)信息技术有限公司 Classification model training method and device, storage medium and electronic equipment
CN116594838A (en) * 2023-05-18 2023-08-15 上海麓霏信息技术服务有限公司 Multi-mode data pre-training method and system
CN116594838B (en) * 2023-05-18 2023-12-22 上海好芯好翼智能科技有限公司 Multi-mode data pre-training method and system
CN116701303A (en) * 2023-07-06 2023-09-05 浙江档科信息技术有限公司 Electronic file classification method, system and readable storage medium based on deep learning
CN116701303B (en) * 2023-07-06 2024-03-12 浙江档科信息技术有限公司 Electronic file classification method, system and readable storage medium based on deep learning

Similar Documents

Publication Publication Date Title
CN111079601A (en) Video content description method, system and device based on multi-mode attention mechanism
CN115130591A (en) Cross supervision-based multi-mode data classification method and device
JP7290861B2 (en) Answer classifier and expression generator for question answering system and computer program for training the expression generator
CN110866542A (en) Depth representation learning method based on feature controllable fusion
CN115131613B (en) Small sample image classification method based on multidirectional knowledge migration
CN114443899A (en) Video classification method, device, equipment and medium
CN116975776A (en) Multi-mode data fusion method and device based on tensor and mutual information
CN113806494A (en) Named entity recognition method based on pre-training language model
CN116661805B (en) Code representation generation method and device, storage medium and electronic equipment
CN117494051A (en) Classification processing method, model training method and related device
CN115965818A (en) Small sample image classification method based on similarity feature fusion
CN114780723A (en) Portrait generation method, system and medium based on guide network text classification
CN114398488A (en) Bilstm multi-label text classification method based on attention mechanism
CN116579345B (en) Named entity recognition model training method, named entity recognition method and named entity recognition device
Wang et al. MT-TCCT: Multi-task learning for multimodal emotion recognition
CN116662924A (en) Aspect-level multi-mode emotion analysis method based on dual-channel and attention mechanism
CN114120074B (en) Training method and training device for image recognition model based on semantic enhancement
CN115659242A (en) Multimode emotion classification method based on mode enhanced convolution graph
CN115544210A (en) Model training and event extraction method based on event extraction of continuous learning
CN114722798A (en) Ironic recognition model based on convolutional neural network and attention system
CN117473119B (en) Text video retrieval method and device
CN118154987A (en) Training and classifying method, device, medium and equipment for dynamic data classifying network
CN116882398B (en) Implicit chapter relation recognition method and system based on phrase interaction
CN118113871A (en) Multi-label emotion classification method and system based on non-autoregressive model
Zouitni et al. A Comparison Between LSTM and Transformers for Image Captioning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination