CN114841335A - Multi-mode joint representation learning method and system based on variational distillation - Google Patents

Multi-mode joint representation learning method and system based on variational distillation Download PDF

Info

Publication number
CN114841335A
CN114841335A CN202210062288.XA CN202210062288A CN114841335A CN 114841335 A CN114841335 A CN 114841335A CN 202210062288 A CN202210062288 A CN 202210062288A CN 114841335 A CN114841335 A CN 114841335A
Authority
CN
China
Prior art keywords
text
modal
image
input
mode
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210062288.XA
Other languages
Chinese (zh)
Inventor
张亚伟
王晶晶
李寿山
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou University
Original Assignee
Suzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou University filed Critical Suzhou University
Priority to CN202210062288.XA priority Critical patent/CN114841335A/en
Publication of CN114841335A publication Critical patent/CN114841335A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/082Learning methods modifying the architecture, e.g. adding, deleting or silencing nodes or connections
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/217Validation; Performance evaluation; Active pattern learning techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Abstract

The invention relates to a multi-mode joint representation learning method based on variational distillation, which comprises the steps of deploying a student model, a text teacher model and an image teacher model, wherein multi-mode data comprise original text mode data and image mode data, and the original text mode data and the original image mode data are arranged to obtain the same input text mode input and image mode input; respectively inputting the data to a modal joint representation module to obtain text output and image output, and inputting original text modal data and image modal data to a text teacher model and an image teacher model to obtain text output and image output; and representing the correlation of text output and image output corresponding to the student model and the teacher model by using the variational mutual information, and performing combined distillation training on the text output and the image output by using a distillation loss function to enable the student model to obtain the capability of matching the teacher model. The invention provides a multi-mode joint representation learning method and system based on variational distillation, which surpass the existing reference model on different modal data sets.

Description

Multi-mode joint representation learning method and system based on variational distillation
Technical Field
The invention relates to the technical field of multi-modal distillation, in particular to a multi-modal joint representation learning method and system based on variational distillation.
Background
Large scale pre-trained models, such as BERT, GPT and RoBERTa in text modalities, or ResNet, BiT, ViT in image modalities, have brought revolutionary advances in different modality areas. However, as pre-trained models grow larger in size, it becomes increasingly challenging to deploy them in resource-scarce environments. Therefore, these model compression methods that reduce the size of the pre-trained model and preserve most of the performance are also receiving increasing attention.
In the field of text modalities, PKD is an earlier exploration, very simple and effective, mainly compressing BERT models in the fine-tuning stage. Subsequently, DistillBERT, TinyBERT, MobileBERT task-independent efficient knowledge distillation of the intermediate layer information of the BERT model with KL divergence or L2 loss functions in the training phase, codr distilled RoBERTa based on contrast learning in the training phase and better performance was achieved. In the field of image modalities, FitNet fits the output of a teacher model and a student model on a specific task data set, ViD uses variance Gaussian distribution to replace sample distribution to calculate mutual information of output characteristic diagrams of the teacher model and the student model, DeiT increases distillation symbols to be distinguished from classification symbols, training and fitting are performed from two different angles, CRD uses contrast learning, and a large number of negative samples are used to improve the upper bound of the mutual information output by the teacher model and the student model.
Currently, the distillation of the text field and the image field of the single modality is mature, but the distillation framework of the text modality and the image modality which are unified is less. Consider that conventional methods fit the probability distributions of the outputs of the teacher model and the student models by KL divergence, or fit the characterization vectors of the teacher model and the student models using the L2 loss function. Although these methods can also reduce the output difference between the teacher model and the student model, these methods have the disadvantages that, for example, the L2 loss function firstly needs dimension transformation to lose some information, and secondly only considers the relationship between corresponding values of the characterization vectors and ignores the whole information. Compared with other methods, the method of the comparative distillation needs a large number of negative samples, so that the training loss is increased, especially the training cost which is doubled on a plurality of modes is generated, and the method is not suitable for the condition of resource limitation. On the other hand, serious forgetting problems can arise in multi-modal distillation, for example distilling text information before distilling image information can cause the encoder to lose most of the text encoding capacity.
Therefore, at present, a mode unified distillation method does not exist, the problem of forgetfulness generated by the joint training of multiple modes cannot be solved, a large number of additional negative samples are needed, and the calculation cost is greatly increased.
Disclosure of Invention
Therefore, the technical problem to be solved by the invention is to overcome the problems in the prior art, and provide a multi-mode joint representation learning method and system based on variational distillation, so that the problem that a modal unified distillation method is lacked in the prior art is solved, and the data sets of different modes exceed the existing reference model.
In order to solve the technical problem, the invention provides a multi-mode joint representation learning method based on variational distillation, which comprises the following steps:
deploying a student model and a teacher model, wherein the teacher model comprises a text teacher model and an image teacher model, the student model comprises a multi-modal data unification module, inputting original multi-modal data, the original multi-modal data comprises original text modal data and original image modal data, inputting the original text modal data and the original image modal data to the multi-modal data unification module to obtain text modal input and image modal input which have the same input form, and performing normalization operation on the text modal input and the image modal input;
the student model comprises a modal joint representation module, wherein the text modal input and the image modal input after normalization operation are respectively input into the modal joint representation module to obtain the text output and the image output of the student model, and simultaneously, the original text modal data and the original image modal data are respectively input into the text teacher model and the image teacher model to obtain the text output and the image output of the teacher model;
and representing the correlation between text output and image output corresponding to the student model and the teacher model by using variational mutual information, and performing combined distillation training on the text output and the image output by using a distillation loss function so that the student model can simultaneously obtain the capability of matching the text teacher model and the image teacher model.
In an embodiment of the present invention, the multi-modal data unification module is disposed at a front end of the modal joint representation module, and the multi-modal data unification module is used to arrange the original text modal data and the original image modal data into the same input form, so as to obtain a text modal input and an image modal input.
In an embodiment of the present invention, sorting original text mode data and original image mode data into the same input form to obtain text mode input and image mode input, includes:
adding [ CLS ] symbols and [ SEP ] symbols in original text modal data, adding [ DIS ] symbols at the tail of sentences in the original text modal data, and obtaining text modal input through a word vector matrix;
the original image modal data is divided into a plurality of picture blocks, each picture block is stretched into a one-dimensional vector, a [ CLS ] symbol and a [ DIS ] symbol are added at the starting position and the ending position of the one-dimensional vector, and the image modal input in the same form as the text modal input is obtained through dimension scaling.
In one embodiment of the invention, the modality joint representation module comprises a MobileBERT model, and the MobileBERT model comprises 24 layers of transform models, and each layer of transform model is added with a linear layer.
In one embodiment of the invention, the distillation loss function is the sum of the loss function of the text teacher model and the loss function of the image teacher model.
In addition, the invention also provides a multi-mode joint representation learning system based on variational distillation, which comprises:
the student model comprises a multi-mode data unifying module and a mode unite representing module, original multi-mode data are input, the original multi-mode data comprise original text mode data and original image mode data, the original text mode data and the original image mode data are input to the multi-mode data unifying module to obtain text mode input and image mode input which are the same in input form, normalization operation is carried out on the text mode input and the image mode input, the text mode input and the image mode input after the normalization operation are respectively input to the mode unite representing module to obtain text output and image output of the student model;
the teacher model comprises a text teacher model and an image teacher model, and original text mode data and original image mode data are respectively input into the text teacher model and the image teacher model to obtain text output and image output of the teacher model;
and the modal unified distillation module is used for representing the correlation between text output and image output corresponding to the student model and the teacher model by utilizing variation mutual information, and performing combined distillation training on the text output and the image output by utilizing a distillation loss function so that the student model can simultaneously obtain the capability of matching the text teacher model and the image teacher model.
In an embodiment of the present invention, the multi-modal data unification module is disposed at a front end of the modal joint representation module, and the multi-modal data unification module is used to arrange the original text modal data and the original image modal data into the same input form, so as to obtain a text modal input and an image modal input.
In one embodiment of the invention, the multimodal data unification module comprises:
the text modal data sorting submodule is used for adding [ CLS ] symbols and [ SEP ] symbols in the text modal data, simultaneously adding [ DIS ] symbols at the tail of sentences in the text modal data, and obtaining text modal input through a word vector matrix;
and the image mode data sorting submodule is used for dividing the image mode data into a plurality of picture blocks, stretching each picture block into a one-dimensional vector, adding a [ CLS ] symbol and a [ DIS ] symbol at the starting position and the tail position of the one-dimensional vector, and obtaining the image mode input in the same form as the text mode input through dimension scaling.
Furthermore, the present invention also provides a computer device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor implements the steps of the method when executing the program.
Furthermore, the present invention also provides a computer-readable storage medium, on which a computer program is stored, which program, when being executed by a processor, carries out the steps of the method as described above.
Compared with the prior art, the technical scheme of the invention has the following advantages:
1. aiming at the problem that a modal unified distillation method is lacked in the prior art, the invention provides a multi-modal joint representation learning method and system based on variational distillation, and the method and system exceed the existing reference model on different modal data sets;
2. the invention adopts variation mutual information angle distillation, which not only greatly reduces the information loss of the teacher model, but also does not need a large number of negative samples to participate in calculation, and has simple effectiveness;
3. the invention adopts a combined distillation mode, and solves the forgetting problem caused by multi-mode distillation.
Drawings
In order that the present disclosure may be more readily and clearly understood, reference will now be made in detail to the present disclosure, examples of which are illustrated in the accompanying drawings.
FIG. 1 is a schematic flow chart of the multi-modal joint representation learning method based on variational distillation of the present invention.
FIG. 2 is a schematic diagram of a framework of a modal unified distillation module in the multi-modal joint representation learning system based on variational distillation of the present invention.
Detailed Description
The present invention is further described below in conjunction with the following figures and specific examples so that those skilled in the art may better understand the present invention and practice it, but the examples are not intended to limit the present invention.
Example one
Referring to fig. 1 and 2, the present embodiment provides a multi-modal joint representation learning method based on variational distillation, including the following steps:
s1: deploying a student model and a teacher model, wherein the teacher model comprises a text teacher model and an image teacher model, the student model comprises a multi-modal data unification module, inputting original multi-modal data, the original multi-modal data comprises original text modal data and original image modal data, inputting the original text modal data and the original image modal data to the multi-modal data unification module to obtain text modal input and image modal input which have the same input form, and performing normalization operation on the text modal input and the image modal input;
s2: the student model comprises a modal joint representation module, wherein the text modal input and the image modal input after normalization operation are respectively input into the modal joint representation module to obtain the text output and the image output of the student model, and simultaneously, the original text modal data and the original image modal data are respectively input into the text teacher model and the image teacher model to obtain the text output and the image output of the teacher model;
s3: and representing the correlation between text output and image output corresponding to the student model and the teacher model by using variational mutual information, and performing combined distillation training on the text output and the image output by using a distillation loss function so that the student model can simultaneously obtain the capability of matching the text teacher model and the image teacher model.
In the multi-mode joint representation learning method based on variational distillation disclosed by the invention, aiming at the problem that a modal unified distillation method is lacked in the prior art, the invention provides the multi-mode joint representation learning method and system based on variational distillation, and the multi-mode joint representation learning method and system are superior to the existing reference model on different modal data sets.
In the multi-modal joint representation learning method based on variational distillation disclosed by the invention, for S1 of the above embodiment, the student model comprises a multi-modal data unification module, the multi-modal data unification module is deployed at the front end of the modal joint representation module, and the multi-modal data unification module is utilized to arrange original text modal data and original image modal data into the same input form, so as to obtain text modal input and image modal input.
In the multi-modal joint representation learning method based on variational distillation disclosed by the invention, for S1 of the above embodiment, original text modal data and original image modal data are arranged into the same input form, and the text modal input and the image modal input are obtained, wherein on one hand, a [ CLS ] symbol and a [ SEP ] symbol are added in the text modal data, meanwhile, a [ DIS ] symbol is added at the tail of a sentence in the text modal data, and the text modal input is obtained through a word vector matrix; on the other hand, image mode data is divided into a plurality of picture blocks, each picture block is stretched into a one-dimensional vector, a [ CLS ] symbol and a [ DIS ] symbol are added at the starting position and the tail position of the one-dimensional vector, and image mode input in the same mode as the text mode input is obtained through dimension scaling.
In particular, one aspect is for raw text modality data D of length L l Adding [ CLS ] to the original text mode data]Symbol sum [ SEP]Symbols, while adding [ DIS ] at the end of sentences in the original text modal data]The symbol is used for bilateral distillation, so that the performance is improved and fitting is accelerated; obtained according to BPECorresponding word segmentation serial number, and obtaining final input text word vector through a word vector matrix with dimension d
Figure BDA0003478686180000051
Wherein L ═ L + 3. On the other hand, because the representation forms of the text and the image are different, the text and the image are difficult to be directly unified. The invention therefore employs a process of dividing the image into picture blocks so that the same form of input as a text word vector can be generated. For original image input data D t The size is first scaled to a size of 256 × 256 × 3, and then the image is divided into 256 picture blocks in the size of 16 × 16 × 3 picture blocks
Figure BDA0003478686180000052
Stretching each picture block into a one-dimensional vector to obtain
Figure BDA0003478686180000061
Then [ CLS ] is added to the start position and the end position]Symbol and [ DIS]Notation, and dimension scaling by the final linear layer yields the same form as text input:
Figure BDA0003478686180000062
since the distribution of text and image data has some differences, which results in large fluctuation of numerical values, the data is finally normalized. The invention unifies the input forms and the distribution of the text mode and the image mode, and is convenient for the processing of the subsequent mode combination layer.
In the multi-modal joint representation learning method based on variational distillation disclosed by the invention, for S2 of the above embodiment, the modal joint representation module comprises a MobileBERT model, the MobileBERT model comprises 24 layers of transform models, and each layer of transform model is added with a linear layer, so that the scale of the transform parameter of the MobileBERT is small. For input of length N
Figure BDA0003478686180000063
The output of each layer of the transform can be obtainedIs composed of
Figure BDA0003478686180000064
For convenience of distillation, [ CLS ] was taken]Symbolic corresponding feature representation
Figure BDA0003478686180000065
For the students used in the training to output the expression characteristics, the calculation formula is
Figure BDA0003478686180000066
In the multi-modal joint representation learning method based on variational distillation disclosed in the present invention, in S3 of the above embodiment, the distillation loss function is a sum of a loss function of the text teacher model and a loss function of the image teacher model.
Specifically, for two different modal information of text and image, the invention unifies distillation modes based on the angle of mutual information. The following is divided into four parts of mutual variation information, knowledge distillation, distillation process summary and experimental analysis for detailed description.
(3.1) mutual information of variation
Mutual information may indicate that the uncertainty of one random variable is reduced by knowing another random variable. In token learning, it can further be used to measure the correlation between different tokens. From the perspective of information theory, knowledge migration is a process of keeping high mutual information between corresponding outputs of a teacher model and a student model, and can be regarded as a process of keeping knowledge in the student model by the teacher model. Given a pair of random variables (X, Y), the mutual information between X and Y can be defined as:
Figure BDA0003478686180000067
where H (X) is the entropy of X, and H (X | Y) is the conditional entropy of the joint distribution P (X, Y). Due to the difficulty of directly computing the joint distribution, the present invention aggregates the results of the input distributions on the various layers into a joint distribution. Mutual information can adapt to higher semantic layersThe secondary distribution. However, since the precise calculation of mutual information is very difficult, it is difficult to maximize mutual information. One solution is to use a contrast learning method to make the samples closer to the positive samples and away from the negative samples, so as to increase the lower limit of mutual information. This method is effective in the case where there are a large number of negative examples. However, this method is computationally expensive. Thus, the present invention uses the lower bound of variation to approximately compute the mutual information I (X, Y). According to VID 13] It is considered difficult to calculate the distribution p (X | Y), and the present invention approximates p (X | Y) using the variation distribution q (X | Y). Thus, the formula can be further calculated:
Figure BDA0003478686180000071
on the one hand, the invention can obtain the last inequality due to the non-negativity of the KL divergence. On the other hand, H (X) is a constant, so the invention only requires calculation
Figure BDA0003478686180000072
Further, the present invention uses a mixed Gaussian distribution q (X | Y), log, since a single Gaussian distribution is too simple to approximate some complex distributions q(x|y) The following can be further calculated:
Figure BDA0003478686180000073
wherein, y n For y, a scalar component with a subscript position of n, μ n (x) The output of the encoder network μ () consisting of transformers, which is guaranteed to be positive by the softplus function. Wherein sigma c Is the parameter to be optimized, epsilon is a very small constant and is set to be greater than 0 to ensure that the variance is positive and constant is a constant. The loss function of the final mutual information distillation is as follows:
Figure BDA0003478686180000074
(3.2) knowledge distillation
In a preferred embodiment, the textual aspect employs BERT large As a teacher model of a text modality, the image aspect adopts ResNet 152 As a teacher model of the image modality. Is provided with
Figure BDA0003478686180000075
Are each BERT large 、ResNet 152 And the output of the i-th hidden layer of the MXBERT, wherein
Figure BDA0003478686180000076
Is BERT large And MXBERT output characteristics representation [ CLS]The corresponding feature vector is used as a basis for determining the feature vector,
Figure BDA0003478686180000077
for output of [ DIS ] in MXBERT output characteristics representation]The corresponding feature vector.
For text modalities, the loss function is:
Figure BDA0003478686180000081
where α (i) is a coefficient function, a different weight is set for each layer and increases as the number of layers increases,
Figure BDA0003478686180000082
is a multi-layer non-linear transformation function for realizing variation effect.
For image modalities, the loss function is:
Figure BDA0003478686180000083
i′=f(i)
Figure BDA0003478686180000084
wherein beta (i) is a coefficient function, different weights are set for each layer, the weights are increased along with the increasing of the number of layers, the ResNet and the MXBERT have different numbers of layers, so that f (phi)) is set to be a function for correspondingly selecting student layers for the number of layers of the teacher model, phi (phi)) is multi-layer linear transformation, and the three-dimensional feature map output by ResNet is mainly converted into a one-dimensional feature vector.
Because the output value ranges of the teacher model are the same, in order to solve the forgetting problem, the invention performs combined training on distillation of two modes, and the loss function of the whole distillation is finally defined as follows:
L dis =L l +L T
(3.3) summary of the distillation Process
Given two teacher models: text teacher model BERT of text mode large And image teacher model ResNet of image modality 152 Given a student model: MXBERT. The purpose of modal unified distillation is to allow MXBERTs with smaller parameters to simultaneously learn the capabilities of two teacher models, so that MXBERTs have text and image encoding capabilities.
The distillation process was as follows: given one text output and one image output. Text output is respectively passed through BERT large And MXBERT, fitting the output of the two model intermediate layers in a way of L l (fitting means that the two variables are made equal as much as possible, in this respect, the loss function is reduced, and a lower value means that the two variables are more similar, i.e., the output of the student model is made more and more similar to the output of the teacher model, so that the student model obtains a performance matching the teacher model). For image output, the images are respectively subjected to ResNet 152 And MXBERT, fitting the output between the two models in a way of L T . Because sequential training can create amnestic problems, the present invention performs co-distillation, i.e., L dis =L l +L T The invention only needs to be matched with L dis The loss function is optimized.
In actual use, the student model MXBERT can input texts and images, and only the final output of the model is required to be used for downstream tasks, such as text classification, emotion analysis, new image classification and the like.
(3.4) Experimental analysis
Table 1 shows the performance comparison results of the MXBERT and the text mode encoder, where in table 1, the models ELMo, GPT, and BERT in the second row are pre-trained models, and the models such as MobileBERT in the third row all belong to the reference model for fair comparison. It can be seen that in the text mode, the method of the present invention not only surpasses the original reference model MobileBERT on multiple data sets of the GLUE, but also further surpasses the pre-training model BERT on most tasks base . Table 2 shows the comparison of the performance of the XBERT and the image encoder, and it can be seen from Table 2 that in the image mode, the performance of the model surpasses the reference model ResNet 50 While far surpassing ResNet 18 . Therefore, the mode unification can not cause mutual influence, but has certain complementarity. Compared with other single-mode reference methods, the CMDIR has the characteristics of simplicity and effectiveness, does not need additional samples to participate in calculation, unifies the distillation modes of different modes, and matches or even surpasses the original distillation mode in distillation performance.
Table 1. comparison of the performance of MXBERT and text mode encoders, where the evaluation index for costa is the mauski correlation coefficient, the evaluation indexes for SST-2, MNLI, QNLI, RTE are the accuracy, the evaluation indexes for MRPC and QQP are the average of the F1 value and the sum of the accuracy, and the evaluation index for STS-B is the pearson correlation coefficient.
Figure BDA0003478686180000091
Table 2 performance comparison of MXBERT and image encoder, where CIFAR dataset is error rate with top1 and ImageNet with top 5.
Figure BDA0003478686180000092
In the multi-mode joint representation learning method based on variational distillation, disclosed by the invention, variational mutual information angle distillation is adopted, so that the information loss of a teacher model is greatly reduced, a large number of negative samples are not required to participate in calculation, and the method has simplicity and effectiveness.
In the multi-mode joint representation learning method based on variational distillation disclosed by the invention, a joint distillation mode is adopted, so that the forgetting problem caused by multi-mode distillation is solved.
Corresponding to the above method embodiment, an embodiment of the present invention further provides a computer device, including:
a memory for storing a computer program;
a processor for implementing the steps of the above multi-modal joint representation learning method based on variational distillation when executing a computer program.
In the embodiment of the present invention, the processor may be a Central Processing Unit (CPU), an application specific integrated circuit, a digital signal processor, a field programmable gate array or other programmable logic device, etc.
The processor may invoke a program stored in the memory, and in particular, the processor may perform operations in an embodiment of a multi-modal joint representation learning method based on variational distillation.
The memory is used for storing one or more programs, which may include program code including computer operating instructions.
Further, the memory may include high speed random access memory, and may also include non-volatile memory, such as at least one disk storage device or other volatile solid state storage device.
Corresponding to the above method embodiment, the embodiment of the present invention further provides a computer readable storage medium, on which a computer program is stored, and the computer program, when executed by a processor, implements the steps of the above multi-modal joint representation learning method based on variational distillation.
Example two
In the following, a multi-modal joint representation learning system based on variational distillation disclosed in the second embodiment of the present invention is introduced, and a multi-modal joint representation learning system based on variational distillation described below and a multi-modal joint representation learning method based on variational distillation described above can be referred to correspondingly.
The embodiment II of the invention discloses a multi-mode joint representation learning system based on variational distillation, which comprises:
the student model comprises a multi-mode data unifying module and a mode unite representing module, original multi-mode data are input, the original multi-mode data comprise original text mode data and original image mode data, the original text mode data and the original image mode data are input to the multi-mode data unifying module to obtain text mode input and image mode input which are the same in input form, normalization operation is carried out on the text mode input and the image mode input, the text mode input and the image mode input after the normalization operation are respectively input to the mode unite representing module to obtain text output and image output of the student model;
the teacher model comprises a text teacher model and an image teacher model, and original text mode data and original image mode data are respectively input into the text teacher model and the image teacher model to obtain text output and image output of the teacher model;
and the modal unified distillation module is used for representing the correlation between the text output and the image output corresponding to the student model and the teacher model by utilizing variation mutual information, and performing combined distillation training on the text output and the image output by utilizing a distillation loss function so that the student model can simultaneously obtain the capability of matching the text teacher model and the image teacher model.
In the multi-modal joint representation learning system based on variational distillation disclosed by the invention, the multi-modal data unification module is arranged at the front end of the modal joint representation module, and the original text modal data and the original image modal data are arranged into the same input form by utilizing the multi-modal data unification module to obtain text modal input and image modal input.
In the multi-modal joint representation learning system based on variational distillation disclosed by the invention, the multi-modal data unification module comprises:
the text modal data sorting submodule is used for adding [ CLS ] symbols and [ SEP ] symbols in the text modal data, simultaneously adding [ DIS ] symbols at the tail of sentences in the text modal data, and obtaining text modal input through a word vector matrix;
and the image mode data sorting submodule is used for dividing the image mode data into a plurality of picture blocks, stretching each picture block into a one-dimensional vector, adding a [ CLS ] symbol and a [ DIS ] symbol at the starting position and the tail position of the one-dimensional vector, and obtaining the image mode input in the same form as the text mode input through dimension scaling.
The multi-modal joint representation learning system based on the variational distillation of the embodiment is used for implementing the multi-modal joint representation learning method based on the variational distillation, so the specific implementation of the system can be seen in the foregoing part of the embodiment of the multi-modal joint representation learning method based on the variational distillation, and therefore, the specific implementation can refer to the description of the corresponding part of the embodiment, and will not be further described herein.
In addition, since the multi-modal joint representation learning system based on variational distillation of the present embodiment is used for implementing the aforementioned multi-modal joint representation learning method based on variational distillation, the action thereof corresponds to the action of the above method, and is not described again here.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
It should be understood that the above examples are only for clarity of illustration and are not intended to limit the embodiments. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. And obvious variations or modifications of the invention may be made without departing from the spirit or scope of the invention.

Claims (10)

1. A multi-mode joint representation learning method based on variational distillation is characterized by comprising the following steps:
deploying a student model and a teacher model, wherein the teacher model comprises a text teacher model and an image teacher model, the student model comprises a multi-modal data unification module, inputting original multi-modal data, the original multi-modal data comprises original text modal data and original image modal data, inputting the original text modal data and the original image modal data to the multi-modal data unification module to obtain text modal input and image modal input which have the same input form, and performing normalization operation on the text modal input and the image modal input;
the student model comprises a modal joint representation module, wherein the text modal input and the image modal input after normalization operation are respectively input into the modal joint representation module to obtain the text output and the image output of the student model, and simultaneously, the original text modal data and the original image modal data are respectively input into the text teacher model and the image teacher model to obtain the text output and the image output of the teacher model;
and representing the correlation between text output and image output corresponding to the student model and the teacher model by using variational mutual information, and performing combined distillation training on the text output and the image output by using a distillation loss function so that the student model can simultaneously obtain the capability of matching the text teacher model and the image teacher model.
2. The multi-modal joint representation learning method based on variational distillation according to claim 1, characterized in that: the multi-mode data unification module is deployed at the front end of the mode joint representation module, and the multi-mode data unification module is used for collating original text mode data and original image mode data into the same input form to obtain text mode input and image mode input.
3. The multi-modal joint representation learning method based on variational distillation according to claim 1, characterized in that: arranging original text mode data and original image mode data into the same input form to obtain text mode input and image mode input, including:
adding [ CLS ] symbols and [ SEP ] symbols in original text modal data, adding [ DIS ] symbols at the tail of sentences in the original text modal data, and obtaining text modal input through a word vector matrix;
the original image modal data is divided into a plurality of picture blocks, each picture block is stretched into a one-dimensional vector, a [ CLS ] symbol and a [ DIS ] symbol are added at the starting position and the ending position of the one-dimensional vector, and the image modal input in the same form as the text modal input is obtained through dimension scaling.
4. The multi-modal joint representation learning method based on variational distillation according to claim 1 or 2, characterized in that: the modal joint representation module comprises a MobileBERT model, the MobileBERT model comprises 24 layers of transform models, and each layer of transform model is added with a linear layer.
5. The variational distillation-based multi-modal joint representation learning method according to claim 1, wherein: the distillation loss function is a sum of a loss function of the text teacher model and a loss function of the image teacher model.
6. A multi-modal joint representation learning system based on variational distillation, comprising:
the student model comprises a multi-mode data unifying module and a mode unite representing module, original multi-mode data are input, the original multi-mode data comprise original text mode data and original image mode data, the original text mode data and the original image mode data are input to the multi-mode data unifying module to obtain text mode input and image mode input which are the same in input form, normalization operation is carried out on the text mode input and the image mode input, the text mode input and the image mode input after the normalization operation are respectively input to the mode unite representing module to obtain text output and image output of the student model;
the teacher model comprises a text teacher model and an image teacher model, and original text modal data and original image modal data are respectively input into the text teacher model and the image teacher model to obtain text output and image output of the teacher model;
and the modal unified distillation module is used for representing the correlation between text output and image output corresponding to the student model and the teacher model by utilizing variation mutual information, and performing combined distillation training on the text output and the image output by utilizing a distillation loss function so that the student model can simultaneously obtain the capability of matching the text teacher model and the image teacher model.
7. The variational distillation-based multi-modal joint representation learning system according to claim 6, wherein: the multi-mode data unification module is deployed at the front end of the mode joint representation module and is used for collating original text mode data and original image mode data into the same input form to obtain text mode input and image mode input.
8. The variational distillation-based multi-modal joint representation learning system according to claim 6 or 7, wherein the multi-modal data unification module comprises:
the text modal data sorting submodule is used for adding [ CLS ] symbols and [ SEP ] symbols in the text modal data, simultaneously adding [ DIS ] symbols at the tail of sentences in the text modal data, and obtaining text modal input through a word vector matrix;
and the image mode data sorting submodule is used for dividing the image mode data into a plurality of picture blocks, stretching each picture block into a one-dimensional vector, adding a [ CLS ] symbol and a [ DIS ] symbol at the starting position and the tail position of the one-dimensional vector, and obtaining the image mode input in the same form as the text mode input through dimension scaling.
9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the method according to any of claims 1 to 5 are implemented when the program is executed by the processor.
10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 5.
CN202210062288.XA 2022-01-19 2022-01-19 Multi-mode joint representation learning method and system based on variational distillation Pending CN114841335A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210062288.XA CN114841335A (en) 2022-01-19 2022-01-19 Multi-mode joint representation learning method and system based on variational distillation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210062288.XA CN114841335A (en) 2022-01-19 2022-01-19 Multi-mode joint representation learning method and system based on variational distillation

Publications (1)

Publication Number Publication Date
CN114841335A true CN114841335A (en) 2022-08-02

Family

ID=82562516

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210062288.XA Pending CN114841335A (en) 2022-01-19 2022-01-19 Multi-mode joint representation learning method and system based on variational distillation

Country Status (1)

Country Link
CN (1) CN114841335A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115239937A (en) * 2022-09-23 2022-10-25 西南交通大学 Cross-modal emotion prediction method

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115239937A (en) * 2022-09-23 2022-10-25 西南交通大学 Cross-modal emotion prediction method
CN115239937B (en) * 2022-09-23 2022-12-20 西南交通大学 Cross-modal emotion prediction method

Similar Documents

Publication Publication Date Title
Chen et al. Recurrent neural network-based sentence encoder with gated attention for natural language inference
CN107273913B (en) Short text similarity calculation method based on multi-feature fusion
CN114565104A (en) Language model pre-training method, result recommendation method and related device
CN112800292B (en) Cross-modal retrieval method based on modal specific and shared feature learning
CN113239169A (en) Artificial intelligence-based answer generation method, device, equipment and storage medium
CN111782826A (en) Knowledge graph information processing method, device, equipment and storage medium
CN111858984A (en) Image matching method based on attention mechanism Hash retrieval
CN111985243A (en) Emotion model training method, emotion analysis device and storage medium
CN113806543B (en) Text classification method of gate control circulation unit based on residual jump connection
CN107688609A (en) A kind of position label recommendation method and computing device
CN111079011A (en) Deep learning-based information recommendation method
CN114841335A (en) Multi-mode joint representation learning method and system based on variational distillation
CN114186059A (en) Article classification method and device
CN111897955B (en) Comment generation method, device, equipment and storage medium based on encoding and decoding
CN110674293B (en) Text classification method based on semantic migration
CN116743692A (en) Historical message folding method and system
CN114970467B (en) Method, device, equipment and medium for generating composition manuscript based on artificial intelligence
CN116341515A (en) Sentence representation method of dynamic course facing contrast learning
WO2023168818A1 (en) Method and apparatus for determining similarity between video and text, electronic device, and storage medium
CN113190681B (en) Fine granularity text classification method based on capsule network mask memory attention
CN113641789B (en) Viewpoint retrieval method and system based on hierarchical fusion multi-head attention network and convolution network
CN112434134B (en) Search model training method, device, terminal equipment and storage medium
US20220318230A1 (en) Text to question-answer model system
US20230004588A1 (en) Resource-Efficient Sequence Generation with Dual-Level Contrastive Learning
CN113722439A (en) Cross-domain emotion classification method and system based on antagonism type alignment network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination