CN114841335A

CN114841335A - Multi-mode joint representation learning method and system based on variational distillation

Info

Publication number: CN114841335A
Application number: CN202210062288.XA
Authority: CN
Inventors: 张亚伟; 王晶晶; 李寿山
Original assignee: Suzhou University
Current assignee: Suzhou University
Priority date: 2022-01-19
Filing date: 2022-01-19
Publication date: 2022-08-02

Abstract

The invention relates to a multi-mode joint representation learning method based on variational distillation, which comprises the steps of deploying a student model, a text teacher model and an image teacher model, wherein multi-mode data comprise original text mode data and image mode data, and the original text mode data and the original image mode data are arranged to obtain the same input text mode input and image mode input; respectively inputting the data to a modal joint representation module to obtain text output and image output, and inputting original text modal data and image modal data to a text teacher model and an image teacher model to obtain text output and image output; and representing the correlation of text output and image output corresponding to the student model and the teacher model by using the variational mutual information, and performing combined distillation training on the text output and the image output by using a distillation loss function to enable the student model to obtain the capability of matching the teacher model. The invention provides a multi-mode joint representation learning method and system based on variational distillation, which surpass the existing reference model on different modal data sets.

Description

Multi-mode joint representation learning method and system based on variational distillation

Technical Field

The invention relates to the technical field of multi-modal distillation, in particular to a multi-modal joint representation learning method and system based on variational distillation.

Background

Large scale pre-trained models, such as BERT, GPT and RoBERTa in text modalities, or ResNet, BiT, ViT in image modalities, have brought revolutionary advances in different modality areas. However, as pre-trained models grow larger in size, it becomes increasingly challenging to deploy them in resource-scarce environments. Therefore, these model compression methods that reduce the size of the pre-trained model and preserve most of the performance are also receiving increasing attention.

In the field of text modalities, PKD is an earlier exploration, very simple and effective, mainly compressing BERT models in the fine-tuning stage. Subsequently, DistillBERT, TinyBERT, MobileBERT task-independent efficient knowledge distillation of the intermediate layer information of the BERT model with KL divergence or L2 loss functions in the training phase, codr distilled RoBERTa based on contrast learning in the training phase and better performance was achieved. In the field of image modalities, FitNet fits the output of a teacher model and a student model on a specific task data set, ViD uses variance Gaussian distribution to replace sample distribution to calculate mutual information of output characteristic diagrams of the teacher model and the student model, DeiT increases distillation symbols to be distinguished from classification symbols, training and fitting are performed from two different angles, CRD uses contrast learning, and a large number of negative samples are used to improve the upper bound of the mutual information output by the teacher model and the student model.

Currently, the distillation of the text field and the image field of the single modality is mature, but the distillation framework of the text modality and the image modality which are unified is less. Consider that conventional methods fit the probability distributions of the outputs of the teacher model and the student models by KL divergence, or fit the characterization vectors of the teacher model and the student models using the L2 loss function. Although these methods can also reduce the output difference between the teacher model and the student model, these methods have the disadvantages that, for example, the L2 loss function firstly needs dimension transformation to lose some information, and secondly only considers the relationship between corresponding values of the characterization vectors and ignores the whole information. Compared with other methods, the method of the comparative distillation needs a large number of negative samples, so that the training loss is increased, especially the training cost which is doubled on a plurality of modes is generated, and the method is not suitable for the condition of resource limitation. On the other hand, serious forgetting problems can arise in multi-modal distillation, for example distilling text information before distilling image information can cause the encoder to lose most of the text encoding capacity.

Therefore, at present, a mode unified distillation method does not exist, the problem of forgetfulness generated by the joint training of multiple modes cannot be solved, a large number of additional negative samples are needed, and the calculation cost is greatly increased.

Disclosure of Invention

Therefore, the technical problem to be solved by the invention is to overcome the problems in the prior art, and provide a multi-mode joint representation learning method and system based on variational distillation, so that the problem that a modal unified distillation method is lacked in the prior art is solved, and the data sets of different modes exceed the existing reference model.

In order to solve the technical problem, the invention provides a multi-mode joint representation learning method based on variational distillation, which comprises the following steps:

deploying a student model and a teacher model, wherein the teacher model comprises a text teacher model and an image teacher model, the student model comprises a multi-modal data unification module, inputting original multi-modal data, the original multi-modal data comprises original text modal data and original image modal data, inputting the original text modal data and the original image modal data to the multi-modal data unification module to obtain text modal input and image modal input which have the same input form, and performing normalization operation on the text modal input and the image modal input;

the student model comprises a modal joint representation module, wherein the text modal input and the image modal input after normalization operation are respectively input into the modal joint representation module to obtain the text output and the image output of the student model, and simultaneously, the original text modal data and the original image modal data are respectively input into the text teacher model and the image teacher model to obtain the text output and the image output of the teacher model;

and representing the correlation between text output and image output corresponding to the student model and the teacher model by using variational mutual information, and performing combined distillation training on the text output and the image output by using a distillation loss function so that the student model can simultaneously obtain the capability of matching the text teacher model and the image teacher model.

In an embodiment of the present invention, the multi-modal data unification module is disposed at a front end of the modal joint representation module, and the multi-modal data unification module is used to arrange the original text modal data and the original image modal data into the same input form, so as to obtain a text modal input and an image modal input.

In an embodiment of the present invention, sorting original text mode data and original image mode data into the same input form to obtain text mode input and image mode input, includes:

adding [ CLS ] symbols and [ SEP ] symbols in original text modal data, adding [ DIS ] symbols at the tail of sentences in the original text modal data, and obtaining text modal input through a word vector matrix;

the original image modal data is divided into a plurality of picture blocks, each picture block is stretched into a one-dimensional vector, a [ CLS ] symbol and a [ DIS ] symbol are added at the starting position and the ending position of the one-dimensional vector, and the image modal input in the same form as the text modal input is obtained through dimension scaling.

In one embodiment of the invention, the modality joint representation module comprises a MobileBERT model, and the MobileBERT model comprises 24 layers of transform models, and each layer of transform model is added with a linear layer.

In one embodiment of the invention, the distillation loss function is the sum of the loss function of the text teacher model and the loss function of the image teacher model.

In addition, the invention also provides a multi-mode joint representation learning system based on variational distillation, which comprises:

the student model comprises a multi-mode data unifying module and a mode unite representing module, original multi-mode data are input, the original multi-mode data comprise original text mode data and original image mode data, the original text mode data and the original image mode data are input to the multi-mode data unifying module to obtain text mode input and image mode input which are the same in input form, normalization operation is carried out on the text mode input and the image mode input, the text mode input and the image mode input after the normalization operation are respectively input to the mode unite representing module to obtain text output and image output of the student model;

the teacher model comprises a text teacher model and an image teacher model, and original text mode data and original image mode data are respectively input into the text teacher model and the image teacher model to obtain text output and image output of the teacher model;

and the modal unified distillation module is used for representing the correlation between text output and image output corresponding to the student model and the teacher model by utilizing variation mutual information, and performing combined distillation training on the text output and the image output by utilizing a distillation loss function so that the student model can simultaneously obtain the capability of matching the text teacher model and the image teacher model.

In one embodiment of the invention, the multimodal data unification module comprises:

the text modal data sorting submodule is used for adding [ CLS ] symbols and [ SEP ] symbols in the text modal data, simultaneously adding [ DIS ] symbols at the tail of sentences in the text modal data, and obtaining text modal input through a word vector matrix;

and the image mode data sorting submodule is used for dividing the image mode data into a plurality of picture blocks, stretching each picture block into a one-dimensional vector, adding a [ CLS ] symbol and a [ DIS ] symbol at the starting position and the tail position of the one-dimensional vector, and obtaining the image mode input in the same form as the text mode input through dimension scaling.

Furthermore, the present invention also provides a computer device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor implements the steps of the method when executing the program.

Furthermore, the present invention also provides a computer-readable storage medium, on which a computer program is stored, which program, when being executed by a processor, carries out the steps of the method as described above.

Compared with the prior art, the technical scheme of the invention has the following advantages:

1. aiming at the problem that a modal unified distillation method is lacked in the prior art, the invention provides a multi-modal joint representation learning method and system based on variational distillation, and the method and system exceed the existing reference model on different modal data sets;

2. the invention adopts variation mutual information angle distillation, which not only greatly reduces the information loss of the teacher model, but also does not need a large number of negative samples to participate in calculation, and has simple effectiveness;

3. the invention adopts a combined distillation mode, and solves the forgetting problem caused by multi-mode distillation.

Drawings

In order that the present disclosure may be more readily and clearly understood, reference will now be made in detail to the present disclosure, examples of which are illustrated in the accompanying drawings.

FIG. 1 is a schematic flow chart of the multi-modal joint representation learning method based on variational distillation of the present invention.

FIG. 2 is a schematic diagram of a framework of a modal unified distillation module in the multi-modal joint representation learning system based on variational distillation of the present invention.

Detailed Description

The present invention is further described below in conjunction with the following figures and specific examples so that those skilled in the art may better understand the present invention and practice it, but the examples are not intended to limit the present invention.

Example one

Referring to fig. 1 and 2, the present embodiment provides a multi-modal joint representation learning method based on variational distillation, including the following steps:

s1: deploying a student model and a teacher model, wherein the teacher model comprises a text teacher model and an image teacher model, the student model comprises a multi-modal data unification module, inputting original multi-modal data, the original multi-modal data comprises original text modal data and original image modal data, inputting the original text modal data and the original image modal data to the multi-modal data unification module to obtain text modal input and image modal input which have the same input form, and performing normalization operation on the text modal input and the image modal input;

s2: the student model comprises a modal joint representation module, wherein the text modal input and the image modal input after normalization operation are respectively input into the modal joint representation module to obtain the text output and the image output of the student model, and simultaneously, the original text modal data and the original image modal data are respectively input into the text teacher model and the image teacher model to obtain the text output and the image output of the teacher model;

s3: and representing the correlation between text output and image output corresponding to the student model and the teacher model by using variational mutual information, and performing combined distillation training on the text output and the image output by using a distillation loss function so that the student model can simultaneously obtain the capability of matching the text teacher model and the image teacher model.

In the multi-mode joint representation learning method based on variational distillation disclosed by the invention, aiming at the problem that a modal unified distillation method is lacked in the prior art, the invention provides the multi-mode joint representation learning method and system based on variational distillation, and the multi-mode joint representation learning method and system are superior to the existing reference model on different modal data sets.

In the multi-modal joint representation learning method based on variational distillation disclosed by the invention, for S1 of the above embodiment, the student model comprises a multi-modal data unification module, the multi-modal data unification module is deployed at the front end of the modal joint representation module, and the multi-modal data unification module is utilized to arrange original text modal data and original image modal data into the same input form, so as to obtain text modal input and image modal input.

In the multi-modal joint representation learning method based on variational distillation disclosed by the invention, for S1 of the above embodiment, original text modal data and original image modal data are arranged into the same input form, and the text modal input and the image modal input are obtained, wherein on one hand, a [ CLS ] symbol and a [ SEP ] symbol are added in the text modal data, meanwhile, a [ DIS ] symbol is added at the tail of a sentence in the text modal data, and the text modal input is obtained through a word vector matrix; on the other hand, image mode data is divided into a plurality of picture blocks, each picture block is stretched into a one-dimensional vector, a [ CLS ] symbol and a [ DIS ] symbol are added at the starting position and the tail position of the one-dimensional vector, and image mode input in the same mode as the text mode input is obtained through dimension scaling.

In particular, one aspect is for raw text modality data D of length L _l Adding [ CLS ] to the original text mode data]Symbol sum [ SEP]Symbols, while adding [ DIS ] at the end of sentences in the original text modal data]The symbol is used for bilateral distillation, so that the performance is improved and fitting is accelerated; obtained according to BPECorresponding word segmentation serial number, and obtaining final input text word vector through a word vector matrix with dimension d

Wherein L ═ L + 3. On the other hand, because the representation forms of the text and the image are different, the text and the image are difficult to be directly unified. The invention therefore employs a process of dividing the image into picture blocks so that the same form of input as a text word vector can be generated. For original image input data D _t The size is first scaled to a size of 256 × 256 × 3, and then the image is divided into 256 picture blocks in the size of 16 × 16 × 3 picture blocks

Stretching each picture block into a one-dimensional vector to obtain

Then [ CLS ] is added to the start position and the end position]Symbol and [ DIS]Notation, and dimension scaling by the final linear layer yields the same form as text input:

since the distribution of text and image data has some differences, which results in large fluctuation of numerical values, the data is finally normalized. The invention unifies the input forms and the distribution of the text mode and the image mode, and is convenient for the processing of the subsequent mode combination layer.

In the multi-modal joint representation learning method based on variational distillation disclosed by the invention, for S2 of the above embodiment, the modal joint representation module comprises a MobileBERT model, the MobileBERT model comprises 24 layers of transform models, and each layer of transform model is added with a linear layer, so that the scale of the transform parameter of the MobileBERT is small. For input of length N

The output of each layer of the transform can be obtainedIs composed of

For convenience of distillation, [ CLS ] was taken]Symbolic corresponding feature representation

For the students used in the training to output the expression characteristics, the calculation formula is

In the multi-modal joint representation learning method based on variational distillation disclosed in the present invention, in S3 of the above embodiment, the distillation loss function is a sum of a loss function of the text teacher model and a loss function of the image teacher model.

Specifically, for two different modal information of text and image, the invention unifies distillation modes based on the angle of mutual information. The following is divided into four parts of mutual variation information, knowledge distillation, distillation process summary and experimental analysis for detailed description.

(3.1) mutual information of variation

Mutual information may indicate that the uncertainty of one random variable is reduced by knowing another random variable. In token learning, it can further be used to measure the correlation between different tokens. From the perspective of information theory, knowledge migration is a process of keeping high mutual information between corresponding outputs of a teacher model and a student model, and can be regarded as a process of keeping knowledge in the student model by the teacher model. Given a pair of random variables (X, Y), the mutual information between X and Y can be defined as:

where H (X) is the entropy of X, and H (X | Y) is the conditional entropy of the joint distribution P (X, Y). Due to the difficulty of directly computing the joint distribution, the present invention aggregates the results of the input distributions on the various layers into a joint distribution. Mutual information can adapt to higher semantic layersThe secondary distribution. However, since the precise calculation of mutual information is very difficult, it is difficult to maximize mutual information. One solution is to use a contrast learning method to make the samples closer to the positive samples and away from the negative samples, so as to increase the lower limit of mutual information. This method is effective in the case where there are a large number of negative examples. However, this method is computationally expensive. Thus, the present invention uses the lower bound of variation to approximately compute the mutual information I (X, Y). According to VID ^13] It is considered difficult to calculate the distribution p (X | Y), and the present invention approximates p (X | Y) using the variation distribution q (X | Y). Thus, the formula can be further calculated:

on the one hand, the invention can obtain the last inequality due to the non-negativity of the KL divergence. On the other hand, H (X) is a constant, so the invention only requires calculation

Further, the present invention uses a mixed Gaussian distribution q (X | Y), log, since a single Gaussian distribution is too simple to approximate some complex distributions ^q(x|y) The following can be further calculated:

wherein, y _n For y, a scalar component with a subscript position of n, μ _n (x) The output of the encoder network μ () consisting of transformers, which is guaranteed to be positive by the softplus function. Wherein sigma _c Is the parameter to be optimized, epsilon is a very small constant and is set to be greater than 0 to ensure that the variance is positive and constant is a constant. The loss function of the final mutual information distillation is as follows:

(3.2) knowledge distillation

In a preferred embodiment, the textual aspect employs BERT _large As a teacher model of a text modality, the image aspect adopts ResNet ₁₅₂ As a teacher model of the image modality. Is provided with

Are each BERT _large 、ResNet ₁₅₂ And the output of the i-th hidden layer of the MXBERT, wherein

Is BERT _large And MXBERT output characteristics representation [ CLS]The corresponding feature vector is used as a basis for determining the feature vector,

for output of [ DIS ] in MXBERT output characteristics representation]The corresponding feature vector.

For text modalities, the loss function is:

where α (i) is a coefficient function, a different weight is set for each layer and increases as the number of layers increases,

is a multi-layer non-linear transformation function for realizing variation effect.

For image modalities, the loss function is:

i′＝f(i)

wherein beta (i) is a coefficient function, different weights are set for each layer, the weights are increased along with the increasing of the number of layers, the ResNet and the MXBERT have different numbers of layers, so that f (phi)) is set to be a function for correspondingly selecting student layers for the number of layers of the teacher model, phi (phi)) is multi-layer linear transformation, and the three-dimensional feature map output by ResNet is mainly converted into a one-dimensional feature vector.

Because the output value ranges of the teacher model are the same, in order to solve the forgetting problem, the invention performs combined training on distillation of two modes, and the loss function of the whole distillation is finally defined as follows:

L _dis ＝L _l +L _T 。

(3.3) summary of the distillation Process

Given two teacher models: text teacher model BERT of text mode _large And image teacher model ResNet of image modality ₁₅₂ Given a student model: MXBERT. The purpose of modal unified distillation is to allow MXBERTs with smaller parameters to simultaneously learn the capabilities of two teacher models, so that MXBERTs have text and image encoding capabilities.

The distillation process was as follows: given one text output and one image output. Text output is respectively passed through BERT _large And MXBERT, fitting the output of the two model intermediate layers in a way of L _l (fitting means that the two variables are made equal as much as possible, in this respect, the loss function is reduced, and a lower value means that the two variables are more similar, i.e., the output of the student model is made more and more similar to the output of the teacher model, so that the student model obtains a performance matching the teacher model). For image output, the images are respectively subjected to ResNet ₁₅₂ And MXBERT, fitting the output between the two models in a way of L _T . Because sequential training can create amnestic problems, the present invention performs co-distillation, i.e., L _dis ＝L _l +L _T The invention only needs to be matched with L _dis The loss function is optimized.

In actual use, the student model MXBERT can input texts and images, and only the final output of the model is required to be used for downstream tasks, such as text classification, emotion analysis, new image classification and the like.

(3.4) Experimental analysis

Table 1 shows the performance comparison results of the MXBERT and the text mode encoder, where in table 1, the models ELMo, GPT, and BERT in the second row are pre-trained models, and the models such as MobileBERT in the third row all belong to the reference model for fair comparison. It can be seen that in the text mode, the method of the present invention not only surpasses the original reference model MobileBERT on multiple data sets of the GLUE, but also further surpasses the pre-training model BERT on most tasks _base . Table 2 shows the comparison of the performance of the XBERT and the image encoder, and it can be seen from Table 2 that in the image mode, the performance of the model surpasses the reference model ResNet ₅₀ While far surpassing ResNet ₁₈ . Therefore, the mode unification can not cause mutual influence, but has certain complementarity. Compared with other single-mode reference methods, the CMDIR has the characteristics of simplicity and effectiveness, does not need additional samples to participate in calculation, unifies the distillation modes of different modes, and matches or even surpasses the original distillation mode in distillation performance.

Table 1. comparison of the performance of MXBERT and text mode encoders, where the evaluation index for costa is the mauski correlation coefficient, the evaluation indexes for SST-2, MNLI, QNLI, RTE are the accuracy, the evaluation indexes for MRPC and QQP are the average of the F1 value and the sum of the accuracy, and the evaluation index for STS-B is the pearson correlation coefficient.

Table 2 performance comparison of MXBERT and image encoder, where CIFAR dataset is error rate with top1 and ImageNet with top 5.

In the multi-mode joint representation learning method based on variational distillation, disclosed by the invention, variational mutual information angle distillation is adopted, so that the information loss of a teacher model is greatly reduced, a large number of negative samples are not required to participate in calculation, and the method has simplicity and effectiveness.

In the multi-mode joint representation learning method based on variational distillation disclosed by the invention, a joint distillation mode is adopted, so that the forgetting problem caused by multi-mode distillation is solved.

Corresponding to the above method embodiment, an embodiment of the present invention further provides a computer device, including:

a memory for storing a computer program;

a processor for implementing the steps of the above multi-modal joint representation learning method based on variational distillation when executing a computer program.

In the embodiment of the present invention, the processor may be a Central Processing Unit (CPU), an application specific integrated circuit, a digital signal processor, a field programmable gate array or other programmable logic device, etc.

The processor may invoke a program stored in the memory, and in particular, the processor may perform operations in an embodiment of a multi-modal joint representation learning method based on variational distillation.

The memory is used for storing one or more programs, which may include program code including computer operating instructions.

Further, the memory may include high speed random access memory, and may also include non-volatile memory, such as at least one disk storage device or other volatile solid state storage device.

Corresponding to the above method embodiment, the embodiment of the present invention further provides a computer readable storage medium, on which a computer program is stored, and the computer program, when executed by a processor, implements the steps of the above multi-modal joint representation learning method based on variational distillation.

Example two

In the following, a multi-modal joint representation learning system based on variational distillation disclosed in the second embodiment of the present invention is introduced, and a multi-modal joint representation learning system based on variational distillation described below and a multi-modal joint representation learning method based on variational distillation described above can be referred to correspondingly.

The embodiment II of the invention discloses a multi-mode joint representation learning system based on variational distillation, which comprises:

and the modal unified distillation module is used for representing the correlation between the text output and the image output corresponding to the student model and the teacher model by utilizing variation mutual information, and performing combined distillation training on the text output and the image output by utilizing a distillation loss function so that the student model can simultaneously obtain the capability of matching the text teacher model and the image teacher model.

In the multi-modal joint representation learning system based on variational distillation disclosed by the invention, the multi-modal data unification module is arranged at the front end of the modal joint representation module, and the original text modal data and the original image modal data are arranged into the same input form by utilizing the multi-modal data unification module to obtain text modal input and image modal input.

In the multi-modal joint representation learning system based on variational distillation disclosed by the invention, the multi-modal data unification module comprises:

The multi-modal joint representation learning system based on the variational distillation of the embodiment is used for implementing the multi-modal joint representation learning method based on the variational distillation, so the specific implementation of the system can be seen in the foregoing part of the embodiment of the multi-modal joint representation learning method based on the variational distillation, and therefore, the specific implementation can refer to the description of the corresponding part of the embodiment, and will not be further described herein.

In addition, since the multi-modal joint representation learning system based on variational distillation of the present embodiment is used for implementing the aforementioned multi-modal joint representation learning method based on variational distillation, the action thereof corresponds to the action of the above method, and is not described again here.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It should be understood that the above examples are only for clarity of illustration and are not intended to limit the embodiments. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. And obvious variations or modifications of the invention may be made without departing from the spirit or scope of the invention.

Claims

1. A multi-mode joint representation learning method based on variational distillation is characterized by comprising the following steps:

2. The multi-modal joint representation learning method based on variational distillation according to claim 1, characterized in that: the multi-mode data unification module is deployed at the front end of the mode joint representation module, and the multi-mode data unification module is used for collating original text mode data and original image mode data into the same input form to obtain text mode input and image mode input.

3. The multi-modal joint representation learning method based on variational distillation according to claim 1, characterized in that: arranging original text mode data and original image mode data into the same input form to obtain text mode input and image mode input, including:

4. The multi-modal joint representation learning method based on variational distillation according to claim 1 or 2, characterized in that: the modal joint representation module comprises a MobileBERT model, the MobileBERT model comprises 24 layers of transform models, and each layer of transform model is added with a linear layer.

5. The variational distillation-based multi-modal joint representation learning method according to claim 1, wherein: the distillation loss function is a sum of a loss function of the text teacher model and a loss function of the image teacher model.

6. A multi-modal joint representation learning system based on variational distillation, comprising:

the teacher model comprises a text teacher model and an image teacher model, and original text modal data and original image modal data are respectively input into the text teacher model and the image teacher model to obtain text output and image output of the teacher model;

7. The variational distillation-based multi-modal joint representation learning system according to claim 6, wherein: the multi-mode data unification module is deployed at the front end of the mode joint representation module and is used for collating original text mode data and original image mode data into the same input form to obtain text mode input and image mode input.

8. The variational distillation-based multi-modal joint representation learning system according to claim 6 or 7, wherein the multi-modal data unification module comprises:

9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the method according to any of claims 1 to 5 are implemented when the program is executed by the processor.

10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 5.