CN117010331A

CN117010331A - Method for expanding multi-modal model language capability

Info

Publication number: CN117010331A
Application number: CN202310810210.6A
Authority: CN
Inventors: 邓卉; 危明; 田泽康
Original assignee: Ysten Technology Co ltd
Current assignee: Ysten Technology Co ltd
Priority date: 2023-07-03
Filing date: 2023-07-03
Publication date: 2023-11-07

Abstract

The application discloses a method for expanding language capability of a multi-mode model, which comprises the following steps: reserving and freezing the encoder, selecting and freezing the pre-training multi-language text encoder, defining a multi-layer MLP network, connecting the multi-language text adapter behind the pre-training multi-language text encoder, selecting the training set, and training the multi-language text adapter. The training set of the original cross-mode model can be sampled as the training set of the scheme by selecting the pre-trained multilingual text encoder; respectively moving the B mode coding representation and the text representation; the design adapter directly aligns the text representation with the B-mode code representation, eliminating the differences caused by alignment of the target language text representation with the source language text representation. The whole training data of the original cross-modal model is not needed, part of the training data is sampled, and the training cost is low.

Description

Method for expanding multi-modal model language capability

Technical Field

The application relates to the technical field of cross-modal retrieval, in particular to a method for expanding language capability of a multi-modal model.

Background

With the continuous development of self-media, multi-modal data such as images, texts, voices, videos and the like are continuously increased, and a colorful world on the Internet is created. In order to accurately model the multi-modal content of a user, cross-modal retrieval is an important task for cross-modal understanding, namely, data of one modality is used as input to retrieve data of another modality.

With the release of CLIP by OpenAI, the text and visual fields are linked, and the cross-modal retrieval work has made great progress. As shown in FIG. 3, the cross-modal search framework is called the A-modality, and the other modalities such as right-hand image, video, voice and the like are called the B-modality. The text is subjected to a text encoder to obtain text characterization; other modes such as images, videos, voices and the like are correspondingly characterized by corresponding encoders; the cross-modal retrieval model realizes the mutual retrieval of the text and other modalities by aligning the text characterization with the other modality characterization.

Currently, cross-modal retrieval work is usually focused on high-resource languages (such as english), and to expand the language capabilities of the cross-modal retrieval model, such as implementing chinese and retrieval in other modalities such as image, video, and voice, difficulties are faced (herein, english is referred to as the source language, and chinese is referred to as the target language). First, the lack of target language annotation data, the amount and quality of low resource language data are all problematic. Second, training of multimodal models requires a significant amount of computational resources. Taking ViT-L/14 as an example, training the model takes 256V 100 runs for 12 days. Such high training costs limit their work in the direction of extending the language capabilities of the multi-modal model for common developers lacking computational resources.

At present, the language capability of the extended multi-mode model mainly comprises the following schemes:

scheme one, re-gathering data of the B-modality and target language description pair, training a cross-modality model, as shown in FIG. 4 (a). For example, the Chinese text clip issued by the Alidamo institute is trained by using large-scale Chinese data (about 2 hundred million graph-text pairs), so that the cross-mode graph-text retrieval of Chinese version is realized;

the scheme has the problems that training data are difficult to obtain, meanwhile, the training cost is high, a large amount of computing resources and training time are required, and the like;

scheme II, using machine translation, translating the source language into the target language, generating a B mode and target language description pair, and relieving the problem of difficulty in manually marking the B mode data and the multi-language description corpus, as shown in FIG. 4 (B);

since the accuracy of translation cannot be guaranteed, a great amount of noise is introduced in the translation process, so that the translated target language sentence cannot accurately describe the content of the corresponding image, video or voice B-mode data.

Scheme three, using knowledge distillation, distill a target language text encoder on the source language text encoder of the cross-modal retrieval model. As shown in fig. 4 (c), the B-mode encoder is locked, knowledge distillation is performed on the source language text encoder of the cross-mode retrieval model based on the parallel corpus, and a target language text encoder is obtained;

according to the scheme, the target language text representation and the source language text representation are directly aligned, so that the introduction of machine translation noise is reduced, but the difference exists between the target language text representation and the B-mode coding representation;

scheme IV, as shown in fig. 4 (d), locking the B-mode encoder and the target language text encoder, training only one text adapter, and adapting the target language text representation and the source language text representation;

this solution only requires training of one adapter, training is simple. The parallel corpus is used as a training set, the target language text representation and the source language text representation are aligned, and the training cost is low; but there is still a difference between the target text representation and the B-modality encoding representation;

scheme five, as shown in fig. 4 (e), in order to eliminate the difference between the target language text representation and the B-mode coding representation, a two-stage training method is adopted to learn a target language text encoder;

the first stage, using parallel corpus, learning a target language text encoder on a source language text encoder of a cross-modal retrieval model using knowledge distillation;

the second stage, collecting a training set of B-mode data and target language description pairs, and aligning the target language text representation with the B-mode representation;

the scheme uses a parallel corpus and a small amount of B-mode data and target language description pairs, and compensates the difference between target text representation and B-mode coding representation through two-stage training; however, this approach requires the data set of B-modality data and target language description pairs to be re-collected, and the two-stage training approach is cumbersome to train.

Disclosure of Invention

It is an object of the present application to provide a method of extending the capabilities of a multimodal model language. The application solves the problems that training data is difficult to acquire or the training process is complicated due to multi-stage training in the existing scheme, and has the advantages of simple and effective model, easy acquisition of the training data and low training cost.

The technical scheme of the application is as follows: a method for expanding the language capability of a multi-modal model comprises the following steps:

A. the B-mode encoder is preserved and frozen: preserving the encoder of the B mode in the original cross-mode model, and freezing parameters of the encoder of the B mode; defining the output characterization of the B-mode encoder as v;

B. selecting and freezing a pre-trained multilingual text encoder: optionally selecting a pre-training multi-language text encoder, and freezing parameters of the pre-training multi-language text encoder; defining a text representation output by the pre-training multi-language text encoder as t;

C. defining a multi-layer MLP network as a multi-language text adapter, and outputting an adapted text representation a; the dimension of the text representation a after adaptation is consistent with that of the output representation v of the B-mode encoder;

D. connecting a multi-language text adapter behind the pre-trained multi-language text encoder;

E. selecting a training set: sampling a part of the training set of the original cross-modal model to be used as the training set;

F. training a multilingual text adapter: the multi-language text adapter is trained by adopting a contrast learning method, the multi-language text representation t is aligned with the B-mode encoder output representation v in one step, and the difference between the text representation and other mode representations is eliminated.

In the method for expanding the language capability of the multi-modal model, the pre-training multi-language text encoder in the step B at least comprises XLM, XLM-R, XLM-100 or mMiniLM-L12XH384.

In the foregoing method for expanding the language capability of the multimodal model, the training multilingual text adapter in step F may compare the learned penalty functions as follows:

Loss＝Loss _v2a +Loss _a2v ，

Loss _v2a representing the loss of the B-mode encoder output representation matching the adapted text representation,

Loss _a2v representing the loss of the adapted text representation matching the B-mode encoder output representation,

wherein B is the size of the training batch, and tau is the temperature super-parameter;

sim (x, y) represents the cos distance of two vectors,

compared with the prior art, the method selects the pre-trained multi-language text coder (comprising the source language) on the selection of the text coder, and can sample the training set of the original cross-modal model as the training set of the scheme due to the inclusion of the source language; on the one hand, B-mode data is input into a B-mode encoder to obtain B-mode encoding characterization; on the other hand, the source language text description corresponding to the B-mode data is input into a multi-language text encoder to obtain text characterization; according to the application, an adapter is designed to directly align text representation with B-mode coding representation, so that the difference caused by alignment of target language text representation and source language text representation is eliminated;

the training cost for expanding the language capability of the cross-modal model is low; firstly, training data is easy to obtain, a data set of B-mode data and target language description pairs is not required to be collected again, and a training set is obtained by sampling from an original cross-mode model training set;

secondly, the training model is simple, a text encoder and a B-mode encoder do not need to be retrained, only one adapter is required to be trained, and the text code representation and the B-mode code representation are adapted;

finally, realizing the unification of functions of cross languages and multiple languages, training a target language text encoder of one language without crossing one language, for example, if a multiple language model supports 104 languages, the application can expand 104 language capabilities for a cross-mode model at a time;

in summary, the present application has:

the model is simple, and multi-language cross-modal retrieval can be realized only by adding a multi-language text adapter;

the model is effective, the effective representation of the pre-training multilingual model is fully utilized, and the model is directly aligned with other modal representations through a multilingual text adapter; the introduction of machine translation noise is avoided, and the difference between the text representation of the target language and other modal representations is eliminated;

the training data is easy to obtain, the data does not need to be marked again, and the training data of the original cross-modal model can be used;

the training cost is low, all training data of the original cross-modal model is not needed, and part of training data is sampled.

Drawings

FIG. 1 is a flow chart of the steps of the present application;

FIG. 2 is a flow chart of the framework of the present application;

FIG. 3 is a cross-modal retrieval model framework diagram;

FIG. 4 is a diagram of several implementations of expanding the linguistic capabilities of a multimodal model.

Detailed Description

The application is further illustrated by the following figures and examples, which are not intended to be limiting.

Examples. A method for expanding the language capability of a multimodal model, as shown in fig. 1 and 2, comprising the steps of:

E. selecting a training set: sampling 3% from the training set of the original cross-modal model to be used as the training set;

Loss＝Loss _v2a +Loss _a2v ，

Loss _v2a representation B mode encoder output representation matched and adaptedA loss of the text representation of (c) is provided,

sim (x, y) represents the cos distance of two vectors,

comparative experiments

According to the application, a comparison experiment is carried out in a cross-modal retrieval task of the video and the text, the text in the task is expanded to be multi-language, and a good effect is obtained with lower training cost.

The experiment is based on Microsoft text and video alignment model CLIP-ViP, which is a model for aligning text representations with video representations to achieve video retrieval. The target language of the model is English, and the task is to expand the model into a multi-language text and video cross-modal model.

The experiment was carried out with VaTEX as test set. The video in the VaTEX dataset is a subset of graphics-600, containing 600 human activities. For each video, there are 5 Chinese descriptions and 5 English descriptions, and parallel translation pairs of these 10 descriptions. Since part of videos are not available at present, 2653 videos are downloaded in total in the experiment, corresponding to 26530 Chinese descriptions and 26530 English descriptions. The test index is the recall rate of topN of the text search video.

Scheme three (using knowledge distillation to distill a target language text encoder on a source language text encoder of a cross-modal retrieval model) and scheme four (locking a B-modal encoder and a target language text encoder, training only one text adapter, adapting target language text tokens and source language text tokens) both use the chinese-english translation corpus in WMT19 as a training set, comprising 2598 pairs of parallel chinese-english translations. The CLIP-ViP uses HD-VILA-100M as a training set, which is a large video text cross-modality dataset containing 1 million video text pairs from 300 tens of thousands of videos.

The application randomly samples about 3 percent (10 ten thousand videos) from the HD-VILA-100M, and 112 ten thousand video text pairs are obtained and used as a training set of the application.

The comparison of the five schemes and the training difficulty level of the application is shown in the table 1, and the highest difficulty of acquiring data can be seen in the scheme I (data of the B mode and target language description pair are collected again, and a cross-mode model is trained); training data generated by a scheme II (using machine translation to translate a source language into a target language and generate a B mode and target language description pair) is introduced into translation noise, and the data quality is worst; because of the data acquisition difficulty and quality problems of the first scheme and the second scheme, the two schemes are training in the experiment and are not involved in the effect comparison with the application. In addition, the scheme adopts a two-stage training mode, and training is most complicated; in the comparison effect, the application is only compared with a scheme III and a scheme IV which are easy to obtain data and are trained in one stage, and the comparison result is shown in the table 2:

table 1: comparison of the training difficulty level of the application and each scheme

Table 2: comparison of the effects of the application and the various schemes

The comparison experiment can obviously show that the recall rate of the expanded language (Chinese) search video is far better than that of the scheme IV for training only the adapter on the VaTEX test set, and slightly better than that of the scheme III for training the multi-language encoder by using knowledge distillation; the recall rate of the source language (English) search video is better than that of the scheme III and the scheme 4 and slightly lower than that of the original model;

in a comprehensive view, the training data set is easy to obtain, the single-stage training is simple and has good effect, and a good balance is achieved in the training cost and effect.

Claims

1. A method for expanding the language capabilities of a multimodal model comprising the steps of:

A. the B-mode encoder is preserved and frozen: preserving a B-mode encoder in the original cross-mode model, and freezing parameters of the B-mode encoder; defining the output characterization of the B-mode encoder as v;

F. training a multilingual text adapter: the multilingual text adapter is trained by adopting a contrast learning method, the multilingual text representation t is aligned with the B-mode encoder output representation v, and the difference between the text representation and other mode representations is eliminated.

2. The method for extending the capabilities of a multimodal model language according to claim 1, wherein: the pre-trained multilingual text encoder described in step B, which includes at least XLM, XLM-R, XLM-100 or mmilm-L12 XH384.

3. The method of claim 1, wherein the training multi-language text adapter of step F compares learned penalty functions as follows:

Loss＝Loss _v2a +Loss _a2v ，

sim (x, y) represents the cos distance of two vectors,