CN117010331A - Method for expanding multi-modal model language capability - Google Patents

Method for expanding multi-modal model language capability Download PDF

Info

Publication number
CN117010331A
CN117010331A CN202310810210.6A CN202310810210A CN117010331A CN 117010331 A CN117010331 A CN 117010331A CN 202310810210 A CN202310810210 A CN 202310810210A CN 117010331 A CN117010331 A CN 117010331A
Authority
CN
China
Prior art keywords
text
training
encoder
representation
language
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310810210.6A
Other languages
Chinese (zh)
Inventor
邓卉
危明
田泽康
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ysten Technology Co ltd
Original Assignee
Ysten Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ysten Technology Co ltd filed Critical Ysten Technology Co ltd
Priority to CN202310810210.6A priority Critical patent/CN117010331A/en
Publication of CN117010331A publication Critical patent/CN117010331A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/126Character encoding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/254Fusion techniques of classification results, e.g. of results related to same input data
    • G06F18/256Fusion techniques of classification results, e.g. of results related to same input data of results relating to different input data, e.g. multimodal recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/58Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Machine Translation (AREA)

Abstract

The application discloses a method for expanding language capability of a multi-mode model, which comprises the following steps: reserving and freezing the encoder, selecting and freezing the pre-training multi-language text encoder, defining a multi-layer MLP network, connecting the multi-language text adapter behind the pre-training multi-language text encoder, selecting the training set, and training the multi-language text adapter. The training set of the original cross-mode model can be sampled as the training set of the scheme by selecting the pre-trained multilingual text encoder; respectively moving the B mode coding representation and the text representation; the design adapter directly aligns the text representation with the B-mode code representation, eliminating the differences caused by alignment of the target language text representation with the source language text representation. The whole training data of the original cross-modal model is not needed, part of the training data is sampled, and the training cost is low.

Description

Method for expanding multi-modal model language capability
Technical Field
The application relates to the technical field of cross-modal retrieval, in particular to a method for expanding language capability of a multi-modal model.
Background
With the continuous development of self-media, multi-modal data such as images, texts, voices, videos and the like are continuously increased, and a colorful world on the Internet is created. In order to accurately model the multi-modal content of a user, cross-modal retrieval is an important task for cross-modal understanding, namely, data of one modality is used as input to retrieve data of another modality.
With the release of CLIP by OpenAI, the text and visual fields are linked, and the cross-modal retrieval work has made great progress. As shown in FIG. 3, the cross-modal search framework is called the A-modality, and the other modalities such as right-hand image, video, voice and the like are called the B-modality. The text is subjected to a text encoder to obtain text characterization; other modes such as images, videos, voices and the like are correspondingly characterized by corresponding encoders; the cross-modal retrieval model realizes the mutual retrieval of the text and other modalities by aligning the text characterization with the other modality characterization.
Currently, cross-modal retrieval work is usually focused on high-resource languages (such as english), and to expand the language capabilities of the cross-modal retrieval model, such as implementing chinese and retrieval in other modalities such as image, video, and voice, difficulties are faced (herein, english is referred to as the source language, and chinese is referred to as the target language). First, the lack of target language annotation data, the amount and quality of low resource language data are all problematic. Second, training of multimodal models requires a significant amount of computational resources. Taking ViT-L/14 as an example, training the model takes 256V 100 runs for 12 days. Such high training costs limit their work in the direction of extending the language capabilities of the multi-modal model for common developers lacking computational resources.
At present, the language capability of the extended multi-mode model mainly comprises the following schemes:
scheme one, re-gathering data of the B-modality and target language description pair, training a cross-modality model, as shown in FIG. 4 (a). For example, the Chinese text clip issued by the Alidamo institute is trained by using large-scale Chinese data (about 2 hundred million graph-text pairs), so that the cross-mode graph-text retrieval of Chinese version is realized;
the scheme has the problems that training data are difficult to obtain, meanwhile, the training cost is high, a large amount of computing resources and training time are required, and the like;
scheme II, using machine translation, translating the source language into the target language, generating a B mode and target language description pair, and relieving the problem of difficulty in manually marking the B mode data and the multi-language description corpus, as shown in FIG. 4 (B);
since the accuracy of translation cannot be guaranteed, a great amount of noise is introduced in the translation process, so that the translated target language sentence cannot accurately describe the content of the corresponding image, video or voice B-mode data.
Scheme three, using knowledge distillation, distill a target language text encoder on the source language text encoder of the cross-modal retrieval model. As shown in fig. 4 (c), the B-mode encoder is locked, knowledge distillation is performed on the source language text encoder of the cross-mode retrieval model based on the parallel corpus, and a target language text encoder is obtained;
according to the scheme, the target language text representation and the source language text representation are directly aligned, so that the introduction of machine translation noise is reduced, but the difference exists between the target language text representation and the B-mode coding representation;
scheme IV, as shown in fig. 4 (d), locking the B-mode encoder and the target language text encoder, training only one text adapter, and adapting the target language text representation and the source language text representation;
this solution only requires training of one adapter, training is simple. The parallel corpus is used as a training set, the target language text representation and the source language text representation are aligned, and the training cost is low; but there is still a difference between the target text representation and the B-modality encoding representation;
scheme five, as shown in fig. 4 (e), in order to eliminate the difference between the target language text representation and the B-mode coding representation, a two-stage training method is adopted to learn a target language text encoder;
the first stage, using parallel corpus, learning a target language text encoder on a source language text encoder of a cross-modal retrieval model using knowledge distillation;
the second stage, collecting a training set of B-mode data and target language description pairs, and aligning the target language text representation with the B-mode representation;
the scheme uses a parallel corpus and a small amount of B-mode data and target language description pairs, and compensates the difference between target text representation and B-mode coding representation through two-stage training; however, this approach requires the data set of B-modality data and target language description pairs to be re-collected, and the two-stage training approach is cumbersome to train.
Disclosure of Invention
It is an object of the present application to provide a method of extending the capabilities of a multimodal model language. The application solves the problems that training data is difficult to acquire or the training process is complicated due to multi-stage training in the existing scheme, and has the advantages of simple and effective model, easy acquisition of the training data and low training cost.
The technical scheme of the application is as follows: a method for expanding the language capability of a multi-modal model comprises the following steps:
A. the B-mode encoder is preserved and frozen: preserving the encoder of the B mode in the original cross-mode model, and freezing parameters of the encoder of the B mode; defining the output characterization of the B-mode encoder as v;
B. selecting and freezing a pre-trained multilingual text encoder: optionally selecting a pre-training multi-language text encoder, and freezing parameters of the pre-training multi-language text encoder; defining a text representation output by the pre-training multi-language text encoder as t;
C. defining a multi-layer MLP network as a multi-language text adapter, and outputting an adapted text representation a; the dimension of the text representation a after adaptation is consistent with that of the output representation v of the B-mode encoder;
D. connecting a multi-language text adapter behind the pre-trained multi-language text encoder;
E. selecting a training set: sampling a part of the training set of the original cross-modal model to be used as the training set;
F. training a multilingual text adapter: the multi-language text adapter is trained by adopting a contrast learning method, the multi-language text representation t is aligned with the B-mode encoder output representation v in one step, and the difference between the text representation and other mode representations is eliminated.
In the method for expanding the language capability of the multi-modal model, the pre-training multi-language text encoder in the step B at least comprises XLM, XLM-R, XLM-100 or mMiniLM-L12XH384.
In the foregoing method for expanding the language capability of the multimodal model, the training multilingual text adapter in step F may compare the learned penalty functions as follows:
Loss=Loss v2a +Loss a2v
Loss v2a representing the loss of the B-mode encoder output representation matching the adapted text representation,
Loss a2v representing the loss of the adapted text representation matching the B-mode encoder output representation,
wherein B is the size of the training batch, and tau is the temperature super-parameter;
sim (x, y) represents the cos distance of two vectors,
compared with the prior art, the method selects the pre-trained multi-language text coder (comprising the source language) on the selection of the text coder, and can sample the training set of the original cross-modal model as the training set of the scheme due to the inclusion of the source language; on the one hand, B-mode data is input into a B-mode encoder to obtain B-mode encoding characterization; on the other hand, the source language text description corresponding to the B-mode data is input into a multi-language text encoder to obtain text characterization; according to the application, an adapter is designed to directly align text representation with B-mode coding representation, so that the difference caused by alignment of target language text representation and source language text representation is eliminated;
the training cost for expanding the language capability of the cross-modal model is low; firstly, training data is easy to obtain, a data set of B-mode data and target language description pairs is not required to be collected again, and a training set is obtained by sampling from an original cross-mode model training set;
secondly, the training model is simple, a text encoder and a B-mode encoder do not need to be retrained, only one adapter is required to be trained, and the text code representation and the B-mode code representation are adapted;
finally, realizing the unification of functions of cross languages and multiple languages, training a target language text encoder of one language without crossing one language, for example, if a multiple language model supports 104 languages, the application can expand 104 language capabilities for a cross-mode model at a time;
in summary, the present application has:
the model is simple, and multi-language cross-modal retrieval can be realized only by adding a multi-language text adapter;
the model is effective, the effective representation of the pre-training multilingual model is fully utilized, and the model is directly aligned with other modal representations through a multilingual text adapter; the introduction of machine translation noise is avoided, and the difference between the text representation of the target language and other modal representations is eliminated;
the training data is easy to obtain, the data does not need to be marked again, and the training data of the original cross-modal model can be used;
the training cost is low, all training data of the original cross-modal model is not needed, and part of training data is sampled.
Drawings
FIG. 1 is a flow chart of the steps of the present application;
FIG. 2 is a flow chart of the framework of the present application;
FIG. 3 is a cross-modal retrieval model framework diagram;
FIG. 4 is a diagram of several implementations of expanding the linguistic capabilities of a multimodal model.
Detailed Description
The application is further illustrated by the following figures and examples, which are not intended to be limiting.
Examples. A method for expanding the language capability of a multimodal model, as shown in fig. 1 and 2, comprising the steps of:
A. the B-mode encoder is preserved and frozen: preserving the encoder of the B mode in the original cross-mode model, and freezing parameters of the encoder of the B mode; defining the output characterization of the B-mode encoder as v;
B. selecting and freezing a pre-trained multilingual text encoder: optionally selecting a pre-training multi-language text encoder, and freezing parameters of the pre-training multi-language text encoder; defining a text representation output by the pre-training multi-language text encoder as t;
C. defining a multi-layer MLP network as a multi-language text adapter, and outputting an adapted text representation a; the dimension of the text representation a after adaptation is consistent with that of the output representation v of the B-mode encoder;
D. connecting a multi-language text adapter behind the pre-trained multi-language text encoder;
E. selecting a training set: sampling 3% from the training set of the original cross-modal model to be used as the training set;
F. training a multilingual text adapter: the multi-language text adapter is trained by adopting a contrast learning method, the multi-language text representation t is aligned with the B-mode encoder output representation v in one step, and the difference between the text representation and other mode representations is eliminated.
In the method for expanding the language capability of the multi-modal model, the pre-training multi-language text encoder in the step B at least comprises XLM, XLM-R, XLM-100 or mMiniLM-L12XH384.
In the foregoing method for expanding the language capability of the multimodal model, the training multilingual text adapter in step F may compare the learned penalty functions as follows:
Loss=Loss v2a +Loss a2v
Loss v2a representation B mode encoder output representation matched and adaptedA loss of the text representation of (c) is provided,
Loss a2v representing the loss of the adapted text representation matching the B-mode encoder output representation,
wherein B is the size of the training batch, and tau is the temperature super-parameter;
sim (x, y) represents the cos distance of two vectors,
comparative experiments
According to the application, a comparison experiment is carried out in a cross-modal retrieval task of the video and the text, the text in the task is expanded to be multi-language, and a good effect is obtained with lower training cost.
The experiment is based on Microsoft text and video alignment model CLIP-ViP, which is a model for aligning text representations with video representations to achieve video retrieval. The target language of the model is English, and the task is to expand the model into a multi-language text and video cross-modal model.
The experiment was carried out with VaTEX as test set. The video in the VaTEX dataset is a subset of graphics-600, containing 600 human activities. For each video, there are 5 Chinese descriptions and 5 English descriptions, and parallel translation pairs of these 10 descriptions. Since part of videos are not available at present, 2653 videos are downloaded in total in the experiment, corresponding to 26530 Chinese descriptions and 26530 English descriptions. The test index is the recall rate of topN of the text search video.
Scheme three (using knowledge distillation to distill a target language text encoder on a source language text encoder of a cross-modal retrieval model) and scheme four (locking a B-modal encoder and a target language text encoder, training only one text adapter, adapting target language text tokens and source language text tokens) both use the chinese-english translation corpus in WMT19 as a training set, comprising 2598 pairs of parallel chinese-english translations. The CLIP-ViP uses HD-VILA-100M as a training set, which is a large video text cross-modality dataset containing 1 million video text pairs from 300 tens of thousands of videos.
The application randomly samples about 3 percent (10 ten thousand videos) from the HD-VILA-100M, and 112 ten thousand video text pairs are obtained and used as a training set of the application.
The comparison of the five schemes and the training difficulty level of the application is shown in the table 1, and the highest difficulty of acquiring data can be seen in the scheme I (data of the B mode and target language description pair are collected again, and a cross-mode model is trained); training data generated by a scheme II (using machine translation to translate a source language into a target language and generate a B mode and target language description pair) is introduced into translation noise, and the data quality is worst; because of the data acquisition difficulty and quality problems of the first scheme and the second scheme, the two schemes are training in the experiment and are not involved in the effect comparison with the application. In addition, the scheme adopts a two-stage training mode, and training is most complicated; in the comparison effect, the application is only compared with a scheme III and a scheme IV which are easy to obtain data and are trained in one stage, and the comparison result is shown in the table 2:
table 1: comparison of the training difficulty level of the application and each scheme
Table 2: comparison of the effects of the application and the various schemes
The comparison experiment can obviously show that the recall rate of the expanded language (Chinese) search video is far better than that of the scheme IV for training only the adapter on the VaTEX test set, and slightly better than that of the scheme III for training the multi-language encoder by using knowledge distillation; the recall rate of the source language (English) search video is better than that of the scheme III and the scheme 4 and slightly lower than that of the original model;
in a comprehensive view, the training data set is easy to obtain, the single-stage training is simple and has good effect, and a good balance is achieved in the training cost and effect.

Claims (3)

1. A method for expanding the language capabilities of a multimodal model comprising the steps of:
A. the B-mode encoder is preserved and frozen: preserving a B-mode encoder in the original cross-mode model, and freezing parameters of the B-mode encoder; defining the output characterization of the B-mode encoder as v;
B. selecting and freezing a pre-trained multilingual text encoder: optionally selecting a pre-training multi-language text encoder, and freezing parameters of the pre-training multi-language text encoder; defining a text representation output by the pre-training multi-language text encoder as t;
C. defining a multi-layer MLP network as a multi-language text adapter, and outputting an adapted text representation a; the dimension of the text representation a after adaptation is consistent with that of the output representation v of the B-mode encoder;
D. connecting a multi-language text adapter behind the pre-trained multi-language text encoder;
E. selecting a training set: sampling a part of the training set of the original cross-modal model to be used as the training set;
F. training a multilingual text adapter: the multilingual text adapter is trained by adopting a contrast learning method, the multilingual text representation t is aligned with the B-mode encoder output representation v, and the difference between the text representation and other mode representations is eliminated.
2. The method for extending the capabilities of a multimodal model language according to claim 1, wherein: the pre-trained multilingual text encoder described in step B, which includes at least XLM, XLM-R, XLM-100 or mmilm-L12 XH384.
3. The method of claim 1, wherein the training multi-language text adapter of step F compares learned penalty functions as follows:
Loss=Loss v2a +Loss a2v
Loss v2a representing the loss of the B-mode encoder output representation matching the adapted text representation,
Loss a2v representing the loss of the adapted text representation matching the B-mode encoder output representation,
wherein B is the size of the training batch, and tau is the temperature super-parameter;
sim (x, y) represents the cos distance of two vectors,
CN202310810210.6A 2023-07-03 2023-07-03 Method for expanding multi-modal model language capability Pending CN117010331A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310810210.6A CN117010331A (en) 2023-07-03 2023-07-03 Method for expanding multi-modal model language capability

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310810210.6A CN117010331A (en) 2023-07-03 2023-07-03 Method for expanding multi-modal model language capability

Publications (1)

Publication Number Publication Date
CN117010331A true CN117010331A (en) 2023-11-07

Family

ID=88566390

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310810210.6A Pending CN117010331A (en) 2023-07-03 2023-07-03 Method for expanding multi-modal model language capability

Country Status (1)

Country Link
CN (1) CN117010331A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117218498A (en) * 2023-11-08 2023-12-12 苏州大学 Multi-modal large language model training method and system based on multi-modal encoder

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117218498A (en) * 2023-11-08 2023-12-12 苏州大学 Multi-modal large language model training method and system based on multi-modal encoder
CN117218498B (en) * 2023-11-08 2024-02-23 苏州大学 Multi-modal large language model training method and system based on multi-modal encoder

Similar Documents

Publication Publication Date Title
CN110750959B (en) Text information processing method, model training method and related device
CN109325229B (en) Method for calculating text similarity by utilizing semantic information
Schulz et al. Multi-modular domain-tailored OCR post-correction
CN112016604B (en) Zero-resource machine translation method applying visual information
CN112765345A (en) Text abstract automatic generation method and system fusing pre-training model
KR20080026128A (en) Grammatical parsing of document visual structures
Meetei et al. WAT2019: English-Hindi translation on Hindi visual genome dataset
CN110717341A (en) Method and device for constructing old-Chinese bilingual corpus with Thai as pivot
CN117010331A (en) Method for expanding multi-modal model language capability
Sun et al. Study on medical image report generation based on improved encoding-decoding method
Alnajjar When word embeddings become endangered
JP2016164707A (en) Automatic translation device and translation model learning device
Yang et al. Hierarchical neural data synthesis for semantic parsing
Vashistha et al. Active learning for neural machine translation
KR20210035721A (en) Machine translation method using multi-language corpus and system implementing using the same
Elbedwehy et al. Improved Arabic image captioning model using feature concatenation with pre-trained word embedding
CN116955644A (en) Knowledge fusion method, system and storage medium based on knowledge graph
CN111382583A (en) Chinese-Uygur name translation system with mixed multiple strategies
Moukafih et al. Improving machine translation of arabic dialects through multi-task learning
Singh et al. An Integrated Model for Text to Text, Image to Text and Audio to Text Linguistic Conversion using Machine Learning Approach
Gamal et al. Case Study of Improving English-Arabic Translation Using the Transformer Model.
CN116415587A (en) Information processing apparatus and information processing method
CN109325224B (en) Word vector representation learning method and system based on semantic primitive language
Yang et al. Backpropagation-based decoding for multimodal machine translation
Wołk et al. Implementing statistical machine translation into mobile augmented reality systems

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination