CN114005012A

CN114005012A - Training method, device, equipment and storage medium of multi-mode pre-training model

Info

Publication number: CN114005012A
Application number: CN202111306978.7A
Authority: CN
Inventors: 刘丰刚; 李阳光; 梁丰; 赵立晨; 崔玉峰; 邵婧; 余锋伟; 闫俊杰
Original assignee: Beijing Sensetime Technology Development Co Ltd
Current assignee: Beijing Sensetime Technology Development Co Ltd
Priority date: 2021-11-05
Filing date: 2021-11-05
Publication date: 2022-02-01

Abstract

The disclosure provides a training method, a device, equipment and a storage medium for a multi-mode pre-training model, wherein the method comprises the following steps: acquiring initial sample data, wherein the initial sample data comprises a plurality of groups of sample pairs; each group of sample pairs comprises an initial image and supervised text information corresponding to the initial image; inputting the multiple groups of sample pairs into a multi-modal pre-training model to be trained to obtain first feature vectors corresponding to the multiple groups of sample pairs; respectively performing data augmentation processing on the initial image and/or the supervision text information in the initial sample data, and determining a second feature vector corresponding to the augmented sample data after the data augmentation processing based on the multi-mode pre-training model to be trained; and training the multi-mode pre-training model to be trained based on the first feature vector and the second feature vector.

Description

Training method, device, equipment and storage medium of multi-mode pre-training model

Technical Field

The disclosure relates to the technical field of deep learning, in particular to a training method, a device, equipment and a storage medium for a multi-mode pre-training model.

Background

With the rapid development of the technology in the field of deep learning, large-scale multi-modal pre-training becomes one of the popular research directions in the field, and the main content is to perform modeling on a pre-training model through large-scale image samples and supervision texts corresponding to the image samples.

In the related art, when model training is performed, the related loss function only often includes feature similarity between feature vectors respectively corresponding to an image sample and a supervision text, and supervision data is single, so that the training effect on a multi-modal pre-training model is poor.

Disclosure of Invention

The embodiment of the disclosure at least provides a training method, a device, equipment and a storage medium for a multi-mode pre-training model.

In a first aspect, an embodiment of the present disclosure provides a training method for a multi-modal pre-training model, including:

acquiring initial sample data, wherein the initial sample data comprises a plurality of groups of sample pairs; each group of sample pairs comprises an initial image and supervision text information corresponding to the initial image;

inputting the multiple groups of sample pairs into a multi-modal pre-training model to be trained to obtain first feature vectors corresponding to the multiple groups of sample pairs;

respectively performing data augmentation processing on the initial image and/or the supervision text information in the initial sample data, and determining a second feature vector corresponding to the augmented sample data after the data augmentation processing based on the multi-mode pre-training model to be trained;

and training the multi-mode pre-training model to be trained based on the first feature vector and the second feature vector.

In this way, the initial image and/or the supervised text information in the initial sample data are/is subjected to data augmentation processing respectively, and the second feature vector corresponding to the augmented sample data after the data augmentation processing is determined based on the multi-mode pre-training model to be trained, so that loss values of more dimensions can be determined when the multi-mode pre-training model is determined to be used for training, and the training effect of the multi-mode pre-training model is further improved.

In a possible embodiment, the inputting the multiple groups of sample pairs into a multi-modal pre-training model to be trained to obtain first feature vectors corresponding to the multiple groups of sample pairs includes:

for any sample pair, inputting an initial image in the sample pair into an image encoder of the multi-modal pre-training model to obtain a first image feature vector corresponding to the initial image; and inputting the supervised text information in any sample pair into a text coder of the multi-modal pre-training model to obtain a first text feature vector corresponding to the supervised text information.

In a possible implementation, the method further includes performing data augmentation processing on an initial image in the initial sample data according to the following method:

for any initial image, performing at least one cutting on the initial image according to a preset image cutting proportion; and/or adjusting image parameters of any of the initial images.

Therefore, by carrying out data amplification processing on the initial image, richer sample images can be constructed, and the second image feature vector can be determined according to the amplified sample images after the data amplification processing, so that the multi-mode pre-training model can be trained.

In a possible implementation manner, the supervised text information in the initial sample data includes at least two words;

the method further comprises the step of carrying out data augmentation processing on the supervision text information in the initial sample data according to the following method, wherein the method comprises the following steps:

randomly deleting at least one first target word aiming at any supervision text information; and/or randomly exchanging the order of at least two second target words in any one of the supervised text messages; and/or randomly determining at least one third target word, and replacing the third target word in any supervision text information by using the associated word of the third target word.

Therefore, by carrying out data augmentation processing on the supervision text image, richer sample texts can be constructed, and therefore the second text feature vector can be determined according to the augmented text information after the data augmentation processing, so that the multi-mode pre-training model can be trained conveniently.

In a possible embodiment, the training the multi-modal pre-training model to be trained based on the first feature vector and the second feature vector includes:

determining a target loss value in the training process based on the first feature vector and the second feature vector;

and training the multi-modal pre-training model to be trained based on the target loss value.

In one possible implementation, the first feature vector includes a first image feature vector and a first text feature vector;

the determining a target loss value in the training process based on the first feature vector and the second feature vector includes:

determining a first loss value based on the first image feature vector and a first text feature vector; and determining a second loss value based on the second feature vector, the first image feature vector and the first text feature vector;

determining the target loss function based on the first loss value and the second loss value.

In this way, by determining the second loss value, the gain caused by data amplification processing on the supervised text information and the initial image can be mapped to the loss function, so as to improve the training effect on the multi-mode pre-training model.

In one possible embodiment, the determining the target loss function based on the first loss value and the second loss value includes:

for any piece of supervised text information, determining similar text information of which the feature similarity with the any piece of supervised text information meets a first preset condition, and taking a text feature vector corresponding to the similar text information as a third text feature vector corresponding to the any piece of supervised text information;

determining a third loss value based on the third text feature vector and the first feature vector;

determining the target loss value based on a first loss value, the second loss value, and the third loss value.

In this way, by determining the third loss value, the gain caused by the process of determining the similar text information corresponding to the supervised text information can be mapped into the loss function, so as to improve the training effect on the multi-modal pre-training model.

for any initial image, determining a similar image of which the feature similarity with the any image meets a second preset condition, and taking an image feature vector corresponding to the similar image as a third image feature vector corresponding to the any initial image;

determining a fourth loss value based on the third image feature vector and the first feature vector;

determining the target loss value based on a first loss value, the second loss value, and the fourth loss value.

In this way, by determining the fourth loss value, the gain caused by the process of determining the similar image corresponding to the initial image can be mapped into the loss function, so as to improve the training effect on the multi-mode pre-training model.

In one possible embodiment, after obtaining the initial sample data, before inputting the plurality of sets of sample pairs to a multi-modal pre-training model to be trained, the method further includes:

and based on preset screening conditions, carrying out data cleaning on the initial sample data.

Therefore, by screening the initial sample data, the quality of the sample data during training of the multi-mode pre-training model to be trained can be higher, and the training effect of the multi-mode pre-training model can be improved.

In one possible embodiment, after the training of the multi-modal pre-training model is completed, the method further includes:

building a multi-mode detection model based on the multi-mode pre-training model and the characteristic pyramid network;

mapping the regional characteristics corresponding to the detection boxes in the training sample data to different layers of the feature pyramid network based on random mapping parameters respectively set for each layer of the feature pyramid network, and calculating a loss value in the training process based on the output of the mapped feature pyramid network;

and adjusting network parameters of network layers of the multi-modal detection model except the multi-modal pre-training model based on the loss value.

In a second aspect, an embodiment of the present disclosure further provides a training apparatus for a multi-modal pre-training model, including:

the device comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring initial sample data which comprises a plurality of groups of sample pairs; each group of sample pairs comprises an initial image and supervised text information corresponding to the initial image;

the input module is used for inputting the plurality of groups of sample pairs into a multi-modal pre-training model to be trained to obtain first feature vectors corresponding to the plurality of groups of sample pairs;

the processing module is used for respectively carrying out data augmentation processing on the initial image and/or the supervision text information in the initial sample data and determining a second feature vector corresponding to the augmented sample data after the data augmentation processing based on the multi-mode pre-training model to be trained;

and the training module is used for training the multi-mode pre-training model to be trained on the basis of the first feature vector and the second feature vector.

In a possible implementation manner, when the plurality of sets of sample pairs are input to a multi-modal pre-training model to be trained, and a first feature vector corresponding to the plurality of sets of sample pairs is obtained, the input module is configured to:

In a possible implementation manner, the processing module is further configured to perform data augmentation processing on an initial image in the initial sample data according to the following steps:

the processing module is further configured to perform data augmentation processing on the supervised text information in the initial sample data according to the following method:

In a possible embodiment, the training module, when training the multi-modal pre-training model to be trained based on the first feature vector and the second feature vector, is configured to:

the training module, when determining a target loss value in the training process based on the first feature vector and the second feature vector, is configured to:

In one possible embodiment, the training module, when determining the target loss function based on the first loss value and the second loss value, is configured to:

In a possible embodiment, the obtaining module, after obtaining the initial sample data, before inputting the plurality of sets of sample pairs to a multi-modal pre-training model to be trained, is further configured to:

In one possible embodiment, after the training of the multi-modal pre-training model is completed, the training module is further configured to:

In a third aspect, an embodiment of the present disclosure further provides a computer device, including: a processor, a memory and a bus, the memory storing machine-readable instructions executable by the processor, the processor and the memory communicating via the bus when the computer device is running, the machine-readable instructions when executed by the processor performing the steps of the first aspect described above, or any possible implementation of the first aspect.

In a fourth aspect, this disclosed embodiment also provides a computer-readable storage medium, on which a computer program is stored, where the computer program is executed by a processor to perform the steps in the first aspect or any one of the possible implementation manners of the first aspect.

Regarding the beneficial effects of the training apparatus, the computer device and the storage medium of the multi-modal pre-training model provided by the embodiment of the present disclosure, reference is made to the beneficial effects of the training method of the multi-modal pre-training model, and a description thereof is not repeated here.

In order to make the aforementioned objects, features and advantages of the present disclosure more comprehensible, preferred embodiments accompanied with figures are described in detail below.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings required for use in the embodiments will be briefly described below, and the drawings herein incorporated in and forming a part of the specification illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the technical solutions of the present disclosure. It is appreciated that the following drawings depict only certain embodiments of the disclosure and are therefore not to be considered limiting of its scope, for those skilled in the art will be able to derive additional related drawings therefrom without the benefit of the inventive faculty.

FIG. 1 is a flow chart illustrating a training method of a multi-modal pre-training model provided by an embodiment of the present disclosure;

FIG. 2 is a schematic diagram illustrating a first loss value determination in a training method of a multi-modal pre-training model provided by an embodiment of the disclosure;

FIG. 3 is a diagram illustrating a second loss value determination in a training method of a multi-modal pre-training model provided by an embodiment of the disclosure;

FIG. 4 is a diagram illustrating a third loss value determined in the training method of the multi-modal pre-training model provided by the embodiment of the disclosure;

FIG. 5 is a flowchart illustrating a specific method for constructing a multi-modal detection model in a training method of a multi-modal pre-training model provided by an embodiment of the present disclosure;

FIG. 6 is a diagram illustrating a loss value calculation in a training method of a multi-modal pre-training model provided by an embodiment of the present disclosure;

FIG. 7 is a schematic diagram illustrating an architecture of a training apparatus for a multi-modal pre-training model provided in an embodiment of the present disclosure;

fig. 8 shows a schematic structural diagram of a computer device provided by an embodiment of the present disclosure.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present disclosure more clear, the technical solutions of the embodiments of the present disclosure will be described clearly and completely with reference to the drawings in the embodiments of the present disclosure, and it is obvious that the described embodiments are only a part of the embodiments of the present disclosure, not all of the embodiments. The components of the embodiments of the present disclosure, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present disclosure, presented in the figures, is not intended to limit the scope of the claimed disclosure, but is merely representative of selected embodiments of the disclosure. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the disclosure without making creative efforts, shall fall within the protection scope of the disclosure.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.

The term "and/or" herein merely describes an associative relationship, meaning that three relationships may exist, e.g., a and/or B, may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the term "at least one" herein means any one of a plurality or any combination of at least two of a plurality, for example, including at least one of A, B, C, and may mean including any one or more elements selected from the group consisting of A, B and C.

Research shows that when model training is carried out, the correlation loss function only contains the feature similarity between the feature vectors respectively corresponding to the image sample and the supervision text, and the supervision data is single, so that the training effect on the multi-mode pre-training model is poor.

Specifically, the purpose of multi-modal pre-training is to enable the model to learn semantic mapping relationships between different modalities, such as to associate the word "cat" with the area in which the cat appears in the picture.

Based on the above research, the present disclosure provides a training method, an apparatus, a device, and a storage medium for a multi-modal pre-training model, in which data augmentation processing is performed on an initial image in the initial sample data and/or the supervised text information, and a second feature vector corresponding to the augmented sample data after the data augmentation processing is determined based on the multi-modal pre-training model to be trained, so that loss values of more dimensions can be determined when the loss values are determined for training the multi-modal pre-training model, and further, a training effect of the multi-modal pre-training model is improved.

To facilitate understanding of the present embodiment, first, a detailed description is given to a training method of a multi-modal pre-training model disclosed in an embodiment of the present disclosure, where an execution subject of the training method of the multi-modal pre-training model provided in the embodiment of the present disclosure is generally a computer device with certain computing power, and the computer device includes, for example: a terminal device, which may be a User Equipment (UE), a mobile device, a User terminal, a computing device, or other processing device. In some possible implementations, the training method of the multi-modal pre-training model may be implemented by a processor calling computer-readable instructions stored in a memory.

Referring to fig. 1, a flowchart of a training method of a multi-modal pre-training model provided in an embodiment of the present disclosure is shown, where the method includes S101 to S104, where:

s101: acquiring initial sample data, wherein the initial sample data comprises a plurality of groups of sample pairs; each set of the sample pairs includes an initial image and supervised text information corresponding to the initial image.

S102: and inputting the plurality of groups of sample pairs into a multi-modal pre-training model to be trained to obtain first feature vectors corresponding to the plurality of groups of sample pairs.

S103: and respectively performing data augmentation processing on the initial image and/or the supervision text information in the initial sample data, and determining a second feature vector corresponding to the augmented sample data after the data augmentation processing based on the multi-mode pre-training model to be trained.

S104: and training the multi-mode pre-training model to be trained based on the first feature vector and the second feature vector.

The following is a detailed description of the above steps.

For S101, the supervised text information corresponding to the initial image is used to describe the initial image, so as to perform supervised training on the multi-modal pre-training model through the content described by the supervised text information.

For example, in a multi-modal pre-training scenario, the initial image may include a person and a dog, and the supervised text information corresponding to the initial image may be "one person walks the dog" to describe the content in the initial image.

Specifically, when the initial sample data is obtained, a web crawler (a program for automatically obtaining web page content) may be used to obtain an initial image and supervised text information (i.e. description information of the initial image) corresponding to the initial image from each web page; or, the initial sample data may also be directly obtained from a sample data set constructed in advance.

In practical application, initial sample data in a sample data set (or initial sample data crawled by a web crawler) may not meet the requirement of the multi-modal pre-training model in an application scene, so that the training effect is poor when the initial sample data is directly used for training.

In a possible implementation manner, after the initial sample data is obtained, data cleaning may be performed on the initial sample data based on a preset screening condition.

Specifically, the preset screening condition may include at least one of the following conditions:

condition 1, whether there is a messy code or a special symbol in the text

Here, if a messy code or a special symbol exists in the text in the supervised text information, the sample pair where the supervised text information is located can be deleted, thereby avoiding the problems of low training efficiency and the like caused by difficulty or incapability of recognizing the supervised text information.

Condition 2, whether the number of words of the text is within a preset number of words

Here, in the case of too many words in the supervised text information, a more complex scene is often described, which is more difficult for the training of the pre-trained model; accordingly, in the case that the number of words in the supervised text information is too small, a simpler scene is often described, which is of lower training value for the pre-trained model. Therefore, if the number of times of the text in the supervised text information is outside the preset word number range, the sample pair in which the supervised text information is located can be deleted, so that the problems of poor training effect and the like caused by too difficult or too simple initial sample data are avoided.

Illustratively, the preset word number range is 5-8, the supervision text information is that "a small piece damages the glass of the next-door old king, and a dad of the small piece is training to repel the small piece", so as to describe that a parent in the corresponding initial image is training to repel a child, and since the supervision text information exceeds the preset word number range, the sample pair corresponding to the supervision text information can be deleted.

Condition 3, whether the target image parameter of the image meets the threshold value requirement

Here, the target image parameter may be a resolution, brightness, contrast, or the like of the image.

For example, taking the target image parameter as the resolution and the threshold requirement as being greater than 1920px × 1080px (that is, the resolution of the image needs to be higher than the picture format of 1080P) as an example, if the resolution of the initial image in the supervised text information is 1280px × 720px (that is, 720P) and is less than 1080P, the pair of samples where the initial image is located may be deleted, so as to avoid the problems of low training efficiency and the like caused by difficulty in recognizing the initial image.

Condition 4, whether keywords corresponding to application scenes exist in text or not

Here, if there is no keyword related to the target application scenario when the model is actually used in the text in the supervised text information, the sample pair where the supervised text information is located may be deleted, thereby avoiding problems such as poor training effect caused by weak relevance between the supervised text information and the actual application scenario.

It should be noted that, in actual use, at least one of the above conditions may be selected according to a requirement to perform data cleaning on the initial sample data, and specifically, what kind of condition may be selected according to an actual requirement is not limited in this disclosure.

Here, for any sample pair, an initial image in the sample pair may be input into an image encoder of the multi-modal pre-training model, resulting in a first image feature vector corresponding to the initial image; and inputting the supervised text information in any sample pair into a text coder of the multi-modal pre-training model to obtain a first text feature vector corresponding to the supervised text information.

When data augmentation processing is performed, data augmentation processing may be performed only on the initial image in the initial sample data, and the augmented sample data after the data augmentation processing includes an augmented sample image obtained after the data augmentation processing and supervised text information corresponding to the initial image that is not subjected to the data augmentation processing;

or, when performing data augmentation processing, only the supervised text information corresponding to the initial image in the initial sample data may be subjected to data augmentation processing, and the augmented sample data after the data augmentation processing includes the augmented text information obtained after the data augmentation processing and the initial image corresponding to the supervised text information without the data augmentation processing;

or, during data augmentation, the initial image in the initial sample data and the monitoring information corresponding to the initial image may be simultaneously subjected to data augmentation, and the augmented sample data after data augmentation includes augmented text information obtained after data augmentation and an augmented sample image obtained after data augmentation.

In the following, several ways of performing data augmentation processing on the initial image will be described respectively:

mode A₁And cutting the image according to the preset image cutting proportion

Here, an association relationship between an original image proportion and an image cropping proportion may be established in advance, and when performing image cropping, at least one image cropping proportion corresponding to the original image proportion is determined according to the original image proportion of the original image, and the original image is cropped according to the at least one image cropping proportion to obtain at least one cropped augmented sample image.

For example, taking the original image proportion of the initial image as 16 to 10, the image cropping proportion corresponding to the initial image may be determined as 16 to 9 according to a preset association relationship between the original image proportion and the image cropping proportion, and then the initial image may be cropped according to the image cropping proportion of 16 to 9.

Mode A₂Adjusting image parameters of the image

Here, the image parameter may be a resolution, a brightness, a contrast, a saturation, or the like, and by adjusting the image parameter of the initial image, at least one augmented sample image corresponding to the initial image may be obtained.

For example, with the luminance of the initial image being 30nits, two pieces of augmented sample images with luminances of 40nits and 50nits, respectively, can be generated by adjusting the luminance of the initial image.

Mode A₃And cutting the image according to the preset cutting parameters

Here, the clipping parameter may be set for first target position information of a vertex of the initial image in a target coordinate system constructed in advance, and the first target position information may be adjusted according to the clipping parameter, that is, the position of the augmented sample image after the clipping processing in the target coordinate system may be determined by second target position information obtained after the adjustment in the target coordinate system, so as to obtain the clipped augmented sample image according to the position of the augmented sample image in the target coordinate system.

The cutting parameter may be randomly selected from a preset numerical range, for example, the numerical range may be-0.1 to 0.1, and when the cutting parameter is used to adjust the first target position information, the cutting parameter may be used to respectively adjust a plurality of first coordinates corresponding to the first target position information, so as to obtain the second target position information.

In the following, several ways of performing data augmentation processing on the supervision text information will be respectively described:

mode B₁Randomly deleting at least one word in the supervised text information

For example, taking the supervision text information as "one lovely white cat" as an example, one of the supervision text information may be deleted, and the augmented text information after the data augmentation processing is obtained as "lovely white cat".

Mode B₂Randomly exchanging order of at least two words in the supervision text information

Here, since the order of at least two words needs to be exchanged, at least two words are required in the supervisory text information.

Illustratively, taking the supervising text information as "person and dog playing" as an example, the augmented text information after the data augmentation processing is "dog and person playing" may be obtained by exchanging the order of "dog" and "person".

Mode B₃At least one word in the supervision text information is randomly determined and replaced by the relevant word of the word

Here, the related word may be a synonym, an alias, a scholarly name, a colloquial name, or the like indicating that the semantics are the same or similar.

For example, in the case where the supervision text message is "potato braised pork", the synonym "potato" may be used to replace "potato", and the augmented text message after the data augmentation processing is "potato braised pork".

Therefore, by performing data amplification processing on the initial sample data, more types of sample data can be obtained, so that more loss functions are set in the subsequent training of the multi-mode pre-training model, and the training effect on the multi-mode pre-training model can be improved.

Further, after obtaining the augmentation sample data after the data augmentation processing, for a sample pair in any augmentation sample data, a target image (possibly an initial image or an augmentation sample image) in any sample pair may be input into an image encoder of the multi-modal pre-training model, so as to obtain a second image feature vector corresponding to the target image; and inputting target text information (possibly supervisory text information and also possibly augmented text information) in any sample pair into a text encoder of the multi-mode pre-training model to obtain a second text feature vector corresponding to the target text information.

Here, when the multi-modal pre-training model to be trained is trained, a target loss value in the current training process may be determined based on the first feature vector and the second feature vector; and training the multi-modal pre-training model to be trained based on the target loss value.

Specifically, the target loss value may be determined by a plurality of loss functions, each loss function may determine a corresponding loss value based on the first eigenvector and/or the second eigenvector, and determine the target loss value according to the loss values determined by the respective loss functions.

In the following, several methods of determining the loss value that may constitute the target loss value will be described:

1 st, first loss value

Here, the first loss value may be determined based on a first image feature vector and a first text feature vector in the first feature vector.

Specifically, for any sample pair, a first feature similarity (for example, cosine similarity or the like) between a first image feature vector corresponding to an initial image in the sample pair and a first text feature vector corresponding to the supervised text information in the sample pair may be determined; determining a first image feature vector in the sample pair, and respectively determining second feature similarity (such as cosine similarity) between the first image feature vector in the sample pair and text feature vectors corresponding to the supervised text information in other sample pairs in the initial sample data; determining a first text feature vector in the sample pair, and determining third feature similarity (such as cosine similarity) between the first text feature vector in the sample pair and image feature vectors corresponding to the initial image in other sample pairs in the initial sample data; determining the first loss value based on the first feature similarity, the second feature similarity, and the third feature similarity.

For example, a diagram for determining the first loss value can be shown in fig. 2, where in fig. 2, there are 3 sets of sample pairs, W₁、W₂、W₃Sequentially represents a first text feature vector, P, corresponding to the supervision text information in the 1 st, 2 nd and 3 rd group sample pairs₁、P₂、P₃Sequentially representing first image feature vectors corresponding to initial images in the 1 st, 2 nd and 3 rd groups of sample pairs, and determining W when determining a first loss value corresponding to the 1 st group of sample pairs₁And P₁A first feature similarity X therebetween; determining P₁Respectively with W₁、W₂、W₃Three second feature similarities N therebetween₁、N₂、N₃(ii) a Determining W₁Are respectively connected with P₁、P₂、P₃Three third feature similarities M therebetween₁、M₂、M₃Then the first loss value corresponding to the 1 st group of sample pairs can be expressed as

When determining the first loss values corresponding to the 2 nd group of sample pairs and the 3 rd group of sample pairs, the steps are the same as those of the 1 st group of sample pairs, which are not repeated herein, and after obtaining the first loss values corresponding to the groups of sample pairs, the first loss values corresponding to the groups of sample pairs can be added, so that the first loss value during the training can be obtained.

Type 2, second loss value

Here, the second loss value may be determined based on the second feature vector, the first image feature vector, and the first text feature vector.

Specifically, the following cases may be classified according to whether to perform data augmentation processing on the initial image and/or the supervised text information in the initial sample data:

case 1, data expansion processing is performed only on the initial image

Specifically, for any sample pair, when determining the second loss value, a fourth feature similarity (for example, cosine similarity, etc.) between the second image feature vector and the first text feature vector (that is, the second text feature vector, which is the same at this time) may be determined; determining a fifth feature similarity (such as cosine similarity) between the second image feature vector in the sample pair and the text feature vectors corresponding to the supervised text information (or augmented text information) in the initial sample data and/or augmented sample data and in other sample pairs respectively; determining sixth feature similarity (such as cosine similarity) between second text feature vectors in the sample pair and image feature vectors corresponding to the initial image (or augmented sample image) in other sample pairs in the initial sample data and/or augmented sample data respectively; determining the first loss value based on the fourth feature similarity, the fifth feature similarity, and the sixth feature similarity.

For example, a diagram for determining the second loss value can be shown in fig. 3, where P in fig. 3₁、P₂、P₃、W₁、W₂、W₃The meaning of which is the same as that shown in FIG. 2, P₄、P₅、P₆Sequentially representing the second image characteristic vectors W corresponding to the initial sample images in the 1 st, 2 nd and 3 rd groups of sample pairs after data augmentation₄、W₅、W₆Sequentially representing second text feature vectors corresponding to the supervised text information in the 1 st, 2 nd and 3 rd group sample pairs after data augmentation, wherein the calculation content in the case 1 is positioned on the left half side of the figure 3, and the content to be calculated is P₄-W₁、P₅-W₂、P₆-W₃(i.e., three spaces filled with diagonal lines in the lower left part of FIG. 3), and the calculation process thereofAnd P of₁-W₁、P₂-W₂、P₃-W₃(i.e., the three spaces filled with the shadow are included in fig. 2 and fig. 3) are similar and the calculation principle is the same, and the physical meaning of calculating the second loss value is to map the gain caused by the data amplification processing on the initial image into the loss function, so as to improve the training effect on the multi-modal pre-training model.

Case 2, data augmentation processing is performed only on the supervised text information

For example, the schematic diagram for determining the second loss value can also be shown in fig. 3, where the calculation content in case 2 is located at the upper half of fig. 3, and the content to be calculated is P₁-W₄、P₂-W₅、P₃-W₆(i.e., three spaces filled with diagonal lines in the upper right portion of FIG. 3), the calculation process and P₁-W₁、P₂-W₂、P₃-W₃(i.e., the three spaces filled with the shadow are included in fig. 2 and fig. 3) are similar and the calculation principle is the same, and the physical meaning of calculating the second loss value is to map the gain caused by the data amplification processing on the supervised text information into a loss function, so as to improve the training effect on the multi-modal pre-training model.

Case 3, the data augmentation processing is simultaneously carried out on the initial image and the corresponding supervision text information

For example, the schematic diagram for determining the second loss value can also be shown in fig. 3, where in case 3, the content to be calculated is P₁-W₄、P₂-W₅、P₃-W₆、P₄-W₁、P₅-W₂、P₆-W₃、P₄-W₄、P₅-W₅、P₆-W₆(i.e., all spaces filled with diagonal lines in FIG. 3), the calculation process and P₁-W₁、P₂-W₂、P₃-W₃(i.e., three spaces filled with shading, both contained in FIGS. 2 and 3) are similar and the principle of computationSimilarly, the physical meaning of calculating the second loss value is to map the gain caused by the data amplification processing on the supervised text information and the initial image into a loss function so as to improve the training effect on the multi-mode pre-training model.

Type 3, third loss value

Here, for any piece of supervised text information, similar text information whose feature similarity to the any piece of supervised text information satisfies a first preset condition may be determined, and a text feature vector corresponding to the similar text information is used as a third text feature vector corresponding to the any piece of supervised text information; determining a third loss value based on the third text feature vector and the first feature vector.

Specifically, the first preset condition may be that the feature similarity with any one piece of the supervised text information is the highest, and taking the supervised text information as "one dog" as an example, the similar text information may be "one dog" with the highest feature similarity with the "one dog".

For example, a diagram for determining the third loss value can be shown in fig. 4, where P in fig. 4₁、P₂、P₃、P₄、P₅、P₆、W₁、W₂、W₃、W₄、W₅、W₆The meaning of the expression is the same as that of FIG. 3, W₇、W₈、W₉Sequentially representing third text feature vectors corresponding to the supervised text information in the 1 st, 2 nd and 3 th group sample pairs, wherein the content to be calculated is P₁-W₇、P₂-W₈、P₃-W₉(i.e., three spaces filled with straight lines in the upper right portion of FIG. 4), the process and P are calculated₁-W₁、P₂-W₂、P₃-W₃(i.e., the three spaces filled with shading in both fig. 2 and 3) are similar and the same in calculation principle, and will not be explained here, the physical meaning of calculating the third loss value is to map the gain resulting from the process of determining similar text information corresponding to the supervised text information to the gain resulting from the process of determining similar text information corresponding to the supervised text informationAnd in the loss function, the training effect on the multi-mode pre-training model is improved.

4 th and fourth loss values

Here, for any initial image, determining a similar image whose feature similarity to the any image meets a second preset condition, and taking an image feature vector corresponding to the similar image as a third image feature vector corresponding to the any initial image; determining a fourth loss value based on the third image feature vector and the first feature vector.

For example, a diagram for determining the fourth loss value can also be shown in fig. 4, where in fig. 4, P₇、P₈、P₉Sequentially representing the third image feature vectors corresponding to the initial images in the 1 st, 2 nd and 3 rd groups of sample pairs, wherein the content to be calculated is P₇-W₁、P₈-W₂、P₉-W₃(i.e., the lower left part of FIG. 4 uses three spaces filled with straight lines), the calculation process and P₁-W₁、P₂-W₂、P₃-W₃(i.e., the three spaces filled with the shadow in fig. 2 and 3) are similar and the calculation principle is the same, and the physical meaning of calculating the fourth loss value is to map the gain caused by the process of determining the similar image corresponding to the initial image into the loss function, so as to improve the training effect on the multi-modal pre-training model.

Type 5, fifth loss value

Here, in an auto-supervised training manner, for any supervised text information (or initial image), the fifth loss value may be determined based on a first text feature vector (or first image feature vector) corresponding to the any supervised text information (or initial image) and a second text feature vector (or second image feature vector) corresponding to the any supervised text information (or initial image) after data augmentation processing, and a physical meaning of calculating the fifth loss value is to add the auto-supervised training manner to a training process of training the multi-modal pre-training model, so as to improve a training effect on the multi-modal pre-training model.

When determining the target loss value, any one or more of the above methods may be used to perform data processing manners such as weighted summation, for example, the first loss value, the second loss value, and the third loss value are used to perform weighted summation at the same time to determine the target loss value; in addition, when determining the loss value using the determination methods corresponding to the plurality of loss values, in addition to determining the loss value corresponding to the initial sample data, the loss value corresponding to the augmented sample data after the data augmentation processing may be determined, for example, when determining the fourth loss value, the loss values P corresponding to the initial sample data may be determined in addition to the loss values P corresponding to the initial sample data, respectively₇-W₁、P₈-W₂、P₉-W₃The loss values P corresponding to the respective augmented sample data may also be determined₇-W₄、P₈-W₅、P₉-W₆(the specific calculation process is similar to that described above, and is not described here again), the specific selection of which loss value to determine the target loss value may be selected according to actual requirements, which is not limited in the embodiment of the present disclosure.

In one possible embodiment, as shown in fig. 5, after the training of the multi-modal pre-training model is completed, a multi-modal detection model may be further constructed according to the following steps:

s501: and building a multi-mode detection model based on the multi-mode pre-training model and the characteristic pyramid network.

Here, when building the multi-modal detection model, the building may be performed based on an image encoder, a feature pyramid network, and a head network in the multi-modal pre-training model.

Specifically, the output of the image encoder is 5 outputs, which are C1-C5, respectively, C2-C5 of the outputs are input to the feature pyramid network, and the feature pyramid network is output to a subsequent head network to obtain an output feature vector, so that the multi-modal detection model can be constructed.

S502: and mapping the regional characteristics corresponding to the detection boxes in the training sample data to different layers of the feature pyramid network based on random mapping parameters respectively set for each layer of the feature pyramid network, and calculating a loss value in the training process based on the output of the mapped feature pyramid network.

S503: and adjusting network parameters of network layers of the multi-modal detection model except the multi-modal pre-training model based on the loss value.

Here, when training the multi-modal detection models, two sets of training models may be respectively constructed, where a first set of training models is a multi-modal detection model that is finally required to be obtained, and is composed of an image encoder, a feature pyramid network, a head network, and a plurality of training auxiliary modules (such as a mapping layer and a prediction layer) during training, and an output of the first set of training models is a first training vector; the second group of training models are models for training the first group of training models and are composed of an image encoder, a characteristic pyramid network, a head network and a training auxiliary module (such as a mapping layer), the output of the second group of training models is a second training vector, loss values of the first group of training models can be determined based on the characteristic similarity of the first training vector and the second training vector, and the first group of training models are trained according to the loss values to obtain a trained multi-mode detection model.

In a possible implementation manner, in a training process, a feature mapping module may be used to map, according to random mapping parameters respectively set for each layer of the feature pyramid network (for example, mapping probabilities respectively set for layers where C1 to C4 are located may be 0.4, 0.3, 0.2, and 0.1), region features corresponding to a plurality of detection boxes in training sample data to different layers of the feature pyramid network, so as to replace features in the feature pyramid network, and perform subsequent processing according to the replaced features to obtain a final first training vector (or a second training vector), so that the detection capability of the multi-modal detection model may be enhanced.

Specifically, when a plurality of detection frames are obtained, a target detection algorithm can be used to label an initial sample image, so as to obtain a plurality of detection frames corresponding to the initial sample image; before feature mapping is carried out, the input of the first group of training models and the input of the second group of training models are initial sample images, and C1-C4 in a feature pyramid are features extracted from the initial sample images; when performing feature mapping, data augmentation processing may be performed on the initial sample image labeled with the plurality of detection boxes to obtain two augmented training images (i.e., the training sample data) with detection boxes, and for any training image, the detection boxes and the random mapping parameters in any training image are used to map the regional features corresponding to the plurality of detection boxes to different layers of the feature pyramid network, so as to complete the process of feature mapping.

For example, the process of calculating the loss value may be as shown in fig. 6, the first group of training models is models at the upper right of the image, a feature pyramid network (FPN in the figure) is constructed according to C2-C5, and then the head network, the Proj layer (mapping layer), and the Pred layer (prediction layer) are sequentially connected; the second group of training models are models at the lower right of the image, a feature pyramid network (FPN in the figure) is constructed according to C2-C5, then the head network and a Proj layer (mapping layer) are sequentially connected, the process of calculating loss values is that the first group of training models and the second group of training models are calculated, feature similarity (i.e. similarity) between output vectors of initial sample images is respectively aimed at, parts except a dotted line frame in FIG. 6 are previous training processes, explanation is not carried out, and when feature mapping is carried out, a feature mapping module (Random assign) maps features corresponding to detection frames in the augmented image after data augmentation processing to the FPN in the first group of training models and the second group of training models respectively according to Random mapping parameters respectively set for each layer of the feature pyramid network.

According to the training method of the multi-mode pre-training model, data augmentation processing is performed on the initial image and/or the supervision text information in the initial sample data respectively, and the second feature vector corresponding to the augmented sample data after the data augmentation processing is determined based on the multi-mode pre-training model to be trained, so that loss values of more dimensions can be determined when the multi-mode pre-training model is determined to be used for training, and the training effect of the multi-mode pre-training model is improved.

It will be understood by those skilled in the art that in the method of the present invention, the order of writing the steps does not imply a strict order of execution and any limitations on the implementation, and the specific order of execution of the steps should be determined by their function and possible inherent logic.

Based on the same inventive concept, the embodiment of the present disclosure further provides a training device of a multi-modal pre-training model corresponding to the training method of the multi-modal pre-training model, and as the principle of solving the problem of the device in the embodiment of the present disclosure is similar to the training method of the multi-modal pre-training model in the embodiment of the present disclosure, the implementation of the device may refer to the implementation of the method, and repeated details are not repeated.

Referring to fig. 7, there is shown an architecture diagram of a training apparatus for a multi-modal pre-training model according to an embodiment of the present disclosure, where the apparatus includes: an acquisition module 701, an input module 702, a processing module 703 and a training module 704; wherein the content of the first and second substances,

an obtaining module 701, configured to obtain initial sample data, where the initial sample data includes multiple groups of sample pairs; each group of sample pairs comprises an initial image and supervised text information corresponding to the initial image;

an input module 702, configured to input the multiple groups of sample pairs into a multi-modal pre-training model to be trained, to obtain first feature vectors corresponding to the multiple groups of sample pairs;

a processing module 703, configured to perform data augmentation processing on the initial image and/or the supervised text information in the initial sample data, and determine, based on the to-be-trained multi-modal pre-training model, a second feature vector corresponding to augmented sample data after the data augmentation processing;

a training module 704, configured to train the multi-modal pre-training model to be trained based on the first feature vector and the second feature vector.

In a possible implementation manner, the input module 702, when inputting the plurality of sets of sample pairs into a multi-modal pre-training model to be trained, obtains first feature vectors corresponding to the plurality of sets of sample pairs, is configured to:

In a possible implementation manner, the processing module 703 is further configured to perform data augmentation processing on an initial image in the initial sample data according to the following steps:

the processing module 703 is further configured to perform data augmentation processing on the supervised text information in the initial sample data according to the following method:

In a possible implementation, the training module 704, when training the multi-modal pre-training model to be trained based on the first feature vector and the second feature vector, is configured to:

the training module 704, when determining the target loss value in the current training process based on the first feature vector and the second feature vector, is configured to:

In one possible implementation, the training module 704, when determining the target loss function based on the first loss value and the second loss value, is configured to:

In a possible implementation, in the obtaining module 701, after obtaining the initial sample data, before inputting the plurality of sets of sample pairs to a multi-modal pre-training model to be trained, the obtaining module is further configured to:

In one possible embodiment, after the training of the multi-modal pre-training model is completed, the training module 704 is further configured to:

According to the training device for the multi-mode pre-training model, data augmentation processing is performed on the initial images and/or the supervision text information in the initial sample data respectively, and the second feature vector corresponding to the augmented sample data after the data augmentation processing is determined based on the multi-mode pre-training model to be trained, so that loss values of more dimensions can be determined when the multi-mode pre-training model is determined to be used for training, and the training effect of the multi-mode pre-training model is improved.

The description of the processing flow of each module in the device and the interaction flow between the modules may refer to the related description in the above method embodiments, and will not be described in detail here.

Based on the same technical concept, the embodiment of the disclosure also provides computer equipment. Referring to fig. 8, a schematic structural diagram of a computer device 800 provided in the embodiment of the present disclosure includes a processor 801, a memory 802, and a bus 803. The memory 802 is used for storing execution instructions and includes a memory 8021 and an external memory 8022; the memory 8021 is also referred to as an internal memory, and is used for temporarily storing operation data in the processor 801 and data exchanged with an external storage 8022 such as a hard disk, the processor 801 exchanges data with the external storage 8022 through the memory 8021, and when the computer apparatus 800 operates, the processor 801 communicates with the storage 802 through the bus 803, so that the processor 801 executes the following instructions:

The embodiments of the present disclosure further provide a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the steps of the training method for the multi-modal pre-training model described in the above method embodiments are performed. The storage medium may be a volatile or non-volatile computer-readable storage medium.

The embodiments of the present disclosure also provide a computer program product, where the computer program product carries a program code, and instructions included in the program code may be used to execute the steps of the training method for a multi-modal pre-training model in the above method embodiments, which may be referred to specifically for the above method embodiments, and are not described herein again.

The computer program product may be implemented by hardware, software or a combination thereof. In an alternative embodiment, the computer program product is embodied in a computer storage medium, and in another alternative embodiment, the computer program product is embodied in a Software product, such as a Software Development Kit (SDK), or the like.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the system and the apparatus described above may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again. In the several embodiments provided in the present disclosure, it should be understood that the disclosed system, apparatus, and method may be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one logical division, and there may be other divisions when actually implemented, and for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some communication interfaces, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present disclosure may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a non-volatile computer-readable storage medium executable by a processor. Based on such understanding, the technical solution of the present disclosure may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present disclosure. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

Finally, it should be noted that: the above-mentioned embodiments are merely specific embodiments of the present disclosure, which are used for illustrating the technical solutions of the present disclosure and not for limiting the same, and the scope of the present disclosure is not limited thereto, and although the present disclosure is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive of the technical solutions described in the foregoing embodiments or equivalent technical features thereof within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the embodiments of the present disclosure, and should be construed as being included therein. Therefore, the protection scope of the present disclosure shall be subject to the protection scope of the claims.

Claims

1. A method for training a multi-modal pre-training model, comprising:

acquiring initial sample data, wherein the initial sample data comprises a plurality of groups of sample pairs; each group of sample pairs comprises an initial image and supervised text information corresponding to the initial image;

2. The method of claim 1, wherein inputting the plurality of groups of sample pairs into a multi-modal pre-training model to be trained to obtain first feature vectors corresponding to the plurality of groups of sample pairs comprises:

3. The method according to claim 1 or 2, further comprising performing data augmentation processing on an initial image in the initial sample data according to:

4. The method according to any one of claims 1 to 3, wherein the supervised text information in the initial sample data comprises at least two words;

the method further comprises the step of carrying out data augmentation processing on the supervision text information in the initial sample data according to the following method:

5. The method according to any one of claims 1 to 4, wherein the training the multi-modal pre-training model to be trained based on the first feature vector and the second feature vector comprises:

6. The method of claim 5, wherein the first feature vector comprises a first image feature vector and a first text feature vector;

7. The method of claim 6, wherein determining the target loss function based on the first loss value and the second loss value comprises:

8. The method of claim 6 or 7, wherein determining the target loss function based on the first loss value and the second loss value comprises:

9. The method according to any one of claims 1 to 8, wherein after acquiring the initial sample data, before inputting the plurality of sets of sample pairs into a multi-modal pre-training model to be trained, the method further comprises:

10. The method of any of claims 1-9, wherein after the training of the multi-modal pre-trained model is completed, the method further comprises:

11. A training device for multi-modal pre-training models, comprising:

12. A computer device, comprising: a processor, a memory and a bus, the memory storing machine readable instructions executable by the processor, the processor and the memory communicating over the bus when a computer device is run, the machine readable instructions when executed by the processor performing the steps of the method of training a multimodal pre-trained model according to any one of claims 1 to 10.

13. A computer-readable storage medium, characterized in that the computer-readable storage medium has stored thereon a computer program which, when being executed by a processor, carries out the steps of the method for training a multi-modal pre-training model as claimed in any one of claims 1 to 10.