CN114461839B

CN114461839B - Multi-mode pre-training-based similar picture retrieval method and device and electronic equipment

Info

Publication number: CN114461839B
Application number: CN202210376939.2A
Authority: CN
Inventors: 孟凡飞; 李飞阳; 薛娇; 李大海
Original assignee: Zhizhe Sihai Beijing Technology Co ltd
Current assignee: Zhizhe Sihai Beijing Technology Co ltd
Priority date: 2022-04-12
Filing date: 2022-04-12
Publication date: 2023-02-07
Anticipated expiration: 2042-04-12
Also published as: CN114461839A

Abstract

The application provides a multi-mode pre-training similar picture retrieval method, a device and electronic equipment, wherein the method comprises the following steps: obtaining a picture characteristic encoder, wherein the picture characteristic encoder and a text encoder are obtained through multi-mode pre-training together; based on the picture feature encoder, obtaining the picture to be retrieved and the picture features of the pictures in the picture database; based on the picture to be retrieved and the picture characteristics of the pictures in the picture database, recalling picture data similar to the characteristics of the picture to be retrieved from the picture database as recalled picture data; and sequencing the recalled picture data, and returning the nearest neighbor data as the retrieval result of the picture to be retrieved. According to the method and the device, a group of semantically and contextually similar pictures can be efficiently and accurately searched from massive picture data through multi-mode pre-training, picture feature extraction, similar picture recall and similarity sequencing.

Description

Multi-mode pre-training-based similar picture retrieval method and device and electronic equipment

Technical Field

The application relates to the technical field of computer application, in particular to a method and a device for searching similar pictures based on multi-mode pre-training and electronic equipment.

Background

Along with the precipitation of data, the Chinese content community has massive image-text contents, the image data volume reaches hundred million levels, and the image-text contents have the characteristics of rich varieties and complex semantics. Retrieval and matching of semantic similar contents have strong business values in scenes such as searching, recommendation, wide business and the like, and generally, people use a neural network labeling and content representation method to aggregate image-text data with similar semantic contents. Because the size of the picture is extremely large, how to retrieve a group of semantically similar pictures from massive picture data becomes a complex and important problem.

The traditional similar picture retrieval method comprises perceptual hash retrieval, scale invariant feature transformation feature retrieval, picture label retrieval and neural network-based picture feature retrieval. The perceptual hash-based retrieval method has poor retrieval effect on pictures with similar semantics; the method for searching the feature based on the scale invariant feature transformation has poor searching effect on the picture lacking the texture information; the method based on the picture label retrieval has the advantages that the text label accuracy is low, manual labeling is needed, and great manual expenses are caused; according to the image feature retrieval method based on the neural network, the semantic representation capability of an image feature extractor obtained through a traditional classification task is poor, the supervised training of semantic information is lacked, the traditional supervised learning classification task needs manual labeling data, and the method cannot be used for massive non-labeling data, so that the retrieval effect and robustness of similar images are poor.

At present, multi-mode pre-training similar picture retrieval methods based on image-text information are also shown in succession, for example, a dual network of a ViT model and a BERT model is adopted in the pre-training process of a retrieval model to extract picture features and text features, but in the method, the picture feature extraction model adopts the ViT model to perform self-supervision task training, the phenomena of unstable training process and obvious loss and jitter can occur, and the model training efficiency can be obviously reduced.

Disclosure of Invention

In view of this, an embodiment of the present application aims to provide a method, an apparatus, and an electronic device for retrieving similar pictures based on multi-modal pre-training, which further improve the stability of multi-modal pre-training and the effect of model retrieval through the design of multi-modal pre-training tasks such as cross-modal alignment and staged training of image-text data, and solve the problem of similar picture retrieval on massive image-text data.

In a first aspect, an embodiment of the present application provides a method for retrieving similar pictures based on multi-modal pre-training, where the method includes: obtaining a picture characteristic encoder, wherein the picture characteristic encoder and a text encoder are obtained through multi-mode pre-training together; the multi-mode pre-training comprises a first stage of pre-training the picture feature encoder and the text encoder based on a gradient updated Query model and a momentum updated Key model, and a second stage of pre-training the picture feature encoder and the text encoder based on the gradient updated Query model; wherein a Patch Projection layer of the picture feature encoder is fixed in the first stage training; based on the picture feature encoder, obtaining the picture to be retrieved and the picture features of the pictures in the picture database; based on the picture to be retrieved and the picture characteristics of the pictures in the picture database, recalling picture data similar to the characteristics of the picture to be retrieved from the picture database as recalled picture data; and sequencing the recalled picture data, and returning the nearest neighbor data as the retrieval result of the picture to be retrieved.

Optionally, before the multi-modality pre-training, the method further includes: acquiring pictures and corresponding text information thereof, and constructing picture-text pairs as training data sets; and constructing a multi-mode pre-training model, wherein the model adopts a double-tower mode and comprises a picture characteristic encoder at a picture side and a text characteristic encoder at a text side, the picture characteristic encoder adopts a ViT model to extract picture characteristics, and the text characteristic encoder adopts a language pre-training model BERT to extract picture characteristics.

Optionally, the gradient update-based Query model and the momentum update-based Key model perform a first-stage training of pre-training the picture feature encoder and the text encoder, including: acquiring a certain batch of picture-text pairs of the training data set, adding pictures in the picture-text pairs into a picture sample queue, and adding texts in the picture-text pairs into a text sample queue, wherein the picture sample queue and the text sample queue are of fixed lengths, and a new batch of data is kept to enter the sample queue while an old batch of data is dequeued; inputting the picture in a certain picture-text pair into the picture characteristic encoder Query to obtain picture characteristics; inputting the pictures of the picture sample queue into the picture characteristic encoder Query to obtain the picture characteristics of the picture sample queue, matching the picture characteristics with the picture characteristics of the picture sample queue, and calculating a loss function of first single-mode contrast learning; inputting the text of the text sample queue into the text feature encoder Key to obtain the text features of the text sample queue, matching the picture features with the text features of the text sample queue, and calculating a first cross-modal contrast learning loss function; calculating a first total loss function, and updating parameters of a picture feature encoder Query by adopting a gradient descent method, wherein the first total loss function is the sum of a first single-mode contrast learning loss function and a first cross-mode contrast learning loss function; and updating the parameter of the image feature encoder Key through momentum based on the updated parameter of the image feature encoder Query. And the number of the first and second groups,

inputting a text in a certain picture-text pair into a text feature encoder Query to obtain a text feature; inputting the text of the text sample queue into the text feature encoder Query to obtain the text features of the text sample queue, matching the text features with the text features of the text sample queue, and calculating a loss function of second single-mode contrast learning; inputting the pictures of the picture sample queue into the picture characteristic encoder Key to obtain the picture characteristics of the picture sample queue, matching the text characteristics with the picture characteristics of the picture sample queue, and calculating a second cross-mode contrast learning loss function; calculating a second total loss function, and updating parameters of a text feature encoder Query by adopting a gradient descent method, wherein the second total loss function is the sum of a second single-mode contrast learning loss function and a second cross-mode contrast learning loss function; and updating the parameter of the text feature encoder Key through momentum based on the updated parameter of the text feature encoder Query.

Optionally, the second stage training of pre-training the picture feature encoder and the text encoder by the gradient update-based Query model includes: acquiring a certain batch of picture-text pairs of the training data set; inputting the picture in the picture-text pair into a picture characteristic encoder Query to obtain picture characteristics; inputting a text in the picture-text pair into a text feature encoder Query to obtain a text feature; and matching the picture characteristics with the text characteristics, calculating a third loss function, and updating parameters of a picture characteristic encoder Query and a text characteristic encoder Query by adopting a gradient descent method.

Optionally, the recalling, from a picture database, picture data similar to the features of the picture to be retrieved based on the picture to be retrieved and the picture features of the picture in the picture database, as the recalled picture data, includes: based on the picture characteristics in the picture database, adopting a PCA dimension reduction algorithm and a PQ compression algorithm to carry out dimension reduction, compression and quantification, and endowing each vector with a new index to obtain an index database; based on the characteristics of the picture to be retrieved, carrying out the same PCA dimension reduction and PQ compression operation to obtain a characteristic index of the picture to be retrieved; based on the characteristic index of the picture to be retrieved, performing reverse sorting on the index database by adopting an IVF (in-vehicle vision) technology to obtain a first sorting table; and calculating the distance between the feature index to be retrieved and the top index vectors of the first sorting table, and selecting the image data with the closest distance as the recall image data.

Optionally, the sorting the recalled picture data and returning the nearest neighbor data includes: calculating the similarity between the picture to be retrieved and the recalled picture data; sorting according to similarity descending to obtain a second sorting table; and returning the data with the top ranking of the second sorting table as the nearest neighbor data.

In a second aspect, an embodiment of the present application further provides a device for retrieving similar pictures based on multi-modal pre-training, where the device includes:

the model construction module is used for constructing a multi-mode pre-training model, the model adopts a double-tower mode and comprises a picture characteristic encoder on a picture side and a text characteristic encoder on a text side, the picture characteristic encoder adopts a ViT model to extract picture characteristics, and the text characteristic encoder adopts a language pre-training model BERT to extract picture characteristics;

the data acquisition module is used for acquiring the pictures and the corresponding text information thereof and constructing picture-text pairs as a training data set;

the model training module is used for performing multi-mode pre-training on the basis of the training data set to obtain an image feature encoder and a text encoder; the model training module comprises a first training module and a second training module, the first training module is used for pre-training the picture feature encoder and the text encoder based on a gradient updated Query model and a momentum updated Key model, and the second training module is used for pre-training the picture feature encoder and the text encoder based on the gradient updated Query model; wherein a Patch Projection layer of the picture feature encoder is fixed in the first training module;

the characteristic extraction module is used for acquiring the picture to be retrieved and the picture characteristics of the pictures in the picture database based on the trained picture characteristic encoder;

the recall module is used for recalling picture data similar to the characteristics of the picture to be retrieved as recalled picture data based on the picture to be retrieved and the picture characteristics of the pictures in the picture database;

and the sorting module is used for sorting the recalled picture data and returning the nearest neighbor data as the retrieval result of the picture to be retrieved.

In a third aspect, an embodiment of the present application further provides an electronic device, where the electronic device includes a memory and a processor, where the memory stores a computer program, and the processor executes, when executing the computer program, the steps in any implementation manner of the above-mentioned method for retrieving similar pictures based on multi-modal pre-training.

In a fourth aspect, an embodiment of the present application further provides a readable storage medium, where a computer program is stored in the readable storage medium, and when the computer program is executed by a processor, the computer program performs the steps in any implementation manner of the above similar picture retrieval method based on multi-modal pre-training.

In summary, the application provides a similar picture retrieval method, a similar picture retrieval device and electronic equipment based on multi-mode pre-training, and strong correlation between visual information and semantic information is realized through the design of multi-mode pre-training tasks such as cross-mode alignment and staged training of image-text data, so that a picture feature extractor has richer semantic representation space, and the retrieval effect and robustness are improved; the first stage training, namely fixing the Patch project layer of the picture feature encoder and simultaneously performing monomodal contrast learning and cross-modal contrast learning, so as to further improve the stability of the training process and the retrieval effect of the model; similar pictures are recalled by adopting PCA dimension reduction, PQ compression and IVF technologies, so that the speed and efficiency of model retrieval of similar pictures are greatly improved; by carrying out similarity sequencing on the recalled similar pictures again, the precision loss caused by the recall process can be avoided, and the accuracy of the retrieval result is further improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained from the drawings without inventive effort.

Fig. 1 is a schematic flowchart of a similar picture retrieval method based on multi-modal pre-training according to an embodiment of the present application;

fig. 2a is a schematic process of multi-modal pre-training based on a similar image retrieval method of multi-modal pre-training according to an embodiment of the present application;

fig. 2b is a schematic diagram of a first stage training process of a multi-modal pre-training model based on a multi-modal pre-training similar image retrieval method according to an embodiment of the present application;

fig. 2c is a schematic process diagram of a second training phase of a multi-modal pre-training model in a similar image retrieval method based on multi-modal pre-training according to an embodiment of the present application;

fig. 3 is a schematic flowchart illustrating a similar picture recalling process in a similar picture retrieving method based on multi-modal pre-training according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of a similar picture retrieval apparatus based on multi-modal pre-training according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of an electronic device for retrieving similar pictures based on multi-modal pre-training according to an embodiment of the present disclosure.

Icon: 400-a model training device; 410-a model building module; 420-a data acquisition module; 430-model training module; 440-feature extraction module; 450-recall module; 460-a sorting module; 500-model training electronics; 510-a processor; 520-a memory; 530-bus.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application. In the description of the present application, the terms "first," "second," "third," and the like are used solely to distinguish one from another and are not to be construed as indicating or implying relative importance. It should be apparent that the embodiments described below are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments of the present application without any creative effort belong to the protection scope of the embodiments of the present application.

Referring to fig. 1, fig. 1 is a schematic flowchart of a method for retrieving similar pictures based on multi-modal pre-training according to an embodiment of the present application, including the following steps:

s11, obtaining a picture feature encoder, wherein the picture feature encoder and a text encoder are obtained through multi-mode pre-training.

Optionally, the multi-modal pre-training comprises a first stage of pre-training the picture feature encoder and the text encoder based on a gradient updated Query model and a momentum updated Key model, and a second stage of pre-training the picture feature encoder and the text encoder based on a gradient updated Query model; wherein a Patch Projection layer of the picture feature encoder is fixed in the first stage training.

The Query model trained in the first stage comprises a picture characteristic encoder Query and a text characteristic encoder Query, the Key model comprises a picture characteristic encoder Key constructed on the basis of the picture characteristic encoder Query and a text characteristic encoder Key constructed on the basis of the text characteristic encoder Query, the picture characteristic encoder Query and the picture characteristic encoder Key have the same initial parameters, the text characteristic encoder Query and the text characteristic encoder Key have the same initial parameters, in the first stage training, the parameters of the Query model are updated through a gradient descent method, and the parameters of the Key model are updated based on the parameters of the Query model after gradient updating.

In some embodiments, the Patch Projection layer of the picture feature encoder is fixed in the first stage of training, which may be block embedding using a fixed Random Patch Projection layer, instead of fixing the Patch Projection layer of the ViT in a learning manner, which may greatly increase the stability of the training process of the ViT picture feature encoder and speed up the training speed of the model.

Optionally, before the multi-modality pre-training, the method further includes: acquiring pictures and corresponding text information thereof, and constructing picture-text pairs as training data sets; the method comprises the steps of constructing a multi-mode pre-training model, wherein the model adopts a double-tower mode and comprises a picture feature Encoder at a picture side and a text feature Encoder at a text side, the picture feature Encoder adopts a Vision Transformer (ViT) model capable of reserving more spatial information to extract picture features, and the text feature Encoder adopts a pre-training language Representation model BERT (Bidirectional Encoder Representation from Transformers) with the best performance at present.

It is worth noting that the picture side and the text side are completely independent and do not interfere with each other. The multi-modal pre-training model provided in the embodiment of the application is mainly focused on the training picture side, and the text side of the double-tower structure is only used as auxiliary training. In some embodiments, the text encoder on the text side selects a model Fastformer with lower temporal complexity instead of a Transformer to improve model training efficiency.

And S12, acquiring the picture to be retrieved and the picture characteristics of the pictures in the picture database based on the picture characteristic encoder.

The image database refers to a database in which all images on a webpage or a website are stored; the picture to be retrieved can be a picture in a picture database or a picture which does not belong to the picture database, such as a picture input by a user through a webpage; the picture to be retrieved can be one picture or a plurality of pictures.

In some implementations, before the picture feature extraction is performed by using the picture feature encoder, the picture to be retrieved and the pictures in the picture database may be preprocessed, where the preprocessing includes, but is not limited to, operations of format conversion, cropping, and the like, which unify picture formats, and is used to ensure that the preprocessed picture can be directly input to the picture feature encoder for picture feature extraction.

And S13, based on the picture to be retrieved and the picture characteristics of the pictures in the picture database, recalling the picture data similar to the characteristics of the picture to be retrieved from the picture database as recalled picture data.

The image data similar to the characteristics of the image to be retrieved may refer to multiple images similar in meaning, such as "an aircraft runs through the sky" and "an aircraft is in the sky", or may refer to multiple images similar in content, such as multiple images that include Mona Lisa but have different viewing angles or backgrounds, or may be multiple images having the same meaning, and if the image to be retrieved is a hot water kettle of a certain brand, the similar images may be hot water kettles of different styles or even different brands but have similar appearances.

In some embodiments, a FAISS-based KNN algorithm may be adopted to recall, as recall picture data, picture data having features similar to those of the picture to be retrieved, which specifically includes: based on the picture characteristics in the picture database, adopting PCA (Principal Components Analysis) dimension reduction algorithm and PQ (Product Quantization) compression algorithm to perform dimension reduction, compression and Quantization, and endowing each vector with a new index to obtain an index database; based on the characteristics of the picture to be retrieved, carrying out the same PCA dimension reduction and PQ compression operation to obtain a characteristic index of the picture to be retrieved; based on the characteristic index of the picture to be retrieved, adopting an IVF (Inverted File System) technology to perform Inverted sorting on the index database to obtain a first sorting table; and calculating the distance between the feature index to be retrieved and the top-to-top index vectors of the first sorting table, and selecting the picture data with the closest distance as the recall picture data.

In other embodiments, an LSH (Locality-Sensitive Hashing) based ANN (Approximate Nearest Neighbor) algorithm may also be used, including: selecting an LSH function meeting the sensitive condition; determining the number of hash tables, the number of hash functions in each table and the parameters of an LSH function; hashing each picture characteristic in the picture database into a corresponding bucket through an LSH function to form one or more hash tables; and obtaining a corresponding barrel number by passing the characteristics of the picture to be retrieved through an LSH function, and taking out corresponding data as recalled picture data.

And S14, sequencing the recalled picture data, and returning the nearest neighbor data as the retrieval result of the picture to be retrieved.

Optionally, calculating similarity between the picture to be retrieved and the recalled picture data; sorting according to similarity decreasing to obtain a second sorting table; and returning the data with the top ranking of the second sorting table as the nearest neighbor data.

In some embodiments, the similarity between the picture to be retrieved and the recalled picture data, for example, the cosine similarity, is calculated, and is sorted according to the obtained similarity in a descending order, and then the data ranked at the top is returned as the retrieval result of the picture to be retrieved.

In other embodiments, the distance between the picture to be retrieved and the recall picture data, such as euclidean distance, manhattan distance, mahalanobis distance, and chebyshev distance, may also be calculated, and sorted in increments according to the obtained distance, and then the top data is returned as the retrieval result of the picture to be retrieved.

According to the similar picture retrieval method based on multi-mode pre-training, strong correlation between visual information and semantic information is realized through the design of multi-mode pre-training tasks such as cross-mode alignment and staged training of image-text data, so that the picture feature extractor has richer semantic representation space, and the retrieval effect and robustness are improved; in the first stage of training, by fixing a Patch Projection layer of a picture feature encoder and simultaneously performing single mode contrast learning and cross mode contrast learning, the stability of the training process and the retrieval effect of a model are further improved; similar pictures are recalled by adopting PCA dimension reduction, PQ compression and IVF technologies, so that the efficiency of model retrieval of similar pictures is greatly improved; by carrying out similarity sequencing on the recalled similar pictures again, the precision loss caused by the recall process can be avoided, and the accuracy of the retrieval result is further improved.

Referring to fig. 2a, fig. 2a is a schematic process of multi-modal pre-training based on a similar picture retrieval method of multi-modal pre-training according to an embodiment of the present application, including the following steps:

s21, obtaining the picture and the corresponding text information thereof, and constructing a picture-text pair as a training data set.

Optionally, the picture and the text information corresponding to the picture may be directly obtained from a website having a large amount of image-text data, where the text information includes context information or title information, and may be in a long text form or a short text form.

In some embodiments, acquiring a picture and its corresponding text information, and constructing a picture-text pair as a training data set includes: forming a picture-text pair by the picture and the corresponding context information; the pictures and their corresponding header information form a picture-text pair.

In other embodiments, preprocessing such as transformation, clipping, rotation, blurring and the like may be performed on the pictures in the training data set, and the preprocessed pictures and the corresponding text information form a picture-text pair, so as to further improve the data amount of the training data.

It is noted that, during the training process, for a given picture-text pair, a large-capacity and consistent-characterization negative sample queue may be maintained for picture information and text information, respectively, to further improve the generalization capability of the model. In some embodiments, a large number of negative examples may be constructed based on the MoCo (momentium Contrast) framework.

The method for directly constructing the picture-text through the website with the massive image-text data can better keep the integrity of the information, thereby not losing the information as much as possible, and the method for constructing the data has low cost, and can construct hundred million-magnitude training data during the training of an actual model.

And S22, constructing a multi-mode pre-training model, wherein the model adopts a double-tower mode and comprises a picture characteristic encoder at a picture side and a text characteristic encoder at a text side.

The picture feature encoder adopts a ViT model to extract picture features, and the picture feature encoder comprises: firstly, dividing a picture into a plurality of patches, and mapping each patch into an embedding layer, namely a linear project layer; then adding a one-dimensional position embedding; and finally, adding a sparse classification token to be placed in front of the sequence, and then obtaining the feature coding of the picture through a multilayer encoder structure, namely obtaining the picture features through a picture feature encoder.

The text feature encoder adopts a language pre-training model BERT to extract picture features, and comprises the following steps: the method comprises the steps of obtaining a token sequence through word segmentation processing input from an input end, obtaining embedding of a text through a Fastformer layer, and obtaining feature codes of the text, namely text features obtained through a text feature encoder.

And S23, based on the training data set, carrying out first-stage training on the model until the model initially converges.

The initial convergence of the model means that a loss function of the model parameters is relatively minimized, or the initial convergence of the model parameters can be understood as that the model parameters are initially stable and have small changes in the training process.

Referring to fig. 2b, fig. 2b is a schematic diagram of a first stage training process of a multi-modal pre-training process based on a similar picture retrieval method of multi-modal pre-training according to an embodiment of the present application, including:

s231, a certain batch of picture-text pairs of the training data set are obtained, pictures in the picture-text pairs are added into a picture sample queue, and texts in the picture-text pairs are added into a text sample queue.

The image sample queue and the text sample queue are fixed in length, and in each iteration process, when a new batch of data enters the sample queue, the oldest batch of data is dequeued, so that the sample queue is kept in the fixed length; in the actual training process, the fixed length of the image sample queue and the text sample queue is a hyper-parameter, and the fixed length can be determined manually according to an empirical value.

And S232, calculating a loss function of the first single-mode contrast learning.

Optionally, the pictures in a certain picture-text pair are input into the picture feature encoder Query to obtain picture features, the pictures in the picture sample queue are input into the picture feature encoder Query to obtain the picture features in the picture sample queue, the picture features and the picture features in the picture sample queue are matched, and a loss function of the first single-mode contrast learning is calculated.

S233, calculating a loss function of the first cross-modal contrast learning.

Optionally, the text in the text sample queue is input into the text feature encoder Key to obtain the text features of the text sample queue, and the image features and the text features of the text sample queue are matched to calculate a loss function of the first cross-modal contrast learning.

And S234, updating parameters of the picture characteristic encoder Query and the picture characteristic encoder Key.

Optionally, a first total loss function is calculated, a gradient descent method is adopted to update parameters of a picture feature encoder Query, and parameters of a picture feature encoder Key are updated through momentum based on the updated parameters of the picture feature encoder Query. Wherein the first total loss function is the sum of a first single-mode contrast learning loss function and a first cross-mode contrast learning loss function;

optionally, the parameters of the picture feature encoder are recorded asθ _Q ^I The parameters of the secondary model of the picture feature encoder are recorded asθ _K ^I Then, the formula by the gradient update and the momentum update is expressed as,

wherein the content of the first and second substances,ϒit is indicated that the learning rate is,

representing the loss function of the picture feature encoder,mrepresenting a momentum parameter, typically the optimal momentum update factor is 0.99.

And S235, calculating a loss function of the second monomodal contrast learning.

Optionally, a text in a certain picture-text pair is input into a text feature encoder Query to obtain a text feature, the text in the text sample queue is input into the text feature encoder Query to obtain a text feature in the text sample queue, the text feature and the text feature in the text sample queue are matched, and a loss function of second single-mode contrast learning is calculated.

And S236, calculating a loss function of the second cross-modal contrast learning.

Optionally, the pictures in the picture sample queue are input into the picture feature encoder Key to obtain picture features of the picture sample queue, the text features are matched with the picture features of the picture sample queue, and a second cross-modal contrast learning loss function is calculated.

And S237, updating parameters of the text feature encoder Query and the text feature encoder Key.

Optionally, a second total loss function is calculated, a gradient descent method is adopted to update the parameter of the text feature encoder Query, and based on the updated parameter of the text feature encoder Query, the parameter of the text feature encoder Key is updated through momentum. Wherein the total loss function is the sum of the loss function of the second monomodal contrast learning and the loss function of the second cross-modal contrast learning.

Optionally, the parameters of the text feature encoder are notedθ _Q ^T The parameters of the secondary model of the text feature coder are notedθ _K ^T Then, the equations by gradient update and momentum update are expressed as,

representing the loss function of the text feature encoder,mrepresenting a momentum parameter, the optimum momentum update factor is typically 0.99.

It should be noted that steps S232-S234 and steps S235-S237 are two training processes performed on the picture side and the text side at the same time, and in the first stage of training process, the stability of the training process is increased and the training speed of the model is increased by fixing the Patch Projection layer of the picture feature encoder ViT model on the picture side.

In some embodiments, the first and second monomodal-contrast learning Loss functions may be triple Loss (triple Loss) functions, and the first and second trans-modal-contrast learning Loss functions may be Focal Loss (Focal Loss) functions.

The triple loss function learns better subtle characteristics in training by zooming in the distance between the triple loss function and the positive sample and zooming out the distance between the triple loss function and the negative sample, and the mathematical expression of the triple loss function is as follows,

wherein, the first and the second end of the pipe are connected with each other,aa sample representing the input is taken and,prepresentaThe positive samples to which the samples correspond are,nrepresentaThe samples correspond to negative samples, and margin represents a threshold of the triplet loss function, and may be generally set according to model training.

A focus loss function for increasing class weight by modifying a Cross-loss entropy (Cross entropy) function of the two classesα _t And sample difficulty weight adjustment factor (1-p _t ) ^γ The problems of unbalanced sample category, unbalanced sample classification difficulty and the like are solved, the model accuracy is improved, the mathematical expression is as follows,

wherein the content of the first and second substances,y _i representing a sampleiLabel of (1), positive type is 1, negative type is 0;p _i representing a sampleiThe probability of predicting as a positive class. In the context of the present application, it is,y _i it may be indicated whether the picture-text pair matches, matches 1, mismatches 0,p _i representing the probability that the picture-text pair is a positive sample.

And S24, training the model in the second stage based on the training data set until the model is completely converged.

The model complete convergence refers to minimization of a loss function of a model parameter, or can be understood as that the model parameter is completely stable and slightly changed in a training process.

Referring to fig. 2c, fig. 2c is a schematic diagram of a second stage training process of a multi-modal pre-training process based on a similar image retrieval method of multi-modal pre-training according to an embodiment of the present application, including:

s241, acquiring a certain batch of picture-text pairs of the training data set;

s242, inputting the picture in the picture-text pair into a picture characteristic encoder Query to obtain picture characteristics;

s243, inputting the text in the picture-text pair into a text feature encoder Query to obtain text features;

and S244, matching the picture characteristics with the text characteristics, calculating a third loss function, and updating parameters of a picture characteristic encoder Query and a text characteristic encoder Query by adopting a gradient descent method.

Optionally, the third loss function is a cross-loss entropy of two classes, whose mathematical expression is,

in some embodiments, in the pre-training processes of steps S23 and S24, a commonly used LAMB (Layer-wide Adaptive motion optimizer for locking training) optimizer may also be used as the model parameter optimizer, so as to better maintain the parameter precision when the parameters of the picture feature encoder and the text feature encoder are updated by using the gradient descent method, thereby improving the model effect.

In the multi-mode pre-training process of the similar picture retrieval method based on multi-mode pre-training, the first-stage training stage maintains a gradient updated model and a momentum updated model for the picture encoder and the text encoder at the same time, and performs single-mode contrast learning and cross-mode contrast learning under image-text data through a Patch project layer of a fixed picture feature encoder, so that the picture feature extractor has richer semantic representation space, thereby improving retrieval robustness; in the second stage of training, only one gradient updating model is maintained for the image encoder and the text encoder, cross-mode alignment is carried out under image-text data, and the model retrieval effect is further improved; in the training process, a LAMB parameter optimizer is further adopted, so that the effect of the model is better improved.

Referring to fig. 3, fig. 3 is a schematic flowchart of a procedure for recalling a similar picture based on a knon algorithm of an FAISS based on a multi-modal pre-training similar picture retrieval method according to an embodiment of the present application, including the following steps:

and S31, based on the picture characteristics in the picture database, adopting a PCA dimension reduction algorithm to reduce the dimension.

Optionally, each picture feature in the picture database is high-dimensional data, and the dimensionality reduction by PCA may be to multiply each picture feature by a transformation matrix to obtain a vector in a low-dimensional space, where the transformation matrix is closely related to the picture database and may be obtained by data training, so as to ensure that information loss in the entire dimensionality reduction process is minimum.

And S32, compressing and quantizing by adopting a PQ compression algorithm, and endowing a new index to each vector to obtain an index database.

Alternatively, the PQ compression algorithm is an encoding method, which can be understood as decomposing the original vector space intonThe method comprises the following steps of (1) carrying out Cartesian product on low-dimensional vector spaces, and quantizing the low-dimensional vector spaces obtained by decomposition respectively:

s321, decomposing the original vector intonAnd (4) grouping.

Optionally, the original vector is a low-dimensional vector obtained by performing PCA dimension reduction on each picture feature in the picture database,nthe dimension of the original vector must be divided exactly, for example, when the original vector is 128 dimensions,nmust be a number that can be divided by 128, for examplen4 may be taken, but 6 cannot be taken.

S322, respectively pairingnThe component amount is subjected to the Cluster operation to obtainn*kAnd (4) clustering centers.

Alternatively, the first and second liquid crystal display panels may be,kclustering centers for each groupPoints, the clustering operation may be K-means clustering.

S323, for each original vectornThe group vector is subjected to Assign operation to determine the nearest cluster, and each original vector can be represented asnThe component quantity corresponds to a vector consisting of cluster center IDs.

In some embodiments, when the picture database is too large, the overhead of storage resources may be greatly reduced and the efficiency of similarity retrieval may be improved through the operations of steps S321-S323.

And S33, performing the same PCA dimension reduction and PQ compression operation based on the characteristics of the picture to be retrieved to obtain the characteristic index of the picture to be retrieved.

And S34, performing reverse sorting on the index database by adopting an IVF technology to obtain a first sorting table.

Optionally, directly performing K-means clustering on all the original vectors to obtainKCalculating the sum of each query vectorKAnd (4) clustering the distance of the cluster centers, and performing reverse sorting according to the distance to obtain a first sorting table.

And S35, calculating the distance between the feature index to be retrieved and the top index vectors of the first sorting table, and selecting the image data with the closest distance as recall image data.

Optionally, the number of vectors for which the distance needs to be calculated is reduced by several orders of magnitude through the operation of inverse sorting, so that the process and the speed of vector retrieval are further improved.

According to the FAISS-based KNN algorithm similar picture recalling method, the full IndexIVFPQ index is constructed through PCA dimension reduction, PQ compression, IVF technology and the like, compared with the traditional Hash-based retrieval method, FAISS is focused on compressing original vectors, efficient similarity search and clustering are provided for dense vectors, billion-level vector retrieval is supported, and distributed large-scale dense vector retrieval can be rapidly and accurately performed under the condition of low memory consumption.

Referring to fig. 4, fig. 4 is a schematic structural diagram of a similar picture retrieval apparatus based on multi-modal pre-training according to an embodiment of the present application, where the model training apparatus 400 includes:

the model building module 410 is configured to obtain a picture and text information corresponding to the picture, and build a picture-text pair as a training data set;

the data acquisition module 420 is configured to construct a multi-modal pre-training model, where the model adopts a double-tower mode and includes a picture side and a text side, the picture side extracts picture features by using a picture feature encoder, and the text side extracts text features by using a text feature encoder;

the model training module 430 is configured to perform multi-modal pre-training based on the training data set to obtain an image feature encoder and a text encoder; the model training module comprises a first training module and a second training module, the first training module is used for pre-training the picture feature encoder and the text encoder based on a gradient updated Query model and a momentum updated Key model, and the second training module is used for pre-training the picture feature encoder and the text encoder based on the gradient updated Query model; wherein a Patch Projection layer of the picture feature encoder is fixed in the first training module;

a feature extraction module 440, configured to perform preprocessing and feature extraction on the picture to be retrieved and the pictures in the picture database, respectively, based on the trained picture feature encoder, to obtain features of the picture to be retrieved and features of the pictures in the picture database;

a recall module 450, configured to recall, based on the features of the picture to be retrieved and the features of each picture in the picture database, picture data having features similar to those of the picture to be retrieved as recalled picture data by using a KNN algorithm based on the FAISS;

and the sorting module 460 is configured to sort the recalled picture data, and return the nearest neighbor data as the retrieval result of the picture to be retrieved.

For a detailed description of the similar picture retrieving apparatus based on multi-modal pre-training, please refer to the description of the related method steps in the above embodiments.

Referring to fig. 5, fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure, where the electronic device 500 includes: the memory 510 and the processor 520 are connected by a bus 530, the memory 510 stores a computer program, and when the processor 520 reads and runs the computer program, the electronic device 500 may execute all or part of the process of the method in the foregoing embodiments, so as to achieve similar image retrieval based on the multi-modal pre-training model.

It should be understood that the electronic device may be a Personal Computer (PC), a tablet Computer, a smart phone, or other electronic device having a logic calculation function.

The embodiment of the application further provides a readable storage medium, wherein a computer program is stored in the readable storage medium, and when the computer program is read and executed by a processor, the steps in the similar picture retrieval method based on multi-mode pre-training are executed.

The above-mentioned embodiments are only specific embodiments of the present application, and are used for illustrating the technical solutions of the present application, but not limiting the same, and the scope of the present application is not limited thereto, and although the present application is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive the technical solutions described in the foregoing embodiments or equivalent substitutes for some technical features within the technical scope disclosed in the present application; such modifications, changes or substitutions do not depart from the spirit and scope of the present invention, and they should be construed as being included in the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A similar picture retrieval method based on multi-modal pre-training is characterized by comprising the following steps:

acquiring pictures and corresponding text information thereof, and constructing picture-text pairs as training data sets;

constructing a multi-mode pre-training model, wherein the model adopts a double-tower mode and comprises a picture characteristic encoder at a picture side and a text characteristic encoder at a text side;

acquiring a picture characteristic encoder, wherein the picture characteristic encoder and a text encoder are jointly obtained through multi-mode pre-training, and the text side is only used as auxiliary training; the multi-mode pre-training comprises a first stage of training of single-mode contrast learning and cross-mode contrast learning of the picture feature encoder and the text encoder based on a gradient updated Query model and a momentum updated Key model, and a second stage of training of cross-mode contrast learning of the picture feature encoder and the text encoder based on the gradient updated Query model; wherein a Patch Projection layer of the picture feature encoder is fixed in the first stage training;

acquiring the picture to be retrieved and the picture characteristics of the pictures in the picture database based on the picture characteristic encoder;

based on the picture to be retrieved and the picture characteristics of the pictures in the picture database, recalling picture data similar to the characteristics of the picture to be retrieved from the picture database as recalled picture data;

and sequencing the recalled picture data, and returning the nearest neighbor data as the retrieval result of the picture to be retrieved.

2. The method of claim 1, wherein the picture feature encoder uses a ViT model to extract picture features, and wherein the text feature encoder uses a language pre-training model BERT to extract picture features.

3. The method of claim 2, wherein the gradient update-based Query model and the momentum update-based Key model perform a first stage training of monomodal contrast learning and cross-modal contrast learning for the picture feature encoder and the text encoder, comprising:

acquiring a certain batch of picture-text pairs of the training data set, adding pictures in the picture-text pairs into a picture sample queue, and adding texts in the picture-text pairs into a text sample queue, wherein the picture sample queue and the text sample queue are of fixed lengths, and a new batch of data is kept to enter the sample queue while an old batch of data is dequeued;

inputting the picture in a certain picture-text pair into the picture characteristic encoder Query to obtain picture characteristics;

inputting the pictures of the picture sample queue into the picture characteristic encoder Query to obtain the picture characteristics of the picture sample queue, matching the picture characteristics with the picture characteristics of the picture sample queue, and calculating a loss function of first single-mode contrast learning;

inputting the text of the text sample queue into the text feature encoder Key to obtain the text features of the text sample queue, matching the picture features with the text features of the text sample queue, and calculating a first cross-modal contrast learning loss function;

calculating a first total loss function, and updating parameters of a picture feature encoder Query by adopting a gradient descent method, wherein the first total loss function is the sum of a first single-mode contrast learning loss function and a first cross-mode contrast learning loss function;

and updating the parameter of the picture feature encoder Key through momentum based on the updated parameter of the picture feature encoder Query.

4. The method of claim 3, wherein the gradient update-based Query model and the momentum update-based Key model perform a first stage training of monomodal contrast learning and cross-modal contrast learning for the picture feature encoder and the text encoder, further comprising:

inputting a text in a certain picture-text pair into a text feature encoder Query to obtain a text feature;

inputting the text of the text sample queue into the text feature encoder Query to obtain the text features of the text sample queue, matching the text features with the text features of the text sample queue, and calculating a loss function of second single-mode contrast learning;

inputting the pictures of the picture sample queue into the picture characteristic encoder Key to obtain the picture characteristics of the picture sample queue, matching the text characteristics with the picture characteristics of the picture sample queue, and calculating a second cross-mode contrast learning loss function;

calculating a second total loss function, and updating parameters of a text feature encoder Query by adopting a gradient descent method, wherein the second total loss function is the sum of a second single-mode contrast learning loss function and a second cross-mode contrast learning loss function;

and updating the parameter of the text feature encoder Key through momentum based on the updated parameter of the text feature encoder Query.

5. The method of claim 2, wherein the gradient-update-based Query model performs a second stage training of cross-modal contrast learning on the picture feature encoder and the text encoder, comprising:

acquiring a certain batch of picture-text pairs of the training data set;

inputting the picture in the picture-text pair into a picture characteristic encoder Query to obtain picture characteristics;

inputting a text in the picture-text pair into a text feature encoder Query to obtain a text feature;

and matching the picture characteristics and the text characteristics, calculating a third loss function, and updating parameters of a picture characteristic encoder Query and a text characteristic encoder Query by adopting a gradient descent method.

6. The method according to claim 1, wherein the retrieving picture data having features similar to the features of the picture to be retrieved from a picture database as retrieved picture data based on the picture to be retrieved and picture features of pictures in a picture database comprises:

based on the picture characteristics in the picture database, adopting a PCA dimension reduction algorithm and a PQ compression algorithm to perform dimension reduction, compression and quantification, and endowing new indexes to each vector to obtain an index database;

based on the characteristics of the picture to be retrieved, carrying out the same PCA dimension reduction and PQ compression operation to obtain a characteristic index of the picture to be retrieved;

based on the characteristic index of the picture to be retrieved, performing reverse sorting on the index database by adopting an IVF (in-vehicle vision) technology to obtain a first sorting table;

and calculating the distance between the feature index to be retrieved and the top-to-top index vectors of the first sorting table, and selecting the picture data with the closest distance as the recall picture data.

7. The method of claim 1, wherein the sorting the recalled picture data and returning nearest neighbor data comprises:

calculating the similarity between the picture to be retrieved and the recalled picture data;

sorting according to similarity decreasing to obtain a second sorting table;

and returning the data with the top ranking of the second sorting table as the nearest neighbor data.

8. A similar picture retrieval device based on multi-modal pre-training is characterized by comprising:

the data acquisition module is used for acquiring the pictures and the corresponding text information thereof and constructing picture-text pairs as training data sets;

the model training module is used for carrying out multi-mode pre-training based on the training data set to obtain an image feature encoder and a text encoder, wherein the text side is only used as auxiliary training; the model training module comprises a first training module and a second training module, the first training module is used for performing single-mode contrast learning and cross-mode contrast learning on the picture feature encoder and the text encoder based on a gradient updated Query model and a momentum updated Key model, and the second training module is used for performing cross-mode contrast learning on the picture feature encoder and the text encoder based on the gradient updated Query model; wherein a Patch project layer of the picture feature encoder is fixed in the first training module;

the feature extraction module is used for acquiring the picture to be retrieved and the picture features of the pictures in the picture database based on the trained picture feature encoder;

9. An electronic device, comprising a memory and a processor, wherein the memory stores a computer program, and the processor executes the computer program to perform the method for retrieving similar pictures based on multi-modal pre-training as claimed in any one of claims 1 to 7.

10. A readable storage medium, characterized in that it stores a computer program which, when run on a processor, performs the method for retrieving similar pictures based on multi-modal pre-training according to any of claims 1 to 7.