CN117390213A

CN117390213A - Training method of image-text retrieval model based on OSCAR and method for realizing image-text retrieval

Info

Publication number: CN117390213A
Application number: CN202311395517.0A
Authority: CN
Inventors: 武芳宇; 邱文婷; 刘净心; 林永义
Original assignee: Xian Jiaotong Liverpool University
Current assignee: Xian Jiaotong Liverpool University
Priority date: 2023-10-26
Filing date: 2023-10-26
Publication date: 2024-01-12

Abstract

The invention provides a training method of an image-text retrieval model based on an OSCAR and a method for realizing image-text retrieval, wherein the training method comprises the following steps: acquiring a training set; inputting a plurality of image-text sample pairs in a training set into a pre-training model OSCAR facing visual language tasks, and extracting features to obtain image feature representation and text feature representation; taking each sample in the training set as an anchor sample, and generating a plurality of negative samples with different difficulties corresponding to the anchor sample based on image characteristic representation and text characteristic representation; calculating positive similarity between the image and the text in the positive sample pair, negative sample pair and negative similarity between the image and the text in the generated negative sample pair; and calculating a loss function based on the positive similarity and the negative similarity, and performing fine adjustment on the pre-training model OSCAR through the loss function to obtain the trained OSCAR image-text retrieval model. In the scheme, the generalization capability of the model can be improved, and the accuracy and the efficiency of image-text retrieval of the model are improved.

Description

Training method of image-text retrieval model based on OSCAR and method for realizing image-text retrieval

Technical Field

The invention relates to the technical field of information retrieval, in particular to a training method of an image-text retrieval model based on an OSCAR and a method for realizing image-text retrieval.

Background

The purpose of the image-text retrieval is to associate a given picture with a corresponding textual description, thereby achieving matching between the image and the text. The image-text retrieval plays a key role in a plurality of important cross-modal tasks, such as semantic image retrieval, image description, visual quality assurance and the like. However, teletext matching faces some important challenges, mainly including heterogeneous differences, which refer to the inconsistency of the characteristic representations of image and text data from different modalities, and semantic differences, which refer to the dislocation problem that occurs when capturing cross-modality correspondence between images and text.

Currently, many studies extract image and text features by using pretraining modules such as convolutional neural networks and recurrent neural networks to bridge the heterogeneity differences. However, the feature extractor in these pre-training modules does not undergo special image-text training or networking process on the data, so that a better image or text embedding effect cannot be achieved. Another common method of teletext matching is to use triple loss to encourage the model to make the similarity score for positive image-text pairs higher than the similarity score for negative image-text pairs. However, the existing cost function involves difficulty that negative samples are not fully considered, which is one of the main reasons for inaccurate pattern-text matching of the model. Some studies have shown that increasing the batch size to obtain more negative samples results in a dramatic increase in computational complexity, with a gradual decrease in the return on performance improvement.

Currently, in visual language oriented tasks, OSCAR models are very powerful, pre-trained on millions of pairs of image-text pairs, combine images and text to obtain meaningful feature representations, capture intricate and complex associations between text and images, and learn more discriminative image-text embedding. OSCAR models have good learning and understanding capabilities for images and characteristic representations herein, but the generalization ability of the model is also weak.

Therefore, the method constructs a new image-text retrieval model based on the OSCAR model so as to improve the generalization capability of the model and improve the accuracy and efficiency of image-text retrieval of the model.

Disclosure of Invention

The invention aims to provide a training method of an OSCAR-based image-text retrieval model and a method for realizing image-text retrieval, which can improve the generalization capability of the model and improve the accuracy and efficiency of image-text retrieval of the model.

In order to achieve the above purpose, the present invention provides the following technical solutions:

in a first aspect, the present invention provides a training method for an OSCAR-based graphic retrieval model, the method comprising:

acquiring a training set, wherein the training set comprises a plurality of image-text sample pairs;

inputting a plurality of image-text sample pairs in the training set into a pre-training model OSCAR facing visual language tasks, and performing feature extraction to generate image feature representation and text feature representation;

taking each sample in the training set as an anchor sample, and generating a plurality of negative samples with different difficulties corresponding to the anchor sample based on the image characteristic representation and the text characteristic representation; the generated negative sample and the anchor point sample form a generated negative sample pair;

calculating positive similarity between an image and a text in a positive sample pair, a negative sample pair and negative similarity between the generated image and the text in the negative sample pair;

and calculating a loss function based on the positive similarity and the negative similarity, and performing fine adjustment on the pre-training model OSCAR through the loss function to obtain a trained OSCAR image-text retrieval model.

Further, the generating, with each sample in the training set as an anchor sample, a plurality of negative samples with different difficulties corresponding to the anchor sample based on the image feature representation and the text feature representation includes:

selecting a sample as the anchor point sample q, wherein the sample is an image sample or a text sample;

based on the anchor point sample q, performing global semantic clustering on each sample in the training set to obtain a negative sample clustering set G= { G ₁ ，g ₂ ，…，g _M }, wherein g _i ＝{x _i1 ，x _i2 ，…，x _iN Negative sample set, x, representing N negative samples with similar semantics _ij Representing the negative sample set g _i I is any integer from 1 to M, j is any integer from 1 to N;

and calculating the similarity between each negative sample and the anchor point sample q and the corresponding weight based on a kernel function, and carrying out weighted average to obtain a plurality of negative samples with different difficulties.

Further, the calculating the similarity and the corresponding weight between each negative sample and the anchor point sample q based on the kernel function, and performing weighted average to obtain a plurality of negative samples with different difficulties includes:

calculating the similarity between each negative sample and the anchor sample based on the Gaussian radial basis function:

wherein k represents the anchor sample q and the negative sample x _jn Similarity, sigma is a width parameter, wherein the similarity, sigma represents a range distance;

calculating the weight W corresponding to the similarity between each negative sample and the anchor point sample according to the following formula _n ：

J(W)＝min|X-W _n ||

Wherein J (W) is a cost function representing an error in a least square method, and W is a weight matrix to be optimized; x represents an input data matrix, each row represents a negative sample, and each column represents a feature; w (W) _n The weight value is the weight value of the weight matrix W; i represent calculating an error;

the generated negative samples are obtained through weighted average calculation:

wherein,representing the generated negative samples corresponding to the anchor samples.

Further, the loss function may be expressed as:

wherein v represents an image feature representation and c represents a text feature representation; s is(s) ^vC+ Representing positive similarity when anchor point sample is image sample, s ^cv+ Representing positive similarity when the anchor point sample is a text sample; ^wvc representing a set of positive and negative similarities when the anchor point sample is an image sample, S ^cv Representing a set of positive and negative similarities when the anchor point sample is a text sample;and->Representing a penalty term; τ is a superparameter; the sum of the values represents the set size.

Further, inputting a plurality of image-text sample pairs in the training set into a pre-trained OSCAR image-text retrieval model, performing feature extraction to generate an image feature representation and a text feature representation, including:

acquiring an image sample in the training set, extracting regional visual characteristics and regional position characteristics of the image sample, and linearly combining the regional visual characteristics and the regional position characteristics to obtain image embedding; the image sample includes n object regions;

obtaining text samples in the training set, dividing the text samples into a plurality of marks by adopting a word segmentation technology, and obtaining text embedding corresponding to each mark based on an OSCAR-base model;

based on the image embedding and the text embedding, a joint feature representation is generated using an attention mechanism, and the image feature representation and the text feature representation are generated by averaging pooling.

In a second aspect, the present invention also provides a method for implementing an OSCAR image-text search using an OSCAR image-text search model, where the OSCAR image-text search model is obtained by training a training method according to any one of claims 1 to 5, and the method includes:

acquiring a target text and a target image to be retrieved;

extracting features of the target text based on a text encoder in the image-text retrieval model to obtain text feature representation;

extracting features of the target image based on an image encoder in the image-text retrieval model to obtain image feature representation;

and determining an image retrieval result of the target text in the target image based on the text feature representation and the image feature representation, and/or determining a text retrieval result of the target image in the target text.

In a third aspect, the present invention further provides an OSCAR-based image-text retrieval model training device, where the device includes:

the data acquisition module is used for acquiring a training set, wherein the training set comprises a plurality of image-text sample pairs;

the feature extraction module is used for inputting a plurality of image-text sample pairs in the training set into a pre-training model OSCAR facing visual language tasks, and carrying out feature extraction to generate image feature representation and text feature representation;

the negative sample synthesis module is used for taking each sample in the training set as an anchor sample and generating a plurality of negative samples with different difficulties corresponding to the anchor sample based on the image characteristic representation and the text characteristic representation; the generated negative sample and the anchor point sample form a generated negative sample pair;

the similarity calculation module is used for calculating positive similarity between the image and the text in the positive sample pair, negative sample pair and negative similarity between the generated image and the text in the negative sample pair;

and the contrast loss calculation module is used for calculating a loss function based on the positive similarity and the negative similarity, and performing fine adjustment on the pre-training model OSCAR through the loss function to obtain a trained OSCAR image-text retrieval model.

In a fourth aspect, the present invention also provides a computer device comprising a processor and a memory; the memory stores at least one instruction for execution by the processor to implement a method as described in any of the above.

In a fifth aspect, the invention also provides a computer readable storage medium storing at least one instruction for execution by a processor to implement a method as described in any one of the above.

The invention has the beneficial effects that: according to the training method of the image-text retrieval model based on the OSCAR, the visual language pre-training model OSCAR is utilized to conduct feature extraction on an image sample and a text sample, a negative sample which is not challenging is generated through a negative sample synthesis module, difficulty between the image and the text is increased, a loss function is designed by utilizing positive similarity between an image in a positive sample pair and the text, negative similarity between the negative sample pair and the generated negative similarity between the image in the negative sample pair and the text, a target OSCAR model is obtained based on brand new loss function training, generalization capability of the image-text retrieval model is improved, and further efficiency and accuracy of image-text retrieval of the model are improved.

The foregoing description is only an overview of the present invention, and is intended to provide a better understanding of the present invention, as it is embodied in the following description, with reference to the preferred embodiments of the present invention and the accompanying drawings.

Drawings

Fig. 1 is a schematic flow chart of an OSCAR-based image-text retrieval model training method provided by an embodiment of the present invention;

fig. 2 is a schematic flow chart of a method for implementing image-text retrieval according to an embodiment of the present invention;

FIG. 3 is a block diagram of an OSCAR-based image-text retrieval model training device according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of a computer device according to an embodiment of the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more clear, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is apparent that some, but not all, of the embodiments of the present invention are described in the embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

In addition, the term "and/or" is merely an association relationship describing an association object, and indicates that three relationships may exist, for example, a and/or B may indicate: a exists alone, A and B exist together, and B exists alone. In addition, the character "/" herein generally indicates that the front and rear associated objects are an "or" relationship.

The embodiment of the application provides an OSCAR-based image-text retrieval model training method, and an execution subject of the training method includes, but is not limited to, one of a server, a terminal and the like which can be configured to execute the method provided by the embodiment of the application.

Referring to fig. 1, a flow chart of an OSCAR-based image-text retrieval model training method according to an embodiment of the present invention is shown. In this embodiment, the training method includes:

step S101, a training set is acquired, the training set including a plurality of image-text sample pairs.

In the embodiment of the invention, the data set can be acquired from the appointed open-source natural language learning model corpus, and a large number of image-text pairs can be acquired from the appointed website by utilizing the Python script with the data grabbing capability.

Step S102, inputting a plurality of image-text sample pairs in a training set into a pre-training model OSCAR of a pre-visual language task, and extracting features to obtain image feature representation and text feature representation.

It will be appreciated that the pre-trained OSCAR visual language model has pre-trained on millions of pairs of image-text pairs, enabling joint processing of images and text to obtain meaningful feature representations to capture intricate associations between text and images, and learning more discriminative image-text embedding. That is, OSCAR models have good learning and understanding capabilities for the image and feature representations herein, enabling extraction of richer feature information in the image and text.

Specifically, the step of generating an image feature representation and a text feature representation based on a pre-trained OSCAR model comprises:

1) And acquiring an image sample in the training set, extracting regional visual features and regional position features of the image sample, and linearly combining the regional visual features and the regional position features to obtain the image embedding.

Wherein the image sample is divided into n object regions.

In one example, the image embedding corresponding to each image sample can be obtained by extracting the region Visual features and the region position features of the image using a fast R-CNN model pre-trained on the Visual Genome data set, and linearly combining the region Visual features and the region position features by linear projection.

2) And acquiring text samples in the training set, dividing the text samples into a plurality of marks by adopting a word segmentation technology, and acquiring text embedding corresponding to each mark based on an OSCAR-base model.

In the embodiment of the invention, for a given text sample c, the given text sample c is first divided into z marks by using a word segmentation technique, namely c= { o ₁ ,o ₂ ,…,o _z Acquiring text embedded E corresponding to each mark through an OSCAR-base model _tok The following steps are:

wherein,an i-th mark representing a text sample.

Thus, the text embedding corresponding to the text sample is represented as:

3) Based on image embedding and text embedding, a joint feature representation is generated by adopting an attention mechanism, and an image feature representation and a text feature representation are generated through average pooling.

In this embodiment, the acquired image and text are embedded into a single transducer model in OSCAR visual language model to obtain a joint feature representation of the image and text, and then the local features of the image and the local features of the text are mapped into global features of smaller dimensions by means of averaging pooling, and the average information of the features is kept, so as to generate an image feature representation and a text feature representation. Wherein the transducer model captures complex relationships between image and text elements based on the attention mechanism, and obtains a joint feature representation of image-text pairs based on the interrelationships of the image and text.

Step S103, each sample in the training set is used as an anchor sample, and a plurality of negative samples with different difficulties corresponding to the anchor sample are generated based on the image characteristic representation and the text characteristic representation. The generated negative samples and the anchor point samples form the generated negative samples.

In the embodiment of the invention, the model generalization capability can be improved by designing the negative sample synthesis module to generate negative samples with different difficulties so as to train the model by the challenging negative samples in consideration of the fact that the diversity of samples can influence the retrieval effect of the model in the training process of the image-text retrieval model.

The step of generating a plurality of negative samples with different difficulties corresponding to the anchor point samples comprises the following steps:

1) And selecting a sample as an anchor sample q, wherein the sample is an image sample or a text sample.

In the following embodiments, an anchor point sample q is taken as an image sample for specific explanation.

2) Based on the anchor point sample q, performing global semantic clustering on each sample in the training set to obtain a negative sample clustering set G= { G ₁ ，g ₂ ，…，g _M }, wherein g _i ＝{x _i1 ，x _i2 ，…，x _iN -negative set of N negative samples with similar semantics x _ij Representing the negative sample set g _i I is any integer from 1 to M, j is any integer from 1 to N.

Specifically, negative samples which are not matched with anchor samples are selected in a small batch of the training set, a k-means algorithm is performed on the negative samples, the negative samples are semantically divided into a plurality of different negative sample sets, and the negative sample sets form a final negative sample cluster set G= { G ₁ ，g ₂ ，…，g _M Each element in the cluster set G represents a set of semantically similar negative samples. Wherein the number of negative sets of samples is determined by the parameter k, typically specified before executing the algorithm.

3) And calculating the similarity between each negative sample and the anchor point sample q and the corresponding weight based on the kernel function, and carrying out weighted average to obtain a plurality of negative samples with different difficulties.

In the embodiment of the invention, the kernel function is a Gaussian radial basis function. Specifically, the step of calculating the similarity and the corresponding weight between each negative sample and the anchor point sample q based on the kernel function, and performing weighted average to obtain a plurality of negative samples with different difficulties includes:

1) Calculating the similarity between each negative sample and the anchor sample based on the Gaussian radial basis function:

wherein k represents the anchor sample q and the negative sample x _in Similarity, sigma is a width parameter, wherein the similarity, sigma represents a range distance;

2) Calculating the weight w corresponding to the similarity between each negative sample and the anchor point sample according to the following formula _n ：

J(W)＝min||X-W||

Wherein J (w) represents a cost function representing an error in the least square method, and w represents a weight matrix to be optimized; x represents an input data matrix, each row represents a sample, and each column represents a feature; i represent a paradigm distance;

the embodiment optimizes the weight matrix by a least square method, and the aim of the least square method is to adjust the parameter matrix

3) The generated negative samples are obtained through weighted average calculation:

It will be appreciated that if the anchor point sample is the image sample v, the generated negative sample is a text negative sampleIf the anchor sample is the text sample c, the generated negative sample is the image sample +.>

Step S104, calculating positive similarity between the image and the text in the positive sample pair, negative sample pair and negative similarity between the image and the text in the generated negative sample pair.

And the positive samples are matched with the anchor point samples, positive sample pairs are formed by the positive samples and the anchor point samples, and positive similarity between images and texts in the positive sample pairs is calculated. And the negative sample is not matched with the anchor point, the negative sample and the anchor point form a negative sample pair, and the negative similarity between the image and the text in the negative sample pair and the generated negative sample pair is calculated. The positive and negative similarities are combined to form a third similarity.

Step S105, calculating a loss function based on the positive similarity and the negative similarity, and performing fine adjustment on the pre-training model OSCAR through the loss function to obtain the trained OSCAR image-text retrieval model.

The embodiment of the invention provides a brand new loss function based on InfoCMR, which is used for comparing positive and negative samples from different sources, and the loss function can be expressed by a formula:

wherein v represents an image feature representation and c represents a text feature representation; s is(s) ^vc+ Representing positive similarity when anchor point sample is image sample, s ^cv+ Representing positive similarity when the anchor point sample is a text sample; s is S ^vc Representing a set of positive and negative similarities when the anchor point sample is an image sample, S ^cv Representing a set of positive and negative similarities when the anchor point sample is a text sample;and->Representing a penalty term; τ is a superparameter; the expression. Aggregate size.

Wherein an additional penalty term is introduced in order to mitigate the risk of model overfitting. Z Gaussian noise vectors are randomly sampled from the Gaussian distribution, each vector having the same dimensions as the anchor vector of the anchor sample at the embedded space, and these Gaussian noise vectors form high confidence negative sample pairs with each sample in the batch process to help smooth the representation space. It should be noted that these gaussian noise vectors do not participate in the formation of positive sample pairs.

The loss function designed by the invention integrates the information among the positive sample, the negative sample and the generated negative sample, and further reduces the heterogeneity difference of image-text matching. Meanwhile, an additional punishment item is added in the loss function, a negative sample pair with high confidence coefficient is formed by randomly sampling Gaussian noise vectors, the risk of overfitting is reduced, and smooth representation of the control is facilitated and the generalization capability of the model is improved.

According to the training method of the image-text retrieval model based on the OSCAR, the visual language pre-training model OSCAR is utilized to conduct feature extraction on an image sample and a text sample, a negative sample which is not challenging is generated through a negative sample synthesis module, difficulty between the image and the text is increased, a loss function is designed by utilizing positive similarity between an image in a positive sample pair and the text, negative similarity between the negative sample pair and the generated negative similarity between the image in the negative sample pair and the text, a target OSCAR model is obtained based on brand new loss function training, generalization capability of the image-text retrieval model is improved, and further efficiency and accuracy of image-text retrieval of the model are improved.

Referring to fig. 2, a flowchart of an image-text searching method implemented by using an image-text searching model trained by the above method according to an embodiment of the present invention is shown, where the method includes:

acquiring a target text and a target image to be retrieved;

extracting characteristics of a target text based on a text encoder in the image-text retrieval model to obtain text characteristic representation;

By using the image-text retrieval method provided by the embodiment of the invention, the efficiency and the accuracy of image-text retrieval can be improved.

Referring to fig. 3, a structural block diagram of a training device for an OSCAR-based image-text retrieval model according to an embodiment of the present invention is provided, where the device includes:

a data acquisition module 310 for acquiring a training set comprising a plurality of image-text sample pairs;

the feature extraction module 320 is configured to input a plurality of image-text sample pairs in a training set into the pre-training model OSCAR facing the visual language task, and perform feature extraction to generate an image feature representation and a text feature representation;

the negative sample synthesis module 330 is configured to generate a plurality of negative samples with different difficulties corresponding to the anchor samples based on the image feature representation and the text feature representation by using each sample in the training set as the anchor sample; the generated negative sample and the anchor point sample form a generated negative sample pair;

a similarity calculating module 340, configured to calculate a positive similarity between the image and the text in the positive sample pair, a negative sample pair, and a negative similarity between the generated image and the text in the negative sample pair;

the contrast loss calculation module 350 is configured to calculate a loss function based on the positive similarity and the negative similarity, and perform fine tuning on the pre-training model OSCAR through the loss function, so as to obtain the OSCAR image-text retrieval model after training.

Referring to fig. 4, a schematic structural diagram of a computer device according to an embodiment of the present invention may include a processor 20, a memory 21, and a bus, and may further include a computer program stored in the memory 21 and executable on the processor 20.

The memory 21 includes at least one type of readable storage medium, which includes flash memory, a removable hard disk, a multimedia card, a card memory (e.g., SD or DX memory, etc.), a magnetic memory, a magnetic disk, an optical disk, etc. The memory 21 may in some embodiments be an internal storage unit of a computer device, such as a removable hard disk of the computer device. The memory 21 may in other embodiments also be an external storage device of the electronic device, such as a plug-in mobile hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card) or the like, which are provided on a computer device. Further, the memory 21 may also include both internal storage units and external storage devices of the computer device. The memory 21 may be used not only for storing application software installed in a computer device and various types of data, but also for temporarily storing data that has been output or is to be output.

The processor 20 may in some embodiments be comprised of integrated circuits, for example, a single packaged integrated circuit, or may be comprised of multiple integrated circuits packaged with the same or different functionality, including one or more central processing units (Central Processing unit, CPU), microprocessors, digital processing chips, graphics processors, a combination of various control chips, and the like. The processor 20 is a Control Unit (Control Unit) of the computer device, connects various components of the entire electronic device using various interfaces and lines, and executes various functions of the computer device and processes data by running or executing programs or modules stored in the memory 21, and calling data stored in the memory 21.

The bus may be a peripheral component interconnect standard (peripheral component interconnect, PCI) bus or an extended industry standard architecture (extended industry standard architecture, EISA) bus, among others. The bus may be classified as an address bus, a data bus, a control bus, etc. The bus is arranged to enable a connection communication between the memory 21 and the at least one processor 20 etc.

Fig. 4 shows only a computer device with components, and it will be understood by those skilled in the art that the structure shown in fig. 4 is not limiting of the computer device and may include fewer or more components than shown, or may combine certain components, or a different arrangement of components.

For example, although not shown, the computer device may also include a power source (such as a battery) for powering the various components, preferably the power source may be logically connected to the at least one processor 20 via a power management device, such that charge management, discharge management, and power consumption management functions are performed by the power management device. The power supply may also include one or more of any of a direct current or alternating current power supply, recharging device, power failure detection circuit, power converter or inverter, power status indicator, etc. The computer device may also include various sensors, bluetooth modules, wi-Fi modules, etc., which are not described in detail herein.

Further, the computer device may also include a network interface, which may optionally include a wired interface and/or a wireless interface (e.g., WI-FI interface, bluetooth interface, etc.), typically used to establish a communication connection between the computer device and other computer devices.

The computer device may optionally further comprise a user interface, which may be a Display, an input unit such as a Keyboard (Keyboard), or a standard wired interface, a wireless interface. Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch, or the like. The display may also be referred to as a display screen or display unit, as appropriate, for displaying information processed in the computer device and for displaying a visual user interface.

It should be understood that the above-described embodiments are for illustrative purposes only and are not limited to this configuration in the scope of the patent application.

The modules/units integrated with the computer device may be stored in a computer readable storage medium if implemented in the form of software functional units and sold or used as stand-alone products. The computer readable storage medium may be volatile or nonvolatile. For example, the computer readable medium may include: any entity or device capable of carrying computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM).

The invention also provides a computer readable storage medium storing a computer program which, when executed by a processor of an electronic device, causes the computer program to perform.

In addition, each functional module in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units can be realized in a form of hardware or a form of hardware and a form of software functional modules.

It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential characteristics thereof.

The present embodiments are, therefore, to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference signs in the claims shall not be construed as limiting the claim concerned.

The embodiment of the application can acquire and process the related data based on the artificial intelligence technology. Among these, artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a digital computer-controlled machine to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use knowledge to obtain optimal results.

Furthermore, it is evident that the word "comprising" does not exclude other elements or steps, and that the singular does not exclude a plurality. A plurality of units or means recited in the system claims can also be implemented by means of software or hardware by means of one unit or means. The terms second, etc. are used to denote a name, but not any particular order.

Finally, it should be noted that the above-mentioned embodiments are merely for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications and equivalents may be made to the technical solution of the present invention without departing from the spirit and scope of the technical solution of the present invention.

Claims

1. The training method of the image-text retrieval model based on the OSCAR is characterized by comprising the following steps of:

inputting a plurality of image-text sample pairs in the training set into a pre-training model OSCAR facing visual language tasks, and extracting features to obtain image feature representation and text feature representation;

and calculating a loss function based on the positive similarity and the negative similarity, and performing fine adjustment on the visual language pre-training model OSCAR through the loss function to obtain a trained OSCAR image-text retrieval model.

2. The training method of claim 1, wherein the generating a plurality of negative samples of different difficulty corresponding to the anchor samples based on the image feature representation and the text feature representation using each sample in the training set as an anchor sample comprises:

based on the anchor point sample q, performing global semantic clustering on each sample in the training set to obtain a negative sample clustering set G= { G ₁ ,g ₂ ,…,g _M }, wherein g _i ＝{x _i1 ,x _i2 ,…,x _iN Negative sample set, x, representing N negative samples with similar semantics _ij Representing the negative sample set g _i I is any integer from 1 to M, j is any integer from 1 to N;

3. The training method according to claim 2, wherein the calculating the similarity between each negative sample and the anchor point sample q and the corresponding weight based on the kernel function, and performing weighted average to obtain a plurality of negative samples with different difficulties, includes:

J(W)＝min||X-W _n ||

4. Training method according to claim 1, characterized in that the loss function is expressed as:

wherein v represents an image feature representation and c represents a text feature representation; s is(s) ^vc+ Representing positive similarity when anchor point sample is image sample, s ^cv+ Representing positive similarity when the anchor point sample is a text sample; s is S ^vc Representing a set of positive and negative similarities when the anchor point sample is an image sample, S ^cv Representing a set of positive and negative similarities when the anchor point sample is a text sample;and->Representing a penalty term; τ is a superparameter; the sum of the values represents the set size.

5. The training method of claim 1, wherein the inputting the plurality of image-text sample pairs in the training set into the pre-trained OSCAR teletext retrieval model performs feature extraction to generate an image feature representation and a text feature representation, comprising:

6. A method for implementing an OSCAR image-text search using an OSCAR image-text search model, the OSCAR image-text search model being trained by the training method according to any one of claims 1 to 5, the method comprising:

acquiring a target text and a target image to be retrieved;

7. An OSCAR-based graphic retrieval model training device, the device comprising:

the feature extraction module is used for inputting a plurality of image-text sample pairs in the training set into a pre-training model OSCAR facing visual language tasks, and extracting features to obtain image feature representation and text feature representation;

8. A computer device, the computer device comprising a processor and a memory; the memory stores at least one instruction for execution by the processor to implement the method of any one of claims 1 to 6.

9. A computer readable storage medium storing at least one instruction for execution by a processor to implement the method of any one of claims 1 to 6.