CN117390213A - Training method of image-text retrieval model based on OSCAR and method for realizing image-text retrieval - Google Patents
Training method of image-text retrieval model based on OSCAR and method for realizing image-text retrieval Download PDFInfo
- Publication number
- CN117390213A CN117390213A CN202311395517.0A CN202311395517A CN117390213A CN 117390213 A CN117390213 A CN 117390213A CN 202311395517 A CN202311395517 A CN 202311395517A CN 117390213 A CN117390213 A CN 117390213A
- Authority
- CN
- China
- Prior art keywords
- sample
- image
- text
- negative
- similarity
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000012549 training Methods 0.000 title claims abstract description 88
- 241000501754 Astronotus ocellatus Species 0.000 title claims abstract description 56
- 238000000034 method Methods 0.000 title claims abstract description 55
- 230000006870 function Effects 0.000 claims abstract description 41
- 230000000007 visual effect Effects 0.000 claims abstract description 24
- 238000000605 extraction Methods 0.000 claims description 11
- 239000011159 matrix material Substances 0.000 claims description 10
- 238000004364 calculation method Methods 0.000 claims description 8
- 230000015572 biosynthetic process Effects 0.000 claims description 7
- 238000003786 synthesis reaction Methods 0.000 claims description 6
- 238000005516 engineering process Methods 0.000 claims description 4
- 230000007246 mechanism Effects 0.000 claims description 4
- 238000011176 pooling Methods 0.000 claims description 4
- 230000011218 segmentation Effects 0.000 claims description 4
- 238000012935 Averaging Methods 0.000 claims description 3
- 239000013598 vector Substances 0.000 description 6
- 238000012545 processing Methods 0.000 description 5
- 238000004590 computer program Methods 0.000 description 4
- 238000010586 diagram Methods 0.000 description 4
- 230000008569 process Effects 0.000 description 4
- 238000013473 artificial intelligence Methods 0.000 description 3
- 238000004422 calculation algorithm Methods 0.000 description 2
- 238000004891 communication Methods 0.000 description 2
- 238000013527 convolutional neural network Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 239000004973 liquid crystal related substance Substances 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 230000002093 peripheral effect Effects 0.000 description 2
- 238000013528 artificial neural network Methods 0.000 description 1
- 238000010923 batch production Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 238000000802 evaporation-induced self-assembly Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000006855 networking Effects 0.000 description 1
- 238000000275 quality assurance Methods 0.000 description 1
- 230000000306 recurrent effect Effects 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/50—Information retrieval; Database structures therefor; File system structures therefor of still image data
- G06F16/58—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/583—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/50—Information retrieval; Database structures therefor; File system structures therefor of still image data
- G06F16/53—Querying
- G06F16/538—Presentation of query results
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/74—Image or video pattern matching; Proximity measures in feature spaces
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/774—Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Health & Medical Sciences (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Artificial Intelligence (AREA)
- Library & Information Science (AREA)
- Computing Systems (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Software Systems (AREA)
- Multimedia (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention provides a training method of an image-text retrieval model based on an OSCAR and a method for realizing image-text retrieval, wherein the training method comprises the following steps: acquiring a training set; inputting a plurality of image-text sample pairs in a training set into a pre-training model OSCAR facing visual language tasks, and extracting features to obtain image feature representation and text feature representation; taking each sample in the training set as an anchor sample, and generating a plurality of negative samples with different difficulties corresponding to the anchor sample based on image characteristic representation and text characteristic representation; calculating positive similarity between the image and the text in the positive sample pair, negative sample pair and negative similarity between the image and the text in the generated negative sample pair; and calculating a loss function based on the positive similarity and the negative similarity, and performing fine adjustment on the pre-training model OSCAR through the loss function to obtain the trained OSCAR image-text retrieval model. In the scheme, the generalization capability of the model can be improved, and the accuracy and the efficiency of image-text retrieval of the model are improved.
Description
Technical Field
The invention relates to the technical field of information retrieval, in particular to a training method of an image-text retrieval model based on an OSCAR and a method for realizing image-text retrieval.
Background
The purpose of the image-text retrieval is to associate a given picture with a corresponding textual description, thereby achieving matching between the image and the text. The image-text retrieval plays a key role in a plurality of important cross-modal tasks, such as semantic image retrieval, image description, visual quality assurance and the like. However, teletext matching faces some important challenges, mainly including heterogeneous differences, which refer to the inconsistency of the characteristic representations of image and text data from different modalities, and semantic differences, which refer to the dislocation problem that occurs when capturing cross-modality correspondence between images and text.
Currently, many studies extract image and text features by using pretraining modules such as convolutional neural networks and recurrent neural networks to bridge the heterogeneity differences. However, the feature extractor in these pre-training modules does not undergo special image-text training or networking process on the data, so that a better image or text embedding effect cannot be achieved. Another common method of teletext matching is to use triple loss to encourage the model to make the similarity score for positive image-text pairs higher than the similarity score for negative image-text pairs. However, the existing cost function involves difficulty that negative samples are not fully considered, which is one of the main reasons for inaccurate pattern-text matching of the model. Some studies have shown that increasing the batch size to obtain more negative samples results in a dramatic increase in computational complexity, with a gradual decrease in the return on performance improvement.
Currently, in visual language oriented tasks, OSCAR models are very powerful, pre-trained on millions of pairs of image-text pairs, combine images and text to obtain meaningful feature representations, capture intricate and complex associations between text and images, and learn more discriminative image-text embedding. OSCAR models have good learning and understanding capabilities for images and characteristic representations herein, but the generalization ability of the model is also weak.
Therefore, the method constructs a new image-text retrieval model based on the OSCAR model so as to improve the generalization capability of the model and improve the accuracy and efficiency of image-text retrieval of the model.
Disclosure of Invention
The invention aims to provide a training method of an OSCAR-based image-text retrieval model and a method for realizing image-text retrieval, which can improve the generalization capability of the model and improve the accuracy and efficiency of image-text retrieval of the model.
In order to achieve the above purpose, the present invention provides the following technical solutions:
in a first aspect, the present invention provides a training method for an OSCAR-based graphic retrieval model, the method comprising:
acquiring a training set, wherein the training set comprises a plurality of image-text sample pairs;
inputting a plurality of image-text sample pairs in the training set into a pre-training model OSCAR facing visual language tasks, and performing feature extraction to generate image feature representation and text feature representation;
taking each sample in the training set as an anchor sample, and generating a plurality of negative samples with different difficulties corresponding to the anchor sample based on the image characteristic representation and the text characteristic representation; the generated negative sample and the anchor point sample form a generated negative sample pair;
calculating positive similarity between an image and a text in a positive sample pair, a negative sample pair and negative similarity between the generated image and the text in the negative sample pair;
and calculating a loss function based on the positive similarity and the negative similarity, and performing fine adjustment on the pre-training model OSCAR through the loss function to obtain a trained OSCAR image-text retrieval model.
Further, the generating, with each sample in the training set as an anchor sample, a plurality of negative samples with different difficulties corresponding to the anchor sample based on the image feature representation and the text feature representation includes:
selecting a sample as the anchor point sample q, wherein the sample is an image sample or a text sample;
based on the anchor point sample q, performing global semantic clustering on each sample in the training set to obtain a negative sample clustering set G= { G 1 ,g 2 ,…,g M }, wherein g i ={x i1 ,x i2 ,…,x iN Negative sample set, x, representing N negative samples with similar semantics ij Representing the negative sample set g i I is any integer from 1 to M, j is any integer from 1 to N;
and calculating the similarity between each negative sample and the anchor point sample q and the corresponding weight based on a kernel function, and carrying out weighted average to obtain a plurality of negative samples with different difficulties.
Further, the calculating the similarity and the corresponding weight between each negative sample and the anchor point sample q based on the kernel function, and performing weighted average to obtain a plurality of negative samples with different difficulties includes:
calculating the similarity between each negative sample and the anchor sample based on the Gaussian radial basis function:
wherein k represents the anchor sample q and the negative sample x jn Similarity, sigma is a width parameter, wherein the similarity, sigma represents a range distance;
calculating the weight W corresponding to the similarity between each negative sample and the anchor point sample according to the following formula n :
J(W)=min|X-W n ||
Wherein J (W) is a cost function representing an error in a least square method, and W is a weight matrix to be optimized; x represents an input data matrix, each row represents a negative sample, and each column represents a feature; w (W) n The weight value is the weight value of the weight matrix W; i represent calculating an error;
the generated negative samples are obtained through weighted average calculation:
wherein,representing the generated negative samples corresponding to the anchor samples.
Further, the loss function may be expressed as:
wherein v represents an image feature representation and c represents a text feature representation; s is(s) vC+ Representing positive similarity when anchor point sample is image sample, s cv+ Representing positive similarity when the anchor point sample is a text sample; wvc representing a set of positive and negative similarities when the anchor point sample is an image sample, S cv Representing a set of positive and negative similarities when the anchor point sample is a text sample;and->Representing a penalty term; τ is a superparameter; the sum of the values represents the set size.
Further, inputting a plurality of image-text sample pairs in the training set into a pre-trained OSCAR image-text retrieval model, performing feature extraction to generate an image feature representation and a text feature representation, including:
acquiring an image sample in the training set, extracting regional visual characteristics and regional position characteristics of the image sample, and linearly combining the regional visual characteristics and the regional position characteristics to obtain image embedding; the image sample includes n object regions;
obtaining text samples in the training set, dividing the text samples into a plurality of marks by adopting a word segmentation technology, and obtaining text embedding corresponding to each mark based on an OSCAR-base model;
based on the image embedding and the text embedding, a joint feature representation is generated using an attention mechanism, and the image feature representation and the text feature representation are generated by averaging pooling.
In a second aspect, the present invention also provides a method for implementing an OSCAR image-text search using an OSCAR image-text search model, where the OSCAR image-text search model is obtained by training a training method according to any one of claims 1 to 5, and the method includes:
acquiring a target text and a target image to be retrieved;
extracting features of the target text based on a text encoder in the image-text retrieval model to obtain text feature representation;
extracting features of the target image based on an image encoder in the image-text retrieval model to obtain image feature representation;
and determining an image retrieval result of the target text in the target image based on the text feature representation and the image feature representation, and/or determining a text retrieval result of the target image in the target text.
In a third aspect, the present invention further provides an OSCAR-based image-text retrieval model training device, where the device includes:
the data acquisition module is used for acquiring a training set, wherein the training set comprises a plurality of image-text sample pairs;
the feature extraction module is used for inputting a plurality of image-text sample pairs in the training set into a pre-training model OSCAR facing visual language tasks, and carrying out feature extraction to generate image feature representation and text feature representation;
the negative sample synthesis module is used for taking each sample in the training set as an anchor sample and generating a plurality of negative samples with different difficulties corresponding to the anchor sample based on the image characteristic representation and the text characteristic representation; the generated negative sample and the anchor point sample form a generated negative sample pair;
the similarity calculation module is used for calculating positive similarity between the image and the text in the positive sample pair, negative sample pair and negative similarity between the generated image and the text in the negative sample pair;
and the contrast loss calculation module is used for calculating a loss function based on the positive similarity and the negative similarity, and performing fine adjustment on the pre-training model OSCAR through the loss function to obtain a trained OSCAR image-text retrieval model.
In a fourth aspect, the present invention also provides a computer device comprising a processor and a memory; the memory stores at least one instruction for execution by the processor to implement a method as described in any of the above.
In a fifth aspect, the invention also provides a computer readable storage medium storing at least one instruction for execution by a processor to implement a method as described in any one of the above.
The invention has the beneficial effects that: according to the training method of the image-text retrieval model based on the OSCAR, the visual language pre-training model OSCAR is utilized to conduct feature extraction on an image sample and a text sample, a negative sample which is not challenging is generated through a negative sample synthesis module, difficulty between the image and the text is increased, a loss function is designed by utilizing positive similarity between an image in a positive sample pair and the text, negative similarity between the negative sample pair and the generated negative similarity between the image in the negative sample pair and the text, a target OSCAR model is obtained based on brand new loss function training, generalization capability of the image-text retrieval model is improved, and further efficiency and accuracy of image-text retrieval of the model are improved.
The foregoing description is only an overview of the present invention, and is intended to provide a better understanding of the present invention, as it is embodied in the following description, with reference to the preferred embodiments of the present invention and the accompanying drawings.
Drawings
Fig. 1 is a schematic flow chart of an OSCAR-based image-text retrieval model training method provided by an embodiment of the present invention;
fig. 2 is a schematic flow chart of a method for implementing image-text retrieval according to an embodiment of the present invention;
FIG. 3 is a block diagram of an OSCAR-based image-text retrieval model training device according to an embodiment of the present invention;
fig. 4 is a schematic structural diagram of a computer device according to an embodiment of the present invention.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more clear, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is apparent that some, but not all, of the embodiments of the present invention are described in the embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
In addition, the term "and/or" is merely an association relationship describing an association object, and indicates that three relationships may exist, for example, a and/or B may indicate: a exists alone, A and B exist together, and B exists alone. In addition, the character "/" herein generally indicates that the front and rear associated objects are an "or" relationship.
The embodiment of the application provides an OSCAR-based image-text retrieval model training method, and an execution subject of the training method includes, but is not limited to, one of a server, a terminal and the like which can be configured to execute the method provided by the embodiment of the application.
Referring to fig. 1, a flow chart of an OSCAR-based image-text retrieval model training method according to an embodiment of the present invention is shown. In this embodiment, the training method includes:
step S101, a training set is acquired, the training set including a plurality of image-text sample pairs.
In the embodiment of the invention, the data set can be acquired from the appointed open-source natural language learning model corpus, and a large number of image-text pairs can be acquired from the appointed website by utilizing the Python script with the data grabbing capability.
Step S102, inputting a plurality of image-text sample pairs in a training set into a pre-training model OSCAR of a pre-visual language task, and extracting features to obtain image feature representation and text feature representation.
It will be appreciated that the pre-trained OSCAR visual language model has pre-trained on millions of pairs of image-text pairs, enabling joint processing of images and text to obtain meaningful feature representations to capture intricate associations between text and images, and learning more discriminative image-text embedding. That is, OSCAR models have good learning and understanding capabilities for the image and feature representations herein, enabling extraction of richer feature information in the image and text.
Specifically, the step of generating an image feature representation and a text feature representation based on a pre-trained OSCAR model comprises:
1) And acquiring an image sample in the training set, extracting regional visual features and regional position features of the image sample, and linearly combining the regional visual features and the regional position features to obtain the image embedding.
Wherein the image sample is divided into n object regions.
In one example, the image embedding corresponding to each image sample can be obtained by extracting the region Visual features and the region position features of the image using a fast R-CNN model pre-trained on the Visual Genome data set, and linearly combining the region Visual features and the region position features by linear projection.
2) And acquiring text samples in the training set, dividing the text samples into a plurality of marks by adopting a word segmentation technology, and acquiring text embedding corresponding to each mark based on an OSCAR-base model.
In the embodiment of the invention, for a given text sample c, the given text sample c is first divided into z marks by using a word segmentation technique, namely c= { o 1 ,o 2 ,…,o z Acquiring text embedded E corresponding to each mark through an OSCAR-base model tok The following steps are:
wherein,an i-th mark representing a text sample.
Thus, the text embedding corresponding to the text sample is represented as:
3) Based on image embedding and text embedding, a joint feature representation is generated by adopting an attention mechanism, and an image feature representation and a text feature representation are generated through average pooling.
In this embodiment, the acquired image and text are embedded into a single transducer model in OSCAR visual language model to obtain a joint feature representation of the image and text, and then the local features of the image and the local features of the text are mapped into global features of smaller dimensions by means of averaging pooling, and the average information of the features is kept, so as to generate an image feature representation and a text feature representation. Wherein the transducer model captures complex relationships between image and text elements based on the attention mechanism, and obtains a joint feature representation of image-text pairs based on the interrelationships of the image and text.
Step S103, each sample in the training set is used as an anchor sample, and a plurality of negative samples with different difficulties corresponding to the anchor sample are generated based on the image characteristic representation and the text characteristic representation. The generated negative samples and the anchor point samples form the generated negative samples.
In the embodiment of the invention, the model generalization capability can be improved by designing the negative sample synthesis module to generate negative samples with different difficulties so as to train the model by the challenging negative samples in consideration of the fact that the diversity of samples can influence the retrieval effect of the model in the training process of the image-text retrieval model.
The step of generating a plurality of negative samples with different difficulties corresponding to the anchor point samples comprises the following steps:
1) And selecting a sample as an anchor sample q, wherein the sample is an image sample or a text sample.
In the following embodiments, an anchor point sample q is taken as an image sample for specific explanation.
2) Based on the anchor point sample q, performing global semantic clustering on each sample in the training set to obtain a negative sample clustering set G= { G 1 ,g 2 ,…,g M }, wherein g i ={x i1 ,x i2 ,…,x iN -negative set of N negative samples with similar semantics x ij Representing the negative sample set g i I is any integer from 1 to M, j is any integer from 1 to N.
Specifically, negative samples which are not matched with anchor samples are selected in a small batch of the training set, a k-means algorithm is performed on the negative samples, the negative samples are semantically divided into a plurality of different negative sample sets, and the negative sample sets form a final negative sample cluster set G= { G 1 ,g 2 ,…,g M Each element in the cluster set G represents a set of semantically similar negative samples. Wherein the number of negative sets of samples is determined by the parameter k, typically specified before executing the algorithm.
3) And calculating the similarity between each negative sample and the anchor point sample q and the corresponding weight based on the kernel function, and carrying out weighted average to obtain a plurality of negative samples with different difficulties.
In the embodiment of the invention, the kernel function is a Gaussian radial basis function. Specifically, the step of calculating the similarity and the corresponding weight between each negative sample and the anchor point sample q based on the kernel function, and performing weighted average to obtain a plurality of negative samples with different difficulties includes:
1) Calculating the similarity between each negative sample and the anchor sample based on the Gaussian radial basis function:
wherein k represents the anchor sample q and the negative sample x in Similarity, sigma is a width parameter, wherein the similarity, sigma represents a range distance;
2) Calculating the weight w corresponding to the similarity between each negative sample and the anchor point sample according to the following formula n :
J(W)=min||X-W||
Wherein J (w) represents a cost function representing an error in the least square method, and w represents a weight matrix to be optimized; x represents an input data matrix, each row represents a sample, and each column represents a feature; i represent a paradigm distance;
the embodiment optimizes the weight matrix by a least square method, and the aim of the least square method is to adjust the parameter matrix
3) The generated negative samples are obtained through weighted average calculation:
wherein,representing the generated negative samples corresponding to the anchor samples.
It will be appreciated that if the anchor point sample is the image sample v, the generated negative sample is a text negative sampleIf the anchor sample is the text sample c, the generated negative sample is the image sample +.>
Step S104, calculating positive similarity between the image and the text in the positive sample pair, negative sample pair and negative similarity between the image and the text in the generated negative sample pair.
And the positive samples are matched with the anchor point samples, positive sample pairs are formed by the positive samples and the anchor point samples, and positive similarity between images and texts in the positive sample pairs is calculated. And the negative sample is not matched with the anchor point, the negative sample and the anchor point form a negative sample pair, and the negative similarity between the image and the text in the negative sample pair and the generated negative sample pair is calculated. The positive and negative similarities are combined to form a third similarity.
Step S105, calculating a loss function based on the positive similarity and the negative similarity, and performing fine adjustment on the pre-training model OSCAR through the loss function to obtain the trained OSCAR image-text retrieval model.
The embodiment of the invention provides a brand new loss function based on InfoCMR, which is used for comparing positive and negative samples from different sources, and the loss function can be expressed by a formula:
wherein v represents an image feature representation and c represents a text feature representation; s is(s) vc+ Representing positive similarity when anchor point sample is image sample, s cv+ Representing positive similarity when the anchor point sample is a text sample; s is S vc Representing a set of positive and negative similarities when the anchor point sample is an image sample, S cv Representing a set of positive and negative similarities when the anchor point sample is a text sample;and->Representing a penalty term; τ is a superparameter; the expression. Aggregate size.
Wherein an additional penalty term is introduced in order to mitigate the risk of model overfitting. Z Gaussian noise vectors are randomly sampled from the Gaussian distribution, each vector having the same dimensions as the anchor vector of the anchor sample at the embedded space, and these Gaussian noise vectors form high confidence negative sample pairs with each sample in the batch process to help smooth the representation space. It should be noted that these gaussian noise vectors do not participate in the formation of positive sample pairs.
The loss function designed by the invention integrates the information among the positive sample, the negative sample and the generated negative sample, and further reduces the heterogeneity difference of image-text matching. Meanwhile, an additional punishment item is added in the loss function, a negative sample pair with high confidence coefficient is formed by randomly sampling Gaussian noise vectors, the risk of overfitting is reduced, and smooth representation of the control is facilitated and the generalization capability of the model is improved.
According to the training method of the image-text retrieval model based on the OSCAR, the visual language pre-training model OSCAR is utilized to conduct feature extraction on an image sample and a text sample, a negative sample which is not challenging is generated through a negative sample synthesis module, difficulty between the image and the text is increased, a loss function is designed by utilizing positive similarity between an image in a positive sample pair and the text, negative similarity between the negative sample pair and the generated negative similarity between the image in the negative sample pair and the text, a target OSCAR model is obtained based on brand new loss function training, generalization capability of the image-text retrieval model is improved, and further efficiency and accuracy of image-text retrieval of the model are improved.
Referring to fig. 2, a flowchart of an image-text searching method implemented by using an image-text searching model trained by the above method according to an embodiment of the present invention is shown, where the method includes:
acquiring a target text and a target image to be retrieved;
extracting characteristics of a target text based on a text encoder in the image-text retrieval model to obtain text characteristic representation;
extracting features of the target image based on an image encoder in the image-text retrieval model to obtain image feature representation;
and determining an image retrieval result of the target text in the target image based on the text feature representation and the image feature representation, and/or determining a text retrieval result of the target image in the target text.
By using the image-text retrieval method provided by the embodiment of the invention, the efficiency and the accuracy of image-text retrieval can be improved.
Referring to fig. 3, a structural block diagram of a training device for an OSCAR-based image-text retrieval model according to an embodiment of the present invention is provided, where the device includes:
a data acquisition module 310 for acquiring a training set comprising a plurality of image-text sample pairs;
the feature extraction module 320 is configured to input a plurality of image-text sample pairs in a training set into the pre-training model OSCAR facing the visual language task, and perform feature extraction to generate an image feature representation and a text feature representation;
the negative sample synthesis module 330 is configured to generate a plurality of negative samples with different difficulties corresponding to the anchor samples based on the image feature representation and the text feature representation by using each sample in the training set as the anchor sample; the generated negative sample and the anchor point sample form a generated negative sample pair;
a similarity calculating module 340, configured to calculate a positive similarity between the image and the text in the positive sample pair, a negative sample pair, and a negative similarity between the generated image and the text in the negative sample pair;
the contrast loss calculation module 350 is configured to calculate a loss function based on the positive similarity and the negative similarity, and perform fine tuning on the pre-training model OSCAR through the loss function, so as to obtain the OSCAR image-text retrieval model after training.
Referring to fig. 4, a schematic structural diagram of a computer device according to an embodiment of the present invention may include a processor 20, a memory 21, and a bus, and may further include a computer program stored in the memory 21 and executable on the processor 20.
The memory 21 includes at least one type of readable storage medium, which includes flash memory, a removable hard disk, a multimedia card, a card memory (e.g., SD or DX memory, etc.), a magnetic memory, a magnetic disk, an optical disk, etc. The memory 21 may in some embodiments be an internal storage unit of a computer device, such as a removable hard disk of the computer device. The memory 21 may in other embodiments also be an external storage device of the electronic device, such as a plug-in mobile hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card) or the like, which are provided on a computer device. Further, the memory 21 may also include both internal storage units and external storage devices of the computer device. The memory 21 may be used not only for storing application software installed in a computer device and various types of data, but also for temporarily storing data that has been output or is to be output.
The processor 20 may in some embodiments be comprised of integrated circuits, for example, a single packaged integrated circuit, or may be comprised of multiple integrated circuits packaged with the same or different functionality, including one or more central processing units (Central Processing unit, CPU), microprocessors, digital processing chips, graphics processors, a combination of various control chips, and the like. The processor 20 is a Control Unit (Control Unit) of the computer device, connects various components of the entire electronic device using various interfaces and lines, and executes various functions of the computer device and processes data by running or executing programs or modules stored in the memory 21, and calling data stored in the memory 21.
The bus may be a peripheral component interconnect standard (peripheral component interconnect, PCI) bus or an extended industry standard architecture (extended industry standard architecture, EISA) bus, among others. The bus may be classified as an address bus, a data bus, a control bus, etc. The bus is arranged to enable a connection communication between the memory 21 and the at least one processor 20 etc.
Fig. 4 shows only a computer device with components, and it will be understood by those skilled in the art that the structure shown in fig. 4 is not limiting of the computer device and may include fewer or more components than shown, or may combine certain components, or a different arrangement of components.
For example, although not shown, the computer device may also include a power source (such as a battery) for powering the various components, preferably the power source may be logically connected to the at least one processor 20 via a power management device, such that charge management, discharge management, and power consumption management functions are performed by the power management device. The power supply may also include one or more of any of a direct current or alternating current power supply, recharging device, power failure detection circuit, power converter or inverter, power status indicator, etc. The computer device may also include various sensors, bluetooth modules, wi-Fi modules, etc., which are not described in detail herein.
Further, the computer device may also include a network interface, which may optionally include a wired interface and/or a wireless interface (e.g., WI-FI interface, bluetooth interface, etc.), typically used to establish a communication connection between the computer device and other computer devices.
The computer device may optionally further comprise a user interface, which may be a Display, an input unit such as a Keyboard (Keyboard), or a standard wired interface, a wireless interface. Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch, or the like. The display may also be referred to as a display screen or display unit, as appropriate, for displaying information processed in the computer device and for displaying a visual user interface.
It should be understood that the above-described embodiments are for illustrative purposes only and are not limited to this configuration in the scope of the patent application.
The modules/units integrated with the computer device may be stored in a computer readable storage medium if implemented in the form of software functional units and sold or used as stand-alone products. The computer readable storage medium may be volatile or nonvolatile. For example, the computer readable medium may include: any entity or device capable of carrying computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM).
The invention also provides a computer readable storage medium storing a computer program which, when executed by a processor of an electronic device, causes the computer program to perform.
In addition, each functional module in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units can be realized in a form of hardware or a form of hardware and a form of software functional modules.
It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential characteristics thereof.
The present embodiments are, therefore, to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference signs in the claims shall not be construed as limiting the claim concerned.
The embodiment of the application can acquire and process the related data based on the artificial intelligence technology. Among these, artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a digital computer-controlled machine to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use knowledge to obtain optimal results.
Furthermore, it is evident that the word "comprising" does not exclude other elements or steps, and that the singular does not exclude a plurality. A plurality of units or means recited in the system claims can also be implemented by means of software or hardware by means of one unit or means. The terms second, etc. are used to denote a name, but not any particular order.
Finally, it should be noted that the above-mentioned embodiments are merely for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications and equivalents may be made to the technical solution of the present invention without departing from the spirit and scope of the technical solution of the present invention.
Claims (9)
1. The training method of the image-text retrieval model based on the OSCAR is characterized by comprising the following steps of:
acquiring a training set, wherein the training set comprises a plurality of image-text sample pairs;
inputting a plurality of image-text sample pairs in the training set into a pre-training model OSCAR facing visual language tasks, and extracting features to obtain image feature representation and text feature representation;
taking each sample in the training set as an anchor sample, and generating a plurality of negative samples with different difficulties corresponding to the anchor sample based on the image characteristic representation and the text characteristic representation; the generated negative sample and the anchor point sample form a generated negative sample pair;
calculating positive similarity between an image and a text in a positive sample pair, a negative sample pair and negative similarity between the generated image and the text in the negative sample pair;
and calculating a loss function based on the positive similarity and the negative similarity, and performing fine adjustment on the visual language pre-training model OSCAR through the loss function to obtain a trained OSCAR image-text retrieval model.
2. The training method of claim 1, wherein the generating a plurality of negative samples of different difficulty corresponding to the anchor samples based on the image feature representation and the text feature representation using each sample in the training set as an anchor sample comprises:
selecting a sample as the anchor point sample q, wherein the sample is an image sample or a text sample;
based on the anchor point sample q, performing global semantic clustering on each sample in the training set to obtain a negative sample clustering set G= { G 1 ,g 2 ,…,g M }, wherein g i ={x i1 ,x i2 ,…,x iN Negative sample set, x, representing N negative samples with similar semantics ij Representing the negative sample set g i I is any integer from 1 to M, j is any integer from 1 to N;
and calculating the similarity between each negative sample and the anchor point sample q and the corresponding weight based on a kernel function, and carrying out weighted average to obtain a plurality of negative samples with different difficulties.
3. The training method according to claim 2, wherein the calculating the similarity between each negative sample and the anchor point sample q and the corresponding weight based on the kernel function, and performing weighted average to obtain a plurality of negative samples with different difficulties, includes:
calculating the similarity between each negative sample and the anchor sample based on the Gaussian radial basis function:
wherein k represents the anchor sample q and the negative sample x jn Similarity, sigma is a width parameter, wherein the similarity, sigma represents a range distance;
calculating the weight W corresponding to the similarity between each negative sample and the anchor point sample according to the following formula n :
J(W)=min||X-W n ||
Wherein J (W) is a cost function representing an error in a least square method, and W is a weight matrix to be optimized; x represents an input data matrix, each row represents a negative sample, and each column represents a feature; w (W) n The weight value is the weight value of the weight matrix W; i represent calculating an error;
the generated negative samples are obtained through weighted average calculation:
wherein,representing the generated negative samples corresponding to the anchor samples.
4. Training method according to claim 1, characterized in that the loss function is expressed as:
wherein v represents an image feature representation and c represents a text feature representation; s is(s) vc+ Representing positive similarity when anchor point sample is image sample, s cv+ Representing positive similarity when the anchor point sample is a text sample; s is S vc Representing a set of positive and negative similarities when the anchor point sample is an image sample, S cv Representing a set of positive and negative similarities when the anchor point sample is a text sample;and->Representing a penalty term; τ is a superparameter; the sum of the values represents the set size.
5. The training method of claim 1, wherein the inputting the plurality of image-text sample pairs in the training set into the pre-trained OSCAR teletext retrieval model performs feature extraction to generate an image feature representation and a text feature representation, comprising:
acquiring an image sample in the training set, extracting regional visual characteristics and regional position characteristics of the image sample, and linearly combining the regional visual characteristics and the regional position characteristics to obtain image embedding; the image sample includes n object regions;
obtaining text samples in the training set, dividing the text samples into a plurality of marks by adopting a word segmentation technology, and obtaining text embedding corresponding to each mark based on an OSCAR-base model;
based on the image embedding and the text embedding, a joint feature representation is generated using an attention mechanism, and the image feature representation and the text feature representation are generated by averaging pooling.
6. A method for implementing an OSCAR image-text search using an OSCAR image-text search model, the OSCAR image-text search model being trained by the training method according to any one of claims 1 to 5, the method comprising:
acquiring a target text and a target image to be retrieved;
extracting features of the target text based on a text encoder in the image-text retrieval model to obtain text feature representation;
extracting features of the target image based on an image encoder in the image-text retrieval model to obtain image feature representation;
and determining an image retrieval result of the target text in the target image based on the text feature representation and the image feature representation, and/or determining a text retrieval result of the target image in the target text.
7. An OSCAR-based graphic retrieval model training device, the device comprising:
the data acquisition module is used for acquiring a training set, wherein the training set comprises a plurality of image-text sample pairs;
the feature extraction module is used for inputting a plurality of image-text sample pairs in the training set into a pre-training model OSCAR facing visual language tasks, and extracting features to obtain image feature representation and text feature representation;
the negative sample synthesis module is used for taking each sample in the training set as an anchor sample and generating a plurality of negative samples with different difficulties corresponding to the anchor sample based on the image characteristic representation and the text characteristic representation; the generated negative sample and the anchor point sample form a generated negative sample pair;
the similarity calculation module is used for calculating positive similarity between the image and the text in the positive sample pair, negative sample pair and negative similarity between the generated image and the text in the negative sample pair;
and the contrast loss calculation module is used for calculating a loss function based on the positive similarity and the negative similarity, and performing fine adjustment on the pre-training model OSCAR through the loss function to obtain a trained OSCAR image-text retrieval model.
8. A computer device, the computer device comprising a processor and a memory; the memory stores at least one instruction for execution by the processor to implement the method of any one of claims 1 to 6.
9. A computer readable storage medium storing at least one instruction for execution by a processor to implement the method of any one of claims 1 to 6.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311395517.0A CN117390213A (en) | 2023-10-26 | 2023-10-26 | Training method of image-text retrieval model based on OSCAR and method for realizing image-text retrieval |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311395517.0A CN117390213A (en) | 2023-10-26 | 2023-10-26 | Training method of image-text retrieval model based on OSCAR and method for realizing image-text retrieval |
Publications (1)
Publication Number | Publication Date |
---|---|
CN117390213A true CN117390213A (en) | 2024-01-12 |
Family
ID=89435561
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202311395517.0A Pending CN117390213A (en) | 2023-10-26 | 2023-10-26 | Training method of image-text retrieval model based on OSCAR and method for realizing image-text retrieval |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117390213A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118211126A (en) * | 2024-05-22 | 2024-06-18 | 国网山东省电力公司蒙阴县供电公司 | Training method and device for photovoltaic power generation device fault prediction model |
-
2023
- 2023-10-26 CN CN202311395517.0A patent/CN117390213A/en active Pending
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118211126A (en) * | 2024-05-22 | 2024-06-18 | 国网山东省电力公司蒙阴县供电公司 | Training method and device for photovoltaic power generation device fault prediction model |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2021212682A1 (en) | Knowledge extraction method, apparatus, electronic device, and storage medium | |
CN113822494B (en) | Risk prediction method, device, equipment and storage medium | |
Liu et al. | Open-world semantic segmentation via contrasting and clustering vision-language embedding | |
CN113312461A (en) | Intelligent question-answering method, device, equipment and medium based on natural language processing | |
CN114358203B (en) | Training method and device for image description sentence generation module and electronic equipment | |
CN113378970B (en) | Sentence similarity detection method and device, electronic equipment and storage medium | |
CN113704429A (en) | Semi-supervised learning-based intention identification method, device, equipment and medium | |
CN112668336B (en) | Word processing method based on task model | |
CN113360768A (en) | Product recommendation method, device and equipment based on user portrait and storage medium | |
CN117390213A (en) | Training method of image-text retrieval model based on OSCAR and method for realizing image-text retrieval | |
CN113268615A (en) | Resource label generation method and device, electronic equipment and storage medium | |
CN115392237B (en) | Emotion analysis model training method, device, equipment and storage medium | |
CN115221276A (en) | Chinese image-text retrieval model training method, device, equipment and medium based on CLIP | |
CN115238115A (en) | Image retrieval method, device and equipment based on Chinese data and storage medium | |
CN115510188A (en) | Text keyword association method, device, equipment and storage medium | |
CN116578704A (en) | Text emotion classification method, device, equipment and computer readable medium | |
CN116450829A (en) | Medical text classification method, device, equipment and medium | |
CN113344125B (en) | Long text matching recognition method and device, electronic equipment and storage medium | |
CN112836019B (en) | Public medical health named entity identification and entity linking method and device, electronic equipment and storage medium | |
CN113254814A (en) | Network course video labeling method and device, electronic equipment and medium | |
CN111523351A (en) | Neural network training method and device and electronic equipment | |
CN113204698A (en) | News subject term generation method, device, equipment and medium | |
CN117372405A (en) | Face image quality evaluation method, device, storage medium and equipment | |
CN112597299A (en) | Text entity classification method and device, terminal equipment and storage medium | |
CN113705207A (en) | Grammar error recognition method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |