CN112148839A - Image-text matching method and device and storage medium - Google Patents
Image-text matching method and device and storage medium Download PDFInfo
- Publication number
- CN112148839A CN112148839A CN202011052223.4A CN202011052223A CN112148839A CN 112148839 A CN112148839 A CN 112148839A CN 202011052223 A CN202011052223 A CN 202011052223A CN 112148839 A CN112148839 A CN 112148839A
- Authority
- CN
- China
- Prior art keywords
- text
- image
- vector
- training
- coding
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 46
- 239000013598 vector Substances 0.000 claims abstract description 317
- 238000012549 training Methods 0.000 claims abstract description 161
- 238000012545 processing Methods 0.000 claims description 17
- 230000006870 function Effects 0.000 claims description 10
- XLYOFNOQVPJJNP-UHFFFAOYSA-N water Substances O XLYOFNOQVPJJNP-UHFFFAOYSA-N 0.000 description 7
- 238000010586 diagram Methods 0.000 description 6
- 238000012360 testing method Methods 0.000 description 3
- 238000013459 approach Methods 0.000 description 2
- 238000013528 artificial neural network Methods 0.000 description 2
- 230000002457 bidirectional effect Effects 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000012795 verification Methods 0.000 description 2
- 230000006978 adaptation Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T9/00—Image coding
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/10—Character recognition
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Computational Biology (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- Multimedia (AREA)
- Computational Linguistics (AREA)
- Databases & Information Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The disclosure relates to a method and a device for matching images and texts and a storage medium. The image-text matching method comprises the following steps: acquiring an image to be subjected to image-text matching; inputting the image into a pre-trained image-text coding model, and coding the image to obtain an image vector; determining a text vector similar to the image vector from pre-stored text vectors; the image-text coding model further comprises a text coding sub-network used for coding a text in the training process, and the pre-stored text vector is determined by the text coding sub-network according to a preset text code; and determining the texts corresponding to the preset number of text vectors similar to the image vectors as texts matched with the image. By the method and the system, the image-text matching efficiency of the image-text matching server can be improved, and the system delay of the image-text matching server is reduced.
Description
Technical Field
The present disclosure relates to the field of computer technologies, and in particular, to a method and an apparatus for matching images and texts, and a storage medium.
Background
The multi-modal retrieval is a novel retrieval mode, and data retrieval between different modalities can be realized. For example, in a text search, the user may input an image to search for a descriptive text matching the image, or the user may input a text to search for an image described by the sentence.
Taking the text retrieval according to the image as an example, the server needs to roughly retrieve a candidate text with a higher degree of association with the input image in a text library according to the image input by the user, obtain the candidate text, encode the candidate text by using a text encoding model to obtain a text vector of the candidate text, and finally determine the matching degree with the image input by the user according to the text vector of the candidate text.
When the image-text matching is carried out in the mode, the server needs to carry out several steps of preliminary retrieval, encoding, matching degree calculation and the like to realize the image-text matching, and the processing efficiency of the server is low.
Disclosure of Invention
To overcome the problems in the related art, the present disclosure provides a method, an apparatus, and a storage medium for matching images and texts.
According to a first aspect of the embodiments of the present disclosure, there is provided an image-text matching method, including: acquiring an image to be subjected to image-text matching; inputting the image into a pre-trained image-text coding model, and coding the image to obtain an image vector; determining a text vector similar to the image vector from pre-stored text vectors; the image-text coding model further comprises a text coding sub-network used for coding a text in the training process, and the pre-stored text vector is determined by the text coding sub-network according to a preset text code; and determining the texts corresponding to the preset number of text vectors similar to the image vectors as texts matched with the image.
In one example, the determining a text vector similar to the image vector from pre-stored text vectors includes:
respectively determining cosine distances between the text vectors in the pre-stored text vectors and the image vectors;
according to the calculated cosine distance, obtaining a text vector of which the cosine distance with the image vector is greater than a set distance threshold, and
and determining the text vector with the cosine distance larger than a set distance threshold as the text vector similar to the image vector.
In one example, the pre-stored text vector is determined for a pre-set text encoding by the sub-network of text encodings: calling the image-text coding model; and inputting the preset text into the image-text coding model, and coding the input preset text through a text coding sub-network included in the image-text coding model to obtain the pre-stored text vector.
In an example, the teletext coding model is trained based on a first training sample pair and a second training sample pair, the first training sample pair includes the preset text and an image sample that matches the preset text, and the second training sample pair includes the preset text and an image sample that does not match the preset text.
According to a second aspect of the embodiments of the present disclosure, there is provided a method for training a text-to-text coding model, the method comprising: determining a preset text sample; determining images matched with the preset text samples to obtain a first training sample pair, and determining images not matched with the preset text samples to obtain a second training sample pair; and training based on the first training sample pair and the second training sample pair to obtain a graphic coding model.
In an example, the image-text coding model includes an image coding sub-network and a text coding sub-network, and the training based on the first training sample pair and the second training sample pair obtains the image-text coding model, including: respectively extracting image vectors of the image samples in the first training sample pair and the second training sample pair through an image coding sub-network, and respectively extracting text vectors of the text samples in the first training sample pair and the second training sample pair through a text coding sub-network; determining a first cosine distance and a second cosine distance based on the image vector of the image sample and the text vector of the text sample, wherein the first cosine distance is a cosine distance between the text vector and the image vector in the first training sample pair, and the second cosine distance is a cosine distance between the text vector and the image vector in the second training sample; adjusting training parameters of the image coding sub-network and the text coding sub-network according to the first cosine distance, the second cosine distance and the loss function to obtain the image-text coding model meeting the loss value; the image-text coding model enables the first cosine distance to be larger than a preset first distance threshold value, and the second cosine distance to be smaller than a preset second distance threshold value.
According to a third aspect of the embodiments of the present disclosure, there is provided an image-text matching apparatus including: an acquisition unit configured to acquire an image to be subjected to image-text matching; the processing unit is configured to input the image into a pre-trained image-text coding model, and code the image to obtain an image vector; a determination unit configured to determine a text vector similar to the image vector from among pre-stored text vectors; the image-text coding model further comprises a text coding sub-network used for coding a text in the training process, and the pre-stored text vector is determined by the text coding sub-network according to a preset text code; the determining unit is further configured to determine a text corresponding to a preset number of text vectors similar to the image vector as a text matching the image.
In one example, the determination unit determines a text vector similar to the image vector from among pre-stored text vectors in the following manner: respectively determining cosine distances between the text vectors in the pre-stored text vectors and the image vectors; and according to the calculated cosine distance, obtaining a text vector of which the cosine distance from the image vector is greater than a set distance threshold, and determining the text vector of which the cosine distance is greater than the set distance threshold as a text vector similar to the image vector.
In one example, the determining unit determines the pre-stored text vector for a pre-set text encoding by the sub-text encoding network in the following manner: calling the image-text coding model; and inputting the preset text into the image-text coding model, and coding the input preset text through a text coding sub-network included in the image-text coding model to obtain the pre-stored text vector.
In an example, the teletext coding model is trained based on a first training sample pair and a second training sample pair, the first training sample pair includes the preset text and an image sample that matches the preset text, and the second training sample pair includes the preset text and an image sample that does not match the preset text.
According to a fourth aspect of the embodiments of the present disclosure, there is provided a training apparatus for a text-to-text coding model, the training apparatus comprising: the determining unit is configured to determine a preset text sample, determine an image matched with the preset text sample to obtain a first training sample pair, and determine an image not matched with the preset text sample to obtain a second training sample pair; a training unit configured to train to obtain a teletext coding model based on the first pair of training samples and the second pair of training samples.
In an example, the teletext model comprises an image coding sub-network and a text coding sub-network, and the training unit is trained on the first training sample pair and the second training sample pair to obtain the teletext model as follows: respectively extracting image vectors of the image samples in the first training sample pair and the second training sample pair through an image coding sub-network, and respectively extracting text vectors of the text samples in the first training sample pair and the second training sample pair through a text coding sub-network; determining a first cosine distance and a second cosine distance based on the image vector of the image sample and the text vector of the text sample, wherein the first cosine distance is a cosine distance between the text vector and the image vector in the first training sample pair, and the second cosine distance is a cosine distance between the text vector and the image vector in the second training sample; adjusting training parameters of the image coding sub-network and the text coding sub-network according to the first cosine distance, the second cosine distance and the loss function to obtain the image-text coding model meeting the loss value; the image-text coding model enables the first cosine distance to be larger than a preset first distance threshold value, and the second cosine distance to be smaller than a preset second distance threshold value.
According to a fifth aspect of the present disclosure, there is provided an image-text matching device, including: a memory configured to store instructions. And a processor configured to invoke instructions to perform the teletext matching method of the first aspect or any example of the first aspect.
According to a sixth aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer-executable instructions that, when executed by a processor, perform the method of teletext matching in the aforementioned first aspect or any one of the examples of the first aspect.
According to a seventh aspect of the present disclosure, there is provided a graphics-text coding model training device, including: a memory configured to store instructions. And a processor configured to invoke instructions to perform the method for training a teletext model according to the second aspect or any one of the examples of the second aspect.
According to an eighth aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer-executable instructions that, when executed by a processor, perform the method for training a teletext model according to any one of the preceding second aspect or examples of the second aspect.
The technical scheme provided by the embodiment of the disclosure can have the following beneficial effects: the server for image-text matching acquires images to be subjected to image-text matching, inputs the images to be subjected to image-text matching into a pre-trained image-text coding model, codes the images to be subjected to image-text matching, obtains image vectors of the images to be subjected to image-text matching, determines text vectors similar to the image vectors of the images to be subjected to image-text matching from pre-stored text vectors, and determines texts corresponding to a preset number of text vectors similar to the image vectors of the images to be subjected to image-text matching as texts matched with the images. Therefore, the server for image-text matching does not need to search a plurality of candidate texts matched with the image input by the user in the preset text, does not need to encode the candidate texts in real time, obtains a text vector, improves the image-text matching efficiency of the server for image-text matching, and reduces the system delay of the server for image-text matching.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure.
Fig. 1 is a flow chart illustrating a method of teletext matching according to an exemplary embodiment.
Fig. 2 is a flow chart illustrating a method of teletext matching according to an exemplary embodiment.
FIG. 3 is a flowchart illustrating training a text coding model and an image coding model according to an example embodiment.
Fig. 4 is a block diagram illustrating an apparatus for teletext matching according to an exemplary embodiment.
Fig. 5 is a block diagram illustrating a teletext model training arrangement according to an exemplary embodiment.
FIG. 6 is a block diagram illustrating an apparatus in accordance with an example embodiment.
Detailed Description
Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.
The technical scheme of the exemplary embodiment of the disclosure can be applied to a graph-text matching system, and returns an application scene of a text matched with an image according to the image input to the graph-text matching system by a user. In this scenario, the teletext matching system may include, for example, a terminal for inputting a teletext matching object and a server for performing text matching on the input image to be teletext matched, that is, a teletext matching server. The user terminal includes but is not limited to: fixed or mobile electronic devices such as smart phones, tablet computers, notebook computers, desktop computers, electronic book readers, and the like. The server for matching the image and text may be an independent application service device, or may be a service cluster formed by a plurality of servers, and in practical application, the server may be a cloud server, a cloud host, a virtual center, or the like.
In the related technology, a multi-modal retrieval mode is utilized, for example, when retrieving a description text matched with an image according to the input image, a user inputs the image of the text to be matched on a terminal, after receiving the image input by the user, a server for image-text matching needs to retrieve a plurality of candidate texts matched with the image input by the user in a text library, after obtaining the candidate texts, the server encodes the candidate texts based on a network model to obtain text vectors of the candidate texts, then determines the text matched with the image input by the user in the candidate texts according to the text vectors of the candidate texts, and feeds the text matched with the image input by the user back to the terminal for the user to select.
The server for image-text matching can realize image-text matching only through the steps in the process of executing image-text matching, so that the server for image-text matching is low in efficiency and serious in system delay phenomenon during execution operation.
The embodiment of the disclosure provides an image-text matching method. In the image-text matching method, a server for image-text matching acquires an image to be subjected to image-text matching, inputs the image to be subjected to image-text matching into a pre-trained image-text coding model, codes the image to be subjected to image-text matching, obtains an image vector of the image to be subjected to image-text matching, determines a text vector similar to the image vector of the image to be subjected to image-text matching from pre-stored text vectors, and determines texts corresponding to a preset number of text vectors similar to the image vector of the image to be subjected to image-text matching as texts matched with the image. Therefore, the server for image-text matching does not need to search a plurality of candidate texts matched with the image input by the user in the preset text, does not need to encode the candidate texts in real time, obtains a text vector, improves the image-text matching efficiency of the server for image-text matching, and reduces the system delay of the server for image-text matching.
Fig. 1 is a flowchart illustrating a teletext matching method according to an exemplary embodiment, and as shown in fig. 1, the teletext matching method comprises the following steps.
In step S11, an image to be subjected to image-text matching is acquired.
In the present disclosure, the image to be subjected to image-text matching may be one or several images input by a user, or may be an image frame extracted from a video after the video is input by the user.
In step S12, the image is input to a pre-trained image-text coding model, and the image is coded to obtain an image vector.
The image-text coding model in the disclosure can code images and/or texts according to the input images and/or texts, and output image vectors and/or text vectors.
Furthermore, in the present disclosure, the image to be subjected to image-text matching is input into the image-text coding model, and the image to be subjected to image-text matching is coded by the image coding subnetwork, so as to obtain an image vector of the image to be subjected to image-text matching.
In step S13, a text vector similar to the image vector is determined from the text vectors stored in advance.
In practical application, since the image and the text are from two heterogeneous spaces, in order to directly measure the similarity between the image and the text, the image and the text can be mapped into one space, and the similarity between the image and the text can be measured by using an image vector of the image and a text vector of the text.
Based on the principle of vector similarity, in one embodiment, the present disclosure may retrieve text vectors similar to image vectors from pre-stored text vectors, and determine texts corresponding to a preset number of text vectors similar to image vectors as texts matching images.
The pre-stored text vector is obtained by encoding a preset text based on a text encoding sub-network included in the image-text encoding model.
In one embodiment, the sub-network of text encoding may be, for example, a neural network utilizing Bidirectional Encoder tokens (BERTs) from the transformer.
In step S14, text corresponding to a preset number of text vectors similar to the image vector is determined as text matching the image.
According to the method and the device, after the image to be subjected to image-text matching is obtained, text vectors similar to the image vectors of the image to be subjected to image-text matching can be retrieved from prestored text vectors according to the image vectors of the image to be subjected to image-text matching, and texts corresponding to the preset number of text vectors similar to the image vectors to be subjected to image-text matching are returned to a user for selection.
For example, a user inputs a landscape image, the landscape image is coded through an image-text coding model to obtain image vectors of the landscape image, text vectors similar to the image vectors of the input landscape image are retrieved from pre-stored text vectors, and texts corresponding to a preset number of text vectors similar to the image vectors of the landscape image, such as ancient poems and elegant characters, are returned to the user for selection. Based on the returned characters including ancient poems, graceful characters and the like, the time for the user to match the characters with the scenery is greatly shortened. And the user can also be used for further processing the images and other operations at a later stage according to the returned preset number of texts similar to the image vectors of the landscape picture images.
In an exemplary embodiment of the disclosure, a server for image-text matching acquires an image to be subjected to image-text matching, inputs the image to be subjected to image-text matching into a pre-trained image-text coding model, codes the image to be subjected to image-text matching, obtains an image vector of the image to be subjected to image-text matching, determines a text vector similar to the image vector of the image to be subjected to image-text matching from pre-stored text vectors, and determines a text corresponding to a preset number of text vectors similar to the image vector of the image to be subjected to image-text matching as a text matched with the image. Therefore, the server for image-text matching does not need to search a plurality of candidate texts matched with the image input by the user in the preset text, does not need to encode the candidate texts in real time, obtains a text vector, improves the image-text matching efficiency of the server for image-text matching, and reduces the system delay of the server for image-text matching.
Fig. 2 is a flowchart illustrating a teletext matching method according to an exemplary embodiment, and as shown in fig. 2, the teletext matching method comprises the following steps.
In step S21, an image to be subjected to image-text matching is acquired.
In step S22, the image is input to a pre-trained image-text coding model, and the image is coded to obtain an image vector.
In step S23, cosine distances between the image vectors and the text vectors in the pre-stored text vectors are respectively determined, text vectors whose cosine distances from the image vectors are greater than a set distance threshold are obtained according to the calculated cosine distances, and the text vectors whose cosine distances are greater than the set distance threshold are determined as text vectors similar to the image vectors.
Based on the principle of vector similarity, in one embodiment, the present disclosure may measure the similarity between vectors by measuring the cosine distance between two vectors, i.e., the cosine value of the included angle between two vectors. When the cosine distance between two vectors approaches 1, the two vectors are characterized as similar vectors. When the cosine distance between two vectors approaches 0, the two vectors are characterized as dissimilar vectors.
According to the above-mentioned characteristics of cosine distances, a distance threshold of cosine distances similar to the image vector can be set, for example, the distance threshold of cosine distances similar to the image vector is set to 0.98. Namely, when the cosine distance between the text vector and the image vector in the prestored text vectors is greater than 0.98, the text vector with the cosine distance between the text vector and the image vector in the prestored text vectors is determined to be the text vector similar to the image vector. Further, the text corresponding to the preset number of text vectors similar to the image vector may be determined as the text matching the image.
For example, an image of the mountain and river water input by a user is obtained, an image vector of the image of the mountain and river water is obtained by calling an image-text coding model to code the image of the mountain and river water, cosine distances between text vectors in prestored text vectors and the image vectors of the mountain and river water are respectively determined, the text vector of which the cosine distance with the image vector is greater than 0.98 is obtained according to the calculated cosine distances, and the text vector of which the cosine distance is greater than 0.98 is determined as the text vector similar to the image vector. And returning texts corresponding to the preset number of text vectors similar to the image vectors of the landscape photo image, such as ancient poems including mountain and river water and beautiful characters related to the mountain and river water to the user for selection. And returning characters matched with the images to the user based on the images input by the user, so that the time of the user for matching the characters with the high mountain and river water images is greatly shortened. And the user can also be used for further processing the images and other operations at the later stage according to the returned preset number of texts similar to the image vectors of the high mountain and river images.
For another example, a section of video input by a user is obtained, and because the content difference between adjacent video frames is not large, the input video can be sampled every preset number of video frames to obtain each image frame. And calling a picture-text coding model to code each image frame to obtain an image vector of each image frame, respectively determining the cosine distance between each text vector in the prestored text vectors and the image vector of each image frame, acquiring the text vector of which the cosine distance with the image vector of each image frame is greater than 0.98 according to the calculated cosine distance, and determining the text vector of which the cosine distance is greater than 0.98 as the text vector similar to the image vector of each image frame. And returning texts corresponding to the preset number of text vectors similar to the image vectors of the image frame images, such as beautiful characters related to the image frames, to the user for selection.
In step S24, text corresponding to a preset number of text vectors similar to the image vector is determined as text matching the image.
In an exemplary embodiment of the disclosure, a server for image-text matching acquires an image to be subjected to image-text matching, codes the image to be subjected to image-text matching by calling a preset image-text coding model, obtains an image vector, determines cosine distances between text vectors in prestored text vectors and the image vector respectively, and determines the text vector of which the cosine distance is greater than a set distance threshold as a text vector similar to the image vector according to the calculated cosine distance. Therefore, the server for image-text matching does not need to search a plurality of candidate texts matched with the image input by the user in the text library, does not need to code the candidate texts in real time, obtains a text vector, improves the image-text matching efficiency of the server for image-text matching, and reduces the processing delay of image-text matching.
In order to improve the accuracy of image-text matching and reduce the processing delay for image-text matching, in a real-time manner, the present disclosure may train an image-text coding model in an end-to-end training manner.
FIG. 3 is a flowchart illustrating training of a text coding subnetwork and an image coding model according to an exemplary embodiment, as shown in FIG. 3, including the following steps.
In step S31, a first training sample pair and a second training sample pair are acquired.
In one embodiment, the present disclosure may train a graph-text matching model based on a first training sample pair and a second training sample pair. The first training sample pair comprises a preset text and an image sample matched with the preset text, and the second training sample pair comprises the preset text and an image sample not matched with the preset text. And for example, the graph-text matching model may be trained using 70% of the sample pair data in the first training sample pair and the second training sample pair as a training data set, and the graph-text matching model may be verified using 30% of the sample pair data in the first training sample pair and the second training sample pair as a test data set.
For convenience in description, a training sample pair composed of a preset text sample and an image sample matched with the preset text is called a first training sample pair. And a training sample pair consisting of the preset text and the image sample which is not matched with the preset text is called as a second training sample pair.
In one embodiment, the first training sample pair and the second training sample pair in the present disclosure are determined, for example, by:
acquiring a preset text, wherein the preset text can be a sentence or a combination of sentences with complete semantics. The sentence may be a sentence in any natural language, and may include a large amount of text such as poem of Tang Dynasty, Song Dynasty, lyrics, celebrity, classical film lines, etc.
And constructing images matched with texts in the preset texts to obtain a first training sample pair, and constructing images unmatched with the preset texts to obtain a second training sample pair.
In step S32, a graph-text matching model is trained and optimized.
In the present disclosure, the teletext matching model may be a model comprising an image coding sub-network and a text coding sub-network, wherein the image coding sub-network may be a ResNet or a VGG network. The text encoding subnetwork may be, for example, a neural network utilizing Bidirectional Encoder tokens (BERTs) from Transformers.
In order to avoid situations of search omission and incomplete search when a plurality of candidate texts matched with images input by a user are searched from a text search library and avoid situations of processing delay caused by searching the candidate texts, in one embodiment, the image-text matching method can adopt an end-to-end training mode to train an image-text matching model, namely, an image coding sub-network and a text coding sub-network are jointly trained, so that image-text matching of the images of the texts to be matched is realized in a global angle, the accuracy of image-text matching is improved, the processing efficiency of the image-text matching is improved, and the processing delay of a server for image-text matching is reduced.
Since the first training sample pair is a training sample pair composed of a text in the text search base and an image sample matching the text in the text search base, the cosine distance between the text vector and the image vector in the first training sample pair needs to be 1, and the cosine distance between the text vector and the image vector in the second training sample pair needs to be 0 because the second training sample pair is a training sample pair composed of a text in the text search base and an image sample not matching the text in the text search base.
Therefore, when the image-text matching model comprising the image coding sub-network and the text coding sub-network is trained in an end-to-end mode, the first training sample pair and the second training sample pair are input into the image-text matching model, the image samples in the first training sample pair and the second training sample pair are coded through the image coding sub-network in the image-text matching model, the image vectors of the image samples are output through the image-text coding model, the text samples in the first training sample pair and the second training sample pair are coded through the text coding sub-network in the image-text matching model, and the text vectors of the text samples are output through the image-text coding model.
Then, based on the image vectors of the image samples and the text vectors of the text samples, cosine distances between the text vectors and the image vectors in the first training sample pair are determined, and cosine distances between the text vectors and the image vectors in the second training sample are determined.
The present disclosure facilitates the designation of the cosine distance between the text vector and the image vector in a first training sample pair as a first cosine distance and the cosine distance between the text vector and the image vector in a second training sample pair as a second cosine distance.
And adjusting parameters of the image-text coding model according to the first cosine distance, the second cosine distance and the loss function, namely respectively adjusting training parameters of an image coding sub-network and a text coding sub-network which are included in the image-text coding model, so that the image-text coding model meets the condition that the first cosine distance is greater than a preset first distance threshold value, and the second cosine distance is smaller than a preset second distance threshold value.
For example, the preset first distance threshold is 0.97, and the preset second distance threshold is 0.02. And respectively adjusting the training parameters of the image coding sub-network and the text coding sub-network by using a cross entropy loss function, so that the image vector and the text vector output from the image-text coding model meet the condition that a first cosine distance between the text vector and the image vector in a first training sample pair is more than 0.97, and a second cosine distance between the text vector and the image vector in a second training sample pair is less than 0.02. Thus, a well-trained image-text coding model can be obtained.
In step S33, the test data set is input into the trained image-text matching model for verification, so as to obtain a verified image-text matching model.
In the present disclosure, the test data set may be input into the trained image-text matching model for verification, so as to obtain a verified image-text matching model.
Therefore, after the image-text coding model is trained, all texts in the preset texts are coded by using a text coding sub-network included in the trained image-text coding model, then when a user needs to perform image-text matching on images, the images to be subjected to image-text matching are coded through the trained image-text coding model, after image vectors are obtained, text vectors similar to the image vectors to be subjected to image-text matching can be directly retrieved from prestored text vectors based on cosine distances of the image vectors and the text vectors, texts corresponding to a preset number of text vectors similar to the image vectors to be subjected to image-text matching are determined to be texts matched with the images, and the texts are output for the user to select.
In addition, in order to quickly retrieve text vectors similar to image vectors to be subjected to image-text matching, after a text coding subnetwork included in an image-text coding model codes all texts in a preset text to obtain the text vectors, a vector retrieval library can be constructed for the coded text vectors through similarity search, such as faiss. Because the faiss supports billion-level retrieval, when the vector retrieval library constructed based on the faiss retrieves text vectors similar to image vectors, the retrieval efficiency can be greatly improved on the basis of ensuring the search accuracy, and the image-text matching efficiency is further improved.
In an exemplary embodiment of the present disclosure, an end-to-end image-text matching model including an image coding sub-network and a text coding sub-network is trained, so that the trained image-text coding model satisfies that a cosine distance between a text vector and an image vector in a first training sample pair is greater than a preset first distance threshold, and a second cosine distance is that a cosine distance between a text vector and an image vector in a second training sample is smaller than a preset second distance threshold. When the obtained pictures are subjected to picture-text matching, the pictures are only needed to be input into the picture-text coding model, the obtained pictures are coded through the picture-text coding model to obtain image vectors, then all the text vectors determined by the preset text coding on the basis of the text coding sub-network in the picture-text matching model can be determined to obtain the text vectors similar to the image vectors, candidate texts do not need to be selected in a text retrieval library firstly, the candidate texts are coded in real time, and the candidate text vectors are further matched with the image vectors on the basis of the coded text vectors, so that the picture-text matching efficiency of a server for picture-text matching is improved, and the processing delay of picture-text matching is reduced.
Based on the same conception, the embodiment of the disclosure also provides an image-text matching device.
It is understood that the teletext matching arrangement provided in the embodiments of the disclosure includes corresponding hardware structures and/or software modules for performing the respective functions in order to implement the above-described functions. The disclosed embodiments can be implemented in hardware or a combination of hardware and computer software, in combination with the exemplary elements and algorithm steps disclosed in the disclosed embodiments. Whether a function is performed as hardware or computer software drives hardware depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present disclosure.
Fig. 4 is a block diagram illustrating a teletext matching arrangement according to an exemplary embodiment. Referring to fig. 4, the teletext matching arrangement 400 comprises an acquisition unit 401, a processing unit 402 and a determination unit 403.
The acquiring unit 401 is configured to acquire an image to be subjected to image-text matching; a processing unit 402, configured to input the image into a pre-trained image-text coding model, and code the image to obtain an image vector; a determining unit 403 configured to determine a text vector similar to the image vector from among pre-stored text vectors; the image-text coding model further comprises a text coding sub-network used for coding a text in the training process, and the pre-stored text vector is determined by the text coding sub-network according to a preset text code; the determining unit 403 is further configured to determine a text corresponding to a preset number of text vectors similar to the image vector as a text matching the image.
In one example, the determining unit 403 determines a text vector similar to the image vector from pre-stored text vectors in the following manner: respectively determining cosine distances between the text vectors in the pre-stored text vectors and the image vectors; and according to the calculated cosine distance, obtaining a text vector of which the cosine distance from the image vector is greater than a set distance threshold, and determining the text vector of which the cosine distance is greater than the set distance threshold as a text vector similar to the image vector.
In one example, the determining unit 403 determines the pre-stored text vector for a preset text encoding by the text encoding sub-network in the following manner: calling the image-text coding model; and inputting the preset text into the image-text coding model, and coding the input preset text through a text coding sub-network included in the image-text coding model to obtain the pre-stored text vector.
In an example, the teletext coding model is trained based on a first training sample pair and a second training sample pair, the first training sample pair includes the preset text and an image sample that matches the preset text, and the second training sample pair includes the preset text and an image sample that does not match the preset text.
Fig. 5 is a block diagram illustrating a teletext model training arrangement according to an exemplary embodiment. Referring to fig. 5, the teletext model training arrangement 500 comprises a determination unit 501 and a training unit 502.
The determining unit 501 is configured to determine a preset text sample, determine an image matched with the preset text sample to obtain a first training sample pair, and determine an image not matched with the preset text sample to obtain a second training sample pair; a training unit 502 configured to train a teletext model based on the first pair of training samples and the second pair of training samples.
In an example, the graph-text coding model includes an image coding sub-network and a text coding sub-network, and the training unit 502 trains the graph-text coding model based on the first training sample pair and the second training sample pair as follows: respectively extracting image vectors of the image samples in the first training sample pair and the second training sample pair through an image coding sub-network, and respectively extracting text vectors of the text samples in the first training sample pair and the second training sample pair through a text coding sub-network; determining a first cosine distance and a second cosine distance based on the image vector of the image sample and the text vector of the text sample, wherein the first cosine distance is a cosine distance between the text vector and the image vector in the first training sample pair, and the second cosine distance is a cosine distance between the text vector and the image vector in the second training sample; adjusting training parameters of the image coding sub-network and the text coding sub-network according to the first cosine distance, the second cosine distance and the loss function to obtain the image-text coding model meeting the loss value; the image-text coding model enables the first cosine distance to be larger than a preset first distance threshold value, and the second cosine distance to be smaller than a preset second distance threshold value.
With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.
Fig. 6 is a block diagram illustrating an apparatus 600 for teletext matching according to an exemplary embodiment. For example, the apparatus 600 may be provided as a server. Referring to fig. 6, the apparatus 600 includes a processing component 622 that further includes one or more processors and memory resources, represented by memory 632, for storing instructions, such as applications, that are executable by the processing component 622. The application programs stored in memory 632 may include one or more modules that each correspond to a set of instructions. Further, the processing component 622 is configured to execute instructions to perform the above-described teletext matching method.
The apparatus 600 may also include a power component 626 configured to perform power management of the apparatus 600, a wired or wireless network interface 660 configured to connect the apparatus 600 to a network, and an input-output (I/O) interface 668. The apparatus 600 may operate based on an operating system stored in the memory 632, such as Windows Server, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, or the like.
It is understood that "a plurality" in this disclosure means two or more, and other words are analogous. "and/or" describes the association relationship of the associated objects, meaning that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. The singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.
It will be further understood that the terms "first," "second," and the like are used to describe various information and that such information should not be limited by these terms. These terms are only used to distinguish one type of information from another and do not denote a particular order or importance. Indeed, the terms "first," "second," and the like are fully interchangeable. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of the present disclosure.
It will be further understood that, unless otherwise specified, "connected" includes direct connections between the two without the presence of other elements, as well as indirect connections between the two with the presence of other elements.
It is further to be understood that while operations are depicted in the drawings in a particular order, this is not to be understood as requiring that such operations be performed in the particular order shown or in serial order, or that all illustrated operations be performed, to achieve desirable results. In certain environments, multitasking and parallel processing may be advantageous.
Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.
It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.
Claims (16)
1. A method for matching images and texts, the method comprising:
acquiring an image to be subjected to image-text matching;
inputting the image into a pre-trained image-text coding model, and coding the image to obtain an image vector;
determining a text vector similar to the image vector from pre-stored text vectors;
the image-text coding model further comprises a text coding sub-network used for coding a text in the training process, and the pre-stored text vector is determined by the text coding sub-network according to a preset text code;
and determining the texts corresponding to the preset number of text vectors similar to the image vectors as texts matched with the image.
2. The teletext matching method according to claim 1, wherein determining a text vector from pre-stored text vectors that is similar to the image vector comprises:
respectively determining cosine distances between the text vectors in the pre-stored text vectors and the image vectors;
according to the calculated cosine distance, obtaining a text vector of which the cosine distance with the image vector is greater than a set distance threshold, and
and determining the text vector with the cosine distance larger than a set distance threshold as the text vector similar to the image vector.
3. The teletext matching method according to claim 1, wherein the pre-stored text vector is determined by the text encoding sub-network for a pre-set text encoding, comprising:
and inputting the preset text into the image-text coding model, and coding the input preset text through a text coding sub-network included in the image-text coding model to obtain the pre-stored text vector.
4. The method of claim 1, wherein the teletext coding model is trained on a first pair of training samples and a second pair of training samples, the first pair of training samples including the predetermined text and image samples matching the predetermined text, and the second pair of training samples including the predetermined text and image samples not matching the predetermined text.
5. A method for training a teletext model, the method comprising:
determining a preset text sample;
determining images matched with the preset text samples to obtain a first training sample pair, and determining images not matched with the preset text samples to obtain a second training sample pair;
and training based on the first training sample pair and the second training sample pair to obtain a graphic coding model.
6. The method of claim 5, wherein the teletext model comprises an image coding sub-network and a text coding sub-network, and wherein the training based on the first training sample pair and the second training sample pair to obtain the teletext model comprises:
respectively extracting image vectors of the image samples in the first training sample pair and the second training sample pair through an image coding sub-network, and
respectively extracting text vectors of the text samples in the first training sample pair and the second training sample pair through a text coding sub-network;
determining a first cosine distance and a second cosine distance based on the image vector of the image sample and the text vector of the text sample, wherein the first cosine distance is a cosine distance between the text vector and the image vector in the first training sample pair, and the second cosine distance is a cosine distance between the text vector and the image vector in the second training sample;
adjusting training parameters of the image coding sub-network and the text coding sub-network according to the first cosine distance, the second cosine distance and the loss function to obtain the image-text coding model meeting the loss value;
the image-text coding model enables the first cosine distance to be larger than a preset first distance threshold value, and the second cosine distance to be smaller than a preset second distance threshold value.
7. An apparatus for matching graphics and text, the apparatus comprising:
an acquisition unit configured to acquire an image to be subjected to image-text matching;
the processing unit is configured to input the image into a pre-trained image-text coding model, and code the image to obtain an image vector;
a determination unit configured to determine a text vector similar to the image vector from among pre-stored text vectors;
the image-text coding model further comprises a text coding sub-network used for coding a text in the training process, and the pre-stored text vector is determined by the text coding sub-network according to a preset text code;
the determining unit is further configured to determine a text corresponding to a preset number of text vectors similar to the image vector as a text matching the image.
8. Teletext matching arrangement according to claim 7, wherein the determination unit determines a text vector similar to the image vector from pre-stored text vectors in the following way:
respectively determining cosine distances between the text vectors in the pre-stored text vectors and the image vectors;
according to the calculated cosine distance, obtaining a text vector of which the cosine distance with the image vector is greater than a set distance threshold, and
and determining the text vector with the cosine distance larger than a set distance threshold as the text vector similar to the image vector.
9. Teletext matching arrangement according to claim 7, wherein the determination unit determines the pre-stored text vector for a pre-set text encoding by the sub-network of text encodings as follows:
calling the image-text coding model;
and inputting the preset text into the image-text coding model, and coding the input preset text through a text coding sub-network included in the image-text coding model to obtain the pre-stored text vector.
10. The device according to claim 7, wherein the teletext coding model is trained on a first pair of training samples and a second pair of training samples, the first pair of training samples comprising the predetermined text and image samples matching the predetermined text, and the second pair of training samples comprising the predetermined text and image samples not matching the predetermined text.
11. An apparatus for training a teletext model, the apparatus comprising:
a determination unit configured to determine a preset text sample, an
Determining images matched with the preset text samples to obtain a first training sample pair, and determining images not matched with the preset text samples to obtain a second training sample pair;
a training unit configured to train to obtain a teletext coding model based on the first pair of training samples and the second pair of training samples.
12. The apparatus of claim 11, wherein the teletext model comprises an image coding sub-network and a text coding sub-network, and wherein the training unit is configured to train the teletext model based on the first pair of training samples and the second pair of training samples in the following manner:
respectively extracting image vectors of the image samples in the first training sample pair and the second training sample pair through an image coding sub-network, and
respectively extracting text vectors of the text samples in the first training sample pair and the second training sample pair through a text coding sub-network;
determining a first cosine distance and a second cosine distance based on the image vector of the image sample and the text vector of the text sample, wherein the first cosine distance is a cosine distance between the text vector and the image vector in the first training sample pair, and the second cosine distance is a cosine distance between the text vector and the image vector in the second training sample;
adjusting training parameters of the image coding sub-network and the text coding sub-network according to the first cosine distance, the second cosine distance and the loss function to obtain the image-text coding model meeting the loss value;
the image-text coding model enables the first cosine distance to be larger than a preset first distance threshold value, and the second cosine distance to be smaller than a preset second distance threshold value.
13. An image-text matching device, characterized by comprising:
a processor;
a memory for storing processor-executable instructions;
wherein the processor is configured to: -performing the teletext matching method of any one of claims 1-4.
14. A non-transitory computer readable storage medium having instructions therein which, when executed by a processor of a mobile terminal, enable the mobile terminal to perform the teletext matching method according to any one of claims 1-4.
15. A device for training a graphic coding model, comprising:
a processor;
a memory for storing processor-executable instructions;
wherein the processor is configured to: performing the teletext code model training method of any one of claims 5-6.
16. A non-transitory computer readable storage medium having instructions that, when executed by a processor of a mobile terminal, enable the mobile terminal to perform the teletext code model training method of any one of claims 5-6.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011052223.4A CN112148839A (en) | 2020-09-29 | 2020-09-29 | Image-text matching method and device and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011052223.4A CN112148839A (en) | 2020-09-29 | 2020-09-29 | Image-text matching method and device and storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN112148839A true CN112148839A (en) | 2020-12-29 |
Family
ID=73894229
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011052223.4A Pending CN112148839A (en) | 2020-09-29 | 2020-09-29 | Image-text matching method and device and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112148839A (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113343664A (en) * | 2021-06-29 | 2021-09-03 | 京东数科海益信息科技有限公司 | Method and device for determining matching degree between image texts |
CN113642673A (en) * | 2021-08-31 | 2021-11-12 | 北京字跳网络技术有限公司 | Image generation method, device, equipment and storage medium |
CN114357228A (en) * | 2021-12-17 | 2022-04-15 | 有米科技股份有限公司 | Data processing method and device for creative document generation |
CN115880697A (en) * | 2023-02-07 | 2023-03-31 | 天翼云科技有限公司 | Image searching method and device, readable storage medium and electronic equipment |
WO2023173547A1 (en) * | 2022-03-16 | 2023-09-21 | 平安科技(深圳)有限公司 | Text image matching method and apparatus, device, and storage medium |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106997387A (en) * | 2017-03-28 | 2017-08-01 | 中国科学院自动化研究所 | The multi-modal automaticabstracting matched based on text image |
-
2020
- 2020-09-29 CN CN202011052223.4A patent/CN112148839A/en active Pending
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106997387A (en) * | 2017-03-28 | 2017-08-01 | 中国科学院自动化研究所 | The multi-modal automaticabstracting matched based on text image |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113343664A (en) * | 2021-06-29 | 2021-09-03 | 京东数科海益信息科技有限公司 | Method and device for determining matching degree between image texts |
CN113343664B (en) * | 2021-06-29 | 2023-08-08 | 京东科技信息技术有限公司 | Method and device for determining matching degree between image texts |
CN113642673A (en) * | 2021-08-31 | 2021-11-12 | 北京字跳网络技术有限公司 | Image generation method, device, equipment and storage medium |
CN113642673B (en) * | 2021-08-31 | 2023-12-22 | 北京字跳网络技术有限公司 | Image generation method, device, equipment and storage medium |
CN114357228A (en) * | 2021-12-17 | 2022-04-15 | 有米科技股份有限公司 | Data processing method and device for creative document generation |
WO2023173547A1 (en) * | 2022-03-16 | 2023-09-21 | 平安科技(深圳)有限公司 | Text image matching method and apparatus, device, and storage medium |
CN115880697A (en) * | 2023-02-07 | 2023-03-31 | 天翼云科技有限公司 | Image searching method and device, readable storage medium and electronic equipment |
CN115880697B (en) * | 2023-02-07 | 2024-01-09 | 天翼云科技有限公司 | Image searching method and device, readable storage medium and electronic equipment |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112148839A (en) | Image-text matching method and device and storage medium | |
CN113313022B (en) | Training method of character recognition model and method for recognizing characters in image | |
EP3885966B1 (en) | Method and device for generating natural language description information | |
KR102124466B1 (en) | Apparatus and method for generating conti for webtoon | |
WO2023065731A1 (en) | Method for training target map model, positioning method, and related apparatuses | |
CN110990533B (en) | Method and device for determining standard text corresponding to query text | |
CN114861889B (en) | Deep learning model training method, target object detection method and device | |
US20230143452A1 (en) | Method and apparatus for generating image, electronic device and storage medium | |
CN110097010A (en) | Picture and text detection method, device, server and storage medium | |
CN110263218B (en) | Video description text generation method, device, equipment and medium | |
CN110781413A (en) | Interest point determining method and device, storage medium and electronic equipment | |
US20230215203A1 (en) | Character recognition model training method and apparatus, character recognition method and apparatus, device and storage medium | |
CN105989067A (en) | Method for generating text abstract from image, user equipment and training server | |
CN115269913A (en) | Video retrieval method based on attention fragment prompt | |
CN114973229B (en) | Text recognition model training, text recognition method, device, equipment and medium | |
CN117315249A (en) | Image segmentation model training and segmentation method, system, equipment and medium | |
CN115687664A (en) | Chinese image-text retrieval method and data processing method for Chinese image-text retrieval | |
CN114758330A (en) | Text recognition method and device, electronic equipment and storage medium | |
CN115098722B (en) | Text and image matching method and device, electronic equipment and storage medium | |
CN116340479A (en) | Knowledge base construction method, data retrieval method, device and cloud equipment | |
CN114299074A (en) | Video segmentation method, device, equipment and storage medium | |
CN110209878B (en) | Video processing method and device, computer readable medium and electronic equipment | |
CN118155270B (en) | Model training method, face recognition method and related equipment | |
CN116383428B (en) | Graphic encoder training method, graphic matching method and device | |
CN113722444B (en) | Text processing method and device and electronic equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |