CN114743029A

CN114743029A - Image text matching method

Info

Publication number: CN114743029A
Application number: CN202210394752.5A
Authority: CN
Inventors: 乔壮; 金福生; 袁野; 王国仁; 马波
Original assignee: Beijing Institute of Technology BIT
Current assignee: Beijing Institute of Technology BIT
Priority date: 2022-04-14
Filing date: 2022-04-14
Publication date: 2022-07-12

Abstract

The invention provides an image text matching method, which comprises the following steps: respectively extracting multi-scale image features and text features by using a deep neural network; calculating global similarity of a plurality of scales by using a self-attention mechanism method; performing node level matching and structure level matching, and calculating local similarity of multiple scales by using a structure level matching result; calculating final total similarity according to the calculated global similarity and local similarity of the multiple scales; performing supervised learning model training, and updating model parameters according to a loss function to obtain a trained model; and calculating the final total similarity of the input image and the text by using the trained model, and obtaining the matched text or image according to the final total similarity. The method can more quickly find out the matched image or text on the premise of finishing the network training; and more accurate results are obtained by utilizing the multi-scale characteristics of the original data and a more comprehensive similarity measurement method.

Description

Image text matching method

Technical Field

The invention relates to the technical field of intelligent matching, in particular to a method for matching one mode with another mode in multi-mode data of images and texts.

Background

In daily life, especially in accessing internet resources, images and text are two very important modalities. The search of the image usually cannot be separated from the text description, and the matching of the image to the text similar to the semantic information thereof also has a guiding function for searching similar images, so that the image text matching, namely a method for calculating the similarity between the image and the text, is very important. The traditional method not only needs to store a large amount of relation data except the data, but also cannot well deal with the data in dynamic change, and each calculation is long in time consumption, and the method is not suitable for scenes with real-time requirements. The image text matching method based on deep learning includes respectively extracting features of an image and a text, conducting similarity calculation after the features are aligned, and after supervised training of a large number of data sets is conducted, an obtained model can quickly and accurately find out the text matched with the image or the image matched with the text in data to be processed. At present, a lot of technologies exist in deep learning methods for image text matching, but basically, the problem that the matching of the whole and the local cannot be considered comprehensively exists, and little attention is paid to the situation that the size difference of key targets in an image is large.

Therefore, how to provide a method for matching multi-scale features and more comprehensive similarity measurement between images and texts is a problem to be solved by those skilled in the art.

Disclosure of Invention

In view of the above, the present invention provides a method for extracting image features and text features based on deep learning and multi-modal learning, and performing overall and local similarity matching.

In order to achieve the purpose, the invention adopts the following technical scheme:

a method for matching image texts comprises a model building step and a model matching step:

the step of constructing the model comprises:

s1, extracting multi-scale image features and text features respectively by using a deep neural network, wherein the image and the text are data with matching labels;

s2, calculating the global similarity of the image multi-scale features and the text multi-scale features in multiple scales by using a self-attention mechanism method;

s3, sequentially performing node level matching and structure level matching between the image multi-scale features and the text multi-scale features, and calculating local similarity of multiple scales by using a structure level matching result;

s4, calculating the final total similarity according to the calculated global similarity and the local similarity of the scales;

s5, repeating S2-S4 to train the supervised learning model, and updating model parameters according to the loss function to obtain a trained model;

model matching:

and S6, calculating the final total similarity of the input image and the text by using the trained model, and obtaining the matched text or image according to the final total similarity.

Preferably, the S1 includes the image multi-scale feature extraction step:

inputting the image into a Faster R-CNN model, and identifying to obtain detection frames of a plurality of image areas;

coding the image region corresponding to the detection frame by using a ResNet model to obtain the extracted visual features;

and respectively enabling the visual features to pass through a full-connection layer with output dimensionality of N scales, wherein the output matrix is the multi-scale features of the image, and N belongs to N +.

Preferably, the S1 includes a text multi-scale feature extraction step:

by endowing each word with a unique serial number, performing word-level tokenization on the sentence to obtain a sentence vector;

embedding the sentence vectors into a multidimensional vector space and then obtaining text feature vectors through bidirectional GRUs; the bidirectional GRUs are provided with N text feature vectors of N scales.

Preferably, the node-level matching in S3 is a step of matching the image with a word in a sentence, and includes:

fusing the image features T with the text features V, and then calculating a softmax function along the visual axis to obtain a similarity matrix, namely W ═ softmax (lambda TV)^T) λ is a temperature parameter;

integrating all image features into weighted image features by taking the similarity matrix as a weight parameter;

and calculating the similarity of the weighted image features and the text features by using a cosine similarity measurement method to obtain a node-level matching result.

Preferably, the structure-level matching in S3 is a step of matching the image with a phrase in a sentence, and includes:

analyzing semantic dependency among words in a text sentence, constructing text characteristics into an undirected sparse graph, and recording word nodes with dependency as adjacent nodes;

calculating to obtain a similarity matrix between word representations, and obtaining the weight of edges in the undirected sparse graph;

and integrating the neighborhood matching vectors represented by the similarity matrix by using a graph convolution neural network, and updating the nodes in the similarity matrix.

Preferably, in S3, the local phase is calculated by using the structure level matching resultThe similarity specifically comprises: sending the result of the graph convolution neural network into a full connection layer and calculating l₂Local similarity can be obtained by norm.

Preferably, the S4 includes:

splicing the global similarity and the local similarity into a matrix, and sending the matrix into a full-connection layer to obtain the total similarity under one scale;

and carrying out weighted summation on the total similarity under multiple scales to obtain the final total similarity.

Preferably, the model output of S6 is: one row number is equal to the number of images/texts, the corresponding column number is equal to the number of texts/number of images matrix, each value in the matrix represents the similarity calculated by the image/text represented by the row of the value and the text/image represented by the column of the value.

Through the technical scheme, compared with the prior art, the invention has the beneficial effects that:

the method can more quickly find out the matched image or text on the premise of finishing the network training; compared with the existing deep learning technology, the method can obtain more accurate results by utilizing the multi-scale features of the original data and a more comprehensive similarity measurement method.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts;

fig. 1 is a flowchart of a method for matching image texts according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of training data provided by an embodiment of the present invention;

FIG. 3 is a schematic diagram of image region target detection according to an embodiment of the present invention;

FIG. 4 is a diagram illustrating text results according to image matching provided by an embodiment of the present invention;

fig. 5 is a schematic diagram of an image result matching according to a text according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to the attached figure 1, the invention provides an image text matching method, which is based on multi-scale feature extraction and multi-scale similarity calculation and specifically comprises a model building step and a model matching step:

the model building step comprises:

and S1, extracting multi-scale image features and text features respectively by using the deep neural network, wherein the image and the text are data with matching labels.

In the specific implementation of the step, the purpose of matching and labeling the images and the texts is to establish a training data set, the training data set represents training texts which are related to the training images in a similarity manner and used for supervision training of the model, each image comprises 5 descriptive sentences, and the corresponding example of the images and the texts is shown in fig. 2.

In one embodiment, the image multi-scale feature extraction step is as follows:

inputting the image into a Faster R-CNN model, and identifying to obtain a plurality of detection frames of the image area;

coding the image area corresponding to the detection frame by using a ResNet model to obtain the extracted visual characteristics;

When the method is specifically executed: for image features, 36 image regions including entities and attributes are identified by using a Faster R-CNN model pre-trained on Visual Genomes, as shown in FIG. 3, the upper two subgraphs are respectively an original image and all identified image regions, and the lower six subgraphs are six samples selected from the original image and the identified regions include entities such as shadows and hairs, and attributes such as colors. And then, coding the image regions corresponding to the obtained 36 detection frames by using a ResNet model to obtain the extracted visual features. In order to obtain the multi-scale features, the visual features are respectively passed through full-connection layers with output dimensions of 128, 256, 512 and 1024 (namely four dimensions), and the output matrix is the image multi-scale features.

In one embodiment, the text multi-scale feature extraction step is as follows:

embedding the sentence vectors into a multidimensional vector space and then obtaining text feature vectors through a bidirectional GRU; the number of the bidirectional GRUs is N, and text feature vectors of N scales are output.

When executed in detail: for text features, the sentence needs to be tokenized at word level first. All words in the data set are counted, a unique number is given to each word, and a word-number index table is established. Sentences can be coded into a number matrix by replacing words in the sentences with numbers, and corresponding numbers for beginning and ending need to be added to the beginning and the end of each sentence due to the subsequent use of the RNN network. And embedding each sentence vector into a 300-dimensional vector space, then obtaining two coded vectors through a bidirectional GRU (generalized regression Unit), and obtaining a text feature vector after solving the average value of the two vectors. For matching with image multi-scale features, there are 4 bidirectional GRU networks, again with output dimensions of 128, 256, 512 and 1024.

And S2, calculating the global similarity of the image multi-scale features and the text multi-scale features in multiple scales by using a self-attention mechanism method.

In one embodiment, for visual and textual features, it is first necessary to take advantage of the self-attention mechanismTo calculate the global feature vectors, respectively. The most common method for calculating the similarity of two vectors is to use a cosine similarity measure, but in order to enable the model to learn more complex matches, a similarity calculation method with parameters is used, i.e. the method uses a cosine similarity measure

x and y represent visual and textual features, respectively, W is a learnable parameter, | · Y²Represents element by element squared, | · |₂Is represented by₂Norm, the result obtained, i.e. global similarity.

And S3, sequentially carrying out node level matching and structure level matching between the image multi-scale features and the text multi-scale features, and calculating the local similarity of multiple scales by using the structure level matching result.

The local similarity calculation can be divided into two parts of node level matching and structure level matching.

In one embodiment, the node level matching is a step of matching an image with a word in a sentence, including:

fusing the image features T with the text features V, and then calculating a softmax function along the visual axis to obtain a similarity matrix, namely W ═ softmax (lambda TV)^T) λ is a temperature parameter, and the obtained similarity matrix result can measure the similarity degree of each image to each text;

integrating all image features into weighted image features by taking the similar matrix as a weight parameter;

In one embodiment, the structure-level matching is a step of matching the image with phrases in the sentence, and the structure-level matching takes the result of the node-level matching as input to learn fine-grained phrase correspondence. The method comprises the following steps:

analyzing semantic dependency among words in a sentence by using a Stanford CoreNLP toolkit, constructing text characteristics into an undirected sparse graph, and recording word nodes with dependency relationship as adjacent nodes; the adjacency matrix of each node, which is also adjacent to itself, is represented using a matrix.

The similarity matrix between word representations can be calculated by using the softmax function and is marked as s_ij＝softmax(λu_iu_j)，u_i、u_jRespectively representing words i and j to obtain the weight of the edge in the undirected sparse graph, wherein the weight of the edge in the sparse graph can be obtained through l₂-norm solving;

In one embodiment, the calculating the local similarity using the structure-level matching result specifically includes: sending the result of graph convolution neural network into full connection layer and calculating l₂Local similarity can be obtained by norm.

And S4, calculating the final total similarity according to the calculated global similarity and the local similarity of the multiple scales.

In one embodiment, the global similarity and the local similarity are spliced into a matrix and then sent to a full connection layer to obtain the total similarity under a scale, and the splicing rule is that a (1, n) -dimensional vector is added to the (n, n) -dimensional matrix and then becomes the (n +1, n) -dimensional matrix; and carrying out weighted summation on the total similarity under multiple scales to obtain the final total similarity. Wherein the weights assigned to the four scales in this embodiment are 0.125, 0.25, 0.5, and 1, respectively.

And S5, repeating S2-S4 to train the supervised learning model, and updating model parameters according to the loss function to obtain the trained model.

The loss function employed in this embodiment is a triplet loss. When the type of the query is an image I, sampling in the matched text to obtain T, and sampling in the unmatched text to obtain T'; symmetrically, for the query as text T, sampling results in matched image I and unmatched image I'. In order to minimize the loss function, which can achieve the purpose of high similarity score of matched image-text pair and low similarity score of unmatched image-text pair, the loss function is designed as follows:

s (,) represents a similarity calculation model established by the embodiment of the invention, [ · C]₊This indicates that max (·,0) operation was performed on the expression, and α is a margin value of 0.2. The method adopted by the embodiment is supervised learning, so that a large amount of supervised data, namely data with matched labels on the image and the text, is required during training. And when the second step to the fourth step are repeatedly executed each time, processing a part of data according to the size of the gpu display memory of the training machine, circularly updating the model parameters according to the back propagation of the loss function each time, sending all the data to the network once, recording the data as a circle once, and stopping training until reaching the specified cycle. In the present embodiment, the number of images per training process is set to 64, training is continued for a total of 40 rounds, and the dynamic learning rate is used to prevent learning instability, and the loss function is set to 0.002 in the first 30 rounds and 0.0002 in the last 10 rounds. After each round of training is finished, the model is verified on a small labeled data set (verification set), and finally, the model which is best in performance on the verification set is selected as a final training result after the training is finished.

Model matching:

In one embodiment, the model is tested to simulate the behavior of the model during actual use. Selecting images and texts outside the training set as model input, and outputting the model as follows: one row number is equal to the number of images/texts, the corresponding column number is equal to the number of texts/number of images matrix, each value in the matrix represents the similarity calculated by the image/text represented by the row of the value and the text/image represented by the column of the value. If a text matched with a certain image is to be calculated, the rows corresponding to the images in the matrix can be sorted, the text represented by the column with the largest value is the calculated text with the best match, and meanwhile, the image matched with the certain text can also be calculated.

In order to prove the superiority of the matching effect of the invention, the following comparison description is carried out by combining the specific image text matching result:

fig. 4 shows two examples of matching texts according to images, and five sentences respectively represent the calculated texts most similar to the image, and it can be seen that, using the matching result of the present invention, the highest similarity is calculated between the successful and correct texts of both images, and the fifth sentence of the second image which is wrong can also prove that the present invention successfully captures the entity "dog".

Fig. 5 shows two examples of matching images according to texts, because one sentence in the label of the data set corresponds to only one image, the remaining four images are all marked as errors except that the image corresponding to the label is marked as correct, but it can also be found that "toilet", "bat", "dirty" or "wire bottom" in the remaining images corresponds to texts.

The method for matching image texts provided by the present invention is described in detail above, a specific example is applied in this embodiment to explain the principle and the implementation of the present invention, and the description of the above embodiment is only used to help understanding the method and the core idea of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined in this embodiment may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method for matching image texts is characterized by comprising a model building step and a model matching step:

the step of constructing the model comprises:

model matching:

2. The method for image text matching according to claim 1, wherein said S1 includes the steps of image multi-scale feature extraction:

3. The method for image text matching according to claim 1, wherein said S1 includes a text multi-scale feature extraction step:

embedding the sentence vectors into a multidimensional vector space and then obtaining text feature vectors through a bidirectional GRU; the bidirectional GRUs are provided with N text feature vectors of N scales.

4. The image text matching method according to claim 1, wherein the node level matching in S3 is a step of matching an image with a word in a sentence, and comprises:

5. The method for matching image texts according to claim 1, wherein the structure level matching in S3 is a step of matching an image with a phrase in a sentence, and includes:

calculating to obtain a similarity matrix between word representations, and obtaining the weight of an edge in an undirected sparse graph;

6. The image text matching method according to claim 1, wherein the calculating of the local similarity using the structure-level matching result in S3 specifically includes: the result of the graph convolutional neural networkSend into the full connection layer and calculate l₂And obtaining the local similarity by the norm.

7. The method for image text matching according to claim 1, wherein said S4 includes:

splicing the global similarity and the local similarity into a matrix, and sending the matrix to a full connection layer to obtain the total similarity under one scale;

8. The method for image text matching according to claim 1, wherein the model output of S6 is: one row number is equal to the number of images/texts, the corresponding column number is equal to the number of texts/number of images matrix, each value in the matrix represents the similarity calculated by the image/text represented by the row of the value and the text/image represented by the column of the value.