CN111428801A

CN111428801A - Image-text matching method for improving alternate updating of fusion layer and loss function

Info

Publication number: CN111428801A
Application number: CN202010236904.XA
Authority: CN
Inventors: 程述立; 汪烈军; 杜安钰; 王德鹏
Original assignee: Xinjiang University
Current assignee: Xinjiang University
Priority date: 2020-03-30
Filing date: 2020-03-30
Publication date: 2020-07-17
Anticipated expiration: 2040-03-30
Also published as: CN111428801B

Abstract

The invention provides a graph-text matching method for improving alternate updating of a fusion layer and a loss function, which comprises the following steps: establishing a fusion layer, and extracting fusion characteristics between the image characteristics and the text characteristic data through the fusion layer; embedding and coding the image features and the text features based on the similarity of the fusion features with the image features and the text features respectively; and alternately updating the strategy based on an original loss function consisting of the original features and a fusion loss function consisting of the fusion features to optimize the original loss and the fusion loss. The method simultaneously considers robust feature extraction and model optimization, introduces a fusion layer to obtain robust features, and optimizes network parameters based on alternately updating loss functions before and after fusion.

Description

Image-text matching method for improving alternate updating of fusion layer and loss function

Technical Field

The invention mainly relates to the crossing field of computer vision and natural language processing, is applied to a cross-modal retrieval task in each large search engine, and particularly relates to a method for improving image-text matching of a fusion layer and loss function alternate updating.

Background

With the explosive growth of multimedia data from heterogeneous search engines and social media, teletext matching has become the dominant algorithm in cross-modal search in recent years.

Unlike image retrieval and text retrieval single-modality retrieval tasks, image-text matching tasks focus on data of two modalities simultaneously, trying to find a matching relationship between image data and corresponding text data. The final goal of image-text matching is to find a bridge connecting the two modes of image and text. Through the bridge, the images (I2T) can be caused to retrieve corresponding texts, and the texts (T2I) can be caused to retrieve corresponding images. Image-text matching requires finding the relationship between image features and text semantics. Since image and text data are two different data forms, it is a great challenge to design a compact, robust and efficient image-text matching method.

In order to solve this problem, the existing methods can be divided into two types according to different modeling manners. One is based on the idea of classification, the matching problem is solved by optimizing logistic regression loss, the matching is marked as +1, the mismatching is marked as-1, and the matching problem is converted into a binary classification problem. However, this method is not sufficient to solve the complex multimodal problem and it is difficult to obtain good results. The algorithm idea does not meet the essential requirements of the image-text matching problem. The second method is based on the idea of embedding, i.e. by means of image and text data, the dataforms of both modalities are embedded in a common representation space, and then the degree of matching between the image and the text is described by means of the euclidean or cosine distance. Specifically, first, image features and text features are encoded, and then the triplet rank penalty is optimized such that the distance between mutually matching image and text features is less than the distance between unmatched image and text features. However, finding a suitable common space for both image and text modality data is not easy. The complexity of this approach is typically high and requires a large amount of computational resources to train. Moreover, current embedding methods often ignore the relationship between image and text features, and thus do not well construct a common space.

Disclosure of Invention

In order to solve the defects of the prior art, the invention provides a graph-text matching method for improving the alternate updating of a fusion layer and a loss function by combining the prior art and starting from practical application.

The technical scheme of the invention is as follows:

the image-text matching method for improving alternate updating of a fusion layer and a loss function is characterized by comprising the following steps:

establishing a fusion layer, and extracting fusion characteristics between the image characteristics and the text characteristic data through the fusion layer;

embedding and coding the image features and the text features based on the similarity of the fusion features with the image features and the text features respectively;

and alternately updating the strategy based on an original loss function consisting of the original features and a fusion loss function consisting of the fusion features to optimize the original loss and the fusion loss.

Further, the extracting, by the fusion layer, fusion features between the image features and the text feature data includes:

and extracting image features based on a Faster R-CNN model and a ResNet-101 model, and extracting text features based on a bi-GRU model.

Further, the embedding and encoding the image feature and the text feature based on the similarity of the fusion feature with the image feature and the text feature respectively includes:

the image features or text features are re-encoded based on the distance between the image features and the text features and the fused features such that, in the encoding stage, the distance between the matched features is less than the distance between the unmatched features.

Further, the extraction of the fusion features and the embedding of the features specifically include:

characterizing an image as

Wherein x_iRepresenting an encoding of an area within an image; representing text features as

Wherein y is_iRepresenting the encoding of words in a sentence; connecting the image features and the text features to generate fused features:

determining an effect of the fused features on the image features and the text features based on the cosine evaluation score:

wherein Score _ im represents the cosine evaluation Score of the fused feature to the image feature, and Score _ txt represents the cosine evaluation Score of the fused feature to the text feature;

multiplying the fusion features by the cosine evaluation scores to respectively obtain preliminary fusion features related to the image and the text features, wherein the preliminary fusion features form final fusion features after passing through a normalization layer and a full-connection layer:

Fusion_to_image＝dn(In(F*Score_im)) (3)

Fusion_to_text＝dn(Ln(F*Score_txt)) (4)

wherein, Fusion _ to _ image represents the Fusion feature of the image, and Fusion _ to _ text represents the Fusion feature of the text.

Further, the fusion features of the image and the fusion features of the text are spliced with the image features and the text features respectively, and the embedded features are obtained through a layer of full connection and a layer of normalization layer:

EmbaddingText＝Ln(dn(Fusion_to_text||Y)) (5)

EmbaddingImage＝In(dn(Fusion_to_inage||X)) (6)

wherein EmbaddingText represents a text embedding feature, and EmbaddingImage represents an image embedding feature.

Further, after the feature embedding is completed, model optimization is performed, and a fusion loss function and an original loss function are used for alternately updating the gradient, so that the fusion loss can continuously and effectively reduce the gradient, and the method specifically comprises the following steps:

the fusion loss function is represented as F L (x)_i,y_i) The original loss function is represented as O L (x)_i,y_i) The gradient is expressed as:

a gradient representing the loss of fusion is obtained,

a gradient representing the original loss, t being the number of iteration steps during training; when the step number is even, using a fusion loss function; when the step number is odd, using the original loss function; the original loss is composed of image features and text features directly or indirectly related to the common space, the fusion loss is composed of the finally embedded features, and the updating of the fusion loss is expressed as:

the invention has the beneficial effects that:

the present invention uses a fusion layer to reduce the difference between the two modality data of image and text and respect the respective characteristics of the image feature and the text feature. Extracting the relation between image features and text features based on a fusion layer, emphasizing the difference between matched features and unmatched features, extracting the image features by using a Faster R-CNN and ResNet-101 model, and extracting the text features by using a bi-GRU; then, the image features and the text features are input into the fusion layer to extract the fusion features and embed the image features and the text features, finally, a unique gradient updating method is designed to optimize the original triple loss function and the triple loss function after fusion, the mode belongs to an alternative updating strategy, and experimental cases prove the effectiveness of the method.

Drawings

FIG. 1 is a flow chart of the method of the present invention;

fig. 2 is a block diagram of a matching system of the present invention.

FIG. 3 is a schematic view of the structure of the fusion layer of the present invention.

Detailed Description

The invention is further described with reference to the accompanying drawings and specific embodiments. It should be understood that these examples are for illustrative purposes only and are not intended to limit the scope of the present invention. Further, it should be understood that various changes or modifications of the present invention may be made by those skilled in the art after reading the teaching of the present invention, and these equivalents also fall within the scope of the present application.

As shown in fig. 1, 2 and 3, the related flow and system structure diagram of the method for matching graphics and text for improving alternate update of fusion layer and loss function provided by the invention.

In the teletext matching method, the essence of constructing a reasonable common space is to construct a multimodal feature representation space so that the distance between mutually matching features and unmatched features is large. In order to establish a reasonable common space, the invention designs a fusion layer, namely a fusion layer. The fusion layer first extracts fusion features between the data of the image and the text, i.e., fusion features, including relationship features between the image and the text. The image or text features are recoded according to the distance between the image or text features and the fusion features, so that the distance between the matched features is smaller than the distance between the unmatched features in the coding stage.

In order to solve the problem of complex calculation of an embedding mode, the invention designs a new and simple loss function updating method, so that the network model is easier to train, and the robustness of the algorithm is improved. Specifically, the invention reserves the original embedding characteristics when the fusion layer is not used and also reserves the fusion embedding characteristics after the fusion layer is processed. Although the fused embedded feature is more accurate in description than the original embedded feature, it needs to consider 3 features, but the original feature needs to consider only 2 features, so the triplet rank penalty (original penalty) composed of the original features is more complicated than the triplet rank penalty (fusion penalty) composed of the fused features from the neural network perspective. Therefore, the present invention uses the fusion loss and the original loss to update alternately to better solve the complex problem by optimizing the simple problem.

The fusion layer and loss function optimization method of the present invention will be described in detail below.

The present invention fuses text features and image features using a fusion layer to form a fusion function. And then embedding and coding the image features and the text features according to the similarity of the fusion features and the image features and the text features respectively. By fusing features, the gap between multimodal data is reduced and a reasonable public space is established. Finally, the invention optimizes the original loss and the fusion loss through an alternate update strategy, and the model structure of the invention is shown in figure 2.

Fig. 2 presents an overview of the model. The invention embeds image features and text features using a fusion layer and updates the loss function of the invention alternately using two loss functions.

In the feature extraction stage, the invention adopts the Faster R-CNN and ResNet-101 models to extract image features and uses bi-GRU to extract text features. Then, the present invention inputs the image features and the text features into the fused layer to extract the fused features and embed the image features and the text features. The detailed structure of the fused layer is shown in fig. 3.

The fused layer structure shown in FIG. 3. The layer has completed extraction of the fused features and embedding of the features. Concat represents the feature concatenation operation and Sim represents the cosine similarity.

The present invention represents an image feature set as

Wherein x_iRepresenting the coding of regions within an image, with text features set to

Wherein y is_iRepresenting the encoding of words in a sentence. The present invention concatenates image features and text features to generate the fused features of the present invention

The invention then requires embedding image features and text features based on the fusion features. The present invention is intended not only to narrow the data gap of multimodal data but also to ensure the expression characteristics of image features and text features. Therefore, the present invention determines the impact of the fusion features on image and text features from the Cosine Evaluation Score (CES):

score im represents CES with fused features to image features and Score txt represents CES with fused features to text features. The fused features are then multiplied by CES to obtain preliminary fused features for the image and text features, respectively. The preliminary fused features form the final fused features of the invention after passing through the normalization layer and the full link layer:

Fusion_to_image＝dn(In(F*Score_im)) (3)

Fusion_to_text＝dn(Ln(F*Score_txt)) (4)

the present invention represents the fused features of the image as Fusion _ to _ image and the text as Fusion _ to _ text to prevent the gradient from disappearing, the present invention uses normalization layers, i.e., L ayeernorm (L n) and instancenorm (in)

EmbaddingText＝Ln(dn(Fusion_to_text||Y)) (5)

EmbaddingImage＝In(dn(Fusion_to_inage||X)) (6)

EmbaddingText is embedded text and Embaddingimage is embedded image.

Finally, Fusion _ to _ text and Fusion _ to _ image of the invention are respectively spliced with the image feature (X) and the text feature (Y), and the embedded feature of the invention is obtained through a layer of full connection and a layer of normalization layer.

In the optimization stage of the neural network, the optimization problem of the neural network often becomes very difficult due to the presence of saddle points and local minima_i,y_i) The original loss is denoted as O L (x)_i,y_i). The inventive gradient can then be expressed as:

a gradient representing the loss of fusion is obtained,

gradient representing original loss, t being during trainingThe number of iteration steps. When the step number is even, the invention uses the fusion loss; when the number of steps is odd, the invention uses the raw loss. The fusion loss is composed of the features that the invention ultimately embeds and is directly related to the composition of the public space of the invention. The raw loss consists of image features and text features that are directly or indirectly related to the common space of the present invention. In practice there are two loss functions describing the composition of the common space, but the description angles are different. The update of the fusion loss can be expressed as:

the loss function optimization method of the invention can enable the fusion loss to use the gradient information of the original loss when gradient descent is executed. Since both penalty functions describe the same problem, their optimization directions (i.e. gradient descent directions) are the same and the weight parameters can also be shared. Therefore, the performance of the model of the invention can be improved more stably during the training process.

The method provided by the invention is verified on Flickr30K and MSCOCO data sets.

The Flickr30K and MSCOCO datasets are widely used for image text matching and image retrieval tasks.

Flickr30K has 31000 pictures on the Flickr website, each picture having five corresponding sentences. The invention uses 1000 pictures for verification, 1000 pictures for testing and the rest for training. MS-COCO contains 123287 images, each corresponding to five textual descriptions, of which 113287 images were used as training set, 5000 images were used as validation set, and 5000 images were used as test set, and experiments demonstrated that the method of the present invention has certain advantages over the traditional method.

The task of image-text matching is to discuss the similarity between matching samples and the difference between non-matching samples, so the present invention focuses on the relationship between image and text features. The present invention uses a fusion layer to reduce the difference between the two modality data of image and text and respect the respective characteristics of the image feature and the text feature. Extracting the relation between the image feature and the text feature based on the fusion layer, emphasizing the difference between the matched feature and the unmatched feature, extracting the image feature by using a Faster R-CNN and ResNet-101 model, and extracting the text feature by using a bi-GRU. Then, the present invention inputs the image features and the text features into the fused layer to extract the fused features and embed the image features and the text features. Finally, the invention designs a unique gradient updating method to optimize the original triple loss function and the triple loss function after fusion, and the method belongs to an alternate updating strategy. The invention shows the superiority of the invention on Flickr30K and MSCOCO data sets.

Claims

1. The image-text matching method for improving alternate updating of a fusion layer and a loss function is characterized by comprising the following steps:

2. The method for improving teletext matching according to claim 1, wherein the extraction of the fused features between the image features and the text feature data through the fusion layer comprises:

3. The method for improving teletext matching between a fusion layer and an alternate update of a loss function according to claim 1, wherein the embedded encoding of image features and text features based on the similarity of the fusion features to the image features and the text features respectively comprises:

4. The method for improving the teletext matching process for alternately updating the fusion layer and the loss function according to claim 1, wherein the extraction of the fusion features and the embedding of the features specifically comprise:

characterizing an image as

Fusion_to_image＝dn(In(F*Score_im)) (3)

Fusion_to_text＝dn(Ln(F*Score_txt)) (4)

5. The method for matching graphics and text with improved fusion layer and alternate update of loss function as claimed in claim 4, wherein the fusion features of image and text are spliced with the image and text features respectively, and the embedded features are obtained by a full connection layer and a normalization layer:

EmbaddingText＝Ln(dn(Fusion_to_text||Y)) (5)

EmbaddingImage＝In(dn(Fusion_to_inage||X)) (6)

6. The method for improving the image-text matching of the fusion layer and the loss function in the alternative updating process according to claim 5, wherein after the feature embedding is completed, model optimization is performed, and the gradient is updated alternately by using the fusion loss function and the original loss function, so that the gradient can be continuously and effectively reduced by the fusion loss, specifically comprising:

a gradient representing the loss of fusion is obtained,