WO2020143137A1

WO2020143137A1 - Multi-step self-attention cross-media retrieval method based on restricted text space and system

Info

Publication number: WO2020143137A1
Application number: PCT/CN2019/085771
Authority: WO
Inventors: 王文敏; 余政
Original assignee: 北京大学深圳研究生院
Priority date: 2019-01-07
Filing date: 2019-05-07
Publication date: 2020-07-16
Also published as: CN109783657A; CN109783657B

Abstract

Provided are a multi-step self-attention cross-media retrieval method based on restricted text space and a retrieval system, first constructing the restricted text space of the relatively fixed vocabulary, then converting the unrestricted text space into the restricted text space, comprising: extracting the image features and text features of the restricted text space through a feature extraction network, the features comprising the global features, the regional feature set and the associated features; sending the extracted features into a feature mapping network, and extracting object-level sharing information between the image and the text through the multi-step self-attention mechanism, then collecting the useful information at each moment through a similarity measurement network to measure the similarity between the image and the text, and calculating the triple loss function; therefore, the multi-step self-attention cross-media retrieval based on the restricted text space is realized. The method and the system greatly improve the cross-media retrieval recall rate by introducing the multi-step self-attention mechanism and the associated features.

Description

Multi-step self-attention cross-media retrieval method and system based on restricted text space

Technical field

The invention relates to the technical field of computer vision and information retrieval, in particular to a multi-step self-attention cross-media retrieval method and system based on a limited text space.

Background technique

In recent years, with the rapid development of information technology, multimedia data on the Internet is getting richer and richer, and multimedia data of different modalities (text, images, audio, video, etc.) can be used to express similar content. In order to meet the increasing multimedia retrieval needs of users, cross-media retrieval tasks have been proposed to find a homogeneous semantic space (public space, text space, image space), making the similarity between the underlying heterogeneous multimedia data Can be measured directly. More precisely, the core problem of this cross-media retrieval task can be subdivided into two sub-problems.

The first sub-problem is how to learn to get an effective representation of the underlying features. In the field of cross-media retrieval, most traditional methods only represent images and text through global features, such as the output of the last fully connected layer of a convolutional neural network (CNN) or the hidden layer output of the last time of a recurrent neural network (RNN). Global features contain more redundant information, also known as exclusive information within the modal. This information exists only within the modalities and is not shared among the modalities. This also leads to a decline in the quality of cross-media retrieval. Therefore, some scholars try to extract the local features of image and text (image object area, text word), and then find the shared information between the two through the attention mechanism, thereby reducing the impact of redundant features. However, most existing methods based on the attention mechanism only consider the shared information at the object level between the image and the text, and do not consider the interactive information between the objects.

The second sub-problem is how to find a suitable isomorphic feature space. There are roughly three options for isomorphic space, namely public space, text space and image space. Existing methods usually nonlinearly map heterogeneous features to a latent common space, so that the similarity between different modal data can be directly measured. However, compared with the pixel-based features of images, text features are easier to understand by humans and the information transmitted is more accurate. For example, given an image, the human brain will first condense description sentences based on its content, and then retrieve semantically similar text based on these descriptions. Therefore, in order to simulate the cognitive mode of the human brain, this method explores the feasibility of cross-media retrieval in the text space. Existing cross-media retrieval methods based on text space do not consider the cognitive process of the human brain for images, and most of them use Word2Vec space as the final text space. The characteristics of the image in this space are obtained by combining the category information of the objects in the image. Therefore, this feature will lose the rich interactive information contained in the image. This also shows that for cross-media retrieval, the Word2Vec space is not an effective text feature space.

The text space is essentially a vector space, composed of a series of different Chinese characters and words. For Chinese, there is no accurate number of Chinese characters, about 100,000 (Beijing Guoan Consulting Equipment Co., Ltd. has 91251 Chinese characters from the source of Chinese character database income). At the same time, a large number of new words have emerged to keep the size of the text space growing. In addition to Chinese, similar situations also appear in other languages including English. According to incomplete statistics, the number of existing English words has exceeded one million, and is still growing at a rate of several thousand every year. Therefore, natural language is divergent in nature. Based on this divergent characteristic, it is almost impossible to construct a complete unrestricted text space.

However, in most cases, people only need to master some of the Chinese characters and words to meet their daily needs. For example, many English linguists believe that about 3650 of the most basic common English vocabulary can complete more than 95% of the tasks of expressing thoughts and communication; the "Common Dictionary of Modern Chinese" jointly released by the original National Education Commission in November 1987 proposed that There are 2500 commonly used characters in modern Chinese, accounting for more than 99% of the daily use of Chinese.

In recent years, the attention mechanism has begun to attract more and more researchers' attention. Attention mechanism was initially used in "sequence-sequence" model, such as machine translation and image caption. It contains three commonly used forms: 1) additive attention mechanism, 2) product attention mechanism and 3) self-attention mechanism. If the additive or product self-attention mechanism is used in the cross-media retrieval algorithm, the key information of the image and text cannot be fixed, which results in the uncertainty of the image and text encoding and affects the practical application value of the algorithm. For example, given a data set containing 10 images and 10 texts corresponding to the images one by one, then the additive or product self-attention mechanism will generate 10 different types of focus information for each image and each text (Corresponding to 10 texts and 10 images, respectively), that is, the key information of an image (text) is determined by the corresponding text (image). However, considering the practical application value of cross-media retrieval algorithms, the model must ensure the uniqueness of the encoding of images and text. Therefore, the self-attention mechanism is more suitable for cross-media retrieval. The self-attention mechanism can guide the image and text to find the key information inside the data, and ensure it is fixed.

Summary of the invention

In order to overcome the above problems in the prior art, the present invention proposes a multi-step self-attention cross-media retrieval method and retrieval system based on limited text space. This method learns a limited text space by simulating human brain cognition, and introduces a multi-step self-attention mechanism and associated features, which greatly improves the retrieval recall rate. In addition to objective evaluation indicators (retrieval recall rate), the present invention also builds an online retrieval Demo system. By entering text or uploading images, the Demo can return the corresponding search results, thereby further verifying the effectiveness of the present invention.

In the present invention, the restricted text space refers to a text space with a relatively fixed vocabulary, which is relative to an unrestricted text space. The present invention constructs a restricted text space of a relatively fixed vocabulary, and then converts the unrestricted text space into a restricted text space, thereby ensuring the convergence of the algorithm. The ability to understand based on limited text space is affected by the size of the vocabulary, that is, the larger the vocabulary, the stronger the understanding, and the smaller the vocabulary, the weaker the understanding. After experiments, it was found that the number of words around 3000 can already meet the basic needs of cross-media retrieval, and blindly increasing the number of words will not only not improve the retrieval performance, but also increase the complexity of the algorithm in time and space. The present invention extracts the interaction information between objects through image captioning (image captioning), which is also referred to as correlation information (relation information). The image caption model is essentially an "encoding-decoding" model, that is, given an input image, it will first encode it into a feature vector through the encoder, and then translate the feature vector into an appropriate description through the decoder text. Since the generated description text contains not only the object category information (nouns) in the image, but also the interaction information (verbs, adjectives) between the objects, the related information can be represented by the feature vector generated by the encoder. The representative algorithm of the image caption task is NIC (Neural Image Captioning).

The method of the present invention is used to extract regional features (image object regions, text words) of images and text, and find shared information between the two through a multi-step self-attention mechanism, thereby reducing the interference of redundant information. In addition to the regional features of images and text, the present invention regards the global features of the two as the global prior knowledge of the multi-step self-attention mechanism, which is used to achieve the rapid positioning of key information, and can achieve more at a faster training speed. Good experiment results.

Aiming at the problem of how to find a suitable isomorphic feature space, the present invention maps the underlying features of the image to a "restricted text space", which contains not only the category information of objects, but also the rich interaction information between objects.

The multi-step self-attention cross-media retrieval method based on the restricted text space proposed by the present invention contains a total of three modules, namely a feature extraction network, a feature mapping network and a similarity measurement network. For the first sub-problem (how to learn to obtain an effective low-level feature representation), the feature extraction network is used to extract global features, regional features, and associated features of images and text. The extraction of related features is realized by the representative algorithm NIC of the image caption model; for the second sub-problem (how to find a suitable isomorphic feature space), the feature mapping network is used to learn to get the restricted text space. With the help of the multi-step self-attention mechanism, the feature map network can selectively focus on part of the shared information at different times, and extract the object-level features of images and text by summarizing the useful information at each time. In addition, it also fuses the object-level features of the image with the associated features through the feature fusion layer and maps it to the restricted text space. In order to obtain better experimental results at a faster training speed, the present invention regards the global features of images and text as global prior knowledge of a multi-step self-attention mechanism, which is used to achieve the rapid positioning of key information. Finally, the similarity measurement network measures the final similarity between images and text by summarizing useful information at each moment. The present invention achieves a better recall rate result in the cross-media retrieval classic data set, and also achieves a good performance from a subjective perspective.

For the online retrieval Demo system, the present invention is implemented by MVC (Model View Controller) model design. Among them, Model corresponds to the multi-step self-attention cross-media retrieval method based on the restricted text space proposed by the present invention, which is the core sorting algorithm; View corresponds to the front-end page, which is used to realize the input of queries (images or text) and the display of retrieval results ; Controller corresponds to the background controller, used to read query input from the front end and send data to the core sorting algorithm.

The technical solutions provided by the present invention are:

Multi-step self-attention cross-media retrieval method based on restricted text space, including feature extraction network, feature mapping network and similarity measurement network; feature extraction network is used to extract global features, regional feature sets and associated features of images and text; Features are further fed into the feature mapping network, and as many object-level shared information as possible between images and text is extracted through a multi-step self-attention mechanism. Because the multi-step self-attention mechanism does not consider the interaction information between different objects, the feature mapping network fuses the shared features of the object level and the associated features through the feature fusion layer, and maps them to the restricted text space; finally, the similarity measurement network Measure the final similarity between the image and the text by summing up the useful information at each moment, and calculate the triple loss function; thereby achieving multi-step self-attention cross-media retrieval based on limited text space;

Specifically, assume that the data set D = {D ₁ , D ₂ ,..., D _I } has a total of I samples, and each sample D _i includes a picture i and a description text s, that is, D _i =(i,s) , Each piece of text is composed of multiple (such as 5) sentences, and each sentence independently describes the matching picture; the data set is used to learn the restricted text space; for the data set D, the specific implementation steps of the present invention as follows:

1) Extract the regional features of the image and text in D through the feature extraction network.

For images, the pre-trained VGG (Neural Network Structure proposed by Visual Geometry Group) is used to extract the global features of the image and the regional feature set of the image; NIC is used to extract the related features that contain rich interactive information between objects. For the text, the present invention uses a bidirectional LSTM (Bidirectional Long Short Term Memory networks) network to extract the global features of the text and the set of regional features of the text. The bidirectional LSTM network is not pre-trained, and its parameters are updated synchronously with the parameters of the feature map network;

2) Send the feature extracted in step 1) to the feature map network.

First, focus on object-level shared information between the image and text area features as much as possible through the multi-step self-attention mechanism; second, achieve the fusion of object-level shared features and related features through the feature fusion layer, and map to the restricted Text space. In order to obtain better experimental results at a faster training speed, the present invention regards the global features of images and text as global prior knowledge of a multi-step self-attention mechanism, which is used to quickly locate key information;

3) The similarity measurement network measures the final similarity between the image and the text by summarizing the useful information at each moment, and calculates the triple loss function.

4) Finally, the present invention updates the network parameters by optimizing the triplet loss function.

Among them, the similarity measurement function is defined as:

sim(v,u)=v·u

Among them, v and u represent the characteristics of the image and text in the restricted text space; the similarity ^{sk of the} two at ^k is calculated by Equation 7:

s ^k = v ^k · u ^k (Equation 7)

By summarizing the useful information at time K, the final similarity S between the image and the text is measured and expressed as Equation 8:

5) Calculate the triple loss function, and update the network parameters by optimizing the triple loss function;

The triple loss function is expressed as Equation 9:

Where s _p is the p-th unmatched text of the input image i; _ip is the p-th unmatched image of the input text s; m is the minimum distance interval, and the value is 0.3; sim(v, t) is the similarity Measurement function.

During the specific implementation of the present invention, the effectiveness of the present invention is further verified by implementing an online multi-step self-attention cross-media retrieval Demo system based on limited text space. Among them, the front-end page is implemented by HyperText Markup Language (HTML), Cascading Style Sheets (CSS) and JavaScript; the back-end controller is implemented by Tornado tool.

Compared with the prior art, the beneficial effects of the present invention are:

The invention provides a multi-step self-attention cross-media retrieval method based on limited text space, which includes a feature extraction network, a feature mapping network and a similarity measurement network. The feature extraction network is used to extract global features, regional feature sets, and associated features of images and text; secondly, features are further fed into the feature mapping network, and as many objects as possible are extracted between the image and text through a multi-step self-attention mechanism Level of shared information. Because it does not take into account the interaction information between different objects, the feature mapping network fuses the shared features of the object level with the associated features through the feature fusion layer and maps them to the restricted text space. In order to obtain better experimental results at a faster training speed, the present invention regards the global features of images and text as the global prior knowledge of the multi-step self-attention mechanism, which is used to achieve the rapid positioning of key information; finally, similar The sex measurement network measures the final similarity between the image and the text by summarizing the useful information at each moment, and calculates the triple loss function. In addition to objective evaluation indicators (retrieval recall rate), the present invention additionally builds an online retrieval demo. By entering text or uploading images, the Demo can return the corresponding search results, thereby verifying the effectiveness of the present invention from a subjective perspective. Specifically, the present invention has the following technical advantages:

(1) Based on the restricted text space, the present invention proposes a novel feature mapping network by means of a multi-step self-attention mechanism. It can selectively focus on some shared information at different times, and measure the final similarity between images and text by summing up useful information at each time;

(2) The present invention extracts the correlation feature of the rich interactive information between different objects contained in the image through the image caption model, which is used to make up for the defect of sharing information at the object level;

(3) In order to obtain better experimental results at a faster training speed, the present invention regards the global features of images and text as the global prior knowledge of the multi-step self-attention mechanism, which is used to achieve the rapid positioning of key information.

(4) In addition to objective evaluation indexes (retrieval recall rate), the present invention additionally builds an online retrieval demo. By entering text or uploading images, the Demo can return the corresponding search results, thereby verifying the effectiveness of the present invention from a subjective perspective.

BRIEF DESCRIPTION

The present invention has 6 drawings in total, in which:

Figure 1 defines the concept of shared information and related information at the object level;

Given two different image-text pairs, the shared information at the object level between the two images and text is similar, such as "man", "surfboard" and "wave". However, the interaction information between objects is different, such as how men surf ("jump down" vs "swipe towards").

2 is a flow block diagram of the method provided by the present invention;

A and B represent image and text processing branches respectively; for images, CNN (Convolutional Neural Network, Convolutional Neural Network) is a 19-layer VGG model;

Represents the regional feature set of image i;

Is the associated feature extracted by the image caption model NIC; v _global is the global feature of the image;

Image sharing feature representing time k;

Image context information representing time k; feature fusion layer fusion

Associated features

And map to the restricted text space, so as to get the image feature output v ^k at time ^k ; for text, BLSTM is a bidirectional LSTM network;

Represents the regional feature set of the text s; u _global is the global feature of the text;

Text context information representing time k. S is the final similarity between the image and the text.

Figure 3 is the structure of the feature mapping network of the present invention;

C and D represent the self-attention mechanism of text and image respectively; the attention layer is used to calculate the feature weights of different regions of the image and text (

with

); The weighted average layer performs weighted averaging on the regional feature sets of images and text by different weights to obtain the shared features (v ^k and u ^k ) at the current moment;

Indicates that the context information is updated through the identity connection (dashed line).

Figure 4 shows the effect of global prior knowledge on the model convergence speed under the Flickr8K data set;

Among them, “MSAN with prior” means a model that introduces global prior knowledge, and “MSAN w/oprior” means a model that does not use global prior knowledge.

Figures 5-6 show the main pages of the online search Demo, which are the text search image page and the image search text page screenshot, respectively.

detailed description

The present invention will be further described below by way of examples with reference to the accompanying drawings, but does not limit the scope of the present invention in any way.

The invention provides a multi-step self-attention cross-media retrieval method based on limited text space, which includes a feature extraction network, a feature mapping network and a similarity measurement network. The feature extraction network is used to extract global features, regional feature sets, and associated features of images and text; secondly, features are further fed into the feature mapping network, and as many objects as possible are extracted between the image and text through a multi-step self-attention mechanism Level of shared information. However, it does not consider the interactive information between different objects. As shown in Figure 1, for two different image-text pairs, the shared information at the object level between the two images and text is similar, such as "man", "surfboard" and "wave". However, the interactive information between objects is different, such as how men surf ("jump down" and "swipe towards"). Therefore, the feature mapping network fuses the shared features of the object level with the associated features through the feature fusion layer and maps them to the restricted text space. In order to obtain better experimental results at a faster training speed, the present invention regards the global features of images and text as the global prior knowledge of the multi-step self-attention mechanism, which is used to achieve the rapid positioning of key information; finally, similar The sex measurement network measures the final similarity between the image and the text by summarizing the useful information at each moment, and calculates the triple loss function. In addition to objective evaluation indicators (retrieval recall rate), the present invention additionally builds an online retrieval demo. By entering text or uploading images, the Demo can return the corresponding search results, thereby verifying the effectiveness of the present invention from a subjective perspective. Next, we will describe in detail the principle and structure of feature extraction network, feature mapping network, similarity measurement network and online retrieval Demo.

1. Feature extraction network

As shown in Part A of Figure 2, given the input image i, the output of the last fully connected layer of VGG is used to extract the 4096-dimensional global feature v _{global of the image} . Since the multi-layer convolution and pooling operations are equivalent to extracting the features of the image region, the present invention uses the output of the last pooling layer (pool5) of VGG as the feature set of each region of the image

The output of this layer contains 512 feature maps (feature map), and the size of each feature map is 7×7. In other words, the total number of image areas is 49, and each area is represented by a 512-dimensional feature vector. For the associated features, the present invention adopts the representative algorithm NIC of the image caption task, which is used to extract 512-dimensional associated features containing rich interaction information between objects

During the training process, the parameters of VGG and NIC are fixed. VGG is pre-trained by ImageNet; NIC is pre-trained by cross-media retrieval data set.

For the text s=(s ₀ ,s ₁ ,...,s _N ), we use a bidirectional LSTM network to extract the features of each word:

Where x _t represents the input word at time t;

with

Represent the output of the hidden layer of forward LSTM and backward LSTM at time t respectively;

Represents the d-dimensional feature output of the currently input word. Therefore, as shown in Part B of Figure 2, the set of regional features of the text can be expressed as

The global feature u _global can be regarded as the output of the d-dimensional hidden layer at the last moment of the bidirectional LSTM network. Among them, the dimension d not only represents the feature dimension of the text, but also represents the dimension of the restricted text space. During the experiment, the value of d is 1024.

2. Feature Map Network

For image and text, the feature mapping network uses visual self-attention mechanism and text self-attention mechanism, as shown in Figure 3.

1) Visual self-attention mechanism

As shown in part D of FIG. 3, the regional feature set of a given image i

Image sharing features at time k

Extracted by formula 2:

among them,

Context information representing the image at time k-1;

Represents the feature weight of the nth block in image i;

It is obtained by weighted average of the features of different image regions; visual self-attention function

Used to calculate the weight of each image area;

with

Represents the trainable parameters of the visual self-attention function, the size is 512 × 512.

Next, feature fusion layer fusion

Associated features

And map to the restricted text space, so as to get the image feature output v ^k at time ^k :

Among them, W ^k is the

The parameters of the fully connected layer mapped to the restricted text space are 512×1024; BN represents the batch normalization layer; ReLU represents the activation function. v ^k contains not only the image-sharing features at the object level, but also rich correlation features between objects.

2) Text self-attention mechanism

As shown in Part C of Figure 3, the set of word features for a given text

The text sharing feature u k at time ^k is calculated by Equation 4:

among them,

Context information representing the k-1 time of the text;

Represents the feature weight of the nth word in the text s; u ^{k is} obtained by weighted average of the features of different words; text self-attention function

Used to calculate the weight of each word feature;

with

Represents the trainable parameters of the text self-attention function, and the size is 1024×512.

3) Context information

Context information mentioned in steps 1) and 2)

with

Ability to encode information that has been followed by the self-attention network. Inspired by the identity connection of ResNet (deep residual network), the update formula for defining context information in the present invention is shown as Equation 5:

Where k∈{1,...,K}, V_att and T_att represent visual self-attention and text self-attention functions, respectively. Identical connections can control the flow of contextual information in the network and retain useful information.

In order to obtain better experimental results at a faster training speed, the present invention uses the initial context information

with

Initialized to global features of images and text, as shown in Equation 6:

Among them, v _global and u _global represent global features of images and text, respectively, and can also be called global prior knowledge. At this time, the global feature can be regarded as the global reference information of the multi-step self-attention mechanism, which is used to quickly locate the key information.

Finally, the present invention implements a multi-step self-attention mechanism step by step at time K so that it can find as much shared information between images and text as possible at any time k. For different data sets, the value of K is different. On the Flickr8K data set, K is set to 1; on the Flickr30K and MSCOCO data sets, K is set to 2. The specific experimental results are shown in the subsequent experimental analysis section. The parameter K represents the total number of cycles of the multi-step self-attention mechanism. It can also be expanded in time, which can be seen as a multi-step self-attention mechanism in turn at different times k.

3. Similarity measurement network

The invention defines a similarity measurement function sim(v,u)=v·u, where v and u represent the features of the image and text in the restricted text space, respectively. The similarity s ^k between the two at time ^k can be obtained by Equation 7:

s ^k = v ^k · u ^k (7)

Then, measure the final similarity S between the image and the text by summarizing the useful information at time K:

Finally, the triple loss function is used to update the network parameters, as shown in Equation 9.

Where s _p is the p-th unmatched text of the input image i; _ip is the p-th unmatched image of the input text s; m is the minimum distance interval, and the value is 0.3; sim(v, t) is the similarity Metric function; unmatched samples are randomly selected from the data set in each training cycle. During the training process, we updated the network parameters through the Adam optimizer, and the fixed learning rate was 0.0002 in the first ten iterations. As the training progressed, the learning rate decreased to 0.00002 in the last ten iterations.

4. Online Demo

The realization of online search Demo is mainly realized by Tornado tool. Tornado is an open source version of web server software that can handle thousands of connections per second, and is quite fast. Therefore, Tornado is an ideal framework for real-time web services.

Tornado functions as a controller in the MVC framework. Its functions include: 1) query reading; 2) extract the features of the query; 3) extract the features of all the data to be retrieved in the database; 4) send the data to the model. In order to ensure Demo's response speed, all the features of the data to be retrieved in the database have been pre-loaded into the memory.

The multi-step self-attention cross-media retrieval method based on the restricted text space proposed by the present invention is equivalent to the model in the MVC framework, and is also called the core sorting algorithm. Its main task is to find similar query data quickly and accurately and send it to the controller. In the case of a small amount of data, the easiest way is linear scanning, which is to calculate the distance between each sample and the query in the data set in turn. However, as the amount of data continues to increase, the time consumption of linear scanning also gradually increases, and the response speed of Demo will also slow down. Since the actual data generally shows a cluster-like clustering form, we first establish a clustering center through a clustering algorithm (such as K-means), and then compare all the data in the cluster by finding the clustering center closest to the query Get similar data. Based on this principle, we chose Faiss, Facebook's open source framework, to achieve accurate and fast queries. Faiss is a framework that provides efficient similarity search and clustering for dense vectors. Before querying, Faiss needs to cluster all the data in the data set to form different data clusters.

Finally, the front-end view (View) in the MVC framework is equivalent to the search page in mainstream search engines, mainly through HTML, CSS and JavaScript technologies. The online search Demo contains three pages: the main page, the text search image page (Figure 5) and the image search text page (Figure 6). The main page contains a text input box, camera icon and "Search" button. The user first enters text through the text input box or uploads an image by clicking the camera icon, and then clicks the "Search" button to start the search. For an input text "Arestaurant has modern tables", Figure 5 shows the results of the corresponding text retrieval image; for an image named "COCO_train2014_000000000049.jpg", Figure 6 shows the corresponding image retrieval text result. The search results are displayed in order of relevance, that is, from top to bottom, from left to right, and the relevance of the sample decreases. In order to ensure the beauty of the search results display page, the search box in Figure 5 and Figure 6 has been moved to the upper left corner, the function remains unchanged.

Tables 1 to 3 show the recall rate results of the present invention on the Flickr8K, Flickr30K and MSCOCO data sets. Img2Txt represents image-to-text retrieval, and Txt2Img represents text-to-image retrieval. In order to evaluate the retrieval effect, we followed the standard ranking metrics and used Recall@K. Recall@K measures the retrieval accuracy by calculating the probability that the correctly matched data is ranked in the top K (K=1, 5, 10) retrieval results; the larger the value of Recall@K, the more accurate the retrieval results. The figure lists the effects of the present invention compared with other existing advanced algorithms, including NIC (Neural Image Captioning), m-CNN _ENS (Multimodal Convolutional Neural Networks), HM-LSTM (Hierarchical Multimodal LSTM), LTS (Limited Text Space ), DAN (Dual Attention Networks), DSPE (Deep Structure-Preserving Image-Text Embeddings), VSE++ (Improving Visual-Semantic Embeddings), sm-LSTM (Selective Multimodal LSTM). In addition, we designed three comparative models based on the present invention:

●MSAN-obj does not use associated features

Only the object-level shared information between images and text is considered;

● MSAN-glob does not use a multi-step self-attention mechanism, only expressing images and text through global features;

●MSAN includes related features

And a complete model of a multi-step self-attention mechanism.

Table 1 Example of recall rate results on Flickr8K dataset

Table 2 Example of recall rate results on Flickr30K dataset

Table 3 Example of recall rate results on the MSCOCO dataset

As can be seen from Tables 1 to 3, MSAN has achieved the best results based on the VGG feature at this stage compared with DSPE, HM-LSTM, DAN and other better performing methods. In addition, MSAN has better experimental results than MSAN-obj and MSAN-glob, proving the effectiveness of the multi-step self-attention mechanism and associated features.

Table 4 The effect of different K values on the cross-media retrieval effect of the embodiment

Table 4 shows the effect of the cycle number K of the multi-step self-attention mechanism on the Flickr8K and Flickr30K datasets on the experimental results. From the table, we can see that when K=1, 2, MSAN achieved the best experimental results on the Flickr8K and Flickr30K data sets, respectively. The larger the value of K, the more parameters required for the multi-step self-attention mechanism, and the more likely it is to cause overfitting, thereby reducing the retrieval recall rate. Therefore, on the Flickr8K dataset, K is set to 1; on the Flickr30K and MSCOCO datasets, K is set to 2.

Table 5 The influence of global prior knowledge on the recall rate results of the examples

Table 5 shows the influence of global prior knowledge on experimental results. We designed two comparison models: "MSAN with priority" and "MSAN w/oprior". Among them, "MSAN with prior" indicates the MSAN model using global prior knowledge, and "MSAN/prior" indicates the MSAN model without global prior knowledge. It can be seen from Table 5 that the retrieval recall rate of "MSAN with priority" is higher than that of "MSAN w/oprior", thus verifying the effectiveness of the global prior knowledge. Figure 4 shows the trend graph of the loss function of the "MSAN with priority" and "MSAN w/oprior" models under the Flickr8K data set. Among them, "MSAN with priority" has a faster convergence rate than "MSAN w/oprior", and the loss function when the model converges is smaller. Therefore, due to the introduction of global prior knowledge, the present invention can achieve better retrieval results at a faster convergence speed.

Figures 5 and 6 show the text retrieval image and image retrieval text of Demo online retrieval, respectively. From a subjective point of view, although the displayed results do not necessarily include true matching samples, the multi-step self-attention cross-media retrieval method based on the restricted text space proposed by the present invention can still find results as similar as possible to the query, satisfying people Demand. This also validates the effectiveness of the present invention from a subjective perspective.

It should be noted that the purpose of the disclosed embodiments is to help further understand the present invention, but those skilled in the art can understand that various replacements and modifications are possible without departing from the spirit and scope of the present invention and the appended claims. of. Therefore, the present invention should not be limited to the contents disclosed in the embodiments, and the scope of protection claimed by the present invention is subject to the scope defined by the claims.

Claims

A multi-step self-attention cross-media retrieval method based on restricted text space, first constructs a restricted text space, and then converts an unrestricted text space into a restricted text space, where the restricted text space refers to A relatively fixed vocabulary text space; the search method includes:

Extract image features and text features through a feature extraction network, where the features include global features, regional feature sets, and associated features;

Send the extracted features to the feature mapping network, and extract the object-level shared feature information between the image and the text through a multi-step self-attention mechanism;

The feature mapping network fuses the shared features of the object level with the associated features through the feature fusion layer and maps them to the restricted text space;

Then, through the similarity measurement network, the useful information at each moment is summarized, the similarity between the image and the text is measured, and the triple loss function is calculated;

Thus, multi-step self-attention cross-media retrieval based on limited text space is realized.
The multi-step self-attention cross-media retrieval method based on the restricted text space according to claim 1, characterized in that the restricted text space is represented by a data set D, and let the data set D = {D 1 , D 2 ,..., D I } A total of I samples, each sample D i includes a picture i and a description text s, that is D i = (i, s), each text consists of multiple sentences, each sentence describes the phase independently Matching pictures; the multi-step self-attention cross-media retrieval method based on restricted text space includes the following steps:

1) Extract the regional features of the image and text in D through the feature extraction network;

For the image, the global features of the image and the set of regional features of the image are extracted by the pre-trained neural network structure VGG; the associated features of the interactive information between the objects are extracted by the image caption model method NIC;

For text, the unpre-trained bidirectional long-term short-term memory recurrent neural network LSTM is used to extract the global features of the text and the regional feature set of the text; the parameters of the LSTM and the parameters of the feature mapping network are updated synchronously;

2) Send the feature extracted in step 1) to the feature map network;

First, pay attention to the shared information at the object level between the image and text area features through a multi-step self-attention mechanism;

Second, through the feature fusion layer to achieve the fusion of shared features and related features at the object level, and map to the restricted text space;

Use the global features of images and text as the global prior knowledge of the multi-step self-attention mechanism to achieve the rapid positioning of key information;

3) The similarity measurement network summarizes useful information at each moment to measure the final similarity between the image and the text; the similarity measurement function is defined as:

sim(v,u)=v·u

Among them, v and u represent the characteristics of the image and text in the restricted text space; the similarity sk of the two at k is calculated by Equation 7:

s k = v k · u k (Equation 7)

By summarizing the useful information at time K, the final similarity S between the image and the text is measured and expressed as Equation 8:

4) Calculate the triple loss function, and update the network parameters by optimizing the triple loss function;

The triple loss function is expressed as Equation 9:

Where s p is the p-th unmatched text of the input image i; ip is the p-th unmatched image of the input text s; m is the minimum distance interval, and the value is 0.3; sim(v, t) is the similarity Measurement function.
The multi-step self-attention cross-media retrieval method based on limited text space according to claim 2, wherein in step 1), for text s=(s 0 ,s 1 ,...,s N ), bidirectional The LSTM network extracts the features of each word, specifically expressed as Equation 1:

Among them, xt represents the input word at time t;
with
Represent the output of the hidden layer of forward LSTM and backward LSTM at time t respectively;
D-dimensional feature output representing the current input word;

The set of regional features of the text is expressed as
The output of the d-dimensional hidden layer at the last moment of the bidirectional LSTM network is taken as the global feature u global ; where the dimension d is both the feature dimension of the text and the dimension of the restricted text space.
The multi-step self-attention cross-media retrieval method based on limited text space according to claim 2, characterized in that in step 1), the input image v is used to extract the 4096-dimensional global features of the image using the output of the last fully connected layer of VGG ,, denoted as v global ; the output of the pooling layer pool5 of VGG as the feature set of each region of the image
The output of this layer contains 512 feature maps, the size of each feature map is 7×7, the total number of image regions is 49, and each region is represented by a 512-dimensional feature vector.
The multi-step self-attention cross-media retrieval method based on limited text space according to claim 4, characterized in that NIC is used to extract interactive information between objects to obtain 512-dimensional correlation features
During the training of NIC, the parameters of VGG and NIC are fixed.
The multi-step self-attention cross-media retrieval method based on limited text space according to claim 1, characterized in that the feature mapping network adopts a visual self-attention mechanism for images; the specific operations are as follows:

Set of regional features for a given image i
Extract the image sharing feature at time k by Equation 2

among them,
Context information representing the image at time k-1;
Represents the feature weight of the nth block in image i;
It is obtained by weighted average of the features of different image regions; visual self-attention function
Used to calculate the weight of each image area;
with
Representable trainable parameters of visual self-attention function;

Fusion using feature fusion layers
Associated features
And map to the restricted text space, so as to get the image feature output v k at time k , expressed as Equation 3:

Among them, W k is the
The parameters of the fully connected layer mapped to the restricted text space; BN represents the batch normalization layer; ReLU represents the activation function; v k contains both the image sharing features at the object level and the associated features between objects.
The multi-step self-attention cross-media retrieval method based on limited text space according to claim 1, characterized in that the feature mapping network adopts a text self-attention mechanism for text; the specific operations are as follows:

Set of word features for a given text s
The text sharing feature u k at time k is calculated by Equation 4:

among them,
Context information representing the k-1 time of the text;
Represents the feature weight of the nth word in the text s; u k is obtained by weighted average of the features of different words; text self-attention function
Used to calculate the weight of each word feature;
with
Represents the trainable parameters of the text self-attention function.
The multi-step self-attention cross-media retrieval method based on the restricted text space according to claim 6 or 7, wherein the context information is used
with
Encode the information that the self-attention network has paid attention to; the update formula that specifically defines the context information is shown in Equation 5:

Among them, k∈{1,…,K}, K represents the total number of cycles of the multi-step self-attention mechanism; V_att and T_att represent the visual self-attention and text self-attention functions, respectively.
The multi-step self-attention cross-media retrieval method based on limited text space according to claim 8, wherein the global features of the image and the text are used as initial context information, respectively
with
As formula 6:

Among them, v global and u global respectively represent the global features of the image and text, that is, global prior knowledge; global features are used as global reference information for the multi-step self-attention mechanism, which is used to quickly locate key information.
A multi-step self-attention cross-media retrieval system based on limited text space implemented by the multi-step self-attention cross-media retrieval method according to claim 1 or 2, using a model-view-controller MVC framework, in which the Model adopts the multi-step self-attention cross-media retrieval method based on limited text space as the core sorting algorithm; the View corresponds to the front-end page, which is used to realize the input of query images or text and the display of retrieval results ; Controller Controller corresponds to the background controller, which is used to read the query input from the front end and send data to the core sorting algorithm.
The multi-step self-attention cross-media retrieval system based on limited text space according to claim 10, wherein the front-end page is implemented by hypertext markup language HTML, cascading style sheet CSS and JavaScript; the background control The device is realized by Tornado tool.