WO2020143137A1 - Multi-step self-attention cross-media retrieval method based on restricted text space and system - Google Patents

Multi-step self-attention cross-media retrieval method based on restricted text space and system Download PDF

Info

Publication number
WO2020143137A1
WO2020143137A1 PCT/CN2019/085771 CN2019085771W WO2020143137A1 WO 2020143137 A1 WO2020143137 A1 WO 2020143137A1 CN 2019085771 W CN2019085771 W CN 2019085771W WO 2020143137 A1 WO2020143137 A1 WO 2020143137A1
Authority
WO
WIPO (PCT)
Prior art keywords
text
features
image
feature
attention
Prior art date
Application number
PCT/CN2019/085771
Other languages
French (fr)
Chinese (zh)
Inventor
王文敏
余政
Original Assignee
北京大学深圳研究生院
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京大学深圳研究生院 filed Critical 北京大学深圳研究生院
Publication of WO2020143137A1 publication Critical patent/WO2020143137A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/40Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data
    • G06F16/43Querying
    • G06F16/435Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • the invention relates to the technical field of computer vision and information retrieval, in particular to a multi-step self-attention cross-media retrieval method and system based on a limited text space.
  • the first sub-problem is how to learn to get an effective representation of the underlying features.
  • most traditional methods only represent images and text through global features, such as the output of the last fully connected layer of a convolutional neural network (CNN) or the hidden layer output of the last time of a recurrent neural network (RNN).
  • Global features contain more redundant information, also known as exclusive information within the modal. This information exists only within the modalities and is not shared among the modalities. This also leads to a decline in the quality of cross-media retrieval. Therefore, some scholars try to extract the local features of image and text (image object area, text word), and then find the shared information between the two through the attention mechanism, thereby reducing the impact of redundant features.
  • most existing methods based on the attention mechanism only consider the shared information at the object level between the image and the text, and do not consider the interactive information between the objects.
  • the second sub-problem is how to find a suitable isomorphic feature space.
  • isomorphic space There are roughly three options for isomorphic space, namely public space, text space and image space.
  • Existing methods usually nonlinearly map heterogeneous features to a latent common space, so that the similarity between different modal data can be directly measured.
  • text features are easier to understand by humans and the information transmitted is more accurate. For example, given an image, the human brain will first condense description sentences based on its content, and then retrieve semantically similar text based on these descriptions. Therefore, in order to simulate the cognitive mode of the human brain, this method explores the feasibility of cross-media retrieval in the text space.
  • the text space is essentially a vector space, composed of a series of different Chinese characters and words.
  • Chinese there is no accurate number of Chinese characters, about 100,000 (Beijing Guoan Consulting Equipment Co., Ltd. has 91251 Chinese characters from the source of Chinese character database income).
  • a large number of new words have emerged to keep the size of the text space growing.
  • similar situations also appear in other languages including English. According to incomplete statistics, the number of existing English words has exceeded one million, and is still growing at a rate of several thousand every year. Therefore, natural language is divergent in nature. Based on this divergent characteristic, it is almost impossible to construct a complete unrestricted text space.
  • Attention mechanism was initially used in "sequence-sequence" model, such as machine translation and image caption. It contains three commonly used forms: 1) additive attention mechanism, 2) product attention mechanism and 3) self-attention mechanism. If the additive or product self-attention mechanism is used in the cross-media retrieval algorithm, the key information of the image and text cannot be fixed, which results in the uncertainty of the image and text encoding and affects the practical application value of the algorithm.
  • the additive or product self-attention mechanism will generate 10 different types of focus information for each image and each text (Corresponding to 10 texts and 10 images, respectively), that is, the key information of an image (text) is determined by the corresponding text (image).
  • the model must ensure the uniqueness of the encoding of images and text. Therefore, the self-attention mechanism is more suitable for cross-media retrieval.
  • the self-attention mechanism can guide the image and text to find the key information inside the data, and ensure it is fixed.
  • the present invention proposes a multi-step self-attention cross-media retrieval method and retrieval system based on limited text space.
  • This method learns a limited text space by simulating human brain cognition, and introduces a multi-step self-attention mechanism and associated features, which greatly improves the retrieval recall rate.
  • the present invention also builds an online retrieval Demo system. By entering text or uploading images, the Demo can return the corresponding search results, thereby further verifying the effectiveness of the present invention.
  • the restricted text space refers to a text space with a relatively fixed vocabulary, which is relative to an unrestricted text space.
  • the present invention constructs a restricted text space of a relatively fixed vocabulary, and then converts the unrestricted text space into a restricted text space, thereby ensuring the convergence of the algorithm.
  • the ability to understand based on limited text space is affected by the size of the vocabulary, that is, the larger the vocabulary, the stronger the understanding, and the smaller the vocabulary, the weaker the understanding.
  • the present invention extracts the interaction information between objects through image captioning (image captioning), which is also referred to as correlation information (relation information).
  • image captioning image captioning
  • correlation information correlation information
  • the image caption model is essentially an "encoding-decoding" model, that is, given an input image, it will first encode it into a feature vector through the encoder, and then translate the feature vector into an appropriate description through the decoder text. Since the generated description text contains not only the object category information (nouns) in the image, but also the interaction information (verbs, adjectives) between the objects, the related information can be represented by the feature vector generated by the encoder.
  • the representative algorithm of the image caption task is NIC (Neural Image Captioning).
  • the method of the present invention is used to extract regional features (image object regions, text words) of images and text, and find shared information between the two through a multi-step self-attention mechanism, thereby reducing the interference of redundant information.
  • the present invention regards the global features of the two as the global prior knowledge of the multi-step self-attention mechanism, which is used to achieve the rapid positioning of key information, and can achieve more at a faster training speed. Good experiment results.
  • the present invention maps the underlying features of the image to a "restricted text space", which contains not only the category information of objects, but also the rich interaction information between objects.
  • the multi-step self-attention cross-media retrieval method based on the restricted text space proposed by the present invention contains a total of three modules, namely a feature extraction network, a feature mapping network and a similarity measurement network.
  • the feature extraction network is used to extract global features, regional features, and associated features of images and text.
  • the extraction of related features is realized by the representative algorithm NIC of the image caption model;
  • the feature mapping network is used to learn to get the restricted text space.
  • the feature map network can selectively focus on part of the shared information at different times, and extract the object-level features of images and text by summarizing the useful information at each time.
  • it also fuses the object-level features of the image with the associated features through the feature fusion layer and maps it to the restricted text space.
  • the present invention regards the global features of images and text as global prior knowledge of a multi-step self-attention mechanism, which is used to achieve the rapid positioning of key information.
  • the similarity measurement network measures the final similarity between images and text by summarizing useful information at each moment. The present invention achieves a better recall rate result in the cross-media retrieval classic data set, and also achieves a good performance from a subjective perspective.
  • Model corresponds to the multi-step self-attention cross-media retrieval method based on the restricted text space proposed by the present invention, which is the core sorting algorithm
  • View corresponds to the front-end page, which is used to realize the input of queries (images or text) and the display of retrieval results
  • Controller corresponds to the background controller, used to read query input from the front end and send data to the core sorting algorithm.
  • Multi-step self-attention cross-media retrieval method based on restricted text space, including feature extraction network, feature mapping network and similarity measurement network; feature extraction network is used to extract global features, regional feature sets and associated features of images and text; Features are further fed into the feature mapping network, and as many object-level shared information as possible between images and text is extracted through a multi-step self-attention mechanism.
  • the feature mapping network fuses the shared features of the object level and the associated features through the feature fusion layer, and maps them to the restricted text space; finally, the similarity measurement network Measure the final similarity between the image and the text by summing up the useful information at each moment, and calculate the triple loss function; thereby achieving multi-step self-attention cross-media retrieval based on limited text space;
  • Each piece of text is composed of multiple (such as 5) sentences, and each sentence independently describes the matching picture; the data set is used to learn the restricted text space; for the data set D, the specific implementation steps of the present invention as follows:
  • the pre-trained VGG (Neural Network Structure proposed by Visual Geometry Group) is used to extract the global features of the image and the regional feature set of the image; NIC is used to extract the related features that contain rich interactive information between objects.
  • VGG Neurological Network Structure proposed by Visual Geometry Group
  • NIC is used to extract the related features that contain rich interactive information between objects.
  • the present invention uses a bidirectional LSTM (Bidirectional Long Short Term Memory networks) network to extract the global features of the text and the set of regional features of the text.
  • the bidirectional LSTM network is not pre-trained, and its parameters are updated synchronously with the parameters of the feature map network;
  • step 2) Send the feature extracted in step 1) to the feature map network.
  • the present invention regards the global features of images and text as global prior knowledge of a multi-step self-attention mechanism, which is used to quickly locate key information;
  • the similarity measurement network measures the final similarity between the image and the text by summarizing the useful information at each moment, and calculates the triple loss function.
  • the present invention updates the network parameters by optimizing the triplet loss function.
  • the similarity measurement function is defined as:
  • Equation 7 v and u represent the characteristics of the image and text in the restricted text space; the similarity sk of the two at k is calculated by Equation 7:
  • Equation 8 the final similarity S between the image and the text is measured and expressed as Equation 8:
  • Equation 9 The triple loss function is expressed as Equation 9:
  • the effectiveness of the present invention is further verified by implementing an online multi-step self-attention cross-media retrieval Demo system based on limited text space.
  • the front-end page is implemented by HyperText Markup Language (HTML), Cascading Style Sheets (CSS) and JavaScript; the back-end controller is implemented by Tornado tool.
  • HTML HyperText Markup Language
  • CSS Cascading Style Sheets
  • Tornado tool the back-end controller is implemented by Tornado tool.
  • the invention provides a multi-step self-attention cross-media retrieval method based on limited text space, which includes a feature extraction network, a feature mapping network and a similarity measurement network.
  • the feature extraction network is used to extract global features, regional feature sets, and associated features of images and text; secondly, features are further fed into the feature mapping network, and as many objects as possible are extracted between the image and text through a multi-step self-attention mechanism Level of shared information. Because it does not take into account the interaction information between different objects, the feature mapping network fuses the shared features of the object level with the associated features through the feature fusion layer and maps them to the restricted text space.
  • the present invention regards the global features of images and text as the global prior knowledge of the multi-step self-attention mechanism, which is used to achieve the rapid positioning of key information; finally, similar
  • the sex measurement network measures the final similarity between the image and the text by summarizing the useful information at each moment, and calculates the triple loss function.
  • the present invention additionally builds an online retrieval demo. By entering text or uploading images, the Demo can return the corresponding search results, thereby verifying the effectiveness of the present invention from a subjective perspective.
  • the present invention has the following technical advantages:
  • the present invention proposes a novel feature mapping network by means of a multi-step self-attention mechanism. It can selectively focus on some shared information at different times, and measure the final similarity between images and text by summing up useful information at each time;
  • the present invention extracts the correlation feature of the rich interactive information between different objects contained in the image through the image caption model, which is used to make up for the defect of sharing information at the object level;
  • the present invention regards the global features of images and text as the global prior knowledge of the multi-step self-attention mechanism, which is used to achieve the rapid positioning of key information.
  • the present invention additionally builds an online retrieval demo. By entering text or uploading images, the Demo can return the corresponding search results, thereby verifying the effectiveness of the present invention from a subjective perspective.
  • Figure 1 defines the concept of shared information and related information at the object level
  • the shared information at the object level between the two images and text is similar, such as “man”, “surfboard” and “wave”.
  • the interaction information between objects is different, such as how men surf ("jump down” vs "swipe towards”).
  • a and B represent image and text processing branches respectively;
  • CNN Convolutional Neural Network, Convolutional Neural Network
  • NIC image caption model
  • v global is the global feature of the image
  • Image sharing feature representing time k
  • Image context information representing time k
  • map to the restricted text space so as to get the image feature output v k at time k
  • BLSTM is a bidirectional LSTM network
  • u global is the global feature of the text
  • Text context information representing time k.
  • S is the final similarity between the image and the text.
  • Figure 3 is the structure of the feature mapping network of the present invention.
  • C and D represent the self-attention mechanism of text and image respectively; the attention layer is used to calculate the feature weights of different regions of the image and text ( with ); The weighted average layer performs weighted averaging on the regional feature sets of images and text by different weights to obtain the shared features (v k and u k ) at the current moment; Indicates that the context information is updated through the identity connection (dashed line).
  • Figure 4 shows the effect of global prior knowledge on the model convergence speed under the Flickr8K data set
  • MSAN with prior means a model that introduces global prior knowledge
  • MSAN w/oprior means a model that does not use global prior knowledge
  • Figures 5-6 show the main pages of the online search Demo, which are the text search image page and the image search text page screenshot, respectively.
  • the invention provides a multi-step self-attention cross-media retrieval method based on limited text space, which includes a feature extraction network, a feature mapping network and a similarity measurement network.
  • the feature extraction network is used to extract global features, regional feature sets, and associated features of images and text; secondly, features are further fed into the feature mapping network, and as many objects as possible are extracted between the image and text through a multi-step self-attention mechanism Level of shared information. However, it does not consider the interactive information between different objects. As shown in Figure 1, for two different image-text pairs, the shared information at the object level between the two images and text is similar, such as "man”, “surfboard” and "wave".
  • the feature mapping network fuses the shared features of the object level with the associated features through the feature fusion layer and maps them to the restricted text space.
  • the present invention regards the global features of images and text as the global prior knowledge of the multi-step self-attention mechanism, which is used to achieve the rapid positioning of key information; finally, similar
  • the sex measurement network measures the final similarity between the image and the text by summarizing the useful information at each moment, and calculates the triple loss function.
  • the present invention additionally builds an online retrieval demo.
  • the Demo can return the corresponding search results, thereby verifying the effectiveness of the present invention from a subjective perspective.
  • the output of the last fully connected layer of VGG is used to extract the 4096-dimensional global feature v global of the image . Since the multi-layer convolution and pooling operations are equivalent to extracting the features of the image region, the present invention uses the output of the last pooling layer (pool5) of VGG as the feature set of each region of the image
  • the output of this layer contains 512 feature maps (feature map), and the size of each feature map is 7 ⁇ 7. In other words, the total number of image areas is 49, and each area is represented by a 512-dimensional feature vector.
  • the present invention adopts the representative algorithm NIC of the image caption task, which is used to extract 512-dimensional associated features containing rich interaction information between objects
  • VGG is pre-trained by ImageNet
  • NIC is pre-trained by cross-media retrieval data set.
  • x t represents the input word at time t; with Represent the output of the hidden layer of forward LSTM and backward LSTM at time t respectively; Represents the d-dimensional feature output of the currently input word. Therefore, as shown in Part B of Figure 2, the set of regional features of the text can be expressed as The global feature u global can be regarded as the output of the d-dimensional hidden layer at the last moment of the bidirectional LSTM network. Among them, the dimension d not only represents the feature dimension of the text, but also represents the dimension of the restricted text space. During the experiment, the value of d is 1024.
  • the feature mapping network uses visual self-attention mechanism and text self-attention mechanism, as shown in Figure 3.
  • Context information representing the image at time k-1 Represents the feature weight of the nth block in image i; It is obtained by weighted average of the features of different image regions; visual self-attention function Used to calculate the weight of each image area; with Represents the trainable parameters of the visual self-attention function, the size is 512 ⁇ 512.
  • W k is the The parameters of the fully connected layer mapped to the restricted text space are 512 ⁇ 1024; BN represents the batch normalization layer; ReLU represents the activation function. v k contains not only the image-sharing features at the object level, but also rich correlation features between objects.
  • Context information representing the k-1 time of the text; Represents the feature weight of the nth word in the text s; u k is obtained by weighted average of the features of different words; text self-attention function Used to calculate the weight of each word feature; with Represents the trainable parameters of the text self-attention function, and the size is 1024 ⁇ 512.
  • V_att and T_att represent visual self-attention and text self-attention functions, respectively.
  • Identical connections can control the flow of contextual information in the network and retain useful information.
  • the present invention uses the initial context information with Initialized to global features of images and text, as shown in Equation 6:
  • v global and u global represent global features of images and text, respectively, and can also be called global prior knowledge.
  • the global feature can be regarded as the global reference information of the multi-step self-attention mechanism, which is used to quickly locate the key information.
  • the present invention implements a multi-step self-attention mechanism step by step at time K so that it can find as much shared information between images and text as possible at any time k.
  • K For different data sets, the value of K is different. On the Flickr8K data set, K is set to 1; on the Flickr30K and MSCOCO data sets, K is set to 2.
  • the specific experimental results are shown in the subsequent experimental analysis section.
  • the parameter K represents the total number of cycles of the multi-step self-attention mechanism. It can also be expanded in time, which can be seen as a multi-step self-attention mechanism in turn at different times k.
  • the similarity s k between the two at time k can be obtained by Equation 7:
  • Tornado is an open source version of web server software that can handle thousands of connections per second, and is quite fast. Therefore, Tornado is an ideal framework for real-time web services.
  • Tornado functions as a controller in the MVC framework. Its functions include: 1) query reading; 2) extract the features of the query; 3) extract the features of all the data to be retrieved in the database; 4) send the data to the model. In order to ensure Demo's response speed, all the features of the data to be retrieved in the database have been pre-loaded into the memory.
  • the multi-step self-attention cross-media retrieval method based on the restricted text space proposed by the present invention is equivalent to the model in the MVC framework, and is also called the core sorting algorithm. Its main task is to find similar query data quickly and accurately and send it to the controller. In the case of a small amount of data, the easiest way is linear scanning, which is to calculate the distance between each sample and the query in the data set in turn. However, as the amount of data continues to increase, the time consumption of linear scanning also gradually increases, and the response speed of Demo will also slow down.
  • Faiss Facebook's open source framework, to achieve accurate and fast queries. Faiss is a framework that provides efficient similarity search and clustering for dense vectors. Before querying, Faiss needs to cluster all the data in the data set to form different data clusters.
  • the online search Demo contains three pages: the main page, the text search image page ( Figure 5) and the image search text page ( Figure 6).
  • the main page contains a text input box, camera icon and "Search” button. The user first enters text through the text input box or uploads an image by clicking the camera icon, and then clicks the "Search” button to start the search.
  • Figure 5 shows the results of the corresponding text retrieval image; for an image named “COCO_train2014_000000000049.jpg”
  • Figure 6 shows the corresponding image retrieval text result.
  • the search results are displayed in order of relevance, that is, from top to bottom, from left to right, and the relevance of the sample decreases.
  • the search box in Figure 5 and Figure 6 has been moved to the upper left corner, the function remains unchanged.
  • Tables 1 to 3 show the recall rate results of the present invention on the Flickr8K, Flickr30K and MSCOCO data sets.
  • Img2Txt represents image-to-text retrieval
  • Txt2Img represents text-to-image retrieval.
  • ⁇ MSAN-obj does not use associated features Only the object-level shared information between images and text is considered;
  • ⁇ MSAN-glob does not use a multi-step self-attention mechanism, only expressing images and text through global features
  • ⁇ MSAN includes related features And a complete model of a multi-step self-attention mechanism.
  • MSAN has achieved the best results based on the VGG feature at this stage compared with DSPE, HM-LSTM, DAN and other better performing methods.
  • MSAN has better experimental results than MSAN-obj and MSAN-glob, proving the effectiveness of the multi-step self-attention mechanism and associated features.
  • Table 5 shows the influence of global prior knowledge on experimental results.
  • MSAN with prior indicates the MSAN model using global prior knowledge
  • MSAN/prior indicates the MSAN model without global prior knowledge. It can be seen from Table 5 that the retrieval recall rate of "MSAN with priority” is higher than that of "MSAN w/oprior", thus verifying the effectiveness of the global prior knowledge.
  • Figure 4 shows the trend graph of the loss function of the "MSAN with priority” and "MSAN w/oprior” models under the Flickr8K data set.
  • MSAN with priority has a faster convergence rate than "MSAN w/oprior”, and the loss function when the model converges is smaller. Therefore, due to the introduction of global prior knowledge, the present invention can achieve better retrieval results at a faster convergence speed.
  • Figures 5 and 6 show the text retrieval image and image retrieval text of Demo online retrieval, respectively. From a subjective point of view, although the displayed results do not necessarily include true matching samples, the multi-step self-attention cross-media retrieval method based on the restricted text space proposed by the present invention can still find results as similar as possible to the query, satisfying people Demand. This also validates the effectiveness of the present invention from a subjective perspective.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

Provided are a multi-step self-attention cross-media retrieval method based on restricted text space and a retrieval system, first constructing the restricted text space of the relatively fixed vocabulary, then converting the unrestricted text space into the restricted text space, comprising: extracting the image features and text features of the restricted text space through a feature extraction network, the features comprising the global features, the regional feature set and the associated features; sending the extracted features into a feature mapping network, and extracting object-level sharing information between the image and the text through the multi-step self-attention mechanism, then collecting the useful information at each moment through a similarity measurement network to measure the similarity between the image and the text, and calculating the triple loss function; therefore, the multi-step self-attention cross-media retrieval based on the restricted text space is realized. The method and the system greatly improve the cross-media retrieval recall rate by introducing the multi-step self-attention mechanism and the associated features.

Description

基于受限文本空间的多步自注意力跨媒体检索方法及系统Multi-step self-attention cross-media retrieval method and system based on restricted text space 技术领域Technical field
本发明涉及计算机视觉与信息检索技术领域,尤其涉及一种基于受限文本空间的多步自注意力跨媒体检索方法及系统。The invention relates to the technical field of computer vision and information retrieval, in particular to a multi-step self-attention cross-media retrieval method and system based on a limited text space.
背景技术Background technique
近年来,随着信息技术的飞速发展,互联网上的多媒体数据越来越丰富,不同模态的多媒体数据(文本,图像,音频,视频等)可以用于表达相似的内容。为了满足用户日益增长的多媒体检索需求,人们提出了跨媒体检索任务,用于找到一个同构的语义空间(公共空间,文本空间,图像空间),使得底层异构的多媒体数据之间的相似性能够被直接衡量。更确切的说,这个跨媒体检索任务的核心问题能够被细分成两个子问题。In recent years, with the rapid development of information technology, multimedia data on the Internet is getting richer and richer, and multimedia data of different modalities (text, images, audio, video, etc.) can be used to express similar content. In order to meet the increasing multimedia retrieval needs of users, cross-media retrieval tasks have been proposed to find a homogeneous semantic space (public space, text space, image space), making the similarity between the underlying heterogeneous multimedia data Can be measured directly. More precisely, the core problem of this cross-media retrieval task can be subdivided into two sub-problems.
第一个子问题是如何学习得到有效的底层特征表示。在跨媒体检索领域中,多数传统方法仅通过全局特征表示图像和文本,比如卷积神经网络(CNN)最后全连接层的输出或者循环神经网络(RNN)最后时刻的隐藏层输出。全局特征包含较多的冗余信息,也被称作模态内部的专属信息。这种信息仅存在于模态内部,并且在模态之间互不共享。这也就导致了跨媒体检索质量的下降。因此,部分学者尝试提取图像和文本的局部特征(图像物体区域,文本单词),再通过注意力机制找到两者之间的共享信息,从而减少冗余特征带来的影响。然而,现有的基于注意力机制的方法大都只考虑了图像和文本之间物体层次的共享信息,没有考虑物体之间的交互信息。The first sub-problem is how to learn to get an effective representation of the underlying features. In the field of cross-media retrieval, most traditional methods only represent images and text through global features, such as the output of the last fully connected layer of a convolutional neural network (CNN) or the hidden layer output of the last time of a recurrent neural network (RNN). Global features contain more redundant information, also known as exclusive information within the modal. This information exists only within the modalities and is not shared among the modalities. This also leads to a decline in the quality of cross-media retrieval. Therefore, some scholars try to extract the local features of image and text (image object area, text word), and then find the shared information between the two through the attention mechanism, thereby reducing the impact of redundant features. However, most existing methods based on the attention mechanism only consider the shared information at the object level between the image and the text, and do not consider the interactive information between the objects.
第二个子问题是如何找到一个合适的同构特征空间。同构空间的选择大致有三种,分别是公共空间,文本空间和图像空间。现有的方法通常将异构特征非线性映射至一个潜在的公共空间,从而不同模态数据之间的相似度能够被直接衡量。然而,与图像的基于像素的特征相比,文本特征更容易被人类理解,传递的信息也更为准确。例如,给定一张图像,人脑首先会根据其内容凝练出描述语句,然后根据这些描述去检索语义相似的文本。因此,为了模拟人脑的认知方式,该方法探索了在文本空间进行跨媒体检索的可行性。现有的基于文本空间的跨媒体检索方法没有考虑人脑对于图像的认知过程,它们大都采用Word2Vec空间作为最终的文本空间。图像在该空间的特征表示通过图像中物体的类别信息组合得到的。因此,该特征会丢失图像中蕴含着的丰富的交互信息。这也表明,对于跨媒体检索来说,Word2Vec空间并不是一个有效的文本特征空间。The second sub-problem is how to find a suitable isomorphic feature space. There are roughly three options for isomorphic space, namely public space, text space and image space. Existing methods usually nonlinearly map heterogeneous features to a latent common space, so that the similarity between different modal data can be directly measured. However, compared with the pixel-based features of images, text features are easier to understand by humans and the information transmitted is more accurate. For example, given an image, the human brain will first condense description sentences based on its content, and then retrieve semantically similar text based on these descriptions. Therefore, in order to simulate the cognitive mode of the human brain, this method explores the feasibility of cross-media retrieval in the text space. Existing cross-media retrieval methods based on text space do not consider the cognitive process of the human brain for images, and most of them use Word2Vec space as the final text space. The characteristics of the image in this space are obtained by combining the category information of the objects in the image. Therefore, this feature will lose the rich interactive information contained in the image. This also shows that for cross-media retrieval, the Word2Vec space is not an effective text feature space.
文本空间本质上是一个向量空间,由一系列不同的汉字和单词构成。对于中文来说,汉 字的数量并没有准确的数字,大约将近十万个(北京国安咨询设备公司汉字字库收入有出处的汉字91251个)。与此同时,大量涌现的新词使得文本空间的规模不断增长。除了中文,类似的情况也出现在包括英语在内的其它语言中。据不完全统计,现有的英语单词数量已经超过了百万,并且每年仍以几千的速度增长。因此,自然语言本质上是发散的。基于这种发散的特性,构建一个完备的不受限制的文本空间几乎是不可能完成的。The text space is essentially a vector space, composed of a series of different Chinese characters and words. For Chinese, there is no accurate number of Chinese characters, about 100,000 (Beijing Guoan Consulting Equipment Co., Ltd. has 91251 Chinese characters from the source of Chinese character database income). At the same time, a large number of new words have emerged to keep the size of the text space growing. In addition to Chinese, similar situations also appear in other languages including English. According to incomplete statistics, the number of existing English words has exceeded one million, and is still growing at a rate of several thousand every year. Therefore, natural language is divergent in nature. Based on this divergent characteristic, it is almost impossible to construct a complete unrestricted text space.
然而,在大部分情况下,人们只需掌握其中一部分的汉字和单词就能够满足自身的日常需求。例如,很多英语语言学家认为约3650个最基本的常用英语词汇就能完成表达思想和交际任务的95%以上;原国家教育委员会于1987年11月联合发布的《现代汉语常用字典》提出,现代汉语常用字为2500个,占日常使用汉语的99%以上。However, in most cases, people only need to master some of the Chinese characters and words to meet their daily needs. For example, many English linguists believe that about 3650 of the most basic common English vocabulary can complete more than 95% of the tasks of expressing thoughts and communication; the "Common Dictionary of Modern Chinese" jointly released by the original National Education Commission in November 1987 proposed that There are 2500 commonly used characters in modern Chinese, accounting for more than 99% of the daily use of Chinese.
近年来,注意力机制开始引起越来越多研究者的注意。注意力机制最开始被应用在“序列-序列”的模型中,比如机器翻译和图像题注。它包含三种比较常用的形式:1)加性注意力机制,2)乘积注意力机制和3)自注意力机制。如果在跨媒体检索算法中采用加性或者乘积自注意力机制,则图像和文本的重点关注信息无法固定,从而造成图像和文本编码的不确定性,影响算法的实际应用价值。比如,给定一个数据集,包含10张图像和10个与图像一一对应的文本,那么加性或乘积自注意力机制会为每张图像和每个文本分别生成10种不同的重点关注信息(分别对应10个文本和10张图像),即图像(文本)的重点关注信息由对应的文本(图像)决定。然而,考虑到跨媒体检索算法的实际应用价值,模型必须保证图像和文本的编码唯一性。因此,自注意力机制更为适合跨媒体检索。自注意力机制可以引导图像和文本自己找到数据内部的重点关注信息,并且保证它的固定。In recent years, the attention mechanism has begun to attract more and more researchers' attention. Attention mechanism was initially used in "sequence-sequence" model, such as machine translation and image caption. It contains three commonly used forms: 1) additive attention mechanism, 2) product attention mechanism and 3) self-attention mechanism. If the additive or product self-attention mechanism is used in the cross-media retrieval algorithm, the key information of the image and text cannot be fixed, which results in the uncertainty of the image and text encoding and affects the practical application value of the algorithm. For example, given a data set containing 10 images and 10 texts corresponding to the images one by one, then the additive or product self-attention mechanism will generate 10 different types of focus information for each image and each text (Corresponding to 10 texts and 10 images, respectively), that is, the key information of an image (text) is determined by the corresponding text (image). However, considering the practical application value of cross-media retrieval algorithms, the model must ensure the uniqueness of the encoding of images and text. Therefore, the self-attention mechanism is more suitable for cross-media retrieval. The self-attention mechanism can guide the image and text to find the key information inside the data, and ensure it is fixed.
发明内容Summary of the invention
为了克服上述现有技术存在的问题,本发明提出了一种基于受限文本空间的多步自注意力跨媒体检索方法及检索系统。该方法通过模拟人脑认知的方式学习得到受限文本空间,并且引入了多步自注意力机制和关联特征,大幅度提升了检索召回率。除了客观的评价指标(检索召回率),本发明还搭建了一个线上检索Demo系统。通过输入文本或者上传图像,该Demo可以返回对应的检索结果,从而进一步验证了本发明的有效性。In order to overcome the above problems in the prior art, the present invention proposes a multi-step self-attention cross-media retrieval method and retrieval system based on limited text space. This method learns a limited text space by simulating human brain cognition, and introduces a multi-step self-attention mechanism and associated features, which greatly improves the retrieval recall rate. In addition to objective evaluation indicators (retrieval recall rate), the present invention also builds an online retrieval Demo system. By entering text or uploading images, the Demo can return the corresponding search results, thereby further verifying the effectiveness of the present invention.
本发明中,受限文本空间指的是具有相对固定的词汇表的文本空间,是相对于非受限文本空间而言的。本发明通过构建相对固定的词汇表的受限文本空间,再将非受限文本空间转换成受限文本空间,从而保证算法的收敛性。基于受限文本空间的理解能力受到词汇表大小的影响,即词汇表越大,理解能力越强,词汇表越小,理解能力越弱。经过实验发现,3000左右的单词数量已经能够满足跨媒体检索的基本需求,一味地增加单词数量不仅不会带来检 索性能上的提升,还会增加算法在时间和空间上的复杂度。本发明通过图像题注模型(image captioning)提取物体之间的交互信息,也被称为关联信息(relation information)。图像题注模型本质上是一种“编码-解码”的模型,即给定一张输入图像,它会先通过编码器将其编码成特征向量,再通过解码器将特征向量翻译成恰当的描述文本。由于生成的描述文本中不仅包含图像中的物体类别信息(名词),还包含物体之间的交互信息(动词,形容词),关联信息可以通过编码器生成的特征向量表示。图像题注任务的代表算法是NIC(Neural Image Captioning)。In the present invention, the restricted text space refers to a text space with a relatively fixed vocabulary, which is relative to an unrestricted text space. The present invention constructs a restricted text space of a relatively fixed vocabulary, and then converts the unrestricted text space into a restricted text space, thereby ensuring the convergence of the algorithm. The ability to understand based on limited text space is affected by the size of the vocabulary, that is, the larger the vocabulary, the stronger the understanding, and the smaller the vocabulary, the weaker the understanding. After experiments, it was found that the number of words around 3000 can already meet the basic needs of cross-media retrieval, and blindly increasing the number of words will not only not improve the retrieval performance, but also increase the complexity of the algorithm in time and space. The present invention extracts the interaction information between objects through image captioning (image captioning), which is also referred to as correlation information (relation information). The image caption model is essentially an "encoding-decoding" model, that is, given an input image, it will first encode it into a feature vector through the encoder, and then translate the feature vector into an appropriate description through the decoder text. Since the generated description text contains not only the object category information (nouns) in the image, but also the interaction information (verbs, adjectives) between the objects, the related information can be represented by the feature vector generated by the encoder. The representative algorithm of the image caption task is NIC (Neural Image Captioning).
采用本发明方法提取图像和文本的区域特征(图像物体区域,文本单词),并通过多步自注意力机制找到两者之间的共享信息,从而减少冗余信息的干扰。除了图像和文本的区域特征,本发明将两者的全局特征看作多步自注意力机制的全局先验知识,用于实现关键信息的快速定位,并且能够在更快地训练速度下取得更好地实验结果。The method of the present invention is used to extract regional features (image object regions, text words) of images and text, and find shared information between the two through a multi-step self-attention mechanism, thereby reducing the interference of redundant information. In addition to the regional features of images and text, the present invention regards the global features of the two as the global prior knowledge of the multi-step self-attention mechanism, which is used to achieve the rapid positioning of key information, and can achieve more at a faster training speed. Good experiment results.
针对如何找到一个合适的同构特征空间的问题,本发明将图像底层特征映射至“受限文本空间”,该空间不仅包含物体的类别信息,也包含丰富的物体之间的交互信息。Aiming at the problem of how to find a suitable isomorphic feature space, the present invention maps the underlying features of the image to a "restricted text space", which contains not only the category information of objects, but also the rich interaction information between objects.
本发明提出的基于受限文本空间的多步自注意力跨媒体检索方法总共包含三个模块,分别是特征提取网络,特征映射网络和相似性度量网络。对于第一个子问题(如何学习得到有效的底层特征表示),特征提取网络用于提取图像和文本的全局特征、区域特征和关联特征。关联特征的提取通过图像题注模型的代表算法NIC实现;对于第二个子问题(如何找到一个合适的同构特征空间),特征映射网络被用于学习得到受限文本空间。借助于多步自注意力机制,特征映射网络可以在不同时刻选择性地关注部分共享信息,并且通过汇总各个时刻有用的信息来提取图像和文本的物体层次的特征。除此之外,它还通过特征融合层,将图像的物体层次的特征与关联特征相融合,并且映射至受限文本空间。为了在更快地训练速度下取得更好地实验结果,本发明将图像和文本的全局特征看作多步自注意力机制的全局先验知识,用于实现关键信息的快速定位。最后,相似性度量网络通过汇总各时刻有用信息的方式来衡量图像和文本之间最终的相似度。本发明在跨媒体检索经典数据集中取得了较好的召回率结果,并且在主观角度上也取得了不错的性能。The multi-step self-attention cross-media retrieval method based on the restricted text space proposed by the present invention contains a total of three modules, namely a feature extraction network, a feature mapping network and a similarity measurement network. For the first sub-problem (how to learn to obtain an effective low-level feature representation), the feature extraction network is used to extract global features, regional features, and associated features of images and text. The extraction of related features is realized by the representative algorithm NIC of the image caption model; for the second sub-problem (how to find a suitable isomorphic feature space), the feature mapping network is used to learn to get the restricted text space. With the help of the multi-step self-attention mechanism, the feature map network can selectively focus on part of the shared information at different times, and extract the object-level features of images and text by summarizing the useful information at each time. In addition, it also fuses the object-level features of the image with the associated features through the feature fusion layer and maps it to the restricted text space. In order to obtain better experimental results at a faster training speed, the present invention regards the global features of images and text as global prior knowledge of a multi-step self-attention mechanism, which is used to achieve the rapid positioning of key information. Finally, the similarity measurement network measures the final similarity between images and text by summarizing useful information at each moment. The present invention achieves a better recall rate result in the cross-media retrieval classic data set, and also achieves a good performance from a subjective perspective.
对于线上检索Demo系统,本发明通过MVC(Model View Controller,模型-视图-控制器)框架设计实现。其中,Model对应本发明提出的基于受限文本空间的多步自注意力跨媒体检索方法,是核心排序算法;View对应前端页面,用于实现查询(图像或文本)的输入和检索结果的展示;Controller对应后台控制器,用于从前端读取查询输入,并向核心排序算法发送数据。For the online retrieval Demo system, the present invention is implemented by MVC (Model View Controller) model design. Among them, Model corresponds to the multi-step self-attention cross-media retrieval method based on the restricted text space proposed by the present invention, which is the core sorting algorithm; View corresponds to the front-end page, which is used to realize the input of queries (images or text) and the display of retrieval results ; Controller corresponds to the background controller, used to read query input from the front end and send data to the core sorting algorithm.
本发明提供的技术方案是:The technical solutions provided by the present invention are:
基于受限文本空间的多步自注意力跨媒体检索方法,包含特征提取网络、特征映射网络和相似性度量网络;特征提取网络用于提取图像和文本的全局特征、区域特征集合以及关联特征;特征被进一步送入特征映射网络,并通过多步自注意力机制尽可能多地提取图像和文本之间物体层次的共享信息。由于多步自注意力机制没有考虑不同物体之间的交互信息,特征映射网络通过特征融合层将物体层次的共享特征与关联特征相融合,并且映射至受限文本空间;最后,相似性度量网络通过汇总各时刻有用信息的方式来衡量图像和文本之间最终的相似度,并计算三元组损失函数;由此实现基于受限文本空间的多步自注意力跨媒体检索;Multi-step self-attention cross-media retrieval method based on restricted text space, including feature extraction network, feature mapping network and similarity measurement network; feature extraction network is used to extract global features, regional feature sets and associated features of images and text; Features are further fed into the feature mapping network, and as many object-level shared information as possible between images and text is extracted through a multi-step self-attention mechanism. Because the multi-step self-attention mechanism does not consider the interaction information between different objects, the feature mapping network fuses the shared features of the object level and the associated features through the feature fusion layer, and maps them to the restricted text space; finally, the similarity measurement network Measure the final similarity between the image and the text by summing up the useful information at each moment, and calculate the triple loss function; thereby achieving multi-step self-attention cross-media retrieval based on limited text space;
具体地,假设数据集D={D 1,D 2,…,D I}共有I个样本,每个样本D i包括一张图片i和一段描述文本s,即D i=(i,s),每段文本由多个(如5个)句子组成,每个句子都独立地对相匹配的图片进行描述;数据集用于学习受限文本空间;针对数据集D,本发明的具体实施步骤如下: Specifically, assume that the data set D = {D 1 , D 2 ,..., D I } has a total of I samples, and each sample D i includes a picture i and a description text s, that is, D i =(i,s) , Each piece of text is composed of multiple (such as 5) sentences, and each sentence independently describes the matching picture; the data set is used to learn the restricted text space; for the data set D, the specific implementation steps of the present invention as follows:
1)通过特征提取网络提取D中图像和文本的区域特征。1) Extract the regional features of the image and text in D through the feature extraction network.
对于图像,预训练的VGG(Visual Geometry Group提出的神经网络结构)用于提取图像的全局特征和图像的区域特征集合;NIC用于提取蕴含着丰富物体之间交互信息的关联特征。对于文本,本发明使用双向LSTM(Bidirectional Long Short Term Memory networks,双向长短期记忆循环神经网络)网络提取文本的全局特征和文本的区域特征集合。双向LSTM网络未经过预训练,它的参数与特征映射网络的参数同步更新;For images, the pre-trained VGG (Neural Network Structure proposed by Visual Geometry Group) is used to extract the global features of the image and the regional feature set of the image; NIC is used to extract the related features that contain rich interactive information between objects. For the text, the present invention uses a bidirectional LSTM (Bidirectional Long Short Term Memory networks) network to extract the global features of the text and the set of regional features of the text. The bidirectional LSTM network is not pre-trained, and its parameters are updated synchronously with the parameters of the feature map network;
2)将步骤1)提取得到的特征送入特征映射网络。2) Send the feature extracted in step 1) to the feature map network.
首先,通过多步自注意力机制尽可能多地关注图像和文本区域特征之间物体层次的共享信息;其次,通过特征融合层实现物体层次的共享特征和关联特征的融合,并且映射至受限文本空间。为了在更快地训练速度下取得更好地实验结果,本发明将图像和文本的全局特征看作多步自注意力机制的全局先验知识,用于实现关键信息的快速定位;First, focus on object-level shared information between the image and text area features as much as possible through the multi-step self-attention mechanism; second, achieve the fusion of object-level shared features and related features through the feature fusion layer, and map to the restricted Text space. In order to obtain better experimental results at a faster training speed, the present invention regards the global features of images and text as global prior knowledge of a multi-step self-attention mechanism, which is used to quickly locate key information;
3)相似性度量网络通过汇总各时刻有用信息的方式来衡量图像和文本之间最终的相似度,并且计算三元组损失函数。3) The similarity measurement network measures the final similarity between the image and the text by summarizing the useful information at each moment, and calculates the triple loss function.
4)最后,本发明通过优化该三元组损失函数更新网络参数。4) Finally, the present invention updates the network parameters by optimizing the triplet loss function.
其中,相似性衡量函数定义为:Among them, the similarity measurement function is defined as:
sim(v,u)=v·usim(v,u)=v·u
其中,v和u分别代表受限文本空间中的图像和文本的特征;k时刻二者的相似度s k通过式7计算得到: Among them, v and u represent the characteristics of the image and text in the restricted text space; the similarity sk of the two at k is calculated by Equation 7:
s k=v k·u k      (式7) s k = v k · u k (Equation 7)
通过汇总K时刻有用信息的方式,衡量图像和文本之间最终的相似度S,表示为式8:By summarizing the useful information at time K, the final similarity S between the image and the text is measured and expressed as Equation 8:
Figure PCTCN2019085771-appb-000001
Figure PCTCN2019085771-appb-000001
5)计算三元组损失函数,通过优化该三元组损失函数更新网络参数;5) Calculate the triple loss function, and update the network parameters by optimizing the triple loss function;
三元组损失函数表示为式9:The triple loss function is expressed as Equation 9:
Figure PCTCN2019085771-appb-000002
Figure PCTCN2019085771-appb-000002
其中,s p是输入图像i的第p个不匹配文本;i p是输入文本s的第p个不匹配图像;m是最小距离间隔,取值为0.3;sim(v,t)是相似性度量函数。 Where s p is the p-th unmatched text of the input image i; ip is the p-th unmatched image of the input text s; m is the minimum distance interval, and the value is 0.3; sim(v, t) is the similarity Measurement function.
本发明具体实施时,通过实现一个线上基于受限文本空间的多步自注意力跨媒体检索Demo系统进一步验证本发明的有效性。其中,前端页面通过超文本标记语言(HyperText Markup Language,HTML)、层叠样式表(Cascading Style Sheets,CSS)和JavaScript实现;后台控制器通过Tornado工具实现。During the specific implementation of the present invention, the effectiveness of the present invention is further verified by implementing an online multi-step self-attention cross-media retrieval Demo system based on limited text space. Among them, the front-end page is implemented by HyperText Markup Language (HTML), Cascading Style Sheets (CSS) and JavaScript; the back-end controller is implemented by Tornado tool.
与现有技术相比,本发明的有益效果是:Compared with the prior art, the beneficial effects of the present invention are:
本发明提供一种基于受限文本空间的多步自注意力跨媒体检索方法,包含特征提取网络,特征映射网络和相似性度量网络。特征提取网络用于提取图像和文本的全局特征、区域特征集合以及关联特征;其次,特征被进一步送入特征映射网络,并通过多步自注意力机制尽可能多地提取图像和文本之间物体层次的共享信息。由于它没有考虑不同物体之间的交互信息,特征映射网络通过特征融合层将物体层次的共享特征与关联特征相融合,并且映射至受限文本空间。为了在更快地训练速度下取得更好地实验结果,本发明将图像和文本的全局特征看作多步自注意力机制的全局先验知识,用于实现关键信息的快速定位;最后,相似性度量网络通过汇总各时刻有用信息的方式来衡量图像和文本之间最终的相似度,并且计算三元组损失函数。除了客观的评价指标(检索召回率),本发明额外搭建了一个线上检索Demo。通过输入文本或者上传图像,该Demo可以返回对应的检索结果,从而从主观角度验证本发明的有效性。具体地,本发明具有如下技术优势:The invention provides a multi-step self-attention cross-media retrieval method based on limited text space, which includes a feature extraction network, a feature mapping network and a similarity measurement network. The feature extraction network is used to extract global features, regional feature sets, and associated features of images and text; secondly, features are further fed into the feature mapping network, and as many objects as possible are extracted between the image and text through a multi-step self-attention mechanism Level of shared information. Because it does not take into account the interaction information between different objects, the feature mapping network fuses the shared features of the object level with the associated features through the feature fusion layer and maps them to the restricted text space. In order to obtain better experimental results at a faster training speed, the present invention regards the global features of images and text as the global prior knowledge of the multi-step self-attention mechanism, which is used to achieve the rapid positioning of key information; finally, similar The sex measurement network measures the final similarity between the image and the text by summarizing the useful information at each moment, and calculates the triple loss function. In addition to objective evaluation indicators (retrieval recall rate), the present invention additionally builds an online retrieval demo. By entering text or uploading images, the Demo can return the corresponding search results, thereby verifying the effectiveness of the present invention from a subjective perspective. Specifically, the present invention has the following technical advantages:
(一)本发明基于受限文本空间,借助多步自注意力机制提出了新颖的特征映射网络。它可以在不同时刻选择性地关注部分共享信息,并且通过汇总各时刻有用信息的方式来衡量图像和文本之间最终的相似度;(1) Based on the restricted text space, the present invention proposes a novel feature mapping network by means of a multi-step self-attention mechanism. It can selectively focus on some shared information at different times, and measure the final similarity between images and text by summing up useful information at each time;
(二)本发明通过图像题注模型提取图像中蕴含着的丰富的不同物体之间交互信息的 关联特征,用于弥补物体层次共享信息的缺陷;(2) The present invention extracts the correlation feature of the rich interactive information between different objects contained in the image through the image caption model, which is used to make up for the defect of sharing information at the object level;
(三)为了在更快地训练速度下取得更好地实验结果,本发明将图像和文本的全局特征看作多步自注意力机制的全局先验知识,用于实现关键信息的快速定位。(3) In order to obtain better experimental results at a faster training speed, the present invention regards the global features of images and text as the global prior knowledge of the multi-step self-attention mechanism, which is used to achieve the rapid positioning of key information.
(四)除了客观的评价指标(检索召回率),本发明额外搭建了一个线上检索Demo。通过输入文本或者上传图像,该Demo可以返回对应的检索结果,从而从主观角度验证本发明的有效性。(4) In addition to objective evaluation indexes (retrieval recall rate), the present invention additionally builds an online retrieval demo. By entering text or uploading images, the Demo can return the corresponding search results, thereby verifying the effectiveness of the present invention from a subjective perspective.
附图说明BRIEF DESCRIPTION
本发明共有附图6张,其中:The present invention has 6 drawings in total, in which:
图1定义了物体层次的共享信息和关联信息的概念;Figure 1 defines the concept of shared information and related information at the object level;
给定两对不同的图像-文本对,两者图像和文本之间物体层次的共享信息类似,比如“男人”,“冲浪板”和“波浪”。然而,物体之间的交互信息却不尽相同,比如男人是如何冲浪的(“跳下来”vs“向…划动”)。Given two different image-text pairs, the shared information at the object level between the two images and text is similar, such as "man", "surfboard" and "wave". However, the interaction information between objects is different, such as how men surf ("jump down" vs "swipe towards").
图2是本发明提供的方法的流程框图;2 is a flow block diagram of the method provided by the present invention;
A和B分别表示图像和文本的处理分支;对于图像,CNN(Convolutional Neural Network,卷积神经网络)是19层的VGG模型;
Figure PCTCN2019085771-appb-000003
表示图像i的区域特征集合;
Figure PCTCN2019085771-appb-000004
是通过图像题注模型NIC提取得到的关联特征;v global是图像的全局特征;
Figure PCTCN2019085771-appb-000005
代表k时刻的图像共享特征;
Figure PCTCN2019085771-appb-000006
表示k时刻的图像上下文信息;特征融合层融合
Figure PCTCN2019085771-appb-000007
与关联特征
Figure PCTCN2019085771-appb-000008
并且映射至受限文本空间,从而得到k时刻的图像特征输出v k;对于文本,BLSTM是双向LSTM网络;
Figure PCTCN2019085771-appb-000009
表示文本s的区域特征集合;u global是文本的全局特征;
Figure PCTCN2019085771-appb-000010
表示k时刻的文本上下文信息。S即图像和文本之间最终的相似度。
A and B represent image and text processing branches respectively; for images, CNN (Convolutional Neural Network, Convolutional Neural Network) is a 19-layer VGG model;
Figure PCTCN2019085771-appb-000003
Represents the regional feature set of image i;
Figure PCTCN2019085771-appb-000004
Is the associated feature extracted by the image caption model NIC; v global is the global feature of the image;
Figure PCTCN2019085771-appb-000005
Image sharing feature representing time k;
Figure PCTCN2019085771-appb-000006
Image context information representing time k; feature fusion layer fusion
Figure PCTCN2019085771-appb-000007
Associated features
Figure PCTCN2019085771-appb-000008
And map to the restricted text space, so as to get the image feature output v k at time k ; for text, BLSTM is a bidirectional LSTM network;
Figure PCTCN2019085771-appb-000009
Represents the regional feature set of the text s; u global is the global feature of the text;
Figure PCTCN2019085771-appb-000010
Text context information representing time k. S is the final similarity between the image and the text.
图3是本发明特征映射网络的结构;Figure 3 is the structure of the feature mapping network of the present invention;
C和D分别表示文本和图像的自注意力机制;其中,注意力层用于计算图像和文本不同区域的特征权重(
Figure PCTCN2019085771-appb-000011
Figure PCTCN2019085771-appb-000012
);加权平均层通过不同的权重对图像和文本的区域特征集合进行加权平均,得到当前时刻的共享特征(v k和u k);
Figure PCTCN2019085771-appb-000013
表示通过恒等连接(虚线)更新上下文信息。
C and D represent the self-attention mechanism of text and image respectively; the attention layer is used to calculate the feature weights of different regions of the image and text (
Figure PCTCN2019085771-appb-000011
with
Figure PCTCN2019085771-appb-000012
); The weighted average layer performs weighted averaging on the regional feature sets of images and text by different weights to obtain the shared features (v k and u k ) at the current moment;
Figure PCTCN2019085771-appb-000013
Indicates that the context information is updated through the identity connection (dashed line).
图4是在Flickr8K数据集下,全局先验知识对模型收敛速度的影响;Figure 4 shows the effect of global prior knowledge on the model convergence speed under the Flickr8K data set;
其中,“MSAN with prior”表示引入了全局先验知识的模型,“MSAN w/o prior”表示 没有使用全局先验知识的模型。Among them, “MSAN with prior” means a model that introduces global prior knowledge, and “MSAN w/oprior” means a model that does not use global prior knowledge.
图5~6展示了线上检索Demo的主要页面,分别是文本检索图像页面和图像检索文本页面截图。Figures 5-6 show the main pages of the online search Demo, which are the text search image page and the image search text page screenshot, respectively.
具体实施方式detailed description
下面结合附图,通过实施例进一步描述本发明,但不以任何方式限制本发明的范围。The present invention will be further described below by way of examples with reference to the accompanying drawings, but does not limit the scope of the present invention in any way.
本发明提供一种基于受限文本空间的多步自注意力跨媒体检索方法,包含特征提取网络,特征映射网络和相似性度量网络。特征提取网络用于提取图像和文本的全局特征、区域特征集合以及关联特征;其次,特征被进一步送入特征映射网络,并通过多步自注意力机制尽可能多地提取图像和文本之间物体层次的共享信息。然而,它没有考虑不同物体之间的交互信息。如图1所示,对于两对不同的图像-文本对,两者图像和文本之间物体层次的共享信息类似,比如“男人”,“冲浪板”和“波浪”。然而,物体之间的交互信息却不尽相同,比如男人是如何冲浪的(“跳下去”和“向…划动”)。因此,特征映射网络通过特征融合层将物体层次的共享特征与关联特征相融合,并且映射至受限文本空间。为了在更快地训练速度下取得更好地实验结果,本发明将图像和文本的全局特征看作多步自注意力机制的全局先验知识,用于实现关键信息的快速定位;最后,相似性度量网络通过汇总各时刻有用信息的方式来衡量图像和文本之间最终的相似度,并且计算三元组损失函数。除了客观的评价指标(检索召回率),本发明额外搭建了一个线上检索Demo。通过输入文本或者上传图像,该Demo可以返回对应的检索结果,从而从主观角度验证本发明的有效性。接下来,我们将详细描述特征提取网络、特征映射网络、相似性度量网络和线上检索Demo的原理及结构。The invention provides a multi-step self-attention cross-media retrieval method based on limited text space, which includes a feature extraction network, a feature mapping network and a similarity measurement network. The feature extraction network is used to extract global features, regional feature sets, and associated features of images and text; secondly, features are further fed into the feature mapping network, and as many objects as possible are extracted between the image and text through a multi-step self-attention mechanism Level of shared information. However, it does not consider the interactive information between different objects. As shown in Figure 1, for two different image-text pairs, the shared information at the object level between the two images and text is similar, such as "man", "surfboard" and "wave". However, the interactive information between objects is different, such as how men surf ("jump down" and "swipe towards"). Therefore, the feature mapping network fuses the shared features of the object level with the associated features through the feature fusion layer and maps them to the restricted text space. In order to obtain better experimental results at a faster training speed, the present invention regards the global features of images and text as the global prior knowledge of the multi-step self-attention mechanism, which is used to achieve the rapid positioning of key information; finally, similar The sex measurement network measures the final similarity between the image and the text by summarizing the useful information at each moment, and calculates the triple loss function. In addition to objective evaluation indicators (retrieval recall rate), the present invention additionally builds an online retrieval demo. By entering text or uploading images, the Demo can return the corresponding search results, thereby verifying the effectiveness of the present invention from a subjective perspective. Next, we will describe in detail the principle and structure of feature extraction network, feature mapping network, similarity measurement network and online retrieval Demo.
1、特征提取网络1. Feature extraction network
如图2的A部分所示,给定输入图像i,VGG最后全连接层的输出被用于提取图像的4096维全局特征v global。由于多层卷积和池化操作相当于提取图像区域的特征,本发明将VGG最后池化层(pool5)的输出作为图像各区域的特征集合
Figure PCTCN2019085771-appb-000014
该层输出包含512张特征图(feature map),每张特征图的大小为7×7。也就是说,图像区域总数为49,各个区域通过512维的特征向量表示。对于关联特征,本发明采用了图像题注任务的代表算法NIC,用于提取蕴含着丰富物体之间交互信息的512维关联特征
Figure PCTCN2019085771-appb-000015
在训练过程中,VGG和NIC的参数固定。VGG通过ImageNet预训练;NIC通过跨媒体检索数据集预训练。
As shown in Part A of Figure 2, given the input image i, the output of the last fully connected layer of VGG is used to extract the 4096-dimensional global feature v global of the image . Since the multi-layer convolution and pooling operations are equivalent to extracting the features of the image region, the present invention uses the output of the last pooling layer (pool5) of VGG as the feature set of each region of the image
Figure PCTCN2019085771-appb-000014
The output of this layer contains 512 feature maps (feature map), and the size of each feature map is 7×7. In other words, the total number of image areas is 49, and each area is represented by a 512-dimensional feature vector. For the associated features, the present invention adopts the representative algorithm NIC of the image caption task, which is used to extract 512-dimensional associated features containing rich interaction information between objects
Figure PCTCN2019085771-appb-000015
During the training process, the parameters of VGG and NIC are fixed. VGG is pre-trained by ImageNet; NIC is pre-trained by cross-media retrieval data set.
对于文本s=(s 0,s 1,…,s N),我们采用双向LSTM网络提取各个单词的特征: For the text s=(s 0 ,s 1 ,...,s N ), we use a bidirectional LSTM network to extract the features of each word:
Figure PCTCN2019085771-appb-000016
Figure PCTCN2019085771-appb-000016
Figure PCTCN2019085771-appb-000017
Figure PCTCN2019085771-appb-000017
Figure PCTCN2019085771-appb-000018
Figure PCTCN2019085771-appb-000018
其中x t表示t时刻的输入单词;
Figure PCTCN2019085771-appb-000019
Figure PCTCN2019085771-appb-000020
分别表示t时刻前向LSTM和后向LSTM的隐藏层的输出;
Figure PCTCN2019085771-appb-000021
表示当前输入单词的d维特征输出。因此,如图2的B部分所示,文本的区域特征集合可以被表示成
Figure PCTCN2019085771-appb-000022
全局特征u global可以被看成是双向LSTM网络最后时刻的d维隐藏层输出。其中,维度d不仅表示文本的特征维度,还表示受限文本空间的维度。在实验过程中,d的取值为1024.
Where x t represents the input word at time t;
Figure PCTCN2019085771-appb-000019
with
Figure PCTCN2019085771-appb-000020
Represent the output of the hidden layer of forward LSTM and backward LSTM at time t respectively;
Figure PCTCN2019085771-appb-000021
Represents the d-dimensional feature output of the currently input word. Therefore, as shown in Part B of Figure 2, the set of regional features of the text can be expressed as
Figure PCTCN2019085771-appb-000022
The global feature u global can be regarded as the output of the d-dimensional hidden layer at the last moment of the bidirectional LSTM network. Among them, the dimension d not only represents the feature dimension of the text, but also represents the dimension of the restricted text space. During the experiment, the value of d is 1024.
2、特征映射网络2. Feature Map Network
针对图像和文本,特征映射网络分别采用了视觉自注意力机制和文本自注意力机制,如图3所示。For image and text, the feature mapping network uses visual self-attention mechanism and text self-attention mechanism, as shown in Figure 3.
1)视觉自注意力机制1) Visual self-attention mechanism
如图3的D部分所示,给定图像i的区域特征集合
Figure PCTCN2019085771-appb-000023
k时刻的图像共享特征
Figure PCTCN2019085771-appb-000024
通过式2提取得到:
As shown in part D of FIG. 3, the regional feature set of a given image i
Figure PCTCN2019085771-appb-000023
Image sharing features at time k
Figure PCTCN2019085771-appb-000024
Extracted by formula 2:
Figure PCTCN2019085771-appb-000025
Figure PCTCN2019085771-appb-000025
Figure PCTCN2019085771-appb-000026
Figure PCTCN2019085771-appb-000026
Figure PCTCN2019085771-appb-000027
Figure PCTCN2019085771-appb-000027
其中,
Figure PCTCN2019085771-appb-000028
表示图像的k-1时刻的上下文信息;
Figure PCTCN2019085771-appb-000029
表示图像i中第n块区域的特征权重;
Figure PCTCN2019085771-appb-000030
则通过对不同图像区域的特征做加权平均得到;视觉自注意力函数
Figure PCTCN2019085771-appb-000031
用于计算各个图像区域的权重;
Figure PCTCN2019085771-appb-000032
Figure PCTCN2019085771-appb-000033
表示视觉自注意力函数的可训练参数,大小均为512×512。
among them,
Figure PCTCN2019085771-appb-000028
Context information representing the image at time k-1;
Figure PCTCN2019085771-appb-000029
Represents the feature weight of the nth block in image i;
Figure PCTCN2019085771-appb-000030
It is obtained by weighted average of the features of different image regions; visual self-attention function
Figure PCTCN2019085771-appb-000031
Used to calculate the weight of each image area;
Figure PCTCN2019085771-appb-000032
with
Figure PCTCN2019085771-appb-000033
Represents the trainable parameters of the visual self-attention function, the size is 512 × 512.
接下来,特征融合层融合
Figure PCTCN2019085771-appb-000034
与关联特征
Figure PCTCN2019085771-appb-000035
并且映射至受限文本空间,从而得到k时刻的图像特征输出v k
Next, feature fusion layer fusion
Figure PCTCN2019085771-appb-000034
Associated features
Figure PCTCN2019085771-appb-000035
And map to the restricted text space, so as to get the image feature output v k at time k :
Figure PCTCN2019085771-appb-000036
Figure PCTCN2019085771-appb-000036
其中,W k为将
Figure PCTCN2019085771-appb-000037
映射至受限文本空间的全连接层参数,大小为512×1024;BN表示批 量归一化层(Batch normalization);ReLU表示激活函数。v k不仅包含物体层次的图像共享特征,也包含丰富的物体之间的关联特征。
Among them, W k is the
Figure PCTCN2019085771-appb-000037
The parameters of the fully connected layer mapped to the restricted text space are 512×1024; BN represents the batch normalization layer; ReLU represents the activation function. v k contains not only the image-sharing features at the object level, but also rich correlation features between objects.
2)文本自注意力机制2) Text self-attention mechanism
如图3的C部分所示,给定文本s的单词特征集合
Figure PCTCN2019085771-appb-000038
k时刻的文本共享特征u k通过式4计算得到:
As shown in Part C of Figure 3, the set of word features for a given text
Figure PCTCN2019085771-appb-000038
The text sharing feature u k at time k is calculated by Equation 4:
Figure PCTCN2019085771-appb-000039
Figure PCTCN2019085771-appb-000039
Figure PCTCN2019085771-appb-000040
Figure PCTCN2019085771-appb-000040
Figure PCTCN2019085771-appb-000041
Figure PCTCN2019085771-appb-000041
其中,
Figure PCTCN2019085771-appb-000042
表示文本的k-1时刻的上下文信息;
Figure PCTCN2019085771-appb-000043
表示文本s中第n个单词的特征权重;u k通过对不同单词的特征做加权平均得到;文本自注意力函数
Figure PCTCN2019085771-appb-000044
用于计算各个单词特征的权重;
Figure PCTCN2019085771-appb-000045
Figure PCTCN2019085771-appb-000046
表示文本自注意力函数的可训练参数,大小均为1024×512。
among them,
Figure PCTCN2019085771-appb-000042
Context information representing the k-1 time of the text;
Figure PCTCN2019085771-appb-000043
Represents the feature weight of the nth word in the text s; u k is obtained by weighted average of the features of different words; text self-attention function
Figure PCTCN2019085771-appb-000044
Used to calculate the weight of each word feature;
Figure PCTCN2019085771-appb-000045
with
Figure PCTCN2019085771-appb-000046
Represents the trainable parameters of the text self-attention function, and the size is 1024×512.
3)上下文信息3) Context information
步骤1)和2)中提到的上下文信息
Figure PCTCN2019085771-appb-000047
Figure PCTCN2019085771-appb-000048
能够对自注意力网络已经关注过的信息进行编码。受到ResNet(深度残差网络,deep residual network)恒等连接(identity connection)的启发,本发明定义上下文信息的更新公式如式5:
Context information mentioned in steps 1) and 2)
Figure PCTCN2019085771-appb-000047
with
Figure PCTCN2019085771-appb-000048
Ability to encode information that has been followed by the self-attention network. Inspired by the identity connection of ResNet (deep residual network), the update formula for defining context information in the present invention is shown as Equation 5:
Figure PCTCN2019085771-appb-000049
Figure PCTCN2019085771-appb-000049
其中k∈{1,…,K},V_att和T_att分别表示视觉自注意力和文本自注意力函数。恒等连接可以控制网络中上下文信息的流动,保留有用信息。Where k∈{1,...,K}, V_att and T_att represent visual self-attention and text self-attention functions, respectively. Identical connections can control the flow of contextual information in the network and retain useful information.
为了在更快地训练速度下取得更好地实验结果,本发明将初始上下文信息
Figure PCTCN2019085771-appb-000050
Figure PCTCN2019085771-appb-000051
初始化为图像和文本的全局特征,如式6所示:
In order to obtain better experimental results at a faster training speed, the present invention uses the initial context information
Figure PCTCN2019085771-appb-000050
with
Figure PCTCN2019085771-appb-000051
Initialized to global features of images and text, as shown in Equation 6:
Figure PCTCN2019085771-appb-000052
Figure PCTCN2019085771-appb-000052
其中v global和u global分别表示图像和文本的全局特征,也可以被称为全局先验知识。此时,全局特征可以被看成是多步自注意力机制的全局参考信息,用于实现关键信息的快速定位。 Among them, v global and u global represent global features of images and text, respectively, and can also be called global prior knowledge. At this time, the global feature can be regarded as the global reference information of the multi-step self-attention mechanism, which is used to quickly locate the key information.
最后,本发明在K时刻分步实现多步自注意力机制,使得它在任一时刻k能够尽可能多地找到图像和文本之间的共享信息。对于不同的数据集,K的取值不同。在Flickr8K数据集上,K被设置为1;在Flickr30K和MSCOCO数据集上,K被设置为2。具体的实验结果展示在后续的实验分析部分。参数K表示多步自注意力机制总的循环次数。它也可以在时间上展开, 可看成是在不同时刻k上依次进行多步自注意机制。Finally, the present invention implements a multi-step self-attention mechanism step by step at time K so that it can find as much shared information between images and text as possible at any time k. For different data sets, the value of K is different. On the Flickr8K data set, K is set to 1; on the Flickr30K and MSCOCO data sets, K is set to 2. The specific experimental results are shown in the subsequent experimental analysis section. The parameter K represents the total number of cycles of the multi-step self-attention mechanism. It can also be expanded in time, which can be seen as a multi-step self-attention mechanism in turn at different times k.
3、相似性度量网络3. Similarity measurement network
本发明定义了一个相似性衡量函数sim(v,u)=v·u,其中v和u分别代表受限文本空间中的图像和文本的特征。k时刻两者的相似度s k可以通过式7得到: The invention defines a similarity measurement function sim(v,u)=v·u, where v and u represent the features of the image and text in the restricted text space, respectively. The similarity s k between the two at time k can be obtained by Equation 7:
s k=v k·u k      (7) s k = v k · u k (7)
然后,通过汇总K时刻有用信息的方式来衡量图像和文本之间最终的相似度S:Then, measure the final similarity S between the image and the text by summarizing the useful information at time K:
Figure PCTCN2019085771-appb-000053
Figure PCTCN2019085771-appb-000053
最后,三元组损失函数被用于更新网络参数,如式9。Finally, the triple loss function is used to update the network parameters, as shown in Equation 9.
Figure PCTCN2019085771-appb-000054
Figure PCTCN2019085771-appb-000054
其中,s p是输入图像i的第p个不匹配文本;i p是输入文本s的第p个不匹配图像;m是最小距离间隔,取值为0.3;sim(v,t)是相似性度量函数;不匹配的样本在每个训练周期从数据集中随机选取。在训练过程中,我们通过Adam优化器更新网络参数,并且在前十次迭代中固定学习率大小为0.0002。随着训练的进行,学习率在后十次迭代中降低至0.00002. Where s p is the p-th unmatched text of the input image i; ip is the p-th unmatched image of the input text s; m is the minimum distance interval, and the value is 0.3; sim(v, t) is the similarity Metric function; unmatched samples are randomly selected from the data set in each training cycle. During the training process, we updated the network parameters through the Adam optimizer, and the fixed learning rate was 0.0002 in the first ten iterations. As the training progressed, the learning rate decreased to 0.00002 in the last ten iterations.
4、线上检索Demo4. Online Demo
线上检索Demo的实现主要通过Tornado工具实现。Tornado是一种Web服务器软件的开源版本,每秒可以处理数以千计的连接,而且速度相当快。因此,Tornado是实时Web服务的一个理想框架。The realization of online search Demo is mainly realized by Tornado tool. Tornado is an open source version of web server software that can handle thousands of connections per second, and is quite fast. Therefore, Tornado is an ideal framework for real-time web services.
Tornado的作用相当于MVC框架中的控制器(Controller)。它的作用包括:1)查询读取;2)提取查询的特征;3)提取数据库中所有待检索数据的特征;4)将数据发送给模型(Model)。为了保证Demo的响应速度,数据库中所有待检索数据的特征已经预先载入内存。Tornado functions as a controller in the MVC framework. Its functions include: 1) query reading; 2) extract the features of the query; 3) extract the features of all the data to be retrieved in the database; 4) send the data to the model. In order to ensure Demo's response speed, all the features of the data to be retrieved in the database have been pre-loaded into the memory.
本发明提出的基于受限文本空间的多步自注意力跨媒体检索方法相当于MVC框架中的模型(Model),也被称为核心排序算法。它的主要任务是快速而准确地找到查询的相似数据并发送至控制器。在数据量较少的情况下,最容易的办法是线性扫描,即依次计算数据集中每个样本与查询的距离。然而,随着数据量的不断增加,线性扫描的时间耗费也逐步提升,Demo的响应速度也会随之变慢。由于实际数据一般都会呈现簇状的聚类形态,因此我们首先通过聚类算法(比如K-means)建立聚类中心,然后通过寻找与查询距离最近的聚类中心,比较聚类中的所有数据得到相似数据。基于该原理,我们选用Facebook的开源框架Faiss实现准确而快速的查询。Faiss是一种为稠密向量提供高效相似度搜索和聚类的框架。在查询之 前,Faiss需要对数据集中的所有数据进行聚类,以形成不同的数据簇。The multi-step self-attention cross-media retrieval method based on the restricted text space proposed by the present invention is equivalent to the model in the MVC framework, and is also called the core sorting algorithm. Its main task is to find similar query data quickly and accurately and send it to the controller. In the case of a small amount of data, the easiest way is linear scanning, which is to calculate the distance between each sample and the query in the data set in turn. However, as the amount of data continues to increase, the time consumption of linear scanning also gradually increases, and the response speed of Demo will also slow down. Since the actual data generally shows a cluster-like clustering form, we first establish a clustering center through a clustering algorithm (such as K-means), and then compare all the data in the cluster by finding the clustering center closest to the query Get similar data. Based on this principle, we chose Faiss, Facebook's open source framework, to achieve accurate and fast queries. Faiss is a framework that provides efficient similarity search and clustering for dense vectors. Before querying, Faiss needs to cluster all the data in the data set to form different data clusters.
最后,MVC框架中的前端视图(View)相当于主流搜索引擎中的搜索页面,主要通过HTML、CSS和JavaScript等技术实现。该线上检索Demo一共包含三个页面:主页面、文本检索图像页面(图5)和图像检索文本页面(图6)。主页面包含文本输入框、相机图标和“Search”按钮。用户首先通过文本输入框输入文本或通过点击相机图标上传图像,然后点击“Search”按钮开始搜索。对于一段输入文本“A restaurant has modern wooden tables and chairs”,图5展示了对应的文本检索图像的结果;对于一张名为“COCO_train2014_000000000049.jpg”的图像,图6展示了对应的图像检索文本的结果。检索结果的显示按照相关性顺序,即从上到下、从左到右,样本的相关性递减。为了保证搜索结果展示页面的美观性,图5和图6中的搜索框被移至左上角,功能不变。Finally, the front-end view (View) in the MVC framework is equivalent to the search page in mainstream search engines, mainly through HTML, CSS and JavaScript technologies. The online search Demo contains three pages: the main page, the text search image page (Figure 5) and the image search text page (Figure 6). The main page contains a text input box, camera icon and "Search" button. The user first enters text through the text input box or uploads an image by clicking the camera icon, and then clicks the "Search" button to start the search. For an input text "Arestaurant has modern tables", Figure 5 shows the results of the corresponding text retrieval image; for an image named "COCO_train2014_000000000049.jpg", Figure 6 shows the corresponding image retrieval text result. The search results are displayed in order of relevance, that is, from top to bottom, from left to right, and the relevance of the sample decreases. In order to ensure the beauty of the search results display page, the search box in Figure 5 and Figure 6 has been moved to the upper left corner, the function remains unchanged.
表1~3给出了本发明在Flickr8K、Flickr30K和MSCOCO数据集上的召回率结果,Img2Txt代表图像到文本的检索,Txt2Img代表文本到图像的检索。为了评价检索效果,我们遵循了标准的排序度量标准,使用Recall@K。Recall@K通过计算正确匹配的数据排在前K(K=1,5,10)个检索结果中的概率来对检索准确性进行度量;Recall@K的数值越大,检索结果越准确。图中列出了本发明与其它现有先进算法的效果比较,包括NIC(Neural Image Captioning),m-CNN ENS(Multimodal Convolutional Neural Networks),HM-LSTM(Hierarchical Multimodal LSTM),LTS(Limited Text Space),DAN(Dual Attention Networks),DSPE(Deep Structure-Preserving Image-Text Embeddings),VSE++(Improving Visual-Semantic Embeddings),sm-LSTM(Selective Multimodal LSTM)。此外,我们在本发明的基础上设计了三个对比模型: Tables 1 to 3 show the recall rate results of the present invention on the Flickr8K, Flickr30K and MSCOCO data sets. Img2Txt represents image-to-text retrieval, and Txt2Img represents text-to-image retrieval. In order to evaluate the retrieval effect, we followed the standard ranking metrics and used Recall@K. Recall@K measures the retrieval accuracy by calculating the probability that the correctly matched data is ranked in the top K (K=1, 5, 10) retrieval results; the larger the value of Recall@K, the more accurate the retrieval results. The figure lists the effects of the present invention compared with other existing advanced algorithms, including NIC (Neural Image Captioning), m-CNN ENS (Multimodal Convolutional Neural Networks), HM-LSTM (Hierarchical Multimodal LSTM), LTS (Limited Text Space ), DAN (Dual Attention Networks), DSPE (Deep Structure-Preserving Image-Text Embeddings), VSE++ (Improving Visual-Semantic Embeddings), sm-LSTM (Selective Multimodal LSTM). In addition, we designed three comparative models based on the present invention:
●MSAN-obj没有使用关联特征
Figure PCTCN2019085771-appb-000055
只考虑了图像和文本之间物体层次的共享信息;
●MSAN-obj does not use associated features
Figure PCTCN2019085771-appb-000055
Only the object-level shared information between images and text is considered;
●MSAN-glob没有采用多步自注意力机制,仅通过全局特征表示图像和文本;● MSAN-glob does not use a multi-step self-attention mechanism, only expressing images and text through global features;
●MSAN是包含了关联特征
Figure PCTCN2019085771-appb-000056
和多步自注意力机制的完整模型。
●MSAN includes related features
Figure PCTCN2019085771-appb-000056
And a complete model of a multi-step self-attention mechanism.
表1 实施例在Flickr8K数据集上的召回率结果Table 1 Example of recall rate results on Flickr8K dataset
Figure PCTCN2019085771-appb-000057
Figure PCTCN2019085771-appb-000057
表2 实施例在Flickr30K数据集上的召回率结果Table 2 Example of recall rate results on Flickr30K dataset
Figure PCTCN2019085771-appb-000058
Figure PCTCN2019085771-appb-000058
表3 实施例在MSCOCO数据集上的召回率结果Table 3 Example of recall rate results on the MSCOCO dataset
Figure PCTCN2019085771-appb-000059
Figure PCTCN2019085771-appb-000059
从表1~3中可以看出,与DSPE、HM-LSTM、DAN等几种性能较好的方法相比,MSAN取得了现阶段基于VGG特征的最优结果。此外,MSAN比MSAN-obj和MSAN-glob的实验结果都要好,证明了多步自注意力机制和关联特征的有效性。As can be seen from Tables 1 to 3, MSAN has achieved the best results based on the VGG feature at this stage compared with DSPE, HM-LSTM, DAN and other better performing methods. In addition, MSAN has better experimental results than MSAN-obj and MSAN-glob, proving the effectiveness of the multi-step self-attention mechanism and associated features.
表4 不同K的取值对实施例的跨媒体检索效果的影响Table 4 The effect of different K values on the cross-media retrieval effect of the embodiment
Figure PCTCN2019085771-appb-000060
Figure PCTCN2019085771-appb-000060
表4展示了在Flickr8K和Flickr30K数据集上,多步自注意力机制的循环次数K对实验结果的影响。从表格中我们可以看出,当K=1,2时,MSAN分别在Flickr8K和Flickr30K数据集上取得了最优的实验结果。K的值越大,多步自注意力机制的所需参数也就越多,越可能造成过拟合现象,从而降低检索召回率。因此,在Flickr8K数据集上,K被设置为1;在Flickr30K和MSCOCO数据集上,K被设置为2。Table 4 shows the effect of the cycle number K of the multi-step self-attention mechanism on the Flickr8K and Flickr30K datasets on the experimental results. From the table, we can see that when K=1, 2, MSAN achieved the best experimental results on the Flickr8K and Flickr30K data sets, respectively. The larger the value of K, the more parameters required for the multi-step self-attention mechanism, and the more likely it is to cause overfitting, thereby reducing the retrieval recall rate. Therefore, on the Flickr8K dataset, K is set to 1; on the Flickr30K and MSCOCO datasets, K is set to 2.
表5 全局先验知识对实施例的召回率结果的影响Table 5 The influence of global prior knowledge on the recall rate results of the examples
Figure PCTCN2019085771-appb-000061
Figure PCTCN2019085771-appb-000061
表5展示了全局先验知识对实验结果的影响。我们设计了两个对比模型:“MSAN with prior”和“MSAN w/o prior”。其中“MSAN with prior”表示使用了全局先验知识的MSAN模型,“MSAN w/o prior”表示不使用全局先验知识的MSAN模型。从表5中可以看出,“MSAN with prior”的检索召回率高于“MSAN w/o prior”,从而验证了全局先验知识的有效性。图4则展示了在Flickr8K数据集下,“MSAN with prior”和“MSAN w/o prior”模型的损失函数变化趋势图。其中,“MSAN with prior”的收敛速度比“MSAN w/o prior”更快,并且模型收敛时的损失函数更小。因此,由于引入了全局先验知识,本发明可以在更快的收敛速度下取得更好的检索结果。Table 5 shows the influence of global prior knowledge on experimental results. We designed two comparison models: "MSAN with priority" and "MSAN w/oprior". Among them, "MSAN with prior" indicates the MSAN model using global prior knowledge, and "MSAN/prior" indicates the MSAN model without global prior knowledge. It can be seen from Table 5 that the retrieval recall rate of "MSAN with priority" is higher than that of "MSAN w/oprior", thus verifying the effectiveness of the global prior knowledge. Figure 4 shows the trend graph of the loss function of the "MSAN with priority" and "MSAN w/oprior" models under the Flickr8K data set. Among them, "MSAN with priority" has a faster convergence rate than "MSAN w/oprior", and the loss function when the model converges is smaller. Therefore, due to the introduction of global prior knowledge, the present invention can achieve better retrieval results at a faster convergence speed.
图5和图6分别展示了线上检索Demo的文本检索图像和图像检索文本的结果。从主观角度出发,尽管展示的结果中不一定包含真正的匹配样本,本发明提出的基于受限文本空间的多步自注意力跨媒体检索方法仍能找到与查询尽可能相似的结果,满足人们的需求。这也从主观角度验证了本发明的有效性。Figures 5 and 6 show the text retrieval image and image retrieval text of Demo online retrieval, respectively. From a subjective point of view, although the displayed results do not necessarily include true matching samples, the multi-step self-attention cross-media retrieval method based on the restricted text space proposed by the present invention can still find results as similar as possible to the query, satisfying people Demand. This also validates the effectiveness of the present invention from a subjective perspective.
需要注意的是,公布实施例的目的在于帮助进一步理解本发明,但是本领域的技术人员可以理解:在不脱离本发明及所附权利要求的精神和范围内,各种替换和修改都是可能的。因此,本发明不应局限于实施例所公开的内容,本发明要求保护的范围以权利要求书界定的范围为准。It should be noted that the purpose of the disclosed embodiments is to help further understand the present invention, but those skilled in the art can understand that various replacements and modifications are possible without departing from the spirit and scope of the present invention and the appended claims. of. Therefore, the present invention should not be limited to the contents disclosed in the embodiments, and the scope of protection claimed by the present invention is subject to the scope defined by the claims.

Claims (11)

  1. 一种基于受限文本空间的多步自注意力跨媒体检索方法,先构建受限文本空间,再将非受限文本空间转换成受限文本空间,其中所述受限文本空间指的是具有相对固定的词汇表的文本空间;该检索方法包括:A multi-step self-attention cross-media retrieval method based on restricted text space, first constructs a restricted text space, and then converts an unrestricted text space into a restricted text space, where the restricted text space refers to A relatively fixed vocabulary text space; the search method includes:
    通过特征提取网络提取图像特征和文本特征,所述特征包括全局特征、区域特征集合和关联特征;Extract image features and text features through a feature extraction network, where the features include global features, regional feature sets, and associated features;
    将提取得到的特征送入特征映射网络,并通过多步自注意力机制提取图像和文本之间物体层次的共享特征信息;Send the extracted features to the feature mapping network, and extract the object-level shared feature information between the image and the text through a multi-step self-attention mechanism;
    特征映射网络通过特征融合层将物体层次的共享特征与关联特征相融合,并且映射至受限文本空间;The feature mapping network fuses the shared features of the object level with the associated features through the feature fusion layer and maps them to the restricted text space;
    再通过相似性度量网络汇总各时刻有用信息,衡量图像和文本之间的相似度,并计算三元组损失函数;Then, through the similarity measurement network, the useful information at each moment is summarized, the similarity between the image and the text is measured, and the triple loss function is calculated;
    由此实现基于受限文本空间的多步自注意力跨媒体检索。Thus, multi-step self-attention cross-media retrieval based on limited text space is realized.
  2. 如权利要求1所述基于受限文本空间的多步自注意力跨媒体检索方法,其特征是,受限文本空间采用数据集D表示,设数据集D={D 1,D 2,…,D I}共有I个样本,每个样本D i包括一张图片i和一段描述文本s,即D i=(i,s),每段文本由多个句子组成,每个句子均独立描述相匹配的图片;所述基于受限文本空间的多步自注意力跨媒体检索方法包括如下步骤: The multi-step self-attention cross-media retrieval method based on the restricted text space according to claim 1, characterized in that the restricted text space is represented by a data set D, and let the data set D = {D 1 , D 2 ,..., D I } A total of I samples, each sample D i includes a picture i and a description text s, that is D i = (i, s), each text consists of multiple sentences, each sentence describes the phase independently Matching pictures; the multi-step self-attention cross-media retrieval method based on restricted text space includes the following steps:
    1)通过特征提取网络提取D中图像和文本的区域特征;1) Extract the regional features of the image and text in D through the feature extraction network;
    对于图像,通过预训练的神经网络结构VGG提取图像的全局特征和图像的区域特征集合;通过图像题注模型方法NIC提取物体之间交互信息的关联特征;For the image, the global features of the image and the set of regional features of the image are extracted by the pre-trained neural network structure VGG; the associated features of the interactive information between the objects are extracted by the image caption model method NIC;
    对于文本,使用未经过预训练的双向长短期记忆循环神经网络LSTM提取文本的全局特征和文本的区域特征集合;LSTM的参数与特征映射网络的参数同步更新;For text, the unpre-trained bidirectional long-term short-term memory recurrent neural network LSTM is used to extract the global features of the text and the regional feature set of the text; the parameters of the LSTM and the parameters of the feature mapping network are updated synchronously;
    2)将步骤1)提取得到的特征送入特征映射网络;2) Send the feature extracted in step 1) to the feature map network;
    首先,通过多步自注意力机制关注图像和文本区域特征之间物体层次的共享信息;First, pay attention to the shared information at the object level between the image and text area features through a multi-step self-attention mechanism;
    其次,通过特征融合层实现物体层次的共享特征和关联特征的融合,并且映射至受限文本空间;Second, through the feature fusion layer to achieve the fusion of shared features and related features at the object level, and map to the restricted text space;
    将图像和文本的全局特征作为多步自注意力机制的全局先验知识,用于实现关键信息的快速定位;Use the global features of images and text as the global prior knowledge of the multi-step self-attention mechanism to achieve the rapid positioning of key information;
    3)通过相似性度量网络汇总各时刻有用信息,衡量图像和文本之间最终的相似度;相似 性衡量函数定义为:3) The similarity measurement network summarizes useful information at each moment to measure the final similarity between the image and the text; the similarity measurement function is defined as:
    sim(v,u)=v·usim(v,u)=v·u
    其中,v和u分别代表受限文本空间中的图像和文本的特征;k时刻二者的相似度s k通过式7计算得到: Among them, v and u represent the characteristics of the image and text in the restricted text space; the similarity sk of the two at k is calculated by Equation 7:
    s k=v k·u k       (式7) s k = v k · u k (Equation 7)
    通过汇总K时刻有用信息的方式,衡量图像和文本之间最终的相似度S,表示为式8:By summarizing the useful information at time K, the final similarity S between the image and the text is measured and expressed as Equation 8:
    Figure PCTCN2019085771-appb-100001
    Figure PCTCN2019085771-appb-100001
    4)计算三元组损失函数,通过优化该三元组损失函数更新网络参数;4) Calculate the triple loss function, and update the network parameters by optimizing the triple loss function;
    三元组损失函数表示为式9:The triple loss function is expressed as Equation 9:
    Figure PCTCN2019085771-appb-100002
    Figure PCTCN2019085771-appb-100002
    其中,s p是输入图像i的第p个不匹配文本;i p是输入文本s的第p个不匹配图像;m是最小距离间隔,取值为0.3;sim(v,t)是相似性度量函数。 Where s p is the p-th unmatched text of the input image i; ip is the p-th unmatched image of the input text s; m is the minimum distance interval, and the value is 0.3; sim(v, t) is the similarity Measurement function.
  3. 如权利要求2所述基于受限文本空间的多步自注意力跨媒体检索方法,其特征是,步骤1)中,对于文本s=(s 0,s 1,…,s N),采用双向LSTM网络提取各个单词的特征,具体表示为式1: The multi-step self-attention cross-media retrieval method based on limited text space according to claim 2, wherein in step 1), for text s=(s 0 ,s 1 ,...,s N ), bidirectional The LSTM network extracts the features of each word, specifically expressed as Equation 1:
    Figure PCTCN2019085771-appb-100003
    Figure PCTCN2019085771-appb-100003
    其中,xt表示t时刻的输入单词;
    Figure PCTCN2019085771-appb-100004
    Figure PCTCN2019085771-appb-100005
    分别表示t时刻前向LSTM和后向LSTM的隐藏层的输出;
    Figure PCTCN2019085771-appb-100006
    表示当前输入单词的d维特征输出;
    Among them, xt represents the input word at time t;
    Figure PCTCN2019085771-appb-100004
    with
    Figure PCTCN2019085771-appb-100005
    Represent the output of the hidden layer of forward LSTM and backward LSTM at time t respectively;
    Figure PCTCN2019085771-appb-100006
    D-dimensional feature output representing the current input word;
    文本的区域特征集合表示为
    Figure PCTCN2019085771-appb-100007
    将双向LSTM网络最后时刻的d维隐藏层输出作为全局特征u global;其中,维度d既是文本的特征维度,也是受限文本空间的维度。
    The set of regional features of the text is expressed as
    Figure PCTCN2019085771-appb-100007
    The output of the d-dimensional hidden layer at the last moment of the bidirectional LSTM network is taken as the global feature u global ; where the dimension d is both the feature dimension of the text and the dimension of the restricted text space.
  4. 如权利要求2所述基于受限文本空间的多步自注意力跨媒体检索方法,其特征是,步骤1)中,输入图像v,利用VGG最后全连接层的输出提取图像的4096维全局特征,,记作v global;将VGG最后池化层pool5的输出作为图像各区域的特征集合
    Figure PCTCN2019085771-appb-100008
    该层输出包含512张特征图,每张特征图的大小为7×7,图像区域总数为49,各个区域通过512维的特征 向量表示。
    The multi-step self-attention cross-media retrieval method based on limited text space according to claim 2, characterized in that in step 1), the input image v is used to extract the 4096-dimensional global features of the image using the output of the last fully connected layer of VGG ,, denoted as v global ; the output of the pooling layer pool5 of VGG as the feature set of each region of the image
    Figure PCTCN2019085771-appb-100008
    The output of this layer contains 512 feature maps, the size of each feature map is 7×7, the total number of image regions is 49, and each region is represented by a 512-dimensional feature vector.
  5. 如权利要求4所述基于受限文本空间的多步自注意力跨媒体检索方法,其特征是,采用NIC提取物体之间交互信息,得到512维关联特征
    Figure PCTCN2019085771-appb-100009
    在训练NIC过程中,VGG和NIC的参数固定。
    The multi-step self-attention cross-media retrieval method based on limited text space according to claim 4, characterized in that NIC is used to extract interactive information between objects to obtain 512-dimensional correlation features
    Figure PCTCN2019085771-appb-100009
    During the training of NIC, the parameters of VGG and NIC are fixed.
  6. 如权利要求1所述基于受限文本空间的多步自注意力跨媒体检索方法,其特征是,特征映射网络针对图像采用视觉自注意力机制;具体执行如下操作:The multi-step self-attention cross-media retrieval method based on limited text space according to claim 1, characterized in that the feature mapping network adopts a visual self-attention mechanism for images; the specific operations are as follows:
    给定图像i的区域特征集合
    Figure PCTCN2019085771-appb-100010
    通过式2提取得到k时刻的图像共享特征
    Figure PCTCN2019085771-appb-100011
    Set of regional features for a given image i
    Figure PCTCN2019085771-appb-100010
    Extract the image sharing feature at time k by Equation 2
    Figure PCTCN2019085771-appb-100011
    Figure PCTCN2019085771-appb-100012
    Figure PCTCN2019085771-appb-100012
    其中,
    Figure PCTCN2019085771-appb-100013
    表示图像的k-1时刻的上下文信息;
    Figure PCTCN2019085771-appb-100014
    表示图像i中第n块区域的特征权重;
    Figure PCTCN2019085771-appb-100015
    是通过对不同图像区域的特征做加权平均得到;视觉自注意力函数
    Figure PCTCN2019085771-appb-100016
    用于计算各个图像区域的权重;
    Figure PCTCN2019085771-appb-100017
    Figure PCTCN2019085771-appb-100018
    表示视觉自注意力函数的可训练参数;
    among them,
    Figure PCTCN2019085771-appb-100013
    Context information representing the image at time k-1;
    Figure PCTCN2019085771-appb-100014
    Represents the feature weight of the nth block in image i;
    Figure PCTCN2019085771-appb-100015
    It is obtained by weighted average of the features of different image regions; visual self-attention function
    Figure PCTCN2019085771-appb-100016
    Used to calculate the weight of each image area;
    Figure PCTCN2019085771-appb-100017
    with
    Figure PCTCN2019085771-appb-100018
    Representable trainable parameters of visual self-attention function;
    利用特征融合层融合
    Figure PCTCN2019085771-appb-100019
    与关联特征
    Figure PCTCN2019085771-appb-100020
    并映射至受限文本空间,从而得到k时刻的图像特征输出v k,表示为式3:
    Fusion using feature fusion layers
    Figure PCTCN2019085771-appb-100019
    Associated features
    Figure PCTCN2019085771-appb-100020
    And map to the restricted text space, so as to get the image feature output v k at time k , expressed as Equation 3:
    Figure PCTCN2019085771-appb-100021
    Figure PCTCN2019085771-appb-100021
    其中,W k为将
    Figure PCTCN2019085771-appb-100022
    映射至受限文本空间的全连接层参数;BN表示批量归一化层;ReLU表示激活函数;v k既包含物体层次的图像共享特征,也包含物体之间的关联特征。
    Among them, W k is the
    Figure PCTCN2019085771-appb-100022
    The parameters of the fully connected layer mapped to the restricted text space; BN represents the batch normalization layer; ReLU represents the activation function; v k contains both the image sharing features at the object level and the associated features between objects.
  7. 如权利要求1所述基于受限文本空间的多步自注意力跨媒体检索方法,其特征是,特征映射网络针对文本采用文本自注意力机制;具体执行如下操作:The multi-step self-attention cross-media retrieval method based on limited text space according to claim 1, characterized in that the feature mapping network adopts a text self-attention mechanism for text; the specific operations are as follows:
    给定文本s的单词特征集合
    Figure PCTCN2019085771-appb-100023
    k时刻的文本共享特征u k通过式4计算得到:
    Set of word features for a given text s
    Figure PCTCN2019085771-appb-100023
    The text sharing feature u k at time k is calculated by Equation 4:
    Figure PCTCN2019085771-appb-100024
    Figure PCTCN2019085771-appb-100024
    其中,
    Figure PCTCN2019085771-appb-100025
    表示文本的k-1时刻的上下文信息;
    Figure PCTCN2019085771-appb-100026
    表示文本s中第n个单词的特征权重;u k通过对不同单词的特征做加权平均得到;文本自注意力函数
    Figure PCTCN2019085771-appb-100027
    用于计算各个单词特征的权重;
    Figure PCTCN2019085771-appb-100028
    Figure PCTCN2019085771-appb-100029
    表示文本自注意力函数的可训练参数。
    among them,
    Figure PCTCN2019085771-appb-100025
    Context information representing the k-1 time of the text;
    Figure PCTCN2019085771-appb-100026
    Represents the feature weight of the nth word in the text s; u k is obtained by weighted average of the features of different words; text self-attention function
    Figure PCTCN2019085771-appb-100027
    Used to calculate the weight of each word feature;
    Figure PCTCN2019085771-appb-100028
    with
    Figure PCTCN2019085771-appb-100029
    Represents the trainable parameters of the text self-attention function.
  8. 如权利要求6或7所述基于受限文本空间的多步自注意力跨媒体检索方法,其特征是,利用上下文信息
    Figure PCTCN2019085771-appb-100030
    Figure PCTCN2019085771-appb-100031
    对自注意力网络关注过的信息进行编码;具体定义上下文信息的更新公式如式5:
    The multi-step self-attention cross-media retrieval method based on the restricted text space according to claim 6 or 7, wherein the context information is used
    Figure PCTCN2019085771-appb-100030
    with
    Figure PCTCN2019085771-appb-100031
    Encode the information that the self-attention network has paid attention to; the update formula that specifically defines the context information is shown in Equation 5:
    Figure PCTCN2019085771-appb-100032
    Figure PCTCN2019085771-appb-100032
    其中,k∈{1,…,K},K表示多步自注意力机制总的循环次数;V_att和T_att分别表示视觉自注意力和文本自注意力函数。Among them, k∈{1,…,K}, K represents the total number of cycles of the multi-step self-attention mechanism; V_att and T_att represent the visual self-attention and text self-attention functions, respectively.
  9. 如权利要求8所述基于受限文本空间的多步自注意力跨媒体检索方法,其特征是,将图像和文本的全局特征分别作为初始上下文信息
    Figure PCTCN2019085771-appb-100033
    Figure PCTCN2019085771-appb-100034
    如式6:
    The multi-step self-attention cross-media retrieval method based on limited text space according to claim 8, wherein the global features of the image and the text are used as initial context information, respectively
    Figure PCTCN2019085771-appb-100033
    with
    Figure PCTCN2019085771-appb-100034
    As formula 6:
    Figure PCTCN2019085771-appb-100035
    Figure PCTCN2019085771-appb-100035
    其中,v global和u global分别表示图像和文本的全局特征即全局先验知识;全局特征作为多步自注意力机制的全局参考信息,用于实现关键信息的快速定位。 Among them, v global and u global respectively represent the global features of the image and text, that is, global prior knowledge; global features are used as global reference information for the multi-step self-attention mechanism, which is used to quickly locate key information.
  10. 一种利用权利要求1或2所述基于受限文本空间的多步自注意力跨媒体检索方法实现的基于受限文本空间的多步自注意力跨媒体检索系统,采用模型-视图-控制器MVC框架,其中,模型Model采用所述基于受限文本空间的多步自注意力跨媒体检索方法作为核心排序算法;视图View对应前端页面,用于实现查询图像或文本的输入和检索结果的展示;控制器Controller对应后台控制器,用于从前端读取查询输入,并向核心排序算法发送数据。A multi-step self-attention cross-media retrieval system based on limited text space implemented by the multi-step self-attention cross-media retrieval method according to claim 1 or 2, using a model-view-controller MVC framework, in which the Model adopts the multi-step self-attention cross-media retrieval method based on limited text space as the core sorting algorithm; the View corresponds to the front-end page, which is used to realize the input of query images or text and the display of retrieval results ; Controller Controller corresponds to the background controller, which is used to read the query input from the front end and send data to the core sorting algorithm.
  11. 如权利要求10所述的基于受限文本空间的多步自注意力跨媒体检索系统,其特征是,所述前端页面通过超文本标记语言HTML、层叠样式表CSS和JavaScript实现;所述后台控制器通过Tornado工具实现。The multi-step self-attention cross-media retrieval system based on limited text space according to claim 10, wherein the front-end page is implemented by hypertext markup language HTML, cascading style sheet CSS and JavaScript; the background control The device is realized by Tornado tool.
PCT/CN2019/085771 2019-01-07 2019-05-07 Multi-step self-attention cross-media retrieval method based on restricted text space and system WO2020143137A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910011678.2A CN109783657B (en) 2019-01-07 2019-01-07 Multi-step self-attention cross-media retrieval method and system based on limited text space
CN201910011678.2 2019-01-07

Publications (1)

Publication Number Publication Date
WO2020143137A1 true WO2020143137A1 (en) 2020-07-16

Family

ID=66499980

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/085771 WO2020143137A1 (en) 2019-01-07 2019-05-07 Multi-step self-attention cross-media retrieval method based on restricted text space and system

Country Status (2)

Country Link
CN (1) CN109783657B (en)
WO (1) WO2020143137A1 (en)

Cited By (24)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111897974A (en) * 2020-08-12 2020-11-06 吉林大学 Heterogeneous knowledge graph learning method based on multilayer attention mechanism
CN112001166A (en) * 2020-08-24 2020-11-27 齐鲁工业大学 Intelligent question-answer sentence-to-semantic matching method and device for government affair consultation service
CN112084358A (en) * 2020-09-04 2020-12-15 中国石油大学(华东) Image-text matching method based on regional enhanced network with theme constraint
CN112613451A (en) * 2020-12-29 2021-04-06 民生科技有限责任公司 Modeling method of cross-modal text picture retrieval model
CN112651448A (en) * 2020-12-29 2021-04-13 中山大学 Multi-modal emotion analysis method for social platform expression package
CN112965968A (en) * 2021-03-04 2021-06-15 湖南大学 Attention mechanism-based heterogeneous data pattern matching method
CN113642630A (en) * 2021-08-10 2021-11-12 福州大学 Image description method and system based on dual-path characteristic encoder
CN113704443A (en) * 2021-09-08 2021-11-26 天津大学 Dialog generation method fusing explicit and implicit personalized information
CN114201621A (en) * 2021-11-24 2022-03-18 人民网股份有限公司 Cross-modal retrieval model construction and retrieval method based on image-text cooperative attention
CN114298159A (en) * 2021-12-06 2022-04-08 湖南工业大学 Image similarity detection method based on text fusion under label-free sample
CN114372163A (en) * 2021-12-09 2022-04-19 西安理工大学 Image retrieval method based on attention mechanism and feature fusion
CN114494813A (en) * 2021-12-24 2022-05-13 西北工业大学 Method for generating nominal expression based on intensive cross attention
CN114547235A (en) * 2022-01-19 2022-05-27 西北大学 Method for constructing image text matching model based on prior knowledge graph
CN114625882A (en) * 2022-01-26 2022-06-14 西安理工大学 Network construction method for improving unique diversity of image text description
CN114821770A (en) * 2022-04-11 2022-07-29 华南理工大学 Text-to-image cross-modal pedestrian re-identification method, system, medium, and apparatus
CN114840705A (en) * 2022-04-27 2022-08-02 中山大学 Combined commodity retrieval method and system based on multi-mode pre-training model
CN115757857A (en) * 2023-01-09 2023-03-07 吉林大学 Underwater three-dimensional cross-modal combined retrieval method, storage medium and electronic equipment
CN115858848A (en) * 2023-02-27 2023-03-28 浪潮电子信息产业股份有限公司 Image-text mutual inspection method and device, training method and device, server and medium
CN115909317A (en) * 2022-07-15 2023-04-04 广东工业大学 Learning method and system for three-dimensional model-text joint expression
CN116310425A (en) * 2023-05-24 2023-06-23 山东大学 Fine-grained image retrieval method, system, equipment and storage medium
CN116994069A (en) * 2023-09-22 2023-11-03 武汉纺织大学 Image analysis method and system based on multi-mode information
CN117292442A (en) * 2023-10-13 2023-12-26 中国科学技术大学先进技术研究院 Cross-mode and cross-domain universal face counterfeiting positioning method
CN117316369A (en) * 2023-08-24 2023-12-29 兰州交通大学 Chest image diagnosis report automatic generation method for balancing cross-mode information
CN117932099A (en) * 2024-03-21 2024-04-26 大连海事大学 Multi-mode image retrieval method based on modified text feedback

Families Citing this family (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110189249B (en) * 2019-05-24 2022-02-18 深圳市商汤科技有限公司 Image processing method and device, electronic equipment and storage medium
CN110765286A (en) * 2019-09-09 2020-02-07 卓尔智联(武汉)研究院有限公司 Cross-media retrieval method and device, computer equipment and storage medium
CN110706302B (en) * 2019-10-11 2023-05-19 中山市易嘀科技有限公司 System and method for synthesizing images by text
CN111209961B (en) * 2020-01-03 2020-10-09 广州海洋地质调查局 Method for identifying benthos in cold spring area and processing terminal
CN111291551B (en) * 2020-01-22 2023-04-18 腾讯科技(深圳)有限公司 Text processing method and device, electronic equipment and computer readable storage medium
CN111782921A (en) * 2020-03-25 2020-10-16 北京沃东天骏信息技术有限公司 Method and device for searching target
CN111914113A (en) * 2020-08-07 2020-11-10 大连理工大学 Image retrieval method and related device
CN112949415B (en) * 2021-02-04 2023-03-24 北京百度网讯科技有限公司 Image processing method, apparatus, device and medium
CN113392254A (en) * 2021-03-29 2021-09-14 西安理工大学 Image text retrieval method based on context awareness
CN113220919B (en) * 2021-05-17 2022-04-22 河海大学 Dam defect image text cross-modal retrieval method and model
CN113204675B (en) * 2021-07-07 2021-09-21 成都考拉悠然科技有限公司 Cross-modal video time retrieval method based on cross-modal object inference network
CN113449808B (en) * 2021-07-13 2022-06-21 广州华多网络科技有限公司 Multi-source image-text information classification method and corresponding device, equipment and medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140133759A1 (en) * 2012-11-14 2014-05-15 Nec Laboratories America, Inc. Semantic-Aware Co-Indexing for Near-Duplicate Image Retrieval
CN104462489A (en) * 2014-12-18 2015-03-25 北京邮电大学 Cross-modal retrieval method based on deep-layer models
CN107330100A (en) * 2017-07-06 2017-11-07 北京大学深圳研究生院 Combine the two-way search method of image text of embedded space based on multi views
CN108319686A (en) * 2018-02-01 2018-07-24 北京大学深圳研究生院 Antagonism cross-media retrieval method based on limited text space

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7788099B2 (en) * 2007-04-09 2010-08-31 International Business Machines Corporation Method and apparatus for query expansion based on multimodal cross-vocabulary mapping
CN101303694A (en) * 2008-04-30 2008-11-12 浙江大学 Method for implementing decussation retrieval between mediums through amalgamating different modality information
US9311544B2 (en) * 2012-08-24 2016-04-12 Jeffrey T Haley Teleproctor reports use of a vehicle and restricts functions of drivers phone
CN108694200B (en) * 2017-04-10 2019-12-20 北京大学深圳研究生院 Cross-media retrieval method based on deep semantic space
CN108052512B (en) * 2017-11-03 2021-05-11 同济大学 Image description generation method based on depth attention mechanism

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140133759A1 (en) * 2012-11-14 2014-05-15 Nec Laboratories America, Inc. Semantic-Aware Co-Indexing for Near-Duplicate Image Retrieval
CN104462489A (en) * 2014-12-18 2015-03-25 北京邮电大学 Cross-modal retrieval method based on deep-layer models
CN107330100A (en) * 2017-07-06 2017-11-07 北京大学深圳研究生院 Combine the two-way search method of image text of embedded space based on multi views
CN108319686A (en) * 2018-02-01 2018-07-24 北京大学深圳研究生院 Antagonism cross-media retrieval method based on limited text space

Cited By (44)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111897974B (en) * 2020-08-12 2024-04-16 吉林大学 Heterogeneous knowledge graph learning method based on multilayer attention mechanism
CN111897974A (en) * 2020-08-12 2020-11-06 吉林大学 Heterogeneous knowledge graph learning method based on multilayer attention mechanism
CN112001166A (en) * 2020-08-24 2020-11-27 齐鲁工业大学 Intelligent question-answer sentence-to-semantic matching method and device for government affair consultation service
CN112001166B (en) * 2020-08-24 2023-10-17 齐鲁工业大学 Intelligent question-answer sentence semantic matching method and device for government affair consultation service
CN112084358A (en) * 2020-09-04 2020-12-15 中国石油大学(华东) Image-text matching method based on regional enhanced network with theme constraint
CN112084358B (en) * 2020-09-04 2023-10-27 中国石油大学(华东) Image-text matching method based on area strengthening network with subject constraint
CN112613451A (en) * 2020-12-29 2021-04-06 民生科技有限责任公司 Modeling method of cross-modal text picture retrieval model
CN112651448A (en) * 2020-12-29 2021-04-13 中山大学 Multi-modal emotion analysis method for social platform expression package
CN112651448B (en) * 2020-12-29 2023-09-15 中山大学 Multi-mode emotion analysis method for social platform expression package
CN112965968A (en) * 2021-03-04 2021-06-15 湖南大学 Attention mechanism-based heterogeneous data pattern matching method
CN112965968B (en) * 2021-03-04 2023-10-24 湖南大学 Heterogeneous data pattern matching method based on attention mechanism
CN113642630B (en) * 2021-08-10 2024-03-15 福州大学 Image description method and system based on double-path feature encoder
CN113642630A (en) * 2021-08-10 2021-11-12 福州大学 Image description method and system based on dual-path characteristic encoder
CN113704443A (en) * 2021-09-08 2021-11-26 天津大学 Dialog generation method fusing explicit and implicit personalized information
CN113704443B (en) * 2021-09-08 2023-10-13 天津大学 Dialog generation method integrating explicit personalized information and implicit personalized information
CN114201621A (en) * 2021-11-24 2022-03-18 人民网股份有限公司 Cross-modal retrieval model construction and retrieval method based on image-text cooperative attention
CN114201621B (en) * 2021-11-24 2024-04-02 人民网股份有限公司 Cross-modal retrieval model construction and retrieval method based on graphic and text cooperative attention
CN114298159B (en) * 2021-12-06 2024-04-09 湖南工业大学 Image similarity detection method based on text fusion under label-free sample
CN114298159A (en) * 2021-12-06 2022-04-08 湖南工业大学 Image similarity detection method based on text fusion under label-free sample
CN114372163B (en) * 2021-12-09 2024-04-23 西安理工大学 Image retrieval method based on attention mechanism and feature fusion
CN114372163A (en) * 2021-12-09 2022-04-19 西安理工大学 Image retrieval method based on attention mechanism and feature fusion
CN114494813A (en) * 2021-12-24 2022-05-13 西北工业大学 Method for generating nominal expression based on intensive cross attention
CN114494813B (en) * 2021-12-24 2024-03-05 西北工业大学 Dense cross attention-based index expression generation method
CN114547235B (en) * 2022-01-19 2024-04-16 西北大学 Construction method of image text matching model based on priori knowledge graph
CN114547235A (en) * 2022-01-19 2022-05-27 西北大学 Method for constructing image text matching model based on prior knowledge graph
CN114625882A (en) * 2022-01-26 2022-06-14 西安理工大学 Network construction method for improving unique diversity of image text description
CN114625882B (en) * 2022-01-26 2024-04-16 西安理工大学 Network construction method for improving unique diversity of image text description
CN114821770A (en) * 2022-04-11 2022-07-29 华南理工大学 Text-to-image cross-modal pedestrian re-identification method, system, medium, and apparatus
CN114821770B (en) * 2022-04-11 2024-03-26 华南理工大学 Cross-modal pedestrian re-identification method, system, medium and device from text to image
CN114840705B (en) * 2022-04-27 2024-04-19 中山大学 Combined commodity retrieval method and system based on multi-mode pre-training model
CN114840705A (en) * 2022-04-27 2022-08-02 中山大学 Combined commodity retrieval method and system based on multi-mode pre-training model
CN115909317A (en) * 2022-07-15 2023-04-04 广东工业大学 Learning method and system for three-dimensional model-text joint expression
CN115757857A (en) * 2023-01-09 2023-03-07 吉林大学 Underwater three-dimensional cross-modal combined retrieval method, storage medium and electronic equipment
CN115858848B (en) * 2023-02-27 2023-08-15 浪潮电子信息产业股份有限公司 Image-text mutual inspection method and device, training method and device, server and medium
CN115858848A (en) * 2023-02-27 2023-03-28 浪潮电子信息产业股份有限公司 Image-text mutual inspection method and device, training method and device, server and medium
CN116310425B (en) * 2023-05-24 2023-09-26 山东大学 Fine-grained image retrieval method, system, equipment and storage medium
CN116310425A (en) * 2023-05-24 2023-06-23 山东大学 Fine-grained image retrieval method, system, equipment and storage medium
CN117316369A (en) * 2023-08-24 2023-12-29 兰州交通大学 Chest image diagnosis report automatic generation method for balancing cross-mode information
CN117316369B (en) * 2023-08-24 2024-05-07 兰州交通大学 Chest image diagnosis report automatic generation method for balancing cross-mode information
CN116994069B (en) * 2023-09-22 2023-12-22 武汉纺织大学 Image analysis method and system based on multi-mode information
CN116994069A (en) * 2023-09-22 2023-11-03 武汉纺织大学 Image analysis method and system based on multi-mode information
CN117292442B (en) * 2023-10-13 2024-03-26 中国科学技术大学先进技术研究院 Cross-mode and cross-domain universal face counterfeiting positioning method
CN117292442A (en) * 2023-10-13 2023-12-26 中国科学技术大学先进技术研究院 Cross-mode and cross-domain universal face counterfeiting positioning method
CN117932099A (en) * 2024-03-21 2024-04-26 大连海事大学 Multi-mode image retrieval method based on modified text feedback

Also Published As

Publication number Publication date
CN109783657A (en) 2019-05-21
CN109783657B (en) 2022-12-30

Similar Documents

Publication Publication Date Title
WO2020143137A1 (en) Multi-step self-attention cross-media retrieval method based on restricted text space and system
CN108319686B (en) Antagonism cross-media retrieval method based on limited text space
CN109241524B (en) Semantic analysis method and device, computer-readable storage medium and electronic equipment
Wiseman et al. Learning neural templates for text generation
US20220222920A1 (en) Content processing method and apparatus, computer device, and storage medium
WO2018195875A1 (en) Generating question-answer pairs for automated chatting
US11586810B2 (en) Generating responses in automated chatting
US8694303B2 (en) Systems and methods for tuning parameters in statistical machine translation
US20170185581A1 (en) Systems and methods for suggesting emoji
WO2019000326A1 (en) Generating responses in automated chatting
US20210168098A1 (en) Providing local service information in automated chatting
WO2018165932A1 (en) Generating responses in automated chatting
Li et al. Residual attention-based LSTM for video captioning
JP7335300B2 (en) Knowledge pre-trained model training method, apparatus and electronic equipment
Cai et al. Intelligent question answering in restricted domains using deep learning and question pair matching
US20230306205A1 (en) System and method for personalized conversational agents travelling through space and time
CN111581364B (en) Chinese intelligent question-answer short text similarity calculation method oriented to medical field
Perez-Martin et al. A comprehensive review of the video-to-text problem
CN111651661A (en) Image-text cross-media retrieval method
CN113934835A (en) Retrieval type reply dialogue method and system combining keywords and semantic understanding representation
He et al. Hierarchical attention and knowledge matching networks with information enhancement for end-to-end task-oriented dialog systems
Wang et al. Image captioning based on deep learning methods: A survey
CN114817510B (en) Question and answer method, question and answer data set generation method and device
CN113222772B (en) Native personality dictionary construction method, native personality dictionary construction system, storage medium and electronic equipment
Meng et al. Tibetan Comment Text Sentiment Recognition Algorithm Based on Syllables

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19909251

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19909251

Country of ref document: EP

Kind code of ref document: A1