CN112256904A

CN112256904A - Image retrieval method based on visual description sentences

Info

Publication number: CN112256904A
Application number: CN202010998165.8A
Authority: CN
Inventors: 聂为之; 李杰思; 刘安安; 徐宁; 张勇东
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2020-09-21
Filing date: 2020-09-21
Publication date: 2021-01-22

Abstract

The invention discloses an image retrieval method based on visual description sentences, which comprises the following steps: based on a graph convolution deep learning network, constructing an information transfer mode of nodes and edges in the representation of a visual knowledge graph, and realizing the aggregation and the update of the characteristics of each semantic unit; coding each semantic unit feature aggregated and updated in the map by adopting a multi-level length short-time memory network in combination with an attention mechanism for generating an image description statement; under the framework of reinforcement learning, by utilizing CIDER scores and map similarity, a reward and punishment function based on an image description statement is designed and used for feedback regulation and optimization of an image-map process, an image-statement process and an image-map-statement process, a visual description statement with finer granularity of an image is obtained and retrieved, and a target retrieval image corresponding to an inquiry image is output. The invention improves the feasibility of utilizing the character-based image retrieval on a large-scale data set.

Description

Image retrieval method based on visual description sentences

Technical Field

The invention relates to the field of image retrieval, in particular to an image retrieval method based on visual description sentences.

Background

In recent years, image retrieval has been a research focus in the field of computer vision, of which content-based image retrieval methods are the most popular^[1][2]. However, content-based image retrieval methods have mainly focused on retrieval using low-level visual features of images, such as color, shape, and texture^[3][4][5]The method can not capture the high-level semantic information of the image, and does not accord with the method that people usually judge the image similarity according to the semantic information of the image. Thus, most content-based image retrieval systems do not fully reflect or match query intent, and there is a semantic gap between low-level features and upper-level understanding. To reduce the semantic gap, processing content-based image retrieval by matching visual elements of the image in the form of natural language has attracted the attention of researchers^[6][7]Wherein the visual descriptive statement of the image plays a key role. The visual description sentence of the image expands the image representation from a small group of category labels or keywords to a detailed sentence, contains richer high-level semantic features of the image, and can improve the retrieval precision through longer and more targeted query. However, it is a major challenge to efficiently and accurately generate a plurality of visual descriptive sentences of an image, and not only to deeply understand and model the image content, but also to process the natural description language (including the elements of words, phrases, sentences, etc.) to match the corresponding image content. Recently, many researchers have applied visual knowledge-graph theory to visual descriptive statement generation of images^[8][9]However, most of the research work only stays in the feed-forward type depth modeling stage, i.e., a unidirectional modeling process of first encoding "images" into "maps" and then decoding "maps" into "sentences". Because of the subjectivity of image description, a visual knowledge graph can often resolve various description sentences, so that the requirement of resolving the diversity of the visual resolution cannot be met only by a feedforward depth modeling algorithm^[10]。

Disclosure of Invention

The invention provides an image retrieval method based on visual description sentences, which is different from the prior method that the image retrieval is carried out only by depending on the visual characteristics of the lower layer of the image, the invention adopts the visual description sentences which generate the image to carry out the image retrieval, can effectively reduce the semantic gap in the image retrieval, better support the refinement and complicated retrieval of the image, automatically generate the corresponding image description, avoid the defects in manual labeling, improve the feasibility of utilizing the image retrieval based on characters on a large-scale data set, and is described in detail as follows:

an image retrieval method based on visual descriptive sentences, the method comprising the steps of:

based on a graph convolution deep learning network, constructing an information transfer mode of nodes and edges in the representation of a visual knowledge graph, and realizing the aggregation and the update of the characteristics of each semantic unit;

coding each semantic unit feature aggregated and updated in the map by adopting a multi-level length short-time memory network in combination with an attention mechanism for generating an image description statement;

under the framework of reinforcement learning, by utilizing CIDER scores and map similarity, a reward and punishment function based on an image description statement is designed and used for feedback regulation and optimization of an image-map process, an image-statement process and an image-map-statement process, a visual description statement with finer granularity of an image is obtained and retrieved, and a target retrieval image corresponding to an inquiry image is output.

The reward and punishment function is specifically as follows:

wherein, ω is_cAnd ω_sA trainable fusion weight;

a prediction description sentence representing an image,

a CIDER score that describes a sentence for prediction; s_i＝{s₁,s₂,s₃Are 3 similarity scores characterizing the similarity of the spectra,

is a reward and penalty function of 3, wherein

For the optimization of the process of image-map-sentence,

for individual optimization of the "image-map" process,

for individual optimization of the "map-sentence" process.

Further, the method further comprises:

taking a reward and punishment function as a reward mechanism, updating network parameters by using a strategy gradient in reinforcement learning, and obtaining a loss function L_RL(θ) the gradient calculation with respect to the parameter θ is as follows:

wherein the content of the first and second substances,

in order to reward the objective function for the desire,

as a function of the reward and punishment defined above,

as a function of the score.

Wherein the method further comprises: and detecting the visual relation between the visual entity and the vision from the input image, and constructing a visual knowledge map representation corresponding to the image.

The technical scheme provided by the invention has the beneficial effects that:

1. the invention utilizes the visual descriptive sentences of the images to carry out retrieval, uses the visual knowledge map to better represent the image semantics, introduces a feedback regulation mechanism, uses reinforcement learning to improve the accuracy and diversity of the descriptive sentences, effectively reduces the semantic gap in the image retrieval and improves the efficiency and precision of the image retrieval;

2. the method can automatically generate the corresponding image description without manual marking, avoids the defects generated during manual marking, and improves the feasibility of utilizing the character-based image retrieval on a large-scale data set.

Drawings

FIG. 1 is a flow chart of a method for image retrieval based on visual descriptive statements;

FIG. 2 is a schematic diagram of a visual knowledge map of an image;

FIG. 3 is a diagram of a two-layer long-short-term memory (LSTM) network code.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention are described in further detail below.

In order to solve the problems existing in the background technology, the existing cognitive neuroscience shows that the human visual system is formed by connecting a large number of feedforward and feedback, the image retrieval based on the content is usually represented by using low-level visual features, the query intention cannot be completely reflected or matched, and semantic gaps exist in the retrieval. Therefore, in the description model based on the visual map, the feedback regulation mechanism is introduced, the feedback type depth modeling optimization of the process from the sentence to the map, the process from the map to the image and the whole process is jointly explored, the complex image retrieval is carried out based on the visual description sentence of the generated image, the semantic gap in the image retrieval can be effectively reduced, the accurate and diversified visual description sentence generation of the image is realized, the method is further used for the image retrieval, the retrieval efficiency and accuracy are improved, and the refinement and the complicated retrieval of the image are better supported.

Example 1

An image retrieval method based on visual descriptive sentences, referring to fig. 1, the method comprises the following steps:

step 101: detecting a visual entity and a visual relation from an input image, and constructing a visual knowledge map representation corresponding to the image;

step 102: based on a graph convolution deep learning network, constructing an information transfer mode of nodes and edges in the representation of a visual knowledge graph, and realizing the aggregation and the update of the characteristics of each semantic unit;

step 103: coding each semantic unit feature aggregated and updated in the map by adopting a multi-level length short-time memory network in combination with an attention mechanism for generating an image description statement;

step 104: in a reinforcement learning framework, 3 punishment functions based on an image description statement are designed by utilizing CIDER fraction and map similarity and are respectively used for feedback regulation and optimization of an image-map process, an image-statement process and an image-map-statement overall process to obtain a visual description statement with finer granularity of an image;

step 105: and retrieving based on the fine-grained visual description sentences to obtain a target retrieval image corresponding to the query image.

Example 2

The scheme of example 1 is further described below with reference to specific calculation formulas and examples, which are described in detail below:

201: the knowledge graph is represented as tuple G ═ (N, E), where N and E are the set of nodes and edges, respectively;

wherein, N contains three types of nodes: entity node o, attribute node a, and relationship node r. To obtain the node representation, the entity is first detected and classified using the Faster R-CNN (fast area convolutional neural network) target detector. The fast R-CNN is composed of a convolutional layer, an RPN (regional proposal network) layer, a RoI Pooling (region of interest Pooling) layer and a classification and regression layer, the characteristics of an input image are extracted through a plurality of convolutional layers and sent into the RPN network, a plurality of candidate regional frames are generated through training, the regional frames and the image characteristics extracted by the convolutional layers are integrated, and after the RoI Pooling, the regional frame classification and position regression are carried out in the classification and regression layer to obtain the final accurate position of a target class and a detection frame.

The invention selects at least 10 and at most 100 entities for each image by using pre-trained Faster R-CNN, extracts the characteristics of the entities by using RoI Pooling, calculates the attribute classification result of the visual entity by using a full-link layer and a Softmax function, and calculates the attribute classification result of the visual entity by using a MOTIFS model^[11]Calculating the relation classification result between visual entities to obtain the characteristic representation u of the entities, attributes and relation nodes_o、u_aAnd u_r. Further, the information of each node is known, using o_iDenotes the ith entity, r_ijRepresents an entity o_iAnd o_jRelation between a_i,lRepresents an entity o_iThe definition of the edge in E is as follows: if the object o_iPossession Attribute a_i,lThen a slave o is established_iTo a_i,lIf present, directed edge of<o_i-r_ij-o_j>The relationship triple of (3), then the slave o is established_iTo r_ijAnd r_ijTo o_jTwo directed edges.

202: the proposed multi-modal graph convolution network specifically uses a four-function spatial graph convolution function f_r、f_s、f_aAnd f_oAnd calculating the complex semantic association among the nodes in the graph, wherein the four functions are all two-layer fc-ReLU structures with independent parameters, and carrying out context coding on the node characteristics of the graph to generate a semantic context-aware aggregation characteristic representation V.

Wherein, the aggregation feature representation V contains three types of embedding (embedded vector): relation node r_ijRelationship of (3)

Entity node o_iGenus ofSex embedding

And entity embedding

In particular, relationship triplets in a known atlas<o_i-r_ij-o_j>Calculating the relationship embedding

Comprises the following steps:

wherein the content of the first and second substances,

and

are respectively entity nodes o_iAnd o_jIs used to represent the feature vector of (a),

is o_iAnd o_jRelation node r between_ijIs represented by the feature vector of (1).

Entity node o in known map_iAll the attributes of

Wherein

Represents o_iNumber of all attributes with, compute attribute embedding

Comprises the following steps:

wherein the content of the first and second substances,

is o_iFeature vector representation of the corresponding attribute node.

In the graph, because of the node o_iIn a relational tuple, either as the head or tail entity "subject", o is integrated using different functions_iKnowledge of all relation tuples, computing entity embedding

Comprises the following steps:

wherein, the node o_j∈sbj(o_i) When it is object, o_iIs a subject; node o_k∈obj(o_i) When it is, it is subject, o_iIs object;

is at o_iThe number of all relationship triplets present; sbj (o)_i) As a set of head entity nodes, obj (o)_i) Is a tail entity node set; f. of_sAnd f_oRespectively as characteristic convolution transfer functions of a head entity and a tail entity;

and

are respectively entity nodes o_i、o_jAnd o_kIs used to represent the feature vector of (a),

and

are respectively o_iAnd o_jAnd o_kAnd o_iNode r of the relation between_ijAnd r_kiIs represented by the feature vector of (1).

203: the following notation is defined to represent the operation of a long-short-time memory (LSTM) network over a single time step:

h_t＝LSTM(x_t,h_t-1) (4)

wherein x is_tIs the input vector of LSTM, h_tIs the output vector of the LSTM.

Specifically, first, at each time step t, the word ω is input by concatenation_tAverage pooling characteristics

And previous outputs of the second layer LSTM

Maximum context information is collected as input to the first layer LSTM as follows:

wherein, W_ΣIs a word embedding matrix for a vocabulary of size Σ,

the method is a result of average pooling of the aggregation characteristic expression V output by the multi-modal graph convolution network in the previous step.

Further, at each time step t, the output according to the first layer LSTM

The normalized attention distribution is generated for the aggregate feature V as follows:

α_t＝softmax(a_t) (6)

wherein v is_mIs embedding, W of three types in V_v、W_hAnd ω_aIs a learning parameter, α_tIs the resulting normalized attention distribution.

Based on the attention distribution, carrying out attention weighted sum on all embedding in the V, and calculating to obtain new characteristics

Will be provided with

And

combined as input to the second layer LSTM, as follows:

further, using the symbol y_1:TTo represent a word sequence (y)₁,...,y_T) At each time step t, using the output of the second layer LSTM

The conditional distribution of possible output words is given by:

wherein, W_pAnd b_pAre the learning weights and biases. Finally, the distribution of the complete output sequence is calculated as the product of the conditional distributions:

wherein T is the total number of time steps.

204: first, calculate the CIDER score CIDER of the model prediction sentence (c)_i,S_i) Wherein c is_iAnd S_i＝{s_i1,...s_imAre respectively images I_iThe candidate description sentence and the reference description sentence of (4);

defining an n-gram tuple w_kAppear in the reference sentence s_ijThe number of times in (1) is h_k(s_ij) Appears in the candidate sentence c_iThe number of times in (1) is h_k(c_i) Each n-gram tuple w is calculated by_kTF-IDF weight of:

where Ω is the set of all n-grams, I is the set of all images in the dataset, and w_lIs any n-gram tuple in omega, h_l(s_ij) Is w_lAppear in the reference sentence s_ijNumber of times of (1), I_pIs an arbitrary image in the data set, h_k(s_pq) Is w_kAppears in I_pCorresponding reference sentence s_pqThe number of times (1).

For n-grams tuples of length n, the candidate sentence c is used_iAnd a reference sentence S_iAverage cosine similarity between them to calculate its CIDER_nAnd (3) fractional:

wherein m is S_iNumber of sentences contained in, gⁿ(c_i) For all n-grams tuples of length n in the candidate sentence c_iOf TF-IDF weight vector gⁿ(s_ij) For all n-grams tuples of length n in the reference sentence s_ijTF-IDF weight vector in (1).

The scores of all n-grams of different lengths are weighted and summedThe total CIDER score is calculated as follows, where ω_nIs a trade-off parameter:

further, for matching image-sentence pairs, the features of two different modalities of the image and the sentence are mapped to the same space by using a knowledge graph for comparison. Knowledge-graph characterization G of known images₁For model prediction description sentences and real description sentences of the images, through natural language processing, a sentence parser of Stanford is used for constructing corresponding knowledge graph representation G₂And G₃. Performing a map G₁、G₂And G₃Comparing the nodes between every two, and calculating the node similarity to represent the similarity between the maps according to the following formula:

wherein the content of the first and second substances,

for node comparison function, sigma is sigmoid activation function, s₁、s₂And s₃Are each G₁And G₂、G₁And G₃、G₂And G₃Normalized similarity score of (a).

The similarity score is fused with the CIDER score as follows:

wherein, ω is_cAnd ω_sA trainable fusion weight;

a prediction description sentence representing an image,

a CIDER score that describes a sentence for prediction; s_i＝{s₁,s₂,s₃Are the 3 similarity scores mentioned above,

is a reward and penalty function of 3, wherein

For the optimization of the process of image-map-sentence,

for individual optimization of the "image-map" process,

for individual optimization of the "map-sentence" process.

In the optimization process, the reward and punishment function is used as a reward mechanism, the strategy gradient in reinforcement learning is used for updating network parameters, and the loss function L_RL(θ) the gradient calculation with respect to the parameter θ is as follows:

wherein the content of the first and second substances,

in order to reward the objective function for the desire,

as a function of the reward and punishment defined above,

as a function of the score.

205: with I ═ I₁,i₂,...i_mIndicates the data set of the current image retrieval, by

And

representing a query image q and any candidate image i in a dataset_lE.g. I visual description statement, first using Word2vec will

And

encoding into a corresponding vector representation V_q＝{e_q1,e_q2,...e_qnAnd

wherein e_qnAnd

are respectively

And

the word imbedding in (1). Further, by using the coded vector representation, the semantic similarity between the images is measured by calculating the distance between the vectors, and the smaller the distance is, the more similar the images are. Thus, searching for an image becomes based on the distance metric of the visual descriptive statement embedding vector representation of the image, the equation being defined as follows:

wherein N and M are vector representations V respectively_qAnd

number of elements in (e)_qn∈V_q，

sim(q,i_l) The similarity between the query image and any of the candidate images is scored.

After similarity calculation between the query image and all candidate images in the data set is performed by using the above formula, the candidate images are ranked according to the similarity scores, and the required number of target images with the highest similarity score relative to the query image are retrieved. Reference documents:

[1]Patel T,Gandhi S.A survey on context based similarity techniques for image retrieval[C]//2017International Conference on Innovative Mechanisms for Industry Applications(ICIMIA).IEEE,2017:219-223.

[2]Khawandi S,Abdallah F,Ismail A.A survey on Image Indexing and Retrieval based on Content Based Image[C]//2019International Conference on Machine Learning,Big Data,Cloud and Parallel Computing(COMITCon).IEEE,2019:222-225.

[3]Mahamuni C V,Wagh N B.Study of CBIR methods for retrieval of digital images based on colour and texture extraction[C]//2017International Conference on Computer Communication and Informatics(ICCCI).IEEE,2017:1-7.

[4]Zhou W,Li H,Tian Q.Recent advance in content-based image retrieval:A literature survey[J].arXiv preprint arXiv:1706.06064,2017.

[5]Narayan R,Reddy S C,Narayan L,et al.The Study of Approaches of Content Based Image Retrieval[J].2019.

[6]Wei X,Qi Y,Liu J,et al.Image retrieval by dense caption reasoning[C]//2017IEEE Visual Communications and Image Processing(VCIP).IEEE,2017:1-4.

[7]Hoxha G,Melgani F,Demir B.Retrieving Images with Generated Textual Descriptions[C]//IGARSS 2019-2019IEEE International Geoscience and Remote Sensing Symposium.IEEE,2019:5812-5815.

[8]Li X,Jiang S.Know more say less:Image captioning based on scene graphs[J].IEEE Transactions on Multimedia,2019,21(8):2117-2130.

[9]Yang X,Tang K,Zhang H,et al.Auto-encoding scene graphs for image captioning[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2019:10685-10694.

[10] computational modeling and application research of a feedback mechanism in the Caochun and deep convolutional neural network [ D ]. university of Chinese science and technology, 2018.

[11]R.Zellers,M.Yatskar,S.Thomson,and Y.Choi.Neural motifs:Scene graph parsing with global context.In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,pages 5831–5840,2018.

Those skilled in the art will appreciate that the drawings are only schematic illustrations of preferred embodiments, and the above-described embodiments of the present invention are merely provided for description and do not represent the merits of the embodiments.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. An image retrieval method based on visual descriptive sentences, characterized in that the method comprises the following steps:

2. The image retrieval method based on the visual description statement of claim 1, wherein the reward and punishment function is specifically:

wherein, ω is_cAnd ω_sA trainable fusion weight;

a prediction description sentence representing an image,

is a reward and penalty function of 3, wherein

For the optimization of the process of image-map-sentence,

for individual optimization of the "image-map" process,

for individual optimization of the "map-sentence" process.

3. An image retrieval method based on visual descriptive sentences according to claim 1 or 2, characterized in that the method further comprises:

wherein the content of the first and second substances,

in order to reward the objective function for the desire,

as a function of the reward and punishment defined above,

as a function of the score.

4. An image retrieval method based on visual descriptive sentences according to claim 1 or 2, characterized in that the method further comprises: and detecting the visual relation between the visual entity and the vision from the input image, and constructing a visual knowledge map representation corresponding to the image.