CN115952306A

CN115952306A - Image content retrieval method based on scene graph

Info

Publication number: CN115952306A
Application number: CN202211550485.2A
Authority: CN
Inventors: 张智; 李金星; 王立鹏; 尚晓兵; 孙杰
Original assignee: Harbin Engineering University
Current assignee: Harbin Engineering University
Priority date: 2022-12-05
Filing date: 2022-12-05
Publication date: 2023-04-11

Abstract

The invention discloses an image content retrieval method based on a scene graph. The image retrieval method based on the scene graph starts from the content of the whole image, and focuses more on the visual relationship between the targets. The visual relation is composed in a graph form, the image is displayed in a structural description form, the retrieval task is completed by utilizing the complex structural form, the accuracy of image content retrieval is improved through more detailed description, and the requirement of people on fine-grained image retrieval task is met. The invention is not limited to searching out the names and characteristics of single targets or multiple targets, but also aims to search the interaction relation among multiple targets.

Description

Image content retrieval method based on scene graph

Technical Field

The invention belongs to the field of target detection and natural language processing, and particularly relates to an image content retrieval method based on a scene graph.

Background

With the increasing growth of information, a great deal of data of different modalities, such as text, images, audio, video, etc., appear on the internet, and may describe the same subject, such as pictures and words in websites, subtitles, audio and video in movies, which are called multi-modal data. The traditional single-mode search can not meet the increasing requirements of people, and the cross-mode search becomes a popular research direction.

The key of the cross-modal image-text retrieval problem lies in how to cross the semantic gap of heterogeneous data to measure the similarity between different modalities. The basic solution is that the input data of different modes are firstly mapped to a common subspace in a certain mode, then the similarity between the input data and the common subspace is measured, and the final retrieval result is output after the similarity is sorted. The early teletext retrieval mostly uses the traditional machine learning methods, including subspace learning, theme model and other methods, however, these methods have the disadvantages of non-end-to-end, weak feature expression capability and the like. Because deep learning has strong feature representation capability, deep semantic features can be learned, and the characteristics of end-to-end learning can enable a machine to perform operations such as feature extraction, screening and the like, more and more researches are carried out to integrate deep learning technology into the field of image-text retrieval. These methods do not take fine-grained associations between images and text into account. Particularly in the task of image-text matching, the task requires that only correctly matched samples are retrieved, and a large number of similar samples exist in an actual retrieval library, so that it is a very challenging task to accurately identify the correct samples from the similar samples. At this point, thanks to advances in the fields of object detection and natural language processing, images may be represented as a series of region features and text may be represented as a series of word features. At this time, the association relationship between the image and the text can be obtained from the region-word level. The key problem of fine-grained teletext retrieval is how to extract suitable image regions and phrases or words and how to measure the correlation between them. In the actual data set, the problems of interference of the image background, partial shielding of an object, errors of artificial labeling of a text, incomplete labeling information and the like all increase the difficulty of fine-grained image-text retrieval. .

Disclosure of Invention

The invention aims to provide an image content retrieval method based on a scene graph.

The purpose of the invention is realized by the following technical scheme:

a scene graph-based image content retrieval method specifically comprises the following steps:

the method comprises the following steps: embedding a scene graph;

processing an input scene graph by a graph convolution neural network to generate an embedded graph corresponding to a target node in the graph; extracting deep visual features of the image by using a feature extraction network to obtain an integral feature map of the image; selecting a triple area according to the < subject, relation and object > required to be detected and inferred; processing the regional characteristics of the target to obtain a predicted target and a predicted relation;

step two: a layout prediction model;

singleton object embedding is used as the input of the next stage of the network model, the output of the second stage of the prediction model is used as a scene layout with object positioning, and a series of triple (< subject, predicate, object >) embedding vectors are formed by using the object embedding; marking whether the target object belongs to a subject or an object by the triple embedded vector through a triple mask prediction network, wherein the purpose is to mark the main and predicate relations among multiple targets; transmitting the triple embedded vector through a triple regression network, and in the regression network, performing connection positioning on a boundary frame of a main body and an object by a training network; the box is defined as the bounding box of the subject and the object;

step three: matching targets;

querying a database, embedding objects embedded from a learned scene graph to form a structured query; through various forms of structured queries, the same way of visual semantics is included; 3100 visual relation databases are extracted from a test scene graph marked on a COCO-Stuff data set; ranking the retrieved images according to their respective embedding spatial representations using a similarity measure S; and displaying the first N pictures meeting the retrieval requirement for screening.

Further, the input scene graph in step 1 specifically includes:

the input scene graph describes multiple relations between the targets in the graph; given a set of object class sets C and a set of relationship sets R, a scene graph is a (O, E) tuple, O = { O = ₁ ,...,o _n Is a set of categories, o _i ∈C，

Is a set of directional edges, the relationship form between a scene graph can be expressed as a set (o) _i ,r,o _j )，o _i ,o _j ∈O，r∈R。

Further, the scene graph embedding described in step 1 specifically includes:

for all objects o _i E.g. O, all edges (O) _i ,r,o _j ) E given input vector

Using g _s ,g _p And g _o Three graph convolution functions generate an output vector ≧ for all compute nodes and edges>

Triplet vector (v) of one edge _i ,v _r ,v _j ) As input, a new set of vector subjects o is output separately _i Predicate r, object o _j ；

V 'is' _r ＝g _p (v _i ,v _r ,v _j ) An object o _i Output vector ofv' _i Should depend on o _i All vectors v of objects connected by graph edges _j And the vector v of these edges _r (ii) a For each slave o _i Starting edge, use g _s To calculate a candidate vector in the set

Wherein all such candidate vectors are collected, again using g _o To calculate the termination and o _i Is selected based on the set of candidate vectors pickand place of all edges of>

Object o _i V of' _i Is calculated as

h is a symmetric function that pools the input set of vectors into a single output vector.

Further, the layout prediction model in the second step is specifically:

object layout network accepts an embedded vector v _i Shape D object o _i And passes it to a mask regression network to predict a soft binary mask of shape M

A boundary regression network predicts the position of the boundary regression box

To embed vector v _i And mask>

Multiplication results in mask embedding of shape D × M, which is then warped to the position of the bounding box using bilinear interpolation, giving the object layout, which is the sum of all the object layouts above.

Furthermore, the mask regression network is composed of several transpose convolutions, elements in the mask are within the range of (0,1) through a nonlinear sigmoid activation function, and the boundary regression network is the multi-layer sensor MLP.

Further, the step three of ranking the retrieved images according to the respective embedding spatial representation using the similarity measure S:

where d is the query q and the search result r at location k _k L2 distance therebetween.

The invention has the beneficial effects that:

the invention provides a content-based image retrieval method for retrieving a picture set conforming to description in a structured mode to solve the problem of fine-grained image retrieval. The image retrieval method based on the scene graph starts from the content of the whole image, and focuses more on the visual relationship between the targets. The visual relation is formed in a graph form (such as < man, play, football >), the image is displayed in a structural description form, the retrieval task is completed by utilizing the complex structural form, the accuracy of image content retrieval is improved through the more detailed description, and the requirement of people on fine-grained image retrieval task is also met.

The invention searches out the picture set which accords with the description through the structuralization problem based on the content image searching method. The method is not limited to searching out names and characteristics of single targets or multiple targets, and is also used for searching the interaction relation among multiple targets. Therefore, the image retrieval based on the image description mode is more practical and more challenging. The image retrieval method based on the scene graph starts from the whole content of the image, the image is displayed in a structural description mode, the retrieval task is completed by utilizing the complex structural mode, the accuracy of image content retrieval is improved through more detailed description, and the requirement of people on fine-grained image retrieval task is met. The invention provides a scene graph embedding-based method for image retrieval. Visual relationships are extracted from the scene graph for forming a structured description. The visual relationship is derived from a directed subgraph of the scene graph, and the subject and the object are nodes connected by a predicate relationship. The searching method obtains better performance in the data set with long tail. On a COCO-Stuff data set with a long tail distribution, the retrieval performance of the target object is better even at a medium frequency or a low frequency. It is important that exact matches can be obtained despite the omission of the predicate, and that exact matches of all the search results can be obtained despite the omission of the predicate. Even results that are incorrect have the correct predicate and are usually semantically similar.

Drawings

FIG. 1 is a sub-scenario diagram query graph of the present invention;

FIG. 2 is a network architecture diagram of the present invention;

FIG. 3 is a tuple mask diagram of the present invention;

FIG. 4 is a single graph convolution calculation of the present invention;

FIG. 5 is a scene layout flow diagram of the present invention;

fig. 6 is a diagram showing the image search result of the present invention.

Detailed Description

The invention is further described below with reference to the accompanying drawings.

The present invention solves the image retrieval problem using scene graph embedding learned from a scene layout prediction model. A scene graph is a structured data format for encoding semantic relationships between objects. In the scene graph, nodes in the graph are represented by objects in the image, edges between the nodes in the graph learn a sub graph from the scene graph, and the sub graph is used for representing visual relations in the image, so that a structured query database is formed for matching with a massive image scene graph in the database, as shown in fig. 1. Each subgraph consists of a triplet including the subject, the object and the visual relationship between the subject and the object.

The specific implementation process of the invention is as follows:

1. scene graph input

The model input is a scene graph, and the input scene graph describes multiple relationships between objects in the graph. Given a set of object class sets C and a set of relationship sets R, a scene graph is a (O, E) tuple, O = { O = ₁ ,...,o _n Is a set of categories that are,

o _i ∈C，

2. Scene graph embedding

The scene graph can describe the image content in a concise and structured manner, and not only can code the semantic and spatial information of a single object in the scene, but also can represent the relationship between each pair of objects. A scene graph, nodes and edges in the graph are all marked with D _in Giving a vector of dimension, inputting the scene graph as a graph convolution neural network, and uniformly correlating and compressing information in the scene graph by the graph convolution neural network to generate D _out Scene graph of dimensional vector. The output vectors are a function of their corresponding input neighborhood, so the convolutional layers of each graph propagate information along the graph edges. Graph convolution layers apply the same function to all edges of the graph, allowing a single layer to operate on arbitrarily shaped graphs. The graph convolution neural network is a 5-layer perceptron model, the input and output layers are 128-dimensional, and the middle layer is 512-dimensional. The graph convolution network uniformly associates and compresses the information in the scene graph, and converts the graph information into vector information, so that the multiple scene graphs can be associated on the same scale.

In particular, for all objects o _i E.g. O, all edges (O) _i ,r,o _j ) E given an input vector v _i ,

Using g _s ,g _p And g _o Three graph convolution functions generate an output vector v 'for all compute nodes and edges' _i ,/>

It combines the three-tuple vector (v) of an edge _i ,v _r ,v _j ) As input, the new vector subject o of the group set is output respectively _i Predicate r, object o _j 。

To compute an output vector v of edges' _r V is assumed to be' _r ＝g _p (v _i ,v _r ,v _j ). Updating the target vector is difficult because one object may participate in multiple relationships. Thus, an object o _i Output vector v' _i Should depend on o _i All vectors v of objects connected by graph edges _j And the vector v of these edges _r . For this reason, for each slave o _i Starting edge, use g _s To calculate a candidate vector in the set

Object o _i V' _i Is calculated as

h is a symmetric function that pools the input set of vectors into a single output vector. An example of a computation graph for a single graph convolution layer is shown in FIG. 4.

In implementation, the function g _s ，g _p And g _o Is implemented using a single network that connects its three input vectors, provides them to a multi-layer perceptron (MLP), and computes three output vectors using fully connected output headers. The pooling function h averages its input vector and provides the result to the MLP.

3. Layout prediction

Singleton object embedding is used as the input of the next stage of the network model. The output of the second stage of the predictive model is used as a scene layout with object localization. This target embedding is then used to form a series of triple (< subject, predicate, object >) embedded vectors. And marking whether the target object belongs to a subject or an object by the triple embedded vector through a triple mask prediction network, wherein the purpose is to mark the main and predicate relations among multiple targets. The triplet-embedded vector is passed through a triplet regression network in which the training network locates the connections on the bounding boxes of the subject and object. The box is defined as the bounding box of the body and the object. As shown in fig. 2.

The present invention processes an input scene graph using a series of graph convolutions, providing each object with an embedded vector that aggregates the information of all objects and relationships in the graph.

To generate an image layout, it must be moved from image domain to image domain. To this end, a scene layout is calculated using object-embedded vectors, giving a coarse two-dimensional structure of the image to be generated; the scene layout is calculated by predicting a segmentation mask and bounding box for each object using an object layout network. As shown in fig. 5.

Object layout network accepting an inlayVector of input v _i Shape D object o _i And passes it to a mask regression network to predict a soft binary mask of shape M

The mask regression network consists of several transposed convolutions, with the elements in the mask in the range of (0,1) by a nonlinear sigmoid activation function. The boundary regression network is an MLP. As shown in fig. 3.

To embed vector v _i And a mask

Multiplying to obtain mask embedding with the shape of D multiplied by M, and then bending the mask embedding to the position of a boundary box by using bilinear interpolation to give the object layout. Finally, the object layout is the sum of all the object layouts above.

4. Result matching

The visual semantic mode is also included through various forms of structured query. The query database extracted 3100 visual relationship databases from the test scene graph labeled on the COCO-Stuff dataset. Ranking the retrieved images according to the respective embedding spatial representation using a similarity measure S:

where d is the L2 distance between the query q and the search result rk at location k. The results are shown in FIG. 6.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A scene graph-based image content retrieval method is characterized in that: the method comprises the following specific steps:

the method comprises the following steps: embedding a scene graph;

processing an input scene graph by a graph convolution neural network to generate an embedded graph corresponding to a target node in the graph; extracting deep visual features of the image by using a feature extraction network to obtain an integral feature map of the image; selecting a triple region according to the < subject, relation and object > to be detected and inferred; processing the regional characteristics of the target to obtain a predicted target and a predicted relation;

step two: a layout prediction model;

singleton object embedding is used as the input of the next stage of the network model, the output of the second stage of the prediction model is used as a scene layout with object positioning, and a series of triple (< subject, predicate, object >) embedding vectors are formed by using the object embedding; marking whether a target object belongs to a subject or an object by the triple embedded vector through a triple mask prediction network, wherein the purpose is to mark a subject-predicate relation among multiple targets; transmitting the triple embedded vector through a triple regression network, and in the regression network, performing connection positioning on a boundary frame of a main body and an object by a training network; the frame is defined as the bounding box of the subject and the object;

step three: matching targets;

2. The method for retrieving image content based on scene graph as claimed in claim 1, wherein: the input scene graph in step 1 is specifically:

Is a set of directed edges, the relationship form between a scene graph can be expressed as a set (o) _i ,r,o _j )，o _i ,o _j ∈O，r∈R。

3. The method for retrieving image content based on scene graph as claimed in claim 1 or 2, wherein: the scene graph embedding in the step 1 specifically comprises:

for all objects o _i E.g. O, all edges (O) _i ,r,o _j ) E given an input vector v _i ,

Triplet vector (v) of one edge _i ,v _r ,v _j ) As input, the new vector subject o of the group set is output respectively _i Predicate r, object o _j ；

V 'is' _r ＝g _p (v _i ,v _r ,v _j ) An object o _i Output vector v' _i Should depend on o _i All vectors v of objects connected by graph edges _j And the vector v of these edges _r (ii) a For each slave o _i Starting edge, use g _s To calculate a candidate vector in the set V _i ^s In which all such candidate vectors are collected, and used as suchg _o To calculate the termination and o _i A set of candidate vectors V of all edges of _i ^o ：

V _i ^s ＝{g _s (v _i ,v _r ,v _j ):(o _i ,r,o _j )∈E}

V _i ^o ＝{g _o (v _j ,v _r ,v _i ):(o _j ,r,o _i )∈E}

Object o _i V' _i Is calculated as v' _i ＝h(V _i ^s ∪V _i ^o ) H is a symmetric function that pools the input set of vectors into a single output vector.

4. The method of claim 1, wherein the method comprises: the layout prediction model in the second step is specifically as follows:

To embed vector v _i And mask>

5. The method of claim 4, wherein the method comprises: the mask regression network is composed of a plurality of transposition convolutions, elements in the mask are enabled to be within the range of (0,1) through a nonlinear sigmoid activation function, and the boundary regression network is the multi-layer sensor MLP.

6. The method of claim 1, wherein the method comprises: step three, using similarity measurement S to sort the retrieved images according to the respective embedding space representation:

where d is the query q and the search result r at location k _k L2 distance in between.