CN115952306A - Image content retrieval method based on scene graph - Google Patents

Image content retrieval method based on scene graph Download PDF

Info

Publication number
CN115952306A
CN115952306A CN202211550485.2A CN202211550485A CN115952306A CN 115952306 A CN115952306 A CN 115952306A CN 202211550485 A CN202211550485 A CN 202211550485A CN 115952306 A CN115952306 A CN 115952306A
Authority
CN
China
Prior art keywords
graph
scene graph
vector
network
embedding
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211550485.2A
Other languages
Chinese (zh)
Inventor
张智
李金星
王立鹏
尚晓兵
孙杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Harbin Engineering University
Original Assignee
Harbin Engineering University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Harbin Engineering University filed Critical Harbin Engineering University
Priority to CN202211550485.2A priority Critical patent/CN115952306A/en
Publication of CN115952306A publication Critical patent/CN115952306A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses an image content retrieval method based on a scene graph. The image retrieval method based on the scene graph starts from the content of the whole image, and focuses more on the visual relationship between the targets. The visual relation is composed in a graph form, the image is displayed in a structural description form, the retrieval task is completed by utilizing the complex structural form, the accuracy of image content retrieval is improved through more detailed description, and the requirement of people on fine-grained image retrieval task is met. The invention is not limited to searching out the names and characteristics of single targets or multiple targets, but also aims to search the interaction relation among multiple targets.

Description

Image content retrieval method based on scene graph
Technical Field
The invention belongs to the field of target detection and natural language processing, and particularly relates to an image content retrieval method based on a scene graph.
Background
With the increasing growth of information, a great deal of data of different modalities, such as text, images, audio, video, etc., appear on the internet, and may describe the same subject, such as pictures and words in websites, subtitles, audio and video in movies, which are called multi-modal data. The traditional single-mode search can not meet the increasing requirements of people, and the cross-mode search becomes a popular research direction.
The key of the cross-modal image-text retrieval problem lies in how to cross the semantic gap of heterogeneous data to measure the similarity between different modalities. The basic solution is that the input data of different modes are firstly mapped to a common subspace in a certain mode, then the similarity between the input data and the common subspace is measured, and the final retrieval result is output after the similarity is sorted. The early teletext retrieval mostly uses the traditional machine learning methods, including subspace learning, theme model and other methods, however, these methods have the disadvantages of non-end-to-end, weak feature expression capability and the like. Because deep learning has strong feature representation capability, deep semantic features can be learned, and the characteristics of end-to-end learning can enable a machine to perform operations such as feature extraction, screening and the like, more and more researches are carried out to integrate deep learning technology into the field of image-text retrieval. These methods do not take fine-grained associations between images and text into account. Particularly in the task of image-text matching, the task requires that only correctly matched samples are retrieved, and a large number of similar samples exist in an actual retrieval library, so that it is a very challenging task to accurately identify the correct samples from the similar samples. At this point, thanks to advances in the fields of object detection and natural language processing, images may be represented as a series of region features and text may be represented as a series of word features. At this time, the association relationship between the image and the text can be obtained from the region-word level. The key problem of fine-grained teletext retrieval is how to extract suitable image regions and phrases or words and how to measure the correlation between them. In the actual data set, the problems of interference of the image background, partial shielding of an object, errors of artificial labeling of a text, incomplete labeling information and the like all increase the difficulty of fine-grained image-text retrieval. .
Disclosure of Invention
The invention aims to provide an image content retrieval method based on a scene graph.
The purpose of the invention is realized by the following technical scheme:
a scene graph-based image content retrieval method specifically comprises the following steps:
the method comprises the following steps: embedding a scene graph;
processing an input scene graph by a graph convolution neural network to generate an embedded graph corresponding to a target node in the graph; extracting deep visual features of the image by using a feature extraction network to obtain an integral feature map of the image; selecting a triple area according to the < subject, relation and object > required to be detected and inferred; processing the regional characteristics of the target to obtain a predicted target and a predicted relation;
step two: a layout prediction model;
singleton object embedding is used as the input of the next stage of the network model, the output of the second stage of the prediction model is used as a scene layout with object positioning, and a series of triple (< subject, predicate, object >) embedding vectors are formed by using the object embedding; marking whether the target object belongs to a subject or an object by the triple embedded vector through a triple mask prediction network, wherein the purpose is to mark the main and predicate relations among multiple targets; transmitting the triple embedded vector through a triple regression network, and in the regression network, performing connection positioning on a boundary frame of a main body and an object by a training network; the box is defined as the bounding box of the subject and the object;
step three: matching targets;
querying a database, embedding objects embedded from a learned scene graph to form a structured query; through various forms of structured queries, the same way of visual semantics is included; 3100 visual relation databases are extracted from a test scene graph marked on a COCO-Stuff data set; ranking the retrieved images according to their respective embedding spatial representations using a similarity measure S; and displaying the first N pictures meeting the retrieval requirement for screening.
Further, the input scene graph in step 1 specifically includes:
the input scene graph describes multiple relations between the targets in the graph; given a set of object class sets C and a set of relationship sets R, a scene graph is a (O, E) tuple, O = { O = 1 ,...,o n Is a set of categories, o i ∈C,
Figure BDA0003981830350000021
Is a set of directional edges, the relationship form between a scene graph can be expressed as a set (o) i ,r,o j ),o i ,o j ∈O,r∈R。
Further, the scene graph embedding described in step 1 specifically includes:
for all objects o i E.g. O, all edges (O) i ,r,o j ) E given input vector
Figure BDA0003981830350000022
Using g s ,g p And g o Three graph convolution functions generate an output vector ≧ for all compute nodes and edges>
Figure BDA0003981830350000023
Triplet vector (v) of one edge i ,v r ,v j ) As input, a new set of vector subjects o is output separately i Predicate r, object o j
V 'is' r =g p (v i ,v r ,v j ) An object o i Output vector ofv' i Should depend on o i All vectors v of objects connected by graph edges j And the vector v of these edges r (ii) a For each slave o i Starting edge, use g s To calculate a candidate vector in the set
Figure BDA0003981830350000024
Wherein all such candidate vectors are collected, again using g o To calculate the termination and o i Is selected based on the set of candidate vectors pickand place of all edges of>
Figure BDA0003981830350000025
Figure BDA0003981830350000026
Figure BDA0003981830350000027
Object o i V of' i Is calculated as
Figure BDA0003981830350000028
h is a symmetric function that pools the input set of vectors into a single output vector.
Further, the layout prediction model in the second step is specifically:
object layout network accepts an embedded vector v i Shape D object o i And passes it to a mask regression network to predict a soft binary mask of shape M
Figure BDA0003981830350000031
A boundary regression network predicts the position of the boundary regression box
Figure BDA0003981830350000032
To embed vector v i And mask>
Figure BDA0003981830350000033
Multiplication results in mask embedding of shape D × M, which is then warped to the position of the bounding box using bilinear interpolation, giving the object layout, which is the sum of all the object layouts above.
Furthermore, the mask regression network is composed of several transpose convolutions, elements in the mask are within the range of (0,1) through a nonlinear sigmoid activation function, and the boundary regression network is the multi-layer sensor MLP.
Further, the step three of ranking the retrieved images according to the respective embedding spatial representation using the similarity measure S:
Figure BDA0003981830350000034
where d is the query q and the search result r at location k k L2 distance therebetween.
The invention has the beneficial effects that:
the invention provides a content-based image retrieval method for retrieving a picture set conforming to description in a structured mode to solve the problem of fine-grained image retrieval. The image retrieval method based on the scene graph starts from the content of the whole image, and focuses more on the visual relationship between the targets. The visual relation is formed in a graph form (such as < man, play, football >), the image is displayed in a structural description form, the retrieval task is completed by utilizing the complex structural form, the accuracy of image content retrieval is improved through the more detailed description, and the requirement of people on fine-grained image retrieval task is also met.
The invention searches out the picture set which accords with the description through the structuralization problem based on the content image searching method. The method is not limited to searching out names and characteristics of single targets or multiple targets, and is also used for searching the interaction relation among multiple targets. Therefore, the image retrieval based on the image description mode is more practical and more challenging. The image retrieval method based on the scene graph starts from the whole content of the image, the image is displayed in a structural description mode, the retrieval task is completed by utilizing the complex structural mode, the accuracy of image content retrieval is improved through more detailed description, and the requirement of people on fine-grained image retrieval task is met. The invention provides a scene graph embedding-based method for image retrieval. Visual relationships are extracted from the scene graph for forming a structured description. The visual relationship is derived from a directed subgraph of the scene graph, and the subject and the object are nodes connected by a predicate relationship. The searching method obtains better performance in the data set with long tail. On a COCO-Stuff data set with a long tail distribution, the retrieval performance of the target object is better even at a medium frequency or a low frequency. It is important that exact matches can be obtained despite the omission of the predicate, and that exact matches of all the search results can be obtained despite the omission of the predicate. Even results that are incorrect have the correct predicate and are usually semantically similar.
Drawings
FIG. 1 is a sub-scenario diagram query graph of the present invention;
FIG. 2 is a network architecture diagram of the present invention;
FIG. 3 is a tuple mask diagram of the present invention;
FIG. 4 is a single graph convolution calculation of the present invention;
FIG. 5 is a scene layout flow diagram of the present invention;
fig. 6 is a diagram showing the image search result of the present invention.
Detailed Description
The invention is further described below with reference to the accompanying drawings.
The present invention solves the image retrieval problem using scene graph embedding learned from a scene layout prediction model. A scene graph is a structured data format for encoding semantic relationships between objects. In the scene graph, nodes in the graph are represented by objects in the image, edges between the nodes in the graph learn a sub graph from the scene graph, and the sub graph is used for representing visual relations in the image, so that a structured query database is formed for matching with a massive image scene graph in the database, as shown in fig. 1. Each subgraph consists of a triplet including the subject, the object and the visual relationship between the subject and the object.
The specific implementation process of the invention is as follows:
1. scene graph input
The model input is a scene graph, and the input scene graph describes multiple relationships between objects in the graph. Given a set of object class sets C and a set of relationship sets R, a scene graph is a (O, E) tuple, O = { O = 1 ,...,o n Is a set of categories that are,
o i ∈C,
Figure BDA0003981830350000041
is a set of directional edges, the relationship form between a scene graph can be expressed as a set (o) i ,r,o j ),o i ,o j ∈O,r∈R。
2. Scene graph embedding
The scene graph can describe the image content in a concise and structured manner, and not only can code the semantic and spatial information of a single object in the scene, but also can represent the relationship between each pair of objects. A scene graph, nodes and edges in the graph are all marked with D in Giving a vector of dimension, inputting the scene graph as a graph convolution neural network, and uniformly correlating and compressing information in the scene graph by the graph convolution neural network to generate D out Scene graph of dimensional vector. The output vectors are a function of their corresponding input neighborhood, so the convolutional layers of each graph propagate information along the graph edges. Graph convolution layers apply the same function to all edges of the graph, allowing a single layer to operate on arbitrarily shaped graphs. The graph convolution neural network is a 5-layer perceptron model, the input and output layers are 128-dimensional, and the middle layer is 512-dimensional. The graph convolution network uniformly associates and compresses the information in the scene graph, and converts the graph information into vector information, so that the multiple scene graphs can be associated on the same scale.
In particular, for all objects o i E.g. O, all edges (O) i ,r,o j ) E given an input vector v i ,
Figure BDA0003981830350000051
Using g s ,g p And g o Three graph convolution functions generate an output vector v 'for all compute nodes and edges' i ,/>
Figure BDA0003981830350000052
It combines the three-tuple vector (v) of an edge i ,v r ,v j ) As input, the new vector subject o of the group set is output respectively i Predicate r, object o j
To compute an output vector v of edges' r V is assumed to be' r =g p (v i ,v r ,v j ). Updating the target vector is difficult because one object may participate in multiple relationships. Thus, an object o i Output vector v' i Should depend on o i All vectors v of objects connected by graph edges j And the vector v of these edges r . For this reason, for each slave o i Starting edge, use g s To calculate a candidate vector in the set
Figure BDA0003981830350000053
Wherein all such candidate vectors are collected, again using g o To calculate the termination and o i Is selected based on the set of candidate vectors pickand place of all edges of>
Figure BDA0003981830350000054
Figure BDA0003981830350000055
Figure BDA0003981830350000056
Object o i V' i Is calculated as
Figure BDA0003981830350000057
h is a symmetric function that pools the input set of vectors into a single output vector. An example of a computation graph for a single graph convolution layer is shown in FIG. 4.
In implementation, the function g s ,g p And g o Is implemented using a single network that connects its three input vectors, provides them to a multi-layer perceptron (MLP), and computes three output vectors using fully connected output headers. The pooling function h averages its input vector and provides the result to the MLP.
3. Layout prediction
Singleton object embedding is used as the input of the next stage of the network model. The output of the second stage of the predictive model is used as a scene layout with object localization. This target embedding is then used to form a series of triple (< subject, predicate, object >) embedded vectors. And marking whether the target object belongs to a subject or an object by the triple embedded vector through a triple mask prediction network, wherein the purpose is to mark the main and predicate relations among multiple targets. The triplet-embedded vector is passed through a triplet regression network in which the training network locates the connections on the bounding boxes of the subject and object. The box is defined as the bounding box of the body and the object. As shown in fig. 2.
The present invention processes an input scene graph using a series of graph convolutions, providing each object with an embedded vector that aggregates the information of all objects and relationships in the graph.
To generate an image layout, it must be moved from image domain to image domain. To this end, a scene layout is calculated using object-embedded vectors, giving a coarse two-dimensional structure of the image to be generated; the scene layout is calculated by predicting a segmentation mask and bounding box for each object using an object layout network. As shown in fig. 5.
Object layout network accepting an inlayVector of input v i Shape D object o i And passes it to a mask regression network to predict a soft binary mask of shape M
Figure BDA0003981830350000061
A boundary regression network predicts the position of the boundary regression box
Figure BDA0003981830350000062
The mask regression network consists of several transposed convolutions, with the elements in the mask in the range of (0,1) by a nonlinear sigmoid activation function. The boundary regression network is an MLP. As shown in fig. 3.
To embed vector v i And a mask
Figure BDA0003981830350000063
Multiplying to obtain mask embedding with the shape of D multiplied by M, and then bending the mask embedding to the position of a boundary box by using bilinear interpolation to give the object layout. Finally, the object layout is the sum of all the object layouts above.
4. Result matching
The visual semantic mode is also included through various forms of structured query. The query database extracted 3100 visual relationship databases from the test scene graph labeled on the COCO-Stuff dataset. Ranking the retrieved images according to the respective embedding spatial representation using a similarity measure S:
Figure BDA0003981830350000064
where d is the L2 distance between the query q and the search result rk at location k. The results are shown in FIG. 6.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (6)

1. A scene graph-based image content retrieval method is characterized in that: the method comprises the following specific steps:
the method comprises the following steps: embedding a scene graph;
processing an input scene graph by a graph convolution neural network to generate an embedded graph corresponding to a target node in the graph; extracting deep visual features of the image by using a feature extraction network to obtain an integral feature map of the image; selecting a triple region according to the < subject, relation and object > to be detected and inferred; processing the regional characteristics of the target to obtain a predicted target and a predicted relation;
step two: a layout prediction model;
singleton object embedding is used as the input of the next stage of the network model, the output of the second stage of the prediction model is used as a scene layout with object positioning, and a series of triple (< subject, predicate, object >) embedding vectors are formed by using the object embedding; marking whether a target object belongs to a subject or an object by the triple embedded vector through a triple mask prediction network, wherein the purpose is to mark a subject-predicate relation among multiple targets; transmitting the triple embedded vector through a triple regression network, and in the regression network, performing connection positioning on a boundary frame of a main body and an object by a training network; the frame is defined as the bounding box of the subject and the object;
step three: matching targets;
querying a database, embedding objects embedded from a learned scene graph to form a structured query; through various forms of structured queries, the same way of visual semantics is included; 3100 visual relation databases are extracted from a test scene graph marked on a COCO-Stuff data set; ranking the retrieved images according to their respective embedding spatial representations using a similarity measure S; and displaying the first N pictures meeting the retrieval requirement for screening.
2. The method for retrieving image content based on scene graph as claimed in claim 1, wherein: the input scene graph in step 1 is specifically:
the input scene graph describes multiple relations between the targets in the graph; given a set of object class sets C and a set of relationship sets R, a scene graph is a (O, E) tuple, O = { O = 1 ,...,o n Is a set of categories, o i ∈C,
Figure FDA0003981830340000013
Is a set of directed edges, the relationship form between a scene graph can be expressed as a set (o) i ,r,o j ),o i ,o j ∈O,r∈R。
3. The method for retrieving image content based on scene graph as claimed in claim 1 or 2, wherein: the scene graph embedding in the step 1 specifically comprises:
for all objects o i E.g. O, all edges (O) i ,r,o j ) E given an input vector v i ,
Figure FDA0003981830340000011
Using g s ,g p And g o Three graph convolution functions generate an output vector v 'for all compute nodes and edges' i ,/>
Figure FDA0003981830340000012
Triplet vector (v) of one edge i ,v r ,v j ) As input, the new vector subject o of the group set is output respectively i Predicate r, object o j
V 'is' r =g p (v i ,v r ,v j ) An object o i Output vector v' i Should depend on o i All vectors v of objects connected by graph edges j And the vector v of these edges r (ii) a For each slave o i Starting edge, use g s To calculate a candidate vector in the set V i s In which all such candidate vectors are collected, and used as suchg o To calculate the termination and o i A set of candidate vectors V of all edges of i o
V i s ={g s (v i ,v r ,v j ):(o i ,r,o j )∈E}
V i o ={g o (v j ,v r ,v i ):(o j ,r,o i )∈E}
Object o i V' i Is calculated as v' i =h(V i s ∪V i o ) H is a symmetric function that pools the input set of vectors into a single output vector.
4. The method of claim 1, wherein the method comprises: the layout prediction model in the second step is specifically as follows:
object layout network accepts an embedded vector v i Shape D object o i And passes it to a mask regression network to predict a soft binary mask of shape M
Figure FDA0003981830340000021
A boundary regression network predicts the position of the boundary regression box
Figure FDA0003981830340000022
To embed vector v i And mask>
Figure FDA0003981830340000023
Multiplication results in mask embedding of shape D × M, which is then warped to the position of the bounding box using bilinear interpolation, giving the object layout, which is the sum of all the object layouts above.
5. The method of claim 4, wherein the method comprises: the mask regression network is composed of a plurality of transposition convolutions, elements in the mask are enabled to be within the range of (0,1) through a nonlinear sigmoid activation function, and the boundary regression network is the multi-layer sensor MLP.
6. The method of claim 1, wherein the method comprises: step three, using similarity measurement S to sort the retrieved images according to the respective embedding space representation:
Figure FDA0003981830340000024
where d is the query q and the search result r at location k k L2 distance in between.
CN202211550485.2A 2022-12-05 2022-12-05 Image content retrieval method based on scene graph Pending CN115952306A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211550485.2A CN115952306A (en) 2022-12-05 2022-12-05 Image content retrieval method based on scene graph

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211550485.2A CN115952306A (en) 2022-12-05 2022-12-05 Image content retrieval method based on scene graph

Publications (1)

Publication Number Publication Date
CN115952306A true CN115952306A (en) 2023-04-11

Family

ID=87285186

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211550485.2A Pending CN115952306A (en) 2022-12-05 2022-12-05 Image content retrieval method based on scene graph

Country Status (1)

Country Link
CN (1) CN115952306A (en)

Similar Documents

Publication Publication Date Title
CN111858954B (en) Task-oriented text-generated image network model
US11263753B2 (en) Method for training a convolutional neural network for image recognition using image-conditioned masked language modeling
CN111488474B (en) Fine-grained freehand sketch image retrieval method based on attention enhancement
CN110362660A (en) A kind of Quality of electronic products automatic testing method of knowledge based map
CN114298158A (en) Multi-mode pre-training method based on image-text linear combination
CN109271539A (en) A kind of image automatic annotation method and device based on deep learning
KR20200075114A (en) System and Method for Matching Similarity between Image and Text
Zheng et al. Encoding histopathological wsis using gnn for scalable diagnostically relevant regions retrieval
CN111611367B (en) Visual question-answering method introducing external knowledge
CN110928961A (en) Multi-mode entity linking method, equipment and computer readable storage medium
Abdul-Rashid et al. Shrec’18 track: 2d image-based 3d scene retrieval
CN111651635A (en) Video retrieval method based on natural language description
Li et al. Adapting clip for phrase localization without further training
CN115438674A (en) Entity data processing method, entity linking method, entity data processing device, entity linking device and computer equipment
Vidanapathirana et al. Spectral geometric verification: Re-ranking point cloud retrieval for metric localization
Cinaroglu et al. Long-term image-based vehicle localization improved with learnt semantic descriptors
López-Cifuentes et al. Attention-based knowledge distillation in scene recognition: the impact of a dct-driven loss
Li et al. Multi-task attribute-fusion model for fine-grained image recognition
Wang et al. Person re-identification based on graph relation learning
Li et al. Tvg-reid: Transformer-based vehicle-graph re-identification
CN109740405B (en) Method for detecting front window difference information of non-aligned similar vehicles
CN115830643A (en) Light-weight pedestrian re-identification method for posture-guided alignment
CN115098646A (en) Multilevel relation analysis and mining method for image-text data
CN114299342B (en) Unknown mark classification method in multi-mark picture classification based on deep learning
Nguyen et al. A brief review of state-of-the-art object detectors on benchmark document images datasets

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination