CN114677544A - Scene graph generation method, system and equipment based on global context interaction - Google Patents

Scene graph generation method, system and equipment based on global context interaction Download PDF

Info

Publication number
CN114677544A
CN114677544A CN202210297025.7A CN202210297025A CN114677544A CN 114677544 A CN114677544 A CN 114677544A CN 202210297025 A CN202210297025 A CN 202210297025A CN 114677544 A CN114677544 A CN 114677544A
Authority
CN
China
Prior art keywords
target
feature
global
vector
gru
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210297025.7A
Other languages
Chinese (zh)
Other versions
CN114677544B (en
Inventor
罗敏楠
杨名帆
郑庆华
董怡翔
刘欢
秦涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xian Jiaotong University
Original Assignee
Xian Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xian Jiaotong University filed Critical Xian Jiaotong University
Priority to CN202210297025.7A priority Critical patent/CN114677544B/en
Publication of CN114677544A publication Critical patent/CN114677544A/en
Application granted granted Critical
Publication of CN114677544B publication Critical patent/CN114677544B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a scene graph generation method, a system and equipment based on global context interaction, 1) vector joint representation based on fusion of a plurality of characteristics such as object visual characteristics, space coordinates, semantic labels and the like; 2) generating global characteristics based on a bidirectional gating cyclic neural network; 3) a message iteration transfer mechanism based on the global feature vector; 4) and generating a scene graph based on the target and the relation state representation. Compared with the existing scene graph generation method, the scene graph generation method based on the global context interaction has the advantages that the global characteristics of the images are fully utilized through the context interaction, and the application universality is higher; meanwhile, after the global features after context interaction are obtained, message transmission between the target pair and the relation is carried out, the existing state is updated by utilizing the potential relation between the targets, and more accurate scene graph generation is carried out, so that the method has the advantages of practical application.

Description

Scene graph generation method, system and equipment based on global context interaction
Technical Field
The invention belongs to the field of computer vision, and particularly relates to a scene graph generation method, a scene graph generation system and scene graph generation equipment based on global context interaction.
Background
A scene graph composed of < subject-relationship-object > triplets is capable of describing objects in an image and scene structure relationships between pairs of objects. The scene graph has two main advantages: firstly, the < subject-relation-object > triple of the scene graph has structured semantic content, and has obvious advantages in the fine-grained information acquisition and processing process compared with the natural language text; secondly, the scene graph can fully represent the object and scene structure relationship in the image, and has wide application prospects in various computer vision tasks, such as: in the field of automatic driving of vehicles, the scene graph is used for environment modeling, so that more comprehensive environment information can be provided for a decision-making system; in the semantic image retrieval task, an image provider models the scene structure relationship of the image through a scene graph, so that a user can retrieve the image meeting the requirement only by describing a main target or relationship. Based on the real-time requirements of massive pictures and downstream tasks on the scene graph, the scene graph generation by using a computer gradually becomes a research hotspot, and has important significance for the field of image understanding.
The existing scene graph generation method based on message transmission constructs target nodes and relationship edges according to the result of target inspection, updates the state in a local subgraph by using a recurrent neural network based on a message transmission mechanism, and uses the characteristics after message transmission for relationship prediction. The method adopts a message transmission mechanism based on a local context idea, ignores implicit constraints among targets, only takes the visual characteristics of target nodes as an initial state, detects the relationship only depending on the repeated communication of the characteristics of the subject object nodes and the joint visual characteristics, cannot consider the overall structure of the image by a model, and does not play a role in relationship prediction by global information, thereby limiting the prediction capability of the model. In addition, the existing method cannot utilize object coordinates, and the visual relation between the targets is not analyzed from a space perspective. Aiming at the problems, the invention provides a scene graph generation method based on global context interaction. The existing scene graph generation method comprises the following steps:
prior art 1 proposes a method for generating an image scene graph, which performs dual-relationship prediction by dividing relationships into parent classes and child classes, and determines an accurate relationship by using a normalization function to generate the scene graph of the image.
The prior art 2 proposes a scene graph generation method based on a depth relationship self-attention network, and the method mainly includes: firstly, carrying out target detection on an input image to obtain a label, an object frame characteristic and a combined frame characteristic; then, constructing target characteristics and relative relation characteristics; and finally, generating a final visual scene graph by using the deep neural network.
The scene graph generation method in the prior art 1 does not consider fully utilizing the feature vectors in a feature fusion mode; the method in the prior art 2 does not use a message transfer mechanism, does not consider information interaction between a target pair and the relationship thereof, and cannot update the state after context transfer. And both do not use implicit constraints existing among all objects in the image to construct the context, and have certain disadvantages.
Disclosure of Invention
The invention aims to provide a scene graph generation method, a scene graph generation system and scene graph generation equipment based on global context interaction so as to solve the problems.
In order to achieve the purpose, the invention adopts the following technical scheme:
compared with the prior art, the invention has the following technical effects:
compared with a feature representation method using visual features to represent target features, the method makes full use of the target visual features, category features and space coordinate information, so that the method makes full use of information, and improves the relation prediction performance generated by a scene graph;
compared with a scene graph generating method using local context interaction, the method provided by the invention utilizes a recurrent neural network to extract the global context of the image, realizes information interaction based on the global context, then performs message transmission, and fully realizes data interaction and information expansion.
Drawings
FIG. 1 is a block diagram of a scene graph generation method based on global context interaction according to the present invention.
FIG. 2 is a flow diagram of a feature fusion based joint representation of vectors.
Fig. 3 is a structural diagram of a bidirectional gated recurrent neural network BiGRU.
Fig. 4 is a flow diagram of an iterative message delivery mechanism based on global feature vectors.
Fig. 5 is a diagram illustrating target detection results and corresponding scenarios.
FIG. 6 is a graph of the results of the performance test of the present invention.
Detailed Description
The embodiments of the present invention will be described in detail below with reference to the drawings and examples. It should be noted that the embodiments described herein are only for explaining the present invention, and are not intended to limit the present invention. Furthermore, the technical features related to the embodiments of the present invention may be combined with each other without conflict.
The specific implementation process of the invention comprises the processes of target detection and feature vector fusion of the image, feature generation based on global context interaction and message transmission. FIG. 1 is a block diagram of a scene graph generation method based on global context interaction according to the present invention.
1. Object detection and feature vector fusion of images
After an input image is given, the method uses a fast-RCNN deep learning model to carry out target detection, and obtains a target set O ═ O1,o2,…,on) The corresponding visual feature set V ═ V (V ═ V)1,v2,…,vn) The coordinate feature set B ═ B (B)1,b2,…,bn) And the set of pre-classified labels L ═ L1,l2,…,ln) And combining two target coordinates to collect visual characteristics C in the frame (C)i→j,i≠j)。
Firstly, the invention uses a feature fusion method to perform spatial coordinate feature b corresponding to each targetiVector v of visual featuresiAnd performing joint representation. For the object oiThe absolute position coordinate b is (x)1,y1,x2,y2) Wherein x is1,y1,x2,y2Respectively representing the left upper and right lower coordinates of a rectangular regression frame, and converting the coordinates into a relative position code b in an image by using the following formulai
Figure BDA0003563856330000031
In the formula, wid represents the original width of the image I, and hei represents the original height of the image I.
Then, the relative position is encoded b using the fully-connected layer of the neural networkiExtended to 128-dimensional features si
si=σ(Wsbi+bs),
Where σ represents the ReLU activation function, WsAnd bsThe parameters are linearly transformed and are automatically learned and adjusted by a neural network. Meanwhile, the method uses the full connection layer to convert the target visual characteristic v into the target visual characteristiciThe feature is converted into 512 dimensions from 4096 dimensions.
Then, the invention transforms the relative position feature vector s subjected to dimension transformationiAnd a visual feature viSplicing and dimension transformation are carried out to obtain a 512-dimensional target vision and coordinate feature fusion vector fiThe calculation flow is as follows:
fi=σ(Wf[si,vi]+bf),
in the formula [ ·]Represents the stitching operation, σ represents the ReLU activation function, WfAnd bfAre linear transformation parameters.
The above feature vector fusion process is shown in fig. 2.
2. Global feature generation based on bidirectional gated recurrent neural network
In the global feature generation process, the invention constructs a bidirectional gated recurrent neural network BiGRU, and uses a zero vector as an initial state, and the structure of the BiGRU is shown in FIG. 3. Obtaining a feature fusion vector F ═ F (F) of the target set1,f2,…,fn) Then, the first item x coordinate in the relative coordinate is sequenced from left to right, and the sequence is input into a BiGRU to obtain the global context target characteristic gamma (gamma is equal to (gamma))12,…,γn). The specific generation steps are as follows:
(1) initializing a zero vector as a BiGRU initial state;
(2) at two ends of the BiGRU, respectively fusing the first and the last feature fusion vectors f in the target set0And fnInputting, generating hidden states corresponding to the direction and sequence
Figure BDA0003563856330000041
(3) Sequentially inputting feature vectors to two ends of the BiGRU to generate
Figure BDA0003563856330000042
(4) Fusing the forward and reverse hidden states to obtain a context fusion state gamma of each targeti
Then, the invention utilizes the Glove word embedded vector to set the result L of the target pre-classification in the target detection process as (L)1,l2,…,ln) Target class feature vector g converted into 128 dimensionsi
Finally, the invention uses a neural network full connection layer to map the global context target feature gamma of each targetiAnd its class feature vector giPerforming fusion to obtain the targetLocal feature ci. The above calculation process is shown by the formula:
gi=Glove(li),
ci=σ(Wci,gi]+bc),
wherein, Glove (l)i) Represents the encoding of a pre-sorted tag of an object using the Glove approach [. cndot]Representing a splicing operation, WcAnd bcAre linear transformation parameters.
3. Message iteration transfer mechanism based on global feature vector
The message iteration transfer mechanism is divided into a message aggregation function and a state updating function.
Firstly, the invention constructs a message aggregation function: in the scene graph topology, nodes and edges respectively represent a subject object and the relationship thereof in a visual relationship, when a message is transmitted, a single node or edge simultaneously receives information of a plurality of sources, a pooling function is required to be designed to calculate the weight of each part of the message, and the final incoming message is aggregated by using the weighted sum of the messages. Depending on the recipient of the message, the incoming message may be a message received by the target node
Figure BDA0003563856330000051
With messages received by the relational edges
Figure BDA0003563856330000052
Knowing the hidden states of the current node GRU and the relationship edge GRU
Figure BDA0003563856330000053
And
Figure BDA0003563856330000054
the message that is transmitted into the ith node at the t iteration is represented as
Figure BDA0003563856330000055
Hidden state by the target GRU itself
Figure BDA0003563856330000056
Its out-of-range GRU hidden state
Figure BDA0003563856330000057
In-degree edge hidden state
Figure BDA0003563856330000058
And calculating, wherein i → j represents that the target i is the subject and the target j is the object in the relation.
Figure BDA0003563856330000059
Similarly, the relationship edge from the ith target node to the jth target node in the tth iteration is aggregated into a message
Figure BDA00035638563300000510
Hidden states corresponding to last iteration of relational edge GRU
Figure BDA00035638563300000511
Subject node GRU hidden state
Figure BDA00035638563300000512
Object node GRU hidden state
Figure BDA00035638563300000513
And (4) forming.
Figure BDA00035638563300000514
And
Figure BDA00035638563300000515
the following adaptive weighting function:
Figure BDA00035638563300000516
wherein [ ·]Represents the stitching operation, σ represents the ReLU activation function, w1、w2And v1、v2Are learnable parameters.
Secondly, the invention constructs a state update function: respectively constructing a target node GRU and a relation edge GRU, and carrying out feature vector on the relation between the targets
Figure BDA0003563856330000061
Storage and updating of. First, when t is 0, the GRU state of each target node and relationship edge is initialized to a zero vector, and the global feature vector c of the target is initialized to a zero vectoriAs the input of the target node GRU, combining two target coordinates and collecting the visual characteristics c in the framei→jAs the input of the relation edge GRU, the hidden states of the target node and the relation edge at the initial time are respectively generated
Figure BDA0003563856330000062
In subsequent iterations, each iteration t, each GRU, depending on whether it is a target GRU or a relationship GRU, will have its previous iteration hidden state
Figure BDA0003563856330000063
Or
Figure BDA0003563856330000064
And the incoming message of the previous iteration
Figure BDA0003563856330000065
Or
Figure BDA0003563856330000066
As input, and generates a new hidden state
Figure BDA0003563856330000067
Or
Figure BDA0003563856330000068
As an output, the message aggregation function generates a message for the next iteration.
Figure BDA0003563856330000069
Figure BDA00035638563300000610
Therefore, the specific steps of the whole message transmission mechanism are as follows:
(1) initializing GRU states of each target node and the relation edges into zero vectors;
(2) global feature vector c of targetiAs the input of the target node GRU, combining two target coordinates and collecting the visual characteristics c in the framei→jAs the input of the relation edge GRU, the hidden states of the target node and the relation edge at the initial time are respectively generated
Figure BDA00035638563300000611
(3) Computing received messages for each target and relationship using a message aggregation function
Figure BDA00035638563300000612
And
Figure BDA00035638563300000613
(4) combined with hidden states
Figure BDA00035638563300000614
Received message
Figure BDA00035638563300000615
And
Figure BDA00035638563300000616
utilizing GRU to update state to obtain state of next time
Figure BDA00035638563300000617
(5) If the iteration times reach the set times, the state of the current target and the relation is stored
Figure BDA00035638563300000618
Otherwise, returning to the step (3).
The above message passing mechanism flow is shown in fig. 4.
4. Scenario map generation based on target and relationship state representation
And (3) regarding the hidden states of the targets and the relations after the updating of the message transfer mechanism as characteristic vectors of the targets and the relations, sending the characteristic vectors into a neural network, and performing category prediction on the targets and the relations by using a softmax function to obtain the category of each target and the relation category between each pair of targets, thereby obtaining a scene graph capable of reflecting the relations between the targets and the objects in the image.
Given an input image, the target detection result and the corresponding scene graph are shown in fig. 5, and the performance test result of the model is shown in fig. 6.
In another embodiment of the present invention, a scene graph generating system based on global context interaction is provided, which can be used to implement the above scene graph generating method based on global context interaction, and specifically, the system includes:
a target detection module for performing target detection on the input image I to obtain a target set O ═ O1,o2,…,on) And a corresponding set of visual features V ═ V (V ═ V)1,v2,…,vn) And the coordinate feature set B ═ B1,b2,…,bn) And the pre-classified label set L ═ (L)1,l2,…,ln) And (C) combining the two target coordinates and collecting the visual characteristics C in the framei→j,i≠j);
A joint expression vector acquisition module of the target vision and coordinate characteristics, which is used for transforming the absolute position coordinates of each target by using a neural network to obtain a joint expression vector f of the target vision and coordinate characteristicsi
A target global feature obtaining module for obtaining (F) according to the feature fusion vector F1,f2,…,fn) Obtaining local context target characteristics gammaiAnd its class characteristicsVector giUsing a neural network to map the global context target feature gamma of the targetiAnd its class feature vector giFusing to obtain the global feature c of the targeti
A scene graph acquisition module for acquiring a global feature vector c based on each targetiFeature vector c of each relationi→jInitialize its hidden state
Figure BDA0003563856330000071
Further, each node incoming message is initially calculated
Figure BDA0003563856330000072
Each side incoming message
Figure BDA0003563856330000073
And performing iterative transfer, and updating hidden state by using recurrent neural network
Figure BDA0003563856330000074
And carrying out message aggregation to obtain the incoming message of each time i
Figure BDA0003563856330000075
And generating a scene graph capable of reflecting the relation between the target and the target in the image by using the final states of the target node and the relation edge until the set iteration number is reached.
The division of the modules in the embodiments of the present invention is schematic, and only one logical function division is provided, and in actual implementation, there may be another division manner, and in addition, each functional module in each embodiment of the present invention may be integrated in one processor, or may exist alone physically, or two or more modules are integrated in one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode.
In yet another embodiment of the present invention, a computer device is provided that includes a processor and a memory for storing a computer program comprising program instructions, the processor for executing the program instructions stored by the computer storage medium. The Processor may be a Central Processing Unit (CPU), or may be other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field-Programmable gate array (FPGA) or other Programmable logic device, discrete gate or transistor logic device, discrete hardware component, etc., which is a computing core and a control core of the terminal, and is specifically adapted to load and execute one or more instructions in a computer storage medium to implement a corresponding method flow or a corresponding function; the processor of the embodiment of the invention can be used for the operation of the scene graph generation method based on the global context interaction.
The invention discloses a scene graph generation method based on global context interaction, which comprises the following steps of 1) carrying out vector joint representation based on fusion of a plurality of characteristics such as object visual characteristics, space coordinates, semantic labels and the like; 2) generating global characteristics based on a bidirectional gating cyclic neural network; 3) a message iteration transfer mechanism based on the global feature vector; 4) and generating a scene graph based on the target and the relation state representation. Compared with the existing scene graph generation method, the scene graph generation method based on the global context interaction has the advantages that the global characteristics of the images are fully utilized through the context interaction, and the application universality is higher; meanwhile, after the global features after context interaction are obtained, message transmission between the target pair and the relation is carried out, the existing state is updated by utilizing the potential relation between the targets, and more accurate scene graph generation is carried out, so that the method has the advantages of practical application.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting the same, and although the present invention is described in detail with reference to the above embodiments, those of ordinary skill in the art should understand that: modifications and equivalents may be made to the embodiments of the invention without departing from the spirit and scope of the invention, which is to be covered by the claims.

Claims (10)

1. A scene graph generation method based on global context interaction is characterized by comprising
Carrying out target detection on the input image I to obtain a target set O ═ O1,o2,…,on) And a corresponding set of visual features V ═ (V ═ V)1,v2,…,vn) And the coordinate feature set B ═ B1,b2,…,bn) And the set of pre-classified labels L ═ L1,l2,…,ln) And (C) combining the two target coordinates and collecting the visual characteristics C in the framei→j,i≠j);
The absolute position coordinates of each target are converted by utilizing a neural network to obtain a joint expression vector f of the target vision and coordinate characteristicsi
According to the feature fusion vector F ═ F1,f2,…,fn) Obtaining local context target characteristics gammaiAnd its class feature vector giUsing a neural network to map the global context target feature gamma of the targetiAnd its class feature vector giFusing to obtain the global feature c of the targeti
Global feature vector c based on each targetiFeature vector c of each relationi→jInitialize its hidden state
Figure FDA0003563856320000011
Further, each node incoming message is initially calculated
Figure FDA0003563856320000012
Each side incoming message
Figure FDA0003563856320000013
And performing iterative transfer, and updating hidden state by using recurrent neural network
Figure FDA0003563856320000014
And carrying out message aggregation to obtain the incoming message of each time i
Figure FDA0003563856320000015
And generating a scene graph capable of reflecting the relation between the target and the target in the image by using the final states of the target node and the relation edge until the set iteration number is reached.
2. The method as claimed in claim 1, wherein the neural network is used to convert the absolute position coordinates of each target into relative position codes in the image and expand the relative position codes into relative position features siVisual features v of the objectiConverting into 512 dimensions, adopting a feature fusion method to obtain a relative position feature vector siAnd a visual feature viSplicing and converting to obtain a joint expression vector f of the target vision and the coordinate characteristicsi
3. The method as claimed in claim 2, wherein in the feature fusion-based vector joint representation, after target detection is performed on the input image I by using the fast-RCNN model, the absolute position coordinates of the target are converted into the relative position code b in the imageiFor the object oiIts coordinate (x)1,y1,x2,y2) Wherein x is1,y1,x2,y2Respectively representing the upper left coordinate and the lower right coordinate of a rectangular regression frame, and a relative position code calculation formula:
Figure FDA0003563856320000021
in the formula, wid represents the original width of the image I, hei represents the original height of the image I; then, the relative position is encoded b using the full connection layeriExtended to 128-dimensional features si
si=σ(Wsbi+bs),
Where σ represents the ReLU activation function, WsAnd bsThe parameters are linear transformation parameters and are automatically learned and adjusted by a neural network; meanwhile, the visual characteristics v of the target are obtained by detecting the target by the same methodiPerforming dimension transformation, and converting 4096-dimensional features into 512-dimensional features by using a full connection layer; then, the relative position feature vector s subjected to dimension transformationiAnd a visual feature viSplicing and converting are carried out, and finally a 512-dimensional target vision and coordinate feature fusion vector f is obtainediThe calculation flow is as follows:
fi=σ(Wf[si,vi]+bf),
in the formula [ ·]Represents the stitching operation, σ represents the ReLU activation function, WfAnd bfAre linear transformation parameters.
4. The method of claim 1, wherein the fusion vector is F ═ F (F) according to features1,f2,…,fn) Obtaining global context target characteristic gamma (gamma) by using bidirectional gating recurrent neural network (BiGRU)1,γ2,…,γn) (ii) a Classifying the target by the target detection module to obtain a result L ═ L1,l2,…,ln) To obtain the category feature vector g of each targetiUsing a neural network to characterize the global context of the target by the target feature gammaiAnd its class feature vector giPerforming fusion to obtain the global characteristic c of the targeti
5. The method as claimed in claim 4, wherein in the global feature generation process based on the bidirectional gated recurrent neural network, a feature fusion vector F ═ F (F) of the target set is obtained1,f2,…,fn) Then, it is expressed as x-coordinate in relative coordinatesSequencing from left to right, and inputting the sequence into a bidirectional gated recurrent neural network BiGRU to realize global context interaction to obtain a global context target characteristic gamma (gamma is ═1,γ2,…,γn);
Subsequently, the classification result L ═ of the target using the target detection (L)1,l2,…,ln) Calculating a Glove word embedding vector of the classification label to obtain a 128-dimensional target class feature vector giFinally, the global context target feature gamma of each target is determinediAnd its class feature vector giFusing to obtain the global feature c of the targetiThe above calculation process is shown as the formula:
gi=Glove(li),
ci=σ(Wci,gi]+bc),
wherein, Glove (l)i) Represents the encoding of a pre-sorted tag of an object using the Glove approach [. cndot]Representing a splicing operation, WcAnd bcAre linear transformation parameters.
6. The method as claimed in claim 5, wherein γ is a set of parametersiThe specific generation steps are as follows:
(1) initializing a zero vector as a BiGRU initial state;
(2) at two ends of the BiGRU, respectively fusing the first and the last feature fusion vectors f in the target set0And fnInputting, generating hidden states corresponding to the direction and sequence
Figure FDA0003563856320000031
(3) Sequentially inputting feature vectors to two ends of the BiGRU to generate
Figure FDA0003563856320000032
(4) Fusing the forward and reverse hidden states to obtain the context of each targetFusion state gammai
7. The method for generating a scene graph based on global context interaction according to claim 1, wherein the message iterative transfer mechanism based on the global feature vector comprises two calculation functions of constructing a message aggregation function and a state update function;
constructing a message aggregation function: known ith target node GRU hidden state
Figure FDA0003563856320000033
Hidden state of relationship edge GRU from ith target node to jth target node
Figure FDA0003563856320000034
The message that is transmitted into the ith node at the t iteration is represented as
Figure FDA0003563856320000035
Then
Figure FDA0003563856320000036
Hidden state by the target GRU itself
Figure FDA0003563856320000037
Its out-of-range GRU hidden state
Figure FDA0003563856320000038
In-degree edge hidden state
Figure FDA0003563856320000039
Calculated, where i → j represents that target i is the subject and target j is the object in the relationship:
Figure FDA00035638563200000310
similarly, the ith target node goes to the jth target node at the tth iterationAggregated messages for relational edges of individual target nodes
Figure FDA00035638563200000311
Hidden states corresponding to last iteration of relational edge GRU
Figure FDA00035638563200000312
Subject node GRU hidden state
Figure FDA0003563856320000041
Object node GRU hidden state
Figure FDA0003563856320000042
The components of the components are as follows,
Figure FDA0003563856320000043
and
Figure FDA0003563856320000044
the following adaptive weighting function:
Figure FDA0003563856320000045
wherein [ ·]Represents the stitching operation, σ represents the ReLU activation function, w1、w2And v1、v2Is a learnable parameter;
constructing a state updating function: respectively constructing a target node GRU and a relation edge GRU, and carrying out feature vector on the relation between the targets
Figure FDA0003563856320000046
Storage and update of (2): first, when t is 0, the GRU state of each target node and relationship edge is initialized to a zero vector, and the global feature vector c of the target is initialized to a zero vectoriAs the input of the target node GRU, combining two target coordinates and collecting the visual characteristics c in the framei→jAs input to its relational edge GRU, respectively generate targetsHidden states of nodes and relational edges at initial time
Figure FDA0003563856320000047
In subsequent iterations, each iteration t, each GRU, depending on whether it is a target GRU or a relationship GRU, will have its previous iteration hidden state
Figure FDA0003563856320000048
Or
Figure FDA0003563856320000049
And the incoming message of the previous iteration
Figure FDA00035638563200000410
Or
Figure FDA00035638563200000411
As input, and generates a new hidden state
Figure FDA00035638563200000412
Or
Figure FDA00035638563200000413
As an output, the message aggregation function generates a message for the next iteration:
Figure FDA00035638563200000414
Figure FDA00035638563200000415
8. the method for generating a scene graph based on global context interaction according to claim 1, wherein a message iteration delivery mechanism based on a global feature vector specifically comprises the following steps:
(1) initializing GRU states of each target node and the relation edges into zero vectors;
(2) global feature vector c of targetiAs the input of the target node GRU, combining two target coordinates and collecting the visual characteristics c in the framei→jAs the input of the relation edge GRU, the hidden states of the target node and the relation edge at the initial time are respectively generated
Figure FDA00035638563200000416
(3) Computing received messages for each target and relationship using a message aggregation function
Figure FDA00035638563200000417
And with
Figure FDA00035638563200000422
(4) Combined with hidden state
Figure FDA00035638563200000419
Received message
Figure FDA00035638563200000420
And
Figure FDA00035638563200000421
utilizing GRU to update state to obtain state of next time
Figure FDA0003563856320000051
(5) If the iteration times reach the set times, the state of the current target and the relation is stored
Figure FDA0003563856320000052
Otherwise, returning to the step (3);
(6) and after the message is transmitted, sending the final state vector of the target and the relation into a neural network to obtain a scene graph capable of reflecting the relation between the target and the target in the image.
9. A scene graph generation system based on global context interaction, comprising:
a target detection module for performing target detection on the input image I to obtain a target set O ═ O1,o2,…,on) And a corresponding set of visual features V ═ V (V ═ V)1,v2,…,vn) And the coordinate feature set B ═ B1,b2,…,bn) And the pre-classified label set L ═ (L)1,l2,…,ln) And (C) combining the two target coordinates and collecting the visual characteristics C in the framei→j,i≠j);
A joint expression vector acquisition module of the target vision and coordinate characteristics, which is used for transforming the absolute position coordinates of each target by using a neural network to obtain a joint expression vector f of the target vision and coordinate characteristicsi
A target global feature obtaining module for obtaining (F) according to the feature fusion vector F1,f2,…,fn) Obtaining local context target characteristics gammaiAnd its class feature vector giUsing a neural network to map the global context target feature gamma of the targetiAnd its class feature vector giFusing to obtain the global feature c of the targeti
A scene graph acquisition module for acquiring a global feature vector c based on each targetiFeature vector c of each relationi→jInitialize its hidden state
Figure FDA0003563856320000053
Further, each node incoming message is initially calculated
Figure FDA0003563856320000054
Each side incoming message
Figure FDA0003563856320000055
And performing iterative transfer, and updating hidden state by using recurrent neural network
Figure FDA0003563856320000056
And carrying out message aggregation to obtain the incoming message of each time i
Figure FDA0003563856320000057
And generating a scene graph capable of reflecting the relation between the target and the target in the image by using the final states of the target node and the relation edge until the set iteration number is reached.
10. A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor implements the steps of the scene graph generation method based on global context interaction according to any one of claims 1 to 8 when executing the computer program.
CN202210297025.7A 2022-03-24 2022-03-24 Scene graph generation method, system and equipment based on global context interaction Active CN114677544B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210297025.7A CN114677544B (en) 2022-03-24 2022-03-24 Scene graph generation method, system and equipment based on global context interaction

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210297025.7A CN114677544B (en) 2022-03-24 2022-03-24 Scene graph generation method, system and equipment based on global context interaction

Publications (2)

Publication Number Publication Date
CN114677544A true CN114677544A (en) 2022-06-28
CN114677544B CN114677544B (en) 2024-08-16

Family

ID=82073908

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210297025.7A Active CN114677544B (en) 2022-03-24 2022-03-24 Scene graph generation method, system and equipment based on global context interaction

Country Status (1)

Country Link
CN (1) CN114677544B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115546589A (en) * 2022-11-29 2022-12-30 浙江大学 Image generation method based on graph neural network
CN118015522A (en) * 2024-03-22 2024-05-10 广东工业大学 Time transition regularization method and system for video scene graph generation

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111462282A (en) * 2020-04-02 2020-07-28 哈尔滨工程大学 Scene graph generation method
WO2020244287A1 (en) * 2019-06-03 2020-12-10 中国矿业大学 Method for generating image semantic description
CN113221613A (en) * 2020-12-14 2021-08-06 国网浙江宁海县供电有限公司 Power scene early warning method for generating scene graph auxiliary modeling context information
CN113627557A (en) * 2021-08-19 2021-11-09 电子科技大学 Scene graph generation method based on context graph attention mechanism
CN113836339A (en) * 2021-09-01 2021-12-24 淮阴工学院 Scene graph generation method based on global information and position embedding
KR20220025524A (en) * 2020-08-24 2022-03-03 경기대학교 산학협력단 System for generating scene graph using deep neural network

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020244287A1 (en) * 2019-06-03 2020-12-10 中国矿业大学 Method for generating image semantic description
CN111462282A (en) * 2020-04-02 2020-07-28 哈尔滨工程大学 Scene graph generation method
KR20220025524A (en) * 2020-08-24 2022-03-03 경기대학교 산학협력단 System for generating scene graph using deep neural network
CN113221613A (en) * 2020-12-14 2021-08-06 国网浙江宁海县供电有限公司 Power scene early warning method for generating scene graph auxiliary modeling context information
CN113627557A (en) * 2021-08-19 2021-11-09 电子科技大学 Scene graph generation method based on context graph attention mechanism
CN113836339A (en) * 2021-09-01 2021-12-24 淮阴工学院 Scene graph generation method based on global information and position embedding

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
兰红;刘秦邑;: "图注意力网络的场景图到图像生成模型", 中国图象图形学报, no. 08, 12 August 2020 (2020-08-12) *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115546589A (en) * 2022-11-29 2022-12-30 浙江大学 Image generation method based on graph neural network
CN115546589B (en) * 2022-11-29 2023-04-07 浙江大学 Image generation method based on graph neural network
CN118015522A (en) * 2024-03-22 2024-05-10 广东工业大学 Time transition regularization method and system for video scene graph generation

Also Published As

Publication number Publication date
CN114677544B (en) 2024-08-16

Similar Documents

Publication Publication Date Title
CN110084296B (en) Graph representation learning framework based on specific semantics and multi-label classification method thereof
Liang et al. Symbolic graph reasoning meets convolutions
US20210264190A1 (en) Image questioning and answering method, apparatus, device and storage medium
Li et al. Deep supervision with intermediate concepts
CN114677544B (en) Scene graph generation method, system and equipment based on global context interaction
Quilodrán-Casas et al. Digital twins based on bidirectional LSTM and GAN for modelling the COVID-19 pandemic
CN109858390A (en) The Activity recognition method of human skeleton based on end-to-end space-time diagram learning neural network
CN111460928B (en) Human body action recognition system and method
CN111462324B (en) Online spatiotemporal semantic fusion method and system
CN110138595A (en) Time link prediction technique, device, equipment and the medium of dynamic weighting network
CN110991532B (en) Scene graph generation method based on relational visual attention mechanism
CN111382868A (en) Neural network structure search method and neural network structure search device
CN111191526A (en) Pedestrian attribute recognition network training method, system, medium and terminal
CN112016601B (en) Network model construction method based on knowledge graph enhanced small sample visual classification
CN111241306B (en) Path planning method based on knowledge graph and pointer network
CN112395438A (en) Hash code generation method and system for multi-label image
Bajpai et al. Transfer of deep reactive policies for mdp planning
CN111814658A (en) Scene semantic structure chart retrieval method based on semantics
US20200160501A1 (en) Coordinate estimation on n-spheres with spherical regression
CN113868448A (en) Fine-grained scene level sketch-based image retrieval method and system
Oh et al. Hcnaf: Hyper-conditioned neural autoregressive flow and its application for probabilistic occupancy map forecasting
Zhong et al. 3d implicit transporter for temporally consistent keypoint discovery
CN112100376B (en) Mutual enhancement conversion method for fine-grained emotion analysis
CN117520209A (en) Code review method, device, computer equipment and storage medium
WO2023143121A1 (en) Data processing method and related device thereof

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant