CN112990202B

CN112990202B - Scene graph generation method and system based on sparse representation

Info

Publication number: CN112990202B
Application number: CN202110497553.2A
Authority: CN
Inventors: 雷军; 杨亚洲; 周浩; 张军; 李硕豪; 王风雷; 刘盼; 于淼淼
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2021-05-08
Filing date: 2021-05-08
Publication date: 2021-08-06
Anticipated expiration: 2041-05-08
Also published as: CN112990202A

Abstract

The invention discloses a scene graph generation method and a scene graph generation system based on sparse representation, wherein the method comprises the following steps: carrying out target detection on the original image through a fast regional convolutional neural network to obtain a target region set; identifying all edges of the target pair as a foreground edge and a background edge through a preset relation measurement network, and constructing a sparse graph; synchronously learning nodes and edges on the sparse graph through a feature fusion and updating strategy based on a graph attention neural network, and identifying target types and relationships; and generating a scene graph according to the identified target type and relationship. The method can effectively filter the false relation, further effectively generate the sparse graph, reduce the computation complexity of the dense graph and improve the graph message transmission efficiency; meanwhile, the method can accurately extract features from the sparse graph, and further accurately generate the scene graph.

Description

Scene graph generation method and system based on sparse representation

Technical Field

The invention belongs to the technical field of image processing, and particularly relates to a scene graph generation method and system based on sparse representation.

Background

The generation of the scene graph plays an important role in the deep understanding of the visual scene. The scene graph is a refined semantic extraction of the target and the target relation in the real image, and is constructed by predicting the predefined target instance, the target attribute and the target-to-target relation, the interaction between the targets in the scene is represented by the structured language of the common triples, and the interaction can be represented by the form of the < subject-predicate-object > triples. In the scene graph, nodes are represented as target entities including category labels and bounding boxes, directed edges are represented as relationship categories between the subject and the object, and various attributes (such as color, material, and the like) of the target can be described and represented in the scene graph.

At present, the scene graph inference technology draws much attention due to the extraction of rich semantic information contained in target interaction. The rich scene graph semantic understanding can not only provide context clues for basic recognition tasks, but also have a wide prospect in various advanced visual applications, for example, the rich scene graph semantic is a key for improving image retrieval and various natural language-based image tasks, and also provides valuable information for applications such as visual question answering, image description, image generation and the like. Although conventional scenegraph generation methods have been empirically successful in many applications, the problems of high complexity of the computation of the dense map and imprecise pruning of the sparse map still remain.

Based on this, how to infer complex potential relationships among all targets and accurately extract a scene graph from an image is still a problem to be solved urgently at present.

Disclosure of Invention

The invention aims to provide a scene graph generation method and system based on sparse representation so as to realize reasonable reasoning of complex potential relations and accurately generate a scene graph.

In view of the above, in a first aspect, the present invention provides a scene graph generation method based on sparse representation, including:

carrying out target detection on the original image through a fast regional convolutional neural network to obtain a target region set;

identifying all edges of the target pair as a foreground edge and a background edge through a preset relation measurement network, and constructing a sparse graph;

synchronously learning nodes and edges on the sparse graph through a feature fusion and updating strategy based on a graph attention neural network, and identifying a target type and a target relationship;

and generating a scene graph according to the target type and the relation obtained by identification.

Preferably, the identifying all edges of the target pair as foreground edges and background edges through a preset relationship metric network and constructing a sparse map includes:

acquiring category characteristics, space characteristics and appearance characteristics of each target;

classifying edges of the target pair according to the class characteristics, the space characteristics and the appearance characteristics of the two targets in the target pair, and acquiring a classification result;

selecting according to the classification result

Strip foreground edge and front

A strip background edge, the structure comprises

A node and

sparse graph of edges.

Preferably, the classifying the edges of the target pair according to the class features, the spatial features and the appearance features of two targets in the target pair and obtaining a classification result includes:

respectively connecting the spatial features and the appearance features of the two targets in the target pair in series to generate a combined spatial feature and a combined appearance feature;

embedding prior statistical probability of a target class to construct a joint class characteristic of the target pair;

connecting the joint appearance characteristic, the joint space characteristic and the joint category characteristic in series to generate a logs characteristic;

and inputting the logs characteristics into a sigmoid classifier to obtain the edge probability of the target pair.

Preferably, the joint spatial features are:

wherein the content of the first and second substances,

for the purpose of the joint spatial feature,

in order to be a multi-layer sensor,

in order to operate in series with each other,

are respectively the target

And

the spatial characteristic of (a);

the joint appearance characteristics are:

，

wherein the content of the first and second substances,

for the purpose of the said joint appearance feature,

are respectively the target

And

the appearance characteristic of (a);

the prior statistical probability of the target category is:

，

wherein the content of the first and second substances,

a priori statistical probability for the object class defined as presence in the original image

Presence in case of a Category object

Probability of a class object;

the joint category characteristics are:

，

wherein the content of the first and second substances,

for the purpose of the joint class feature(s),

for the number of all the categories it is,

is a target of

Belong to the category

The probability of (a) of (b) being,

is a target of

Belong to the category

A probability of (a),

Are respectively the target

And

the class characteristics of (1).

Preferably, the synchronously learning the nodes and edges on the sparse graph and identifying the target type and the target relationship by the feature fusion and update strategy based on the graph attention neural network comprises:

fusing the appearance characteristic, the spatial characteristic and the category characteristic of each node in the sparse graph, and embedding the prior statistical probability of the category relationship to generate a node characteristic and an edge characteristic;

acquiring attention weights of nodes and edges through a graph attention neural network;

and updating the node features and the edge features according to the attention weights of the nodes and the edges, and classifying the targets and the relations according to the new node features and the new edge features.

Preferably, the fusing the appearance features, the spatial features and the category features of each node in the sparse graph, and embedding the prior statistical probability of the category relationship to generate the node features and the edge features, includes:

aggregating the appearance characteristics, the spatial characteristics and the category characteristics of each node in the sparse graph, and compressing through a coder and a decoder to obtain fusion characteristics;

obtaining an initialization node characteristic and an initialization edge characteristic according to the fusion characteristic;

embedding the prior statistical probability of the class relationship into the initialized node features and the initialized edge features to construct node features and edge features;

and distributing the node characteristics and the edge characteristics to corresponding nodes and edges in the sparse graph.

Preferably, the prior statistical probability of the class relationship is:

，

wherein the content of the first and second substances,

defining as a given node for said prior statistical probability

And

context exists

The probability of (d);

the node is characterized in that:

wherein the content of the first and second substances,

in order to be a feature of the node,

the number of all the relationships is the same,

as to the number of all the nodes,

initializing a node characteristic for the node;

the edge characteristics are as follows:

，

wherein the content of the first and second substances,

in order to be a feature of the edge,

the edge feature is initialized.

Preferably, the attention weights of the nodes and edges include an attention weight between nodes, an attention weight between nodes and edges, and an attention weight of edges;

the attention weight between the nodes is:

wherein the content of the first and second substances,

is a node

And

the weight of attention in between the two,

is an AND node

There is a set of nodes connecting the edges,

is the weight parameter to be learned,

are respectively nodes

And

the characteristics of the nodes of (a) are,

is an AND node

Adjacent node

The characteristics of the nodes of (a) are,

is a node

And node

Connecting edge of

The edge characteristics of (a) are set,

is the weight of the network;

the attention weight between the node and the edge is:

wherein the content of the first and second substances,

is a node

And edge

Attention weight in between;

the attention weight of the edge is:

wherein the content of the first and second substances,

are all nodes

And node

Connecting edge of

Attention weight of (1).

Preferably, the new node is characterized by:

wherein the content of the first and second substances,

for the purpose of the new node characteristic,

in order to be a sigmoid function,

is an AND node

There is a set of nodes connecting the edges,

is a node

And its adjacent node

The weight of attention in between the two,

is a node

And its adjacent node

Connecting edge of

The edge characteristics of (a) are set,

is a node

And the connecting edge

Attention weight in between;

the new edge characteristics are:

wherein the content of the first and second substances,

is the new edge feature.

In a second aspect, the present invention provides a scene graph generation system based on sparse representation, including:

the target area extraction module is used for carrying out target detection on the original image through a fast area convolution neural network to obtain a target area set;

the sparse graph constructing module is used for identifying all edges of the target pair as a foreground edge and a background edge through a preset relation measurement network and constructing a sparse graph;

the graph message transmission module is used for synchronously learning nodes and edges on the sparse graph through a feature fusion and updating strategy based on a graph attention neural network and identifying a target type and a target relationship;

and the scene graph generating module is used for generating a scene graph according to the target type and the relation obtained by identification.

According to the scene graph generation method and system based on sparse representation, all edges of the target pairs in the original image are classified into the foreground and the background through RelMN, and a sparse graph is constructed, so that the false relation can be effectively filtered, the sparse graph can be effectively generated, the computation complexity of the dense graph is reduced, and the graph message transmission efficiency is improved; furthermore, through a feature fusion and updating strategy based on the graph attention neural network, nodes and edges on the sparse graph are synchronously learned to obtain target features and relation features, and the target features and the relation features are used for target and relation classification, so that the features can be accurately extracted from the sparse graph, and the scene graph can be accurately generated.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a flowchart of a sparse representation-based scene graph generation method according to an embodiment of the present invention;

FIG. 2 is a flowchart of step S20 of the sparse representation-based scene graph generating method according to an embodiment of the present invention;

FIG. 3 is a diagram illustrating two classifications of foreground and background in RelMN in an embodiment of the present invention;

FIG. 4 is a flowchart of step S30 of the sparse representation-based scene graph generating method according to an embodiment of the present invention;

fig. 5 is a schematic block diagram of a scene graph generation system based on sparse representation according to an embodiment of the present invention.

Detailed Description

In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In an embodiment, as shown in fig. 1, a scene graph generation method based on sparse representation is provided, which includes the following steps:

and step S10, carrying out target detection on the original image through the fast regional convolutional neural network to obtain a target region set.

In this embodiment, an original image is obtained, Fast regional Convolutional Neural Network (Fast Convolutional Neural Network) is used to perform target detection on the image, and a plurality of (a) are automatically extracted from the original image

A) target area

Obtaining a target area set

. Wherein the target region set

Each target region in (1)

Including position information and appearance characteristics of the target

And class probability

。

Understandably, the speed and accuracy of target detection can be improved by using the circumscribed matrix obtained in step S10 to cover most critical targets.

And step S20, identifying all edges of the target pair as a foreground edge and a background edge through a preset relation measurement network, and constructing a sparse graph.

In this embodiment, RelMN (Relational metric Network) is configured to identify all edges as foreground edges and background edges, and automatically select all foreground edges and part of background edges to construct a sparse graph. The RelMN is composed of three parts, namely multi-feature extraction, two-classification of foreground and background and sparse image generation.

Preferably, as shown in fig. 2, step S20 includes the steps of:

step S201, multi-feature extraction: obtaining the category characteristics of each target

Spatial characteristics of

And appearance characteristics

。

In step S201, based on the target area set

Each target areaDomain

Location information and class probability of included objects

Conversion to spatial features

And category characteristics

And further according to each target area

Appearance characteristics of the contained object

Transformed spatial features

And category characteristics

And obtaining multi-dimensional characteristics for detecting whether potential relations exist between the target pairs. Preferably, for spatial features

By amplitude and splicing will

Dimensional position coordinate conversion

Spatial features of dimensions

Wherein, in the step (A),

in order to be able to target the number of,

position coordinates of the dimension as a target area

Top left corner of the bounding box of the characterization

Coordinates and lower right corner

The coordinates, MPL, are the multi-layer perceptrons,

is a series operation. Similarly, for class probability transition

Class characteristics of dimension

。

Step S202, classification of foreground and background: according to the target pair

Class characteristics of two targets

Spatial characteristics of

And appearance characteristics

To the target pair

And classifying the edges and obtaining a classification result. Wherein the target pair

From two different targets

And

and (4) forming.

As shown in fig. 3, the two classification diagrams of the foreground and the background in RelMN, step S202 includes the following steps:

the method comprises the following steps: respectively couple the targets

Spatial characteristics of two targets

And appearance characteristics

Connected in series to generate joint space characteristics

And combined appearance features

. Wherein spatial features are combined

The calculation formula of (2) is as follows:

（1）

associative appearance feature

Is calculated byComprises the following steps:

（2）

step two, embedding prior statistical probability of target category

Constructing object pairs

Joint class characteristics of

. Wherein the prior statistical probability of the object class

Defined as being present in the original image

Presence in case of a Category object

Probability of class object, prior statistical probability

Can be expressed as:

（3）

equation (3), a priori statistical probability

Is the number of all categories.

And learning statistical co-occurrence knowledge among target categories based on the prior statistical probability and the category characteristics. The joint class characteristics

Can be expressed as:

（4）

in the formula (4), the first and second groups,

is a target of

Belong to the category

The probability of (a) of (b) being,

is a target of

Belong to the category

The probability of (c).

Step three, combining the appearance characteristics

Joining spatial features

And joint category features

Performing series connection to generate logs characteristics

. Wherein, logs characteristics

Is calculated byComprises the following steps:

（5）

step four, identifying the Logits characteristics

Inputting sigmoid classifier to obtain target pair

Edge probability of

. Where the edge probability

The calculation formula of (2) is as follows:

（6）

that is, in step S202, the position coordinates and the class probability of each object are determined

Respectively converted into spatial features

And category characteristics

Then, the spatial characteristics of two different targets are first determined

And appearance characteristics

Concatenate to generate joint spatial features

And combined appearance features

And constructing joint category features

Introducing prior statistical probability of classes, and then combining spatial features

Combined appearance features

And joint category features

Concatenate to generate logs features

And finally calculating output probability by using sigmoid classifier

And further, all the edges are divided into a foreground type and a background type.

Step S203, sparse map generation: selection according to classification result

Strip foreground edge and front

A strip background edge, the structure comprises

A node and

sparse graph of edges.

In step S203, according to the output of the sigmoid classifier, first all are selected

（

Not hyper-parametric and determined by sigmoid classifier) pairs of edges predicted targets as foreground. Secondly, automatically selecting front with high foreground probability

A strip background edge is constructed to contain

The sparse graph with the foreground edge and the background edge enhance the robustness of relation classification and reduce the risk background of eliminating real relations. Finally obtaining a product containing

A node and

sparse graph of edges.

In this embodiment, all the edges are divided into two types, namely foreground and background, by the RelMN to obtain the potential relationship between the target pairs, which is more reasonable and the constructed sparse graph is more reasonable than the potential relationship generated by the distance between the target pairs. In addition, the message transmission on the sparse graph can obviously reduce the calculation complexity, so that the message transmission is more accurate and effective.

And step S30, synchronously learning nodes and edges of the sparse graph through feature fusion and updating strategies based on the graph attention neural network, and identifying object classes and relationships.

In step S30, the feature fusion and update strategy based on the graph attention neural network includes three parts of node feature and edge feature generation, weight learning of node feature and edge feature, and object and relationship classification.

Preferably, as shown in fig. 4, step S30 includes the steps of:

step S301, generation of node features and edge features: appearance characteristics of each node in sparse graph

Spatial characteristics of

And category characteristics

Fusion is carried out, and prior statistical probability of class relation is embedded

Feature of construction node

And edge characteristics

。

In step S301, appearance features of each node in the sparse graph are first checked

Spatial characteristics of

And category characteristics

Performing polymerization and compressing by a coder-decoder to obtain a fusion characteristic

Then, the initialized node characteristics are obtained according to the fusion characteristics

And initializing edge features

. Wherein the node characteristics

The initialization process of (1) is as follows: direct through fusion feature

Initializing node characteristics, i.e.

(ii) a Edge feature

The initialization process of (1) is as follows: sequentially connecting the fusion features of the subject node and the object node, and performing dimension compression through the full connection layer, i.e.

Wherein the full connection layer

A Leaky Relu layer.

Further, the prior statistical probability of the class relation

Embedded to initialization node features

And initializing edge features

Feature of construction node

And edge characteristics

Finally, the node characteristics

And edge characteristics

And distributing to corresponding nodes and edges in the sparse graph. Wherein a priori statistical probabilities of class relationships

Is defined as a given node

And

context exists

Probability of (3), a priori statistical probability of class relationship

Can be expressed as:

（7）

in the formula (7), the first and second groups,

is a main body node

To the object node

Corresponding relation of (A) and

，

is the number of all relationships.

Node characteristics

The calculation formula of (2) is as follows:

（8）

edge feature

The calculation formula of (2) is as follows:

（9）

understandably, as part of the sparse graph, the prior statistical probability according to class relationships

And class probability

The inherent weights of the nodes and edges are constructed. The inherent weights of the node and the edge respectively reflect the node in the node set

And edges between other nodes and in edge sets

And other edge.

Step S302, learning the weight of the node characteristics and the edge characteristics: attention weights of nodes and edges are obtained through a graph attention neural network. Wherein, the attention weight of the node and the edge includes the attention weight between the node and the node, the attention weight between the node and the edge, and the attention weight of the edge.

For node message aggregation, the calculation formula of attention weight between nodes is as follows:

（10）

in the formula (10), the first and second groups,

is a node

And

the weight of attention in between the two,

is a node

There is a set of nodes connecting the edges,

is the weight parameter to be learned,

、

are respectively nodes

And

the characteristics of the nodes of (a) are,

is an AND node

Adjacent node

The characteristics of the nodes of (a) are,

is a node

And node

Connecting edge of

The edge characteristics of (a) are set,

is the weight of the network. Note that, in the formula (10), the node

And node

Are all main body nodes, and are all provided with a main body,

is the object node.

The formula for calculating the attention weight between a node and an edge is:

（11）

in the formula (11), the reaction mixture,

is a node

And edge

Attention weight in between. In the formula (11), the node

Is the object node.

For the aggregation of edge messages, the formula for calculating the attention weight of an edge is as follows:

（12）

in the formula (12), the first and second groups,

、

are all nodes

And node

Connecting edge of

Attention weight of (1). It should be noted that, in the formula (12), the node

Is a main body node, a node

Is the object node.

Understandably, as another aspect of the sparse Graph, Attention weights of nodes and edges are obtained through a GAT (Graph Attention neural Network), in combination with the category-based relationship in step S301Prior statistical probability of

And class probability

To extract new target features and relationship features.

Step S303, object and relationship classification: updating node characteristics based on attention weights of nodes and edges

And edge characteristics

And according to the new node characteristics

And edge characteristics

The objects and relationships are classified.

Specifically, the node characteristics are updated according to the hidden node characteristics, the adjacent node characteristics and the connection edge characteristics

And according to the new node characteristics

Classifying the target category; updating edge characteristics according to hidden edge characteristics, subject node characteristics and object node characteristics simultaneously

And according to the new edge characteristics

The relationships are classified. Wherein the new node characteristics

The calculation formula of (2) is as follows:

（13）

in the formula (13), the first and second groups,

is sigmoid function

Is an AND node

There is a set of nodes connecting the edges,

is a node

And its adjacent node

The weight of attention in between the two,

is a node

And its adjacent node

Connecting edge of

The edge characteristics of (a) are set,

is a node

And the connecting edge

Attention weight in between. Note that, in the formula (13), the node

Is the object node.

Novel edge features

The calculation formula of (2) is as follows:

（14）

note that, in the formula (13), the node

Is a main body node, a node

Is the object node.

In this embodiment, the statistical co-occurrence knowledge and the context clue in the data set are learned in a centralized manner based on the feature fusion and update strategy of the graph attention neural network, so as to obtain output features (including new node features and edge features), and further classify the targets and the relationships thereof according to the output features, so that the messages on the sparse graph can be effectively transmitted and integrated.

And step S40, generating a scene graph according to the identified object type and the relationship.

In step S40, the generated scene graph includes the positions of the objects, the categories of the objects, and the relationships between the objects, and the scene graph can be structurally represented as a set of triples, i.e. the triples

。

Wherein the content of the first and second substances,

is a target area set, each target area in the target area set

Containing coordinate position information of the object described by the bounding box.

Is a target set, each target area in the target area set

All correspond to a category label

And is and

，

a set of labels for all target categories.

Is a binary relation set, each relation in the binary relation set

Is a triple containing a subject node

Object node

And a subject node

And object node

The relationship between

And is and

is the complete set of relationships; wherein the main body node

Object node

From candidate regions

And candidate region

Corresponding category label

Is determined, i.e. is

。

As can be seen from the above, in the scene graph generation method based on sparse representation according to the embodiment, after the target region set is extracted from the original image through Fast R-CNN, all edges of the target pair in the original image are classified into two types, namely foreground and background, through RelMN, and a sparse graph is constructed, so that the false relationship can be effectively filtered, the sparse graph is further effectively generated, the computation complexity of the dense graph is reduced, and the graph message transmission efficiency is improved; and then synchronously learning nodes and edges on the sparse graph to obtain target characteristics and relationship characteristics through a characteristic fusion and updating strategy based on the graph attention neural network, and performing target and relationship classification by using the target characteristics and the relationship characteristics, so that the characteristics can be accurately extracted from the sparse graph, and the scene graph is accurately generated.

In an embodiment, a scene graph generation system based on sparse representation is provided, and the scene graph generation system based on sparse representation corresponds to the scene graph generation method based on sparse representation in the above embodiments one to one. As shown in fig. 5, the sparse representation-based scene graph generation system includes a target region extraction module 110, a sparse graph construction module 120, a graph message transmission module 130, and a scene graph generation module 140, and the detailed description of each functional model is as follows:

and the target area extraction module 110 is configured to perform target detection on the original image through a fast area convolutional neural network to obtain a target area set.

And a sparse graph constructing module 120, configured to identify all edges of the target pair as a foreground edge and a background edge through a preset relationship metric network, and construct a sparse graph.

And the graph message transmission module 130 is used for synchronously learning the nodes and edges of the sparse graph and identifying the target type and the target relation through a feature fusion and updating strategy based on the graph attention neural network.

And the scene graph generating module 140 is configured to generate a scene graph according to the identified object type and relationship.

Further, the sparse graph constructing module 120 includes a multi-feature extracting unit, a binary classifying unit, and a sparse graph generating unit, and the detailed description of each functional unit is as follows:

and the multi-feature extraction unit is used for acquiring the category feature, the spatial feature and the appearance feature of each target.

And the two classification units are used for classifying the edges of the target pairs according to the class characteristics, the space characteristics and the appearance characteristics of the two targets in the target pairs and acquiring the classification result.

A sparse graph generating unit for selecting the sparse graph according to the classification result

Strip foreground edge and front

A strip background edge, the structure comprises

A node and

sparse graph of edges.

Further, the classification unit includes a first joint subunit, a first knowledge embedding subunit, a second joint subunit and a classification subunit, and the detailed description of each functional subunit is as follows:

and the first joint subunit is used for respectively connecting the spatial features and the appearance features of the two targets in the target pair in series to generate joint spatial features and joint appearance features.

And the first knowledge embedding subunit is used for embedding the prior statistical probability of the target class to construct the joint class characteristics of the target pair.

And the second joint subunit is used for connecting the joint appearance characteristic, the joint space characteristic and the joint category characteristic in series to generate the logs characteristic.

And the classification subunit is used for inputting the logs characteristics into the sigmoid classifier to obtain the edge probability of the target pair.

Further, the graph message passing module 130 includes a node and edge feature generating unit, a weight learning unit, and a feature updating unit, and the detailed description of each functional unit is as follows:

and the node and edge feature generation unit is used for fusing the appearance feature, the spatial feature and the class feature of each node in the sparse graph, embedding the prior statistical probability of the class relationship, and generating the node feature and the edge feature.

And the weight learning unit is used for acquiring the attention weights of the nodes and the edges through the graph attention neural network.

And the feature updating unit is used for updating the node features and the edge features according to the attention weights of the nodes and the edges, and classifying the targets and the relations according to the new node features and the new edge features.

Further, the node and edge feature generation unit includes a feature fusion subunit, an initialization subunit, a second knowledge embedding subunit and a feature allocation subunit, and the detailed description of each functional subunit is as follows:

and the characteristic fusion subunit is used for aggregating the appearance characteristics, the spatial characteristics and the category characteristics of each node in the sparse graph and compressing the aggregated appearance characteristics, spatial characteristics and category characteristics through a coder and a decoder to obtain fusion characteristics.

And the initialization subunit is used for obtaining the initialized node characteristic and the initialized edge characteristic according to the fusion characteristic.

And the second knowledge embedding subunit is used for embedding the prior statistical probability of the class relationship into the initialized node feature and the initialized edge feature and constructing the node feature and the edge feature.

And the feature distribution subunit is used for distributing the node features and the edge features to corresponding nodes and edges in the sparse graph.

For specific limitations of the sparse representation-based scenegraph generation system, reference may be made to the above limitations of the sparse representation-based scenegraph generation method, and details thereof are not repeated here.

Those of ordinary skill in the art will understand that: the discussion of any embodiment above is meant to be exemplary only, and is not intended to intimate that the scope of the disclosure, including the claims, is limited to these examples; within the idea of the invention, also technical features in the above embodiments or in different embodiments may be combined and there are many other variations of the different aspects of the invention as described above, which are not provided in detail for the sake of brevity. Therefore, any omissions, modifications, substitutions, improvements and the like that may be made without departing from the spirit and principles of the invention are intended to be included within the scope of the invention.

Claims

1. A scene graph generation method based on sparse representation is characterized by comprising the following steps:

identifying all edges of the target pair as a foreground edge and a background edge through a preset relation measurement network, and constructing a sparse graph; the method comprises the following steps:

selecting according to the classification result

Strip foreground edge and front

A strip background edge, the structure comprises

A node and

a sparse graph of edges;

synchronously learning nodes and edges on the sparse graph through a feature fusion and updating strategy based on a graph attention neural network, and identifying a target type and a target relationship; the method comprises the following steps:

updating the node features and the edge features according to the attention weights of the nodes and the edges, and classifying the targets and the relations according to the new node features and the new edge features;

2. The sparse representation-based scene graph generation method of claim 1, wherein the classifying edges of the target pair according to the class features, the spatial features and the appearance features of two targets in the target pair and obtaining the classification result comprises:

3. The sparse representation-based scene graph generation method of claim 2, wherein the joint spatial features are:

，

wherein the content of the first and second substances,

for the purpose of the joint spatial feature,

in order to be a multi-layer sensor,

in order to operate in series with each other,

、

are respectively the target

And

the spatial characteristic of (a);

the joint appearance characteristics are:

，

wherein the content of the first and second substances,

for the purpose of the said joint appearance feature,

、

are respectively the target

And

the appearance characteristic of (a);

the prior statistical probability of the target category is:

，

wherein the content of the first and second substances,

Presence in case of a Category object

Probability of a class object;

the joint category characteristics are:

，

wherein the content of the first and second substances,

for the purpose of the joint class feature(s),

for the number of all the categories it is,

is a target of

Belong to the category

The probability of (a) of (b) being,

is a target of

Belong to the category

The probability of (a) of (b) being,

、

are respectively the target

And

the class characteristics of (1).

4. The sparse representation-based scene graph generating method of claim 1, wherein the fusing appearance features, spatial features and class features of each node in the sparse graph and embedding prior statistical probabilities of class relations to generate node features and edge features comprises:

5. The sparse representation-based scenegraph generation method of claim 4, wherein the prior statistical probability of the class relationship is:

，

wherein the content of the first and second substances,

defining as a given node for said prior statistical probability

And

context exists

The probability of (d);

the node is characterized in that:

，

wherein the content of the first and second substances,

the node characteristics are the number of all relationships, the number of all nodes and the initialized node characteristics;

the edge characteristics are as follows:

，

wherein the content of the first and second substances,

in order to be a feature of the edge,

the edge feature is initialized.

6. The sparse representation based scene graph generation method of claim 1, wherein the attention weights of the nodes and edges comprise attention weights between nodes, attention weights between nodes and edges, and attention weights of edges;

the attention weight between the nodes is:

，

wherein, for the attention weights between and at the nodes,

a set of nodes having connecting edges with the nodes, which are weight parameters to be learned, node features of the nodes and the nodes respectively, node features of nodes adjacent to the nodes, edge features of the connecting edges of the nodes and the nodes, and weights of the network;

the attention weight between the node and the edge is:

，

wherein the content of the first and second substances,

is a node

And edge

Attention weight in between;

the attention weight of the edge is:

wherein the content of the first and second substances,

、

are all nodes

And node

Connecting edge of

Attention weight of (1).

7. The sparse representation-based scenegraph generation method of claim 6, wherein the new node features are:

，

wherein the content of the first and second substances,

for the purpose of the new node characteristic,

in order to be a sigmoid function,

is an AND node

There is a set of nodes connecting the edges,

is a node

And its adjacent node

The weight of attention in between the two,

is a node

And its adjacent node

Connecting edge of

The edge characteristics of (a) are set,

is a node

And the connecting edge

Attention weight in between;

the new edge characteristics are:

，

wherein the content of the first and second substances,

is the new edge feature.

8. A sparse representation-based scenegraph generation system, comprising:

the sparse graph constructing module is used for identifying all edges of the target pair as a foreground edge and a background edge through a preset relation measurement network and constructing a sparse graph; the sparse graph construction module comprises:

the multi-feature extraction unit is used for acquiring the category feature, the spatial feature and the appearance feature of each target;

the two classification units are used for classifying the edges of the target pairs according to the class characteristics, the space characteristics and the appearance characteristics of the two targets in the target pairs and acquiring classification results;

a sparse map generation unit for selecting the sparse map according to the classification result

Strip foreground edge and front

A strip background edge, the structure comprises

A node and

a sparse graph of edges;

the graph message transmission module is used for synchronously learning nodes and edges on the sparse graph through a feature fusion and updating strategy based on a graph attention neural network and identifying a target type and a target relationship; the graph messaging module includes:

the node and edge feature generation unit is used for fusing the appearance feature, the spatial feature and the category feature of each node in the sparse graph, embedding the prior statistical probability of the category relationship, and generating the node feature and the edge feature;

the weight learning unit is used for acquiring attention weights of the nodes and the edges through the graph attention neural network;

the feature updating unit is used for updating the node features and the edge features according to the attention weights of the nodes and the edges, and classifying the targets and the relations according to the new node features and the new edge features;