CN111340912A

CN111340912A - Scene graph generation method and device and storage medium

Info

Publication number: CN111340912A
Application number: CN202010104227.6A
Authority: CN
Inventors: 孙书洋; 周仪; 李怡康; 欧阳万里
Original assignee: Beijing Sensetime Technology Development Co Ltd
Current assignee: Beijing Sensetime Technology Development Co Ltd
Priority date: 2020-02-20
Filing date: 2020-02-20
Publication date: 2020-06-26
Anticipated expiration: 2040-02-20
Also published as: CN111340912B

Abstract

The disclosure provides a scene graph generation method and device, and a storage medium, wherein the method comprises the following steps: obtaining object characteristics of the image through object detection; obtaining a first relation characteristic and a second relation characteristic between a first object and a second object in the image through relation detection; wherein the subclass of the first relational feature includes the second relational feature; generating the scene graph based on the object corresponding to the object characteristic, the first predicate corresponding to the first relational characteristic and the second predicate corresponding to the second relational characteristic; at least one of the objects with the relationship characteristics belongs to the object corresponding to the object characteristics.

Description

Scene graph generation method and device and storage medium

Technical Field

The present disclosure relates to the field of image understanding, and in particular, to a method and an apparatus for generating a scene graph, and a storage medium.

Background

Scene graph generation is a basic image understanding task, which refers to generating an image composed of detected objects as nodes and relationships between the objects as edges according to the image.

Corresponding to linguistics, a phrase can be composed of a subject-predicate-object, in the generated scene graph, a node is an object, corresponds to the category of the subject or the object, and an edge is the relationship between every two objects, and corresponds to the category of the predicate. Therefore, the scene graph generation generally first performs object detection, and then further infers the relationship between two objects, thereby generating the final scene graph.

Thus, both object detection and relationship detection are represented as classification problems, where each object is assigned only a single class label, and each class is assumed to be independent and orthogonal to each other.

However, unlike object detection, the boundaries of predicate classes in relationship detection are fuzzy. Sometimes, different predicates can semantically overlap. For example: "sit next to …" and "stand next to …" are two different predicate labels, but they semantically express a similar spatial relationship ("next to …"). Therefore, it is impossible to properly use and process the information of the internal connection between these predicates by treating these semantically overlapping predicates as separate categories.

Disclosure of Invention

The disclosure provides a scene graph generation method and device and a storage medium.

According to a first aspect of the embodiments of the present disclosure, a method for generating a scene graph is provided, the method including: obtaining object characteristics of the image through object detection; obtaining a first relation characteristic and a second relation characteristic between a first object and a second object in the image through relation detection; wherein the subclass of the first relational feature includes the second relational feature; generating the scene graph based on the object corresponding to the object characteristic, the first predicate corresponding to the first relational characteristic and the second predicate corresponding to the second relational characteristic; at least one of the objects with the relationship characteristics belongs to the object corresponding to the object characteristics.

In some optional embodiments, the obtaining, by the object detection, an object feature of the image includes: acquiring basic object characteristics of the image; converting the basic object features into object feature vectors; and taking the object feature vector as the object feature of the image.

In some optional embodiments, the obtaining, by the relationship detection, a first relationship feature and a second relationship feature between a first object and a second object in the image includes: acquiring the basic relation characteristics of the image; determining a first relation characteristic matrix and a second relation characteristic matrix according to the basic relation characteristics; obtaining the first relation characteristic between the first object and the second object in the image according to the object characteristic of the image and the first relation characteristic matrix; and obtaining the second relation characteristic between the first object and the second object in the image according to the object characteristic of the image and the second relation characteristic matrix.

In some optional embodiments, the method further comprises: optimizing the second relation characteristic matrix according to the first relation characteristic matrix and the second relation characteristic matrix, or optimizing the first relation characteristic matrix and the second relation characteristic matrix; in a case of optimizing the second relationship feature matrix, obtaining the second relationship feature between the first object and the second object in the image according to the object feature of the image and the second relationship feature matrix includes: obtaining the second relation characteristic between the first object and the second object in the image according to a third relation matrix obtained after the object characteristic vector and the second relation characteristic matrix are optimized; in a case that the first relation feature matrix is optimized, obtaining the first relation feature between the first object and the second object in the image according to the object feature of the image and the first relation feature matrix includes: and obtaining the first relation characteristic between the first object and the second object in the image according to the object characteristic of the image and a fourth relation characteristic matrix obtained after the first relation characteristic matrix is optimized.

In some optional embodiments, the optimizing the second relational feature matrix according to the first relational feature matrix and the second relational feature matrix includes: the first relation characteristic matrix is subjected to convolution processing, normalization processing and activation function processing in sequence to obtain a first matrix, and the second relation characteristic matrix is subjected to convolution processing, normalization processing and activation function processing in sequence to obtain a second matrix; according to the first matrix and the second matrix, determining a first correlation matrix corresponding to the first matrix and a second correlation matrix corresponding to the second matrix in a sample dimension and a feature channel dimension of a relationship feature; inputting the first matrix and the first correlation matrix into a first residual error neural network to obtain a first output value of the first residual error neural network; inputting the first output value and the second correlation matrix into a second residual error neural network to obtain a second output value of the second residual error neural network; and adding the second output value and the second relation characteristic matrix to obtain a third relation characteristic matrix after the second relation characteristic matrix is optimized.

In some optional embodiments, after the obtaining the first and second relationship features between the first and second objects in the image, the method further comprises: in a target predicate dataset, the first predicate corresponding to the first relational feature and the second predicate corresponding to the second relational feature are determined.

In some optional embodiments, the method further comprises: acquiring a first predicate data set corresponding to the second relation characteristic; obtaining an implicit vector corresponding to a predicate in the first predicate data set according to the first predicate data set; clustering the predicates in the first predicate data set according to the distance between every two implicit vectors, and determining a second predicate data set corresponding to the first relation characteristics of the predicates in the first predicate data set and the corresponding relation between the predicates in the first predicate data set and the predicates in the second predicate data set; and taking the first predicate data set, the second predicate data set and the corresponding relation as the target predicate data set.

In some optional embodiments, the obtaining, according to the first predicate data set, an implicit vector corresponding to a predicate in the first predicate data set includes: inputting the first predicate data set into a language coding neural network to obtain implicit vectors which are output by the language coding neural network and correspond to predicates in the first predicate data set; the language coding neural network is obtained by training by taking a predicate in a sample predicate data set as an input value and taking a sample implicit vector corresponding to a predicate pre-labeled in the sample predicate data set as supervision.

In some optional embodiments, the method further comprises: and predicting an object corresponding to the object characteristic, a first predicate corresponding to the first relational characteristic and a second predicate corresponding to the second relational characteristic through a normalization function according to the object characteristic, the first relational characteristic and the second relational characteristic.

In some optional embodiments, the image comprises an image used for reflecting the road condition ahead of the automatic driving device during driving, and the scene graph is used for describing the relationship characteristics between the traffic objects contained in the image; wherein the traffic object comprises at least one of a traffic signal sign, a traffic signal light, a pedestrian, a non-motor vehicle and a motor vehicle.

According to a second aspect of the embodiments of the present disclosure, a neural network is provided, which generates a scene graph by using the method of any one of the first aspect.

According to a third aspect of the embodiments of the present disclosure, there is provided a scene graph generating apparatus, including: the object characteristic acquisition module is used for acquiring the object characteristics of the image through object detection; the relation feature acquisition module is used for acquiring a first relation feature and a second relation feature between a first object and a second object in the image through relation detection; wherein the subclass of the first relational feature includes the second relational feature; a scene graph generating module, configured to generate the scene graph based on an object corresponding to the object feature, a first predicate corresponding to the first relational feature, and a second predicate corresponding to the second relational feature; at least one of the objects with the relationship characteristics belongs to the object corresponding to the object characteristics.

In some optional embodiments, the object feature obtaining module comprises: the first obtaining submodule is used for obtaining the basic object characteristics of the image; a conversion submodule for converting the basic object features into object feature vectors; a first determining sub-module for taking the object feature vector as the object feature of the image.

In some optional embodiments, the relationship feature obtaining module comprises: the second obtaining submodule is used for obtaining the basic relation characteristics of the image; the second determining submodule is used for determining a first relation characteristic matrix and a second relation characteristic matrix according to the basic relation characteristics; a third determining submodule, configured to obtain the first relationship feature between the first object and the second object in the image according to the object feature of the image and the first relationship feature matrix; and the fourth determining submodule is used for obtaining the second relation characteristic between the first object and the second object in the image according to the object characteristic of the image and the second relation characteristic matrix.

In some optional embodiments, the apparatus further comprises: the optimization module is used for optimizing the second relation characteristic matrix according to the first relation characteristic matrix and the second relation characteristic matrix, or optimizing the first relation characteristic matrix and the second relation characteristic matrix; in a case where the second relational feature matrix is optimized, the fourth determination submodule includes: obtaining the second relation characteristic between the first object and the second object in the image according to a third relation matrix obtained after the object characteristic vector and the second relation characteristic matrix are optimized; in a case where the first relational feature matrix is optimized, the third determination submodule includes: and obtaining the first relation characteristic between the first object and the second object in the image according to the object characteristic of the image and a fourth relation characteristic matrix obtained after the first relation characteristic matrix is optimized.

In some optional embodiments, the optimization module comprises: the third obtaining submodule is used for sequentially carrying out convolution processing, normalization processing and activation function processing on the first relation characteristic matrix to obtain a first matrix, and sequentially carrying out convolution processing, normalization processing and activation function processing on the second relation characteristic matrix to obtain a second matrix; a fifth determining submodule, configured to determine, according to the first matrix and the second matrix, a first correlation matrix in a feature channel dimension of the relationship feature, and a second correlation matrix in a sample dimension of the relationship feature; a fourth obtaining submodule, configured to input the first matrix and the first correlation matrix into a first residual error neural network, so as to obtain a first output value of the first residual error neural network; a fifth obtaining submodule, configured to input the first output value and the second correlation matrix into a second residual error neural network, so as to obtain a second output value of the second residual error neural network; and the sixth determining submodule is used for adding the second output value and the second relation characteristic matrix to obtain a third relation characteristic matrix after the second relation characteristic matrix is optimized.

In some optional embodiments, the apparatus further comprises: a first determining module configured to determine, in a target predicate data set, the first predicate corresponding to the first relational feature and the second predicate corresponding to the second relational feature.

In some optional embodiments, the apparatus further comprises: an obtaining module, configured to obtain a second predicate data set corresponding to the plurality of second relationship characteristics; a second determining module, configured to obtain, according to the second predicate data set, an implicit vector corresponding to a predicate in the second predicate data set; a third determining module, configured to cluster the predicates in the second predicate data set according to a distance between each two implicit vectors, and determine a first predicate data set corresponding to multiple first relationship features and a correspondence between the predicates in the first predicate data set and the predicates in the second predicate data set; a fourth determining module, configured to use the first predicate data set, the second predicate data set, and the correspondence as the target predicate data set.

In some optional embodiments, the second determining module comprises: a sixth obtaining submodule, configured to input the second predicate data set into a language-coded neural network, so as to obtain an implicit vector output by the language-coded neural network and corresponding to the predicate in the second predicate data set; the language coding neural network is obtained by training by taking predicates in the sample predicate data set corresponding to the plurality of second relational features as input values and taking sample implicit vectors corresponding to predicates labeled in advance in the sample predicate data set as supervision.

In some optional embodiments, the apparatus further comprises: and the fourth determining module is used for predicting the object corresponding to the object characteristic, the first predicate corresponding to the first relational characteristic and the second predicate corresponding to the second relational characteristic through a normalization function according to the object characteristic, the first relational characteristic and the second relational characteristic.

According to a fourth aspect of the embodiments of the present disclosure, there is provided a computer-readable storage medium storing a computer program for executing the scene graph generating method according to any one of the first aspect.

According to a fifth aspect of the embodiments of the present disclosure, there is provided a scene graph generating apparatus, including: a processor; a memory for storing the processor-executable instructions; wherein the processor is configured to call executable instructions stored in the memory to implement the scene graph generation method of any one of the first aspect.

The technical scheme provided by the embodiment of the disclosure can have the following beneficial effects:

in the embodiment of the disclosure, object detection may be performed on an image to obtain object features of the image, and relationship detection may be performed on the image to obtain a first relationship feature and a second relationship feature between a first object and a second object in the image. Wherein the subclass of the first relational feature includes the second relational feature. Further, the scene graph is generated based on an object corresponding to the object characteristics, a first predicate corresponding to the first relation characteristics and a second predicate corresponding to the second relation characteristics; at least one of the objects with the relationship characteristics belongs to the object corresponding to the object characteristics. In the present disclosure, the generated scene graph may simultaneously represent a first relational feature and a second relational feature between a first object and a second object in the image, wherein a subclass of the first relational feature includes the second relational feature, so that the scene graph includes more layers of information of the image, that is, the relational features obtained based on different dimensions, so that the understanding of the image is more accurate, and the generated scene graph is more reasonable.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure.

FIG. 1 is a flowchart of a method for generating a scene graph according to an exemplary embodiment of the present disclosure;

FIG. 2A is a schematic image diagram illustrating the present disclosure according to an exemplary embodiment;

FIG. 2B is a scene graph illustration schematic shown in accordance with an exemplary embodiment of the present disclosure;

FIG. 2C is another scene illustration schematic diagram shown in accordance with an exemplary embodiment of the present disclosure;

FIG. 3 is a flow chart of another method of generating a scene graph shown in accordance with an exemplary embodiment of the present disclosure;

FIG. 4 is an object bounding box scene schematic diagram shown in accordance with an exemplary embodiment of the present disclosure;

FIG. 5 is a flow chart of another method of generating a scene graph shown in accordance with an exemplary embodiment of the present disclosure;

FIG. 6 is a flow chart of another method of generating a scene graph shown in accordance with an exemplary embodiment of the present disclosure;

FIG. 7 is a flow chart of another method of generating a scene graph shown in accordance with an exemplary embodiment of the present disclosure;

FIG. 8 is a diagram illustrating a scenario in which a second relational feature matrix is optimized according to an exemplary embodiment of the present disclosure;

FIG. 9 is a flow chart of another method of generating a scene graph shown in accordance with an exemplary embodiment of the present disclosure;

FIG. 10 is a flow chart of another method of generating a scene graph shown in accordance with an exemplary embodiment of the present disclosure;

FIG. 11 is a flow chart of another method of generating a scene graph shown in accordance with an exemplary embodiment of the present disclosure;

FIG. 12 is a schematic diagram of a network architecture of a neural network shown in accordance with an exemplary embodiment of the present disclosure;

FIG. 13 is a block diagram of a scene graph generation apparatus shown in accordance with an exemplary embodiment of the present disclosure;

fig. 14 is a schematic structural diagram of a scene graph generating apparatus according to an exemplary embodiment of the present disclosure.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

The terminology used in the present disclosure is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. As used in this disclosure and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as operated herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.

It is to be understood that although the terms first, second, third, etc. may be used herein to describe various information, such information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of the present disclosure. The word "if," as used herein, may be interpreted as "at … …" or "at … …" or "in response to a determination," depending on the context.

Because the boundaries of predicate categories in relation detection are fuzzy, predicate information between objects in an image cannot be accurately determined, and therefore, the generated scene graph cannot accurately reflect relation characteristics between the objects, and the accuracy of image understanding is reduced. For better image understanding, the present disclosure provides a scene graph generation method and apparatus, and a storage medium.

As shown in fig. 1, fig. 1 is a scene graph generating method according to an exemplary embodiment, including the following steps:

in step 101, object features of an image are obtained by object detection.

In the embodiment of the present disclosure, the object detection is to acquire object features of each object included in the image, where the object refers to an object included in the image and may be a person, an animal, or any other object, and the object features may refer to a size, a shape, a category, and the like of the object in the image.

In step 102, a first relational feature and a second relational feature between a first object and a second object in the image are obtained through relational detection.

In the embodiment of the present disclosure, the relationship detection is to obtain the relationship features of different dimensions between two objects in the image. The two objects may refer to a first object and a second object in the image, and the relational features of the different dimensions may include a first relational feature and a second relational feature. Wherein the subclass of the first relational feature includes the second relational feature.

That is, the first relationship feature is equivalent to a parent class of the second relationship feature and is used to describe a larger range of relationship features between the first object and the second object, and the second relationship feature belongs to a child class of the first relationship feature and is used to describe a more detailed relationship feature between the first object and the second object.

In step 103, the scene graph is generated based on the object corresponding to the object feature, the first predicate corresponding to the first relational feature, and the second predicate corresponding to the second relational feature.

In the embodiment of the present disclosure, at least one of the objects having the relationship characteristic belongs to the object corresponding to the object characteristic. For example, the object a and the object b have a first relational feature and a second relational feature, and when the object detection is performed on the image, the object corresponding to the obtained object feature includes the object a and/or the object b.

The first predicate is a predicate describing the first relational characteristic, and the second predicate is a predicate describing the second relational characteristic. For example, the first predicate is "above … …," the second predicate is "sit above … …," "stand above … …," and so on.

The finally generated scene graph not only can represent the first relation characteristic between the first object and the second object, but also can represent the second relation characteristic after being more refined between the first object and the second object.

For example, as shown in fig. 2A, the right half of the image includes three objects, namely girl, mother and apple, the first predicate between girl and apple includes "present", the first predicate between girl and mother includes "close", and the scene graph describing the first relational characteristics between the objects is shown in fig. 2B. The second predicate between girl and apple includes "hold", and the second predicate between girl and mother includes "sit on … …", constituting a scene graph for describing the characteristics of the second relationship between objects as shown in fig. 2C.

In the above embodiment, the generated scene graph may simultaneously represent the first relational feature and the second relational feature between the first object and the second object in the image, wherein the subclass of the first relational feature includes the second relational feature, so that the scene graph includes more layers of information of the image, that is, the relational features obtained based on different dimensions, so that the image is understood more accurately, and the generated scene graph is more reasonable.

In some alternative embodiments, such as shown in fig. 3, step 101 may include:

in step 101-1, the underlying object features of the image are acquired.

In the embodiment of the present disclosure, according to the image feature of the image, the object bounding boxes corresponding to each object included in the image may be determined first. The object bounding box may include a main feature portion of the object, for example, if the object is a person, the object bounding box may include a face portion, and if the object is another object, the object bounding box may include the whole or at least part of the object. Of course, the object bounding box may also include portions that belong to other objects at the edge of the object. And then extracting basic object features in the image based on the object boundary frame by using a region of interest (ROI) alignment method.

The image features may include color features, texture features, shape features, and the like. The color feature is a global feature which describes the surface color attribute of an object corresponding to an image, the texture feature is a global feature which describes the surface texture attribute of the object corresponding to the image, the shape feature has two types of representation methods, one type is a contour feature, the other type is a region feature, the contour feature of the image mainly aims at the outer boundary of the object, and the region feature of the image is related to the shape of an image region. In the embodiment of the disclosure, the image features of the image can be extracted through a pre-trained neural network. The neural Network may be, but is not limited to, VGG Net (Visual Geometry Group Network), Google Net (Google Network), etc.

According to the image features of the image, an object bounding box corresponding to an object included in the image is extracted through a pre-trained RPN (Region predictive Network). The object bounding box corresponding to the object in the image is, for example, as shown in fig. 4.

After the object bounding box is determined, the ROI Align may be employed to obtain the underlying object features. ROI Align is an improvement of the algorithm of ROI Pooling (target region Pooling), which is Pooling the corresponding region in a feature map to a fixed-size feature map according to the position coordinates of the bounding box for subsequent classification and bounding box regression operations. Since the position of the bounding box is usually derived from model regression, it is generally a floating point number, whereas the pooled feature map requires a fixed size. Therefore, there are two quantification processes for this procedure of ROI Pooling. The ROI Align refers to canceling the quantization operation, and obtaining the image numerical value on the pixel point with the floating point coordinate by using the bilinear interpolation method, so as to convert the whole feature aggregation process into a continuous operation.

The basic object features may include some basic feature description of the object, such as the length, width, height, shape, etc. of the object in the image.

In step 101-2, the base object features are converted into object feature vectors.

In the embodiments of the present disclosure, the object features in the image may be converted into object feature vectors by a linear conversion function.

In step 101-3, the object feature vector is taken as the object feature of the image.

In the embodiment of the present disclosure, the converted object feature vector is used as the object feature of the image.

In the above embodiment, the basic object features of the image may be obtained first, and then the basic object features may be converted into the object feature vectors, so that the object features of the image may be obtained quickly, and the usability is high.

In some alternative embodiments, such as shown in FIG. 5, step 102 may include:

in step 102-1, the underlying relational features of the image are obtained.

In the embodiment of the present disclosure, image features of the image may be obtained first, and object bounding boxes corresponding to each object included in the image may be obtained based on the image features. And obtaining a relation boundary box between the first object and the second object through the combination of the object boundary boxes. Wherein the relationship bounding box may be used to describe a positional relationship between the first object and the second object. And determining the basic relation characteristics of the image based on the relation bounding box by using an ROI Align method.

The method for extracting the image features and determining the object bounding box is the same as the method for extracting the image features and determining the object bounding box in step 101-1, and is not described herein again.

After the object bounding boxes are determined, merging of the object bounding boxes is required, and the relationship bounding boxes may be used to describe whether the object bounding boxes overlap, the relative positions of the object bounding boxes, whether the object bounding boxes include the same portion belonging to the same object, and so on.

For example, as shown in fig. 4, the object bounding boxes of the first object and the second object are merged, so that the object bounding boxes of the girl and the apple, which are relationship bounding boxes between the two objects, are not overlapped, the girl object bounding box is located at the right side of the apple object bounding box, and the girl object bounding box and the apple bounding box simultaneously include the hand of the girl.

Further, after the relational bounding box is determined, the ROI Align method may be employed to determine the underlying relational features of the image based on the relational bounding box. Wherein the underlying relational features may be used to describe possible relational features between the first object and the second object. For example, based on the relationship bounding box between girls and apples, the underlying relationship between the girls and apples can be characterized as the apple being on the girls' hands.

In step 102-2, a first relational feature matrix and a second relational feature matrix are determined according to the basic relational features.

In the embodiment of the present disclosure, the first relational feature matrix is used to describe a first relational feature between the first object and the second object, that is, the first relational feature matrix is a feature matrix used to describe a coarse-grained relation between objects. The second relation feature matrix is used for describing a second relation feature between the first object and the second object, that is, the second relation feature matrix is a feature matrix used for describing a fine-grained relation between the objects.

In the embodiment of the present disclosure, the convolution layer 1 may perform convolution processing on the basic relationship feature to obtain a first relationship feature matrix, and the convolution layer 2 may perform convolution processing on the basic relationship feature to obtain a second relationship feature matrix. Wherein, the convolution layer 1 comprises a smaller number of convolution units than the convolution layer 2.

In step 102-3, the first relation feature between the first object and the second object in the image is obtained according to the object feature of the image and the first relation feature matrix.

In the embodiment of the present disclosure, the object feature is an object feature vector, and is transferred between the object feature vector and the first relation feature matrix through an MPS (Message Passing) module. Further, the MPS module may pass the object feature vector and the first relationship feature matrix to an RI (relationship Inference) module, through which the first relationship feature between the first object and the second object is predicted. Any method that can perform message passing within the fault tolerance range in the communication process can be applied to the MPS module.

The RI module may adopt any method capable of predicting the relationship feature according to the object feature vector and the first relationship feature matrix, for example, the RI module may adopt a pre-trained neural network, the neural network may be trained through a sample image with a first relationship feature label, the neural network is trained when an output result matches with a content of the first relationship feature label or is within a fault tolerance range, in this embodiment, the object feature vector and the first relationship feature matrix may be directly input into the trained neural network, and the first relationship feature between the first object and the second object is predicted by the neural network.

In step 102-4, the second relation feature between the first object and the second object in the image is obtained according to the object feature of the image and the second relation feature matrix.

In the embodiment of the present disclosure, the message may be passed in the object feature vector and the second relation feature matrix through another MPS module, and the second relation feature between the first object and the second object may be predicted through another RI module. The implementation manner is the same as the manner of predicting the first relation characteristic, and is not described herein again.

In the above embodiment, the object bounding boxes may be determined according to the extracted image features of the image, the relationship bounding boxes between the objects may be determined by merging the object bounding boxes, and the first relationship feature and the second relationship feature of the first object and the second object in the image may be determined according to the relationship bounding boxes, so that the usability is high.

In some alternative embodiments, such as shown in fig. 6, the method may further include:

in step 104, the second relation feature matrix is optimized or the first relation feature matrix and the second relation feature matrix are optimized according to the first relation feature matrix and the second relation feature matrix.

In the embodiment of the present disclosure, in order to make a subsequently generated scene graph more reasonable and more accurately represent a first relationship characteristic and a second relationship characteristic between a first object and a second object in an image, a second relationship characteristic matrix may be optimized through an HGM (hierarchical Guided Module) according to the first relationship characteristic matrix and the second relationship characteristic matrix. Or, the first relation characteristic matrix and the second relation characteristic matrix can be optimized simultaneously through the HGM according to the first relation characteristic matrix and the second relation characteristic matrix. The procedure of how the optimization is performed will be further described later.

In the case of optimizing the second relational feature matrix, the step 102-4 may include:

and obtaining the second relation characteristic between the first object and the second object in the image according to a third relation matrix obtained after the object characteristic vector and the second relation characteristic matrix are optimized.

In this embodiment of the present disclosure, after the second relationship feature matrix is optimized, a third relationship feature matrix is obtained, and at this time, the second relationship feature between the first object and the second object in the image may be determined according to the object feature vector and the third relationship feature matrix.

Likewise, in the case of optimizing the first relational feature matrix, the step 102-3 may include:

and obtaining the first relation characteristic between the first object and the second object in the image according to the object characteristic of the image and a fourth relation characteristic matrix obtained after the first relation characteristic matrix is optimized.

In this embodiment of the present disclosure, after the first relationship feature matrix is optimized, a fourth relationship feature matrix is obtained, and at this time, the first relationship feature between the first object and the second object in the image may be determined according to the object feature vector and the fourth relationship feature matrix.

In the above embodiment, the second relation feature matrix may be further optimized, or the first relation feature matrix and the second relation feature matrix may be further optimized, so that the finally obtained scene graph is more accurate and reasonable.

In some alternative embodiments, such as shown in fig. 7, the process of optimizing the second relational feature matrix in step 104 may include:

in step 104-1, the first relation feature matrix is subjected to convolution processing, normalization processing and activation function processing in sequence to obtain a first matrix, and the second relation feature matrix is subjected to convolution processing, normalization processing and activation function processing in sequence to obtain a second matrix.

In the embodiment of the present disclosure, the HGM may have a structure as shown in fig. 8, wherein the convolutional layer includes operations such as convolution processing, normalization processing and activation function processing, and the first relational feature matrix a (a is of dimension N × C × L) can be obtained by the convolutional layer^t(A^tIs N × C₁× L dimension) and similarly, a second relational feature matrix B (B is N × C × L dimension) may be convolved to obtain a second matrix B^t(B^tIs N × C₂× L dimension).

In step 104-2, a first correlation matrix is determined in a feature channel dimension of the relational feature and a second correlation matrix is determined in a sample dimension of the relational feature based on the first matrix and the second matrix.

In the disclosed embodiment, the first matrix A^tAnd a second matrix B^tAfter multiplication, convolution processing is carried out, and a first correlation matrix C (C is N × C) of the relation characteristics in the characteristic channel dimension can be obtained₁Dimensional). Second matrix B^tAfter convolution processing, the first matrix A is combined with^tMultiplying and downsampling to obtain a second correlation matrix S with the relation characteristic in the sample dimension (S is N × C)₁Dimensional).

In step 104-3, the first matrix and the first correlation matrix are input into a first residual neural network, and a first output value of the first residual neural network is obtained.

In the disclosed embodiment, the first matrix A^tInputting the correlation matrix C into a first residual neural network, and calculating A by the first residual neural network^tThe product of C and A^tAnd adding, and performing convolution processing to obtain a first output value M of the first residual error neural network.

In step 104-4, the first output value and the second correlation matrix are input to a second residual neural network to obtain a second output value of the second residual neural network.

In the embodiment of the present disclosure, the first output value M and the second correlation matrix S are input to the second residual error neural network, the product of M and S is calculated by the second residual error neural network, and then the product is added to M, and a second output value M' of the second residual error neural network is obtained after convolution processing.

In step 104-5, the second output value and the second relation feature matrix are added to obtain a third relation feature matrix after the second relation feature matrix is optimized.

In the embodiment of the present disclosure, after adding the second output value M' to the second relation feature matrix B, the optimized third relation feature matrix B may be obtained^out。

In the embodiment of the present disclosure, the process of optimizing the first relation feature matrix according to the first relation feature matrix and the second relation feature matrix is the same as the process of optimizing the second relation feature matrix, and is not described herein again.

In the embodiment, the corresponding correlation matrixes are respectively determined in the characteristic channel dimension and the sample dimension of the relational feature through the first relational feature matrix and the second relational feature matrix, and then the second relational feature matrix is optimized through the two residual error neural networks which are sequentially connected, so that the method is simple and convenient to realize, high in usability and capable of improving the accuracy of the subsequently generated scene graph.

In some alternative embodiments, such as shown in fig. 9, after performing step 102, the method may further include:

in step 105, in a target predicate data set, the first predicate corresponding to the first relational feature and the second predicate corresponding to the second relational feature are determined.

In the present disclosed embodiment, the predetermined target predicate data set includes predicates corresponding to various second relational features, predicates corresponding to various first relational features, and correspondence between the first relational features and the second relational features.

After determining the first correspondence and the second correspondence between the first object and the second object in the image, a first predicate corresponding to the first correspondence of the image and a second predicate corresponding to the second correspondence of the image may be determined from the target predicate data set to subsequently generate a scene graph.

In some alternative embodiments, such as shown in fig. 10, the method may further include:

in step 106, a second predicate data set corresponding to the plurality of second relational features is obtained.

In the present disclosure, the second predicate data set may be a commonly-used predicate data set that is set in advance for performing scene graph generation, in which predicates corresponding to a plurality of second relationship features are included, for example, the second predicate data set may be a Visual Genome data set.

In step 107, an implicit vector corresponding to the predicate in the second predicate data set is obtained from the second predicate data set.

In the embodiment of the present disclosure, the second predicate data set may be input to a language-coded neural network trained in advance, and an implicit vector corresponding to the predicate in the second predicate data set output by the language-coded neural network may be obtained.

The language coding neural network can adopt the sample predicate sets corresponding to the various second relational features as input values, and the loss function is minimized by adjusting the network parameters of the language coding neural network according to the sample implicit vector labels corresponding to the predicates in the sample predicate sets, so that the trained language coding neural network is finally obtained.

In step 108, the predicates in the second predicate data set are clustered according to the distance between each pair of implicit vectors, and a first predicate data set corresponding to a plurality of first relationship features and a correspondence between the predicate in the second predicate data set and the predicate in the first predicate data set are determined.

In the embodiment of the present disclosure, a clustering algorithm may be used to cluster the predicates in the second predicate data set, where the clustering algorithm may be a hierarchical method (or other algorithms), and the first predicate data set may be determined by the clustering algorithm according to a distance between every two implicit vectors. Wherein the first predicate data set includes predicates corresponding to the plurality of first relational features. At the same time, the correspondence between the predicates in the first predicate data set and the predicates in the second predicate data set may also be determined. For example, the predicates in the second predicate data set include "sit next to … …", "stand next to … …", "lie next to … …", and all three predicates in the first predicate data set correspond to a predicate that is "next to … …".

In step 109, the first predicate data set, the second predicate data set, and the correspondence are used as the target predicate data set.

In the present disclosed embodiment, the target predicate data set includes predicates in a first predicate data set corresponding to a plurality of first relational characteristics, predicates in a second predicate data set corresponding to a plurality of second relational characteristics, and correspondence between the predicates in the first predicate data set and the predicates in the second predicate data set.

In the above embodiment, the target predicate data set including the first predicate data set and the second predicate data set corresponding to the plurality of types of first relational features and the correspondence between the predicates in the first predicate data set and the predicates in the second predicate data set can be quickly determined from the preset second predicate data sets corresponding to the plurality of types of second relational features, which is easy and convenient to implement and has high availability.

In some alternative embodiments, other means of creation may also be employed to determine the target predicate data set.

After the first predicate data set is determined, the predicates in the second predicate data set can be comprehensively understood to integrate the first predicate data set corresponding to the plurality of first relationship characteristics and the correspondence between the predicates in the first predicate data set and the predicates in the second predicate data set, so that a target predicate data set having a hierarchical structure can be similarly established, and the language and the scene can be more accurately understood and highly available.

In some alternative embodiments, such as shown in fig. 11, the method may further include:

in step 110, an object corresponding to the object feature, a first predicate corresponding to the first relational feature, and a second predicate corresponding to the second relational feature are determined by a normalization function according to the object feature, the first relational feature, and the second relational feature.

In the embodiment of the present disclosure, the object feature of the image, the first relational feature between the first object and the second object, and the second relational feature may be normalized by a normalization function, for example, a softmax function, and the object included in the image, the first predicate corresponding to the first relational feature, and the second predicate corresponding to the second relational feature may be determined in the preset object set and the target predicate data set.

In some alternative embodiments, the above method may also be used in a variety of scenarios requiring image understanding, and an automatic driving scenario is taken as an example for the following description.

In an automatic driving scene, the image can comprise an image used for reflecting the road condition in front of the automatic driving device during driving, and the scene graph is used for describing the relationship characteristics between the traffic objects contained in the image. Wherein the traffic object comprises at least one of a traffic signal sign, a traffic signal light, a pedestrian, a non-motor vehicle and a motor vehicle.

For example, when the automatic driving device acquires an image of a road condition ahead during driving, object characteristics of the image can be obtained by object detection of the image, and first and second relationship characteristics between a first object and a second object in the image can be obtained by relationship detection. The object corresponding to the object feature may include a traffic object, such as at least one of a traffic signal sign, a traffic light, a pedestrian, a non-motor vehicle, and a motor vehicle. The first relationship feature may be used to describe a coarse-grained relationship feature between the first traffic object and the second traffic object in the image, and the second relationship feature may be used to describe a fine-grained relationship feature between the first traffic object and the second traffic object in the image.

Further, the scene graph may be generated based on the traffic object in the image, the first predicate corresponding to the first relational feature, and the second predicate corresponding to the second relational feature. For example, the scene graph corresponding to the first relational feature describes that the pedestrian a (the first object) is beside the pedestrian B (the second object) and the motor vehicle C is beside the traffic light, and the scene graph corresponding to the first relational feature describes that the pedestrian a (the first object) stands beside the pedestrian B (the second object) and the motor vehicle C stops below the traffic light.

In the above embodiment, the image understanding may be performed according to the image of the road condition ahead during the driving process acquired by the automatic driving device, so as to generate the scene graph for describing the relationship characteristics between the traffic objects included in the image. The scene graph can more accurately and reasonably reflect the multi-dimensional relation characteristics between traffic objects, and the aim of automatically driving is better achieved.

In an embodiment of the present disclosure, a neural network is further provided, and the neural network may adopt any one of the above methods to generate a scene graph.

The network architecture of the neural network is, for example, as shown in fig. 12, an image that needs image understanding to generate a scene graph is input to the neural network, the neural network extracts an object bounding box of an object in the image by using RPN after extracting image features of the image, and a relationship bounding box (not shown in fig. 12) is obtained by merging the object bounding boxes. And obtaining the basic object characteristics and the basic relation characteristics of the image based on the object boundary box and the relation boundary box by an ROI Align method.

For the basic object features, the neural network converts the basic object features into object feature vectors through a linear conversion function, and the object feature vectors can be used as the object features corresponding to the image.

Aiming at the basic relational features, the neural network respectively obtains a first relational feature matrix and a second relational feature matrix through a convolutional layer 1 and a convolutional layer 2, wherein the number of convolution units included in the convolutional layer 1 is smaller than that of convolution units included in the convolutional layer 2.

The neural network passes between the object feature vector and the first relational feature matrix by MPS module 1 and between the object feature vector and the second relational feature matrix by MPS module 2. And determining the first relation characteristic according to the object characteristic vector and the first relation matrix through the RI module 1. And optimizing the second relation feature matrix through the HGM to obtain a third relation feature matrix, and determining the second relation feature through the RI module 2 according to the object feature vector and the third relation feature matrix.

Further, an object corresponding to the object feature, a first predicate corresponding to the first relational feature and a second predicate corresponding to the second relational feature are determined through a normalization function based on the object feature vector, the first relational feature and the second relational feature, and accordingly the scene graph is generated.

It should be noted that when the neural network is trained, different predicate labels can be respectively adopted for branches corresponding to the first relational features and branches corresponding to the second relational features for supervised training, so that the accuracy of image analysis performed by the neural network is improved, and more reasonable and accurate scene graphs can be generated subsequently.

Corresponding to the foregoing method embodiments, the present disclosure also provides embodiments of an apparatus.

As shown in fig. 13, fig. 13 is a block diagram of a scene graph generating apparatus according to an exemplary embodiment, the apparatus including: an object feature obtaining module 210, configured to obtain an object feature of the image through object detection; a relationship feature obtaining module 220, configured to obtain, through relationship detection, a first relationship feature and a second relationship feature between a first object and a second object in the image; wherein the subclass of the first relational feature includes the second relational feature; a scene graph generating module 230, configured to generate the scene graph based on an object corresponding to the object feature, a first predicate corresponding to the first relational feature, and a second predicate corresponding to the second relational feature; at least one of the objects with the relationship characteristics belongs to the object corresponding to the object characteristics.

For the device embodiments, since they substantially correspond to the method embodiments, reference may be made to the partial description of the method embodiments for relevant points. The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the disclosed solution. One of ordinary skill in the art can understand and implement it without inventive effort.

An embodiment of the present disclosure further provides a computer-readable storage medium, where the storage medium stores a computer program, and the computer program is configured to execute any one of the scene graph generation methods described above.

In some optional embodiments, the present disclosure provides a computer program product, including computer readable code, which when run on a device, a processor in the device executes instructions for implementing a scenegraph generation method as provided in any of the above embodiments.

In some optional embodiments, the present disclosure further provides another computer program product for storing computer readable instructions, where the instructions, when executed, cause a computer to perform the operations of the scenegraph generation method provided in any of the above embodiments.

The computer program product may be embodied in hardware, software or a combination thereof. In an alternative embodiment, the computer program product is embodied in a computer storage medium, and in another alternative embodiment, the computer program product is embodied in a Software product, such as a Software Development Kit (SDK), or the like.

The embodiment of the present disclosure further provides a scene graph generating apparatus, including: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to call the executable instructions stored in the memory to implement any of the above-mentioned scene graph generation methods.

Fig. 14 is a schematic hardware structure diagram of a scene graph generating apparatus according to an embodiment of the present disclosure. The scene graph generating device 310 includes a processor 311, and may further include an input device 312, an output device 313, and a memory 314. The input device 312, the output device 313, the memory 314, and the processor 311 are connected to each other via a bus.

The memory includes, but is not limited to, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM), or a portable read-only memory (CD-ROM), which is used for storing instructions and data.

The input means are for inputting data and/or signals and the output means are for outputting data and/or signals. The output means and the input means may be separate devices or may be an integral device.

The processor may include one or more processors, for example, one or more Central Processing Units (CPUs), and in the case of one CPU, the CPU may be a single-core CPU or a multi-core CPU. The memory is used to store program codes and data of the network device. The processor is used for calling the program codes and data in the memory and executing the steps in the method embodiment. Specifically, reference may be made to the description of the method embodiment, which is not repeated herein.

It will be appreciated that fig. 14 shows only a simplified design of a scenegraph generating apparatus. In practical applications, the scenegraph generating devices may also include other necessary elements, including but not limited to any number of input/output devices, processors, controllers, memories, etc., and all scenegraph generating devices that can implement the embodiments of the present disclosure are within the scope of the present disclosure.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This disclosure is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

The above description is only exemplary of the present disclosure and should not be taken as limiting the disclosure, as any modification, equivalent replacement, or improvement made within the spirit and principle of the present disclosure should be included in the scope of the present disclosure.

Claims

1. A scene graph generation method is characterized by comprising the following steps:

obtaining object characteristics of the image through object detection;

obtaining a first relation characteristic and a second relation characteristic between a first object and a second object in the image through relation detection; wherein the subclass of the first relational feature includes the second relational feature;

generating the scene graph based on the object corresponding to the object characteristic, the first predicate corresponding to the first relational characteristic and the second predicate corresponding to the second relational characteristic; at least one of the objects with the relationship characteristics belongs to the object corresponding to the object characteristics.

2. The method according to claim 1, wherein the obtaining of the object feature of the image through object detection comprises:

acquiring basic object characteristics of the image;

converting the basic object features into object feature vectors;

and taking the object feature vector as the object feature of the image.

3. The method according to claim 1 or 2, wherein the obtaining of the first and second relational features between the first and second objects in the image by relational detection comprises:

acquiring the basic relation characteristics of the image;

determining a first relation characteristic matrix and a second relation characteristic matrix according to the basic relation characteristics;

obtaining the first relation characteristic between the first object and the second object in the image according to the object characteristic of the image and the first relation characteristic matrix;

and obtaining the second relation characteristic between the first object and the second object in the image according to the object characteristic of the image and the second relation characteristic matrix.

4. The method of claim 3, further comprising:

optimizing the second relation characteristic matrix according to the first relation characteristic matrix and the second relation characteristic matrix, or optimizing the first relation characteristic matrix and the second relation characteristic matrix;

in a case of optimizing the second relationship feature matrix, obtaining the second relationship feature between the first object and the second object in the image according to the object feature of the image and the second relationship feature matrix includes:

obtaining the second relation characteristic between the first object and the second object in the image according to a third relation matrix obtained after the object characteristic vector and the second relation characteristic matrix are optimized;

in a case that the first relation feature matrix is optimized, obtaining the first relation feature between the first object and the second object in the image according to the object feature of the image and the first relation feature matrix includes:

5. The method of claim 4, wherein optimizing the second relational feature matrix based on the first relational feature matrix and the second relational feature matrix comprises:

the first relation characteristic matrix is subjected to convolution processing, normalization processing and activation function processing in sequence to obtain a first matrix, and the second relation characteristic matrix is subjected to convolution processing, normalization processing and activation function processing in sequence to obtain a second matrix;

determining a first correlation matrix in the feature channel dimension of the relationship feature and a second correlation matrix in the sample dimension of the relationship feature according to the first matrix and the second matrix;

inputting the first matrix and the first correlation matrix into a first residual error neural network to obtain a first output value of the first residual error neural network;

inputting the first output value and the second correlation matrix into a second residual error neural network to obtain a second output value of the second residual error neural network;

and adding the second output value and the second relation characteristic matrix to obtain a third relation characteristic matrix after the second relation characteristic matrix is optimized.

6. The method of any one of claims 1-5, wherein after obtaining the first and second relational features between the first and second objects in the image, the method further comprises:

in a target predicate dataset, the first predicate corresponding to the first relational feature and the second predicate corresponding to the second relational feature are determined.

7. The method of claim 6, further comprising:

acquiring a second predicate data set corresponding to the plurality of second relation characteristics;

obtaining an implicit vector corresponding to a predicate in the second predicate data set according to the second predicate data set;

clustering the predicates in the second predicate data set according to the distance between every two implicit vectors, and determining a first predicate data set corresponding to multiple first relation characteristics and corresponding relations between predicates in the first predicate data set and predicates in the second predicate data set;

and taking the first predicate data set, the second predicate data set and the corresponding relation as the target predicate data set.

8. The method of claim 7, wherein obtaining an implicit vector corresponding to a predicate in the second predicate data set from the second predicate data set comprises:

inputting the second predicate data set into a language coding neural network to obtain implicit vectors which are output by the language coding neural network and correspond to predicates in the second predicate data set; the language coding neural network is obtained by training by taking predicates in the sample predicate data set corresponding to the plurality of second relational features as input values and taking sample implicit vectors corresponding to predicates labeled in advance in the sample predicate data set as supervision.

9. The method according to any one of claims 1-8, wherein the image comprises an image reflecting a road condition ahead of the automatic driving device during driving, and the scene graph is used for describing a relationship characteristic between traffic objects included in the image; wherein the traffic object comprises at least one of a traffic signal sign, a traffic signal light, a pedestrian, a non-motor vehicle and a motor vehicle.

10. A scene graph generation apparatus, comprising:

the object characteristic acquisition module is used for acquiring the object characteristics of the image through object detection;

the relation feature acquisition module is used for acquiring a first relation feature and a second relation feature between a first object and a second object in the image through relation detection; wherein the subclass of the first relational feature includes the second relational feature;

a scene graph generating module, configured to generate the scene graph based on an object corresponding to the object feature, a first predicate corresponding to the first relational feature, and a second predicate corresponding to the second relational feature; at least one of the objects with the relationship characteristics belongs to the object corresponding to the object characteristics.

11. The apparatus of claim 10, wherein the object feature obtaining module comprises:

the first obtaining submodule is used for obtaining the basic object characteristics of the image;

a conversion submodule for converting the basic object features into object feature vectors;

a first determining sub-module for taking the object feature vector as the object feature of the image.

12. The apparatus of claim 10 or 11, wherein the relational feature obtaining module comprises:

the second obtaining submodule is used for obtaining the basic relation characteristics of the image;

the second determining submodule is used for determining a first relation characteristic matrix and a second relation characteristic matrix according to the basic relation characteristics;

a third determining submodule, configured to obtain the first relationship feature between the first object and the second object in the image according to the object feature of the image and the first relationship feature matrix;

and the fourth determining submodule is used for obtaining the second relation characteristic between the first object and the second object in the image according to the object characteristic of the image and the second relation characteristic matrix.

13. The apparatus of claim 12, further comprising:

the optimization module is used for optimizing the second relation characteristic matrix according to the first relation characteristic matrix and the second relation characteristic matrix, or optimizing the first relation characteristic matrix and the second relation characteristic matrix;

in a case where the optimization module optimizes the second relational feature matrix, the fourth determination submodule includes:

in a case where the optimization module optimizes the first relational feature matrix, the third determination submodule includes:

14. The apparatus of claim 13, wherein the optimization module comprises:

the third obtaining submodule is used for sequentially carrying out convolution processing, normalization processing and activation function processing on the first relation characteristic matrix to obtain a first matrix, and sequentially carrying out convolution processing, normalization processing and activation function processing on the second relation characteristic matrix to obtain a second matrix;

a fifth determining submodule, configured to determine, according to the first matrix and the second matrix, a first correlation matrix in a feature channel dimension of the relationship feature, and a second correlation matrix in a sample dimension of the relationship feature;

a fourth obtaining submodule, configured to input the first matrix and the first correlation matrix into a first residual error neural network, so as to obtain a first output value of the first residual error neural network;

a fifth obtaining submodule, configured to input the first output value and the second correlation matrix into a second residual error neural network, so as to obtain a second output value of the second residual error neural network;

and the sixth determining submodule is used for adding the second output value and the second relation characteristic matrix to obtain a third relation characteristic matrix after the second relation characteristic matrix is optimized.

15. The apparatus according to any one of claims 10-14, further comprising:

a first determining module configured to determine, in a target predicate data set, the first predicate corresponding to the first relational feature and the second predicate corresponding to the second relational feature.

16. The apparatus of claim 15, further comprising:

an obtaining module, configured to obtain a second predicate data set corresponding to the plurality of second relationship characteristics;

a second determining module, configured to obtain, according to the second predicate data set, an implicit vector corresponding to a predicate in the second predicate data set;

a third determining module, configured to cluster the predicates in the second predicate data set according to a distance between each two implicit vectors, and determine a first predicate data set corresponding to multiple first relationship features and a correspondence between the predicates in the first predicate data set and the predicates in the second predicate data set;

a fourth determining module, configured to use the first predicate data set, the second predicate data set, and the correspondence as the target predicate data set.

17. The apparatus of claim 16, wherein the second determining module comprises:

a sixth obtaining submodule, configured to input the second predicate data set into a language-coded neural network, so as to obtain an implicit vector output by the language-coded neural network and corresponding to the predicate in the second predicate data set; the language coding neural network is obtained by training by taking predicates in the sample predicate data set corresponding to the plurality of second relational features as input values and taking sample implicit vectors corresponding to predicates labeled in advance in the sample predicate data set as supervision.

18. The apparatus according to any one of claims 10-17, wherein the image comprises an image for reflecting a road condition ahead of the automatic driving device during driving, and the scene graph is used for describing a relationship characteristic between traffic objects included in the image; wherein the traffic object comprises at least one of a traffic signal sign, a traffic signal light, a pedestrian, a non-motor vehicle and a motor vehicle.

19. A computer-readable storage medium, characterized in that the storage medium stores a computer program for executing the scene graph generating method of any one of the above claims 1 to 9.

20. A scene graph generation apparatus, comprising:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to invoke executable instructions stored in the memory to implement the scenegraph generation method of any of claims 1-9.