CN114266863B

CN114266863B - 3D scene graph generation method, system, device and readable storage medium based on point cloud

Info

Publication number: CN114266863B
Application number: CN202111673556.3A
Authority: CN
Inventors: 魏平; 危文文; 王帅杰; 郑南宁
Original assignee: Xian Jiaotong University
Current assignee: Xian Jiaotong University
Priority date: 2021-12-31
Filing date: 2021-12-31
Publication date: 2024-02-09
Anticipated expiration: 2041-12-31
Also published as: CN114266863A

Abstract

The invention discloses a 3D scene graph generation method, a system, equipment and a readable storage medium based on point cloud, wherein a 3D scene graph data set is established based on semantic relation and geometric relation, expansion of the point cloud data set in object relation is realized, and then basic point cloud characteristics are extracted from the 3D scene graph data set; performing point cloud target detection based on the generated basic point cloud characteristics to generate a 3D boundary box; the method comprises the steps of establishing a scene graph to generate an initial network model, training the scene graph by adopting basic point cloud characteristics and a corresponding generated 3D boundary frame to generate the initial network model until the set maximum iteration times are reached, obtaining an end-to-end 3D scene graph generation model, arranging original data of an indoor environment into a standard point cloud format, and generating the 3D scene graph by utilizing the scene graph generation network model.

Description

3D scene graph generation method, system, device and readable storage medium based on point cloud

Technical Field

The invention relates to the field of relation prediction of computer vision, in particular to a 3D scene graph generation method, system and device based on point cloud and a readable storage medium.

Background

With the rapid development of deep convolutional neural networks, computers have achieved near-even beyond human performance on some image-based tasks, such as image classification, 2D object detection, and the like. In practical applications, however, it is far from enough to stay in 2D only to understand the 3D world in which we are located. Technologies such as autopilot, robotics, VR, AR, etc. are all very urgent to extend the original 2D task to 3D. However, there are few studies on the task of generating a scene graph in the 3D environment, and it can be said that the task is a completely new research direction in 3D vision. Therefore, a 3D scene generating method capable of realizing a deeper understanding level of a 3D scene and promoting further development of other downstream tasks of 3D vision is urgently needed.

Disclosure of Invention

The invention aims to provide a point cloud-based 3D scene graph generation method, a point cloud-based 3D scene graph generation system, point cloud-based 3D scene graph generation equipment and a point cloud-based 3D scene graph generation method, point cloud-based 3D scene graph generation equipment and a point cloud-based 3D scene graph generation system.

A3D scene graph generation method based on point cloud establishes a 3D scene graph data set based on semantic relation and geometric relation; extracting basic point cloud features from the 3D scene graph dataset; performing point cloud target detection based on the generated basic point cloud characteristics to generate a 3D boundary box;

and establishing a scene graph to generate an initial network model, training the scene graph by adopting basic point cloud characteristics and a corresponding generated 3D boundary frame to generate the initial network model until the set maximum iteration times are reached, obtaining a scene graph generation network model, arranging original data of an indoor environment into a standard point cloud format, and generating a 3D scene graph by utilizing the scene graph generation network model.

Semantic relationships include identification, same set, and part of, to describe the internal relationships between objects; the geometric relationships include support, on, below, above, near, beside, pushed in, pushed out, and in, which describe the relative positional relationship between objects.

And downsampling the input point cloud by adopting a furthest point sampling algorithm (FPS), then searching other point clouds taking the sampling point as the center by using a nearest neighbor algorithm (KNN) so as to form a plurality of point clusters (the number of the point clusters is consistent with that of the sampling point), finally extracting the characteristics of the point clusters by adopting a group of full-connection layers, and taking the characteristics extracted by each point cluster as the characteristics of the corresponding sampling point. The above operation was repeated 4 times, and the number of sampling points was N, 2048, 1024, 512, and 256 in this order.

Further, the characteristic interpolation is performed by adopting the KNN of the inverse distance weighted average, and the calculation formula is as follows:

wherein d (·) is the European distance of two pointsSeparation, f ^(j) (x) Is the j-th dimensional feature of point x.

Further, an eight-dimensional vector [ cls, x, y, z, l, w, h, θ ] is used to represent the 3D bounding box of the object based on the basic point cloud features, where cls is the class of the box, [ x, y, z ] is the center coordinates of the box, [ l, w, h ] is the length, width, height of the box, and θ is the corner of the box along the horizontal direction.

Further, the basic point cloud characteristics are input into a voting module, and the voting module predicts the coordinate bias delta x of each point _i And a characteristic bias Δa _i Finally output g= { g _i |g _i ＝[x _i +Δx _i ；a _i +Δa _i ]I=1, …, M }, a candidate frame generation module clusters the outputs of the voting modules, and obtains initial Q candidate frames through a set of full connection layers, and then filters out overlapping candidate frames through a 3D NMS module, finally obtaining predicted Q bounding frames.

Further, taking the basic point cloud characteristics and the geometric attribute values of the 3D bounding box as inputs; for each 3D bounding box, first querying all points inside the bounding box, and then ordering the points according to their distance from the center of the bounding box; the ordered points and their corresponding features encode context information between the points through a bi-directional LSTM network, and then through a layer 2 MLP network, the final bounding box RoI feature is obtained.

A point cloud based 3D scene graph generation system, comprising:

the data processing module is used for establishing a 3D scene graph data set based on the semantic relation and the geometric relation, extracting basic point cloud characteristics from the 3D scene graph data set, and carrying out point cloud target detection based on the generated basic point cloud characteristics to generate a 3D boundary frame;

the 3D scene graph generating module is used for generating an initial network model according to the basic point cloud characteristics and the corresponding generated 3D boundary frame training scene graph until the set maximum iteration times are reached, obtaining a scene graph generating network model, arranging original data of the indoor environment into a standard point cloud format, and generating a 3D scene graph by utilizing the scene graph generating network model.

A terminal device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the steps of a point cloud based 3D scene graph generation method when the computer program is executed.

A computer readable storage medium storing a computer program which, when executed by a processor, implements the steps of a point cloud based 3D scene graph generation method.

Compared with the prior art, the invention has the following beneficial technical effects:

according to the 3D scene graph generation method based on the point cloud, the 3D scene graph data set is established based on the semantic relation and the geometric relation, expansion of the point cloud data set in the object relation is achieved, and then basic point cloud characteristics are extracted from the 3D scene graph data set; performing point cloud target detection based on the generated basic point cloud characteristics to generate a 3D boundary box; the method comprises the steps of establishing a scene graph to generate an initial network model, training the scene graph by adopting basic point cloud characteristics and a corresponding generated 3D boundary frame to generate the initial network model until the set maximum iteration times are reached, obtaining an end-to-end 3D scene graph generation model, arranging original data of an indoor environment into a standard point cloud format, and generating the 3D scene graph by utilizing the scene graph generation network model.

The method is suitable for indoor scenes based on point clouds, and can efficiently and accurately detect various objects and generate a 3D scene graph.

Drawings

Fig. 1 is a flowchart of a method for generating a 3D scene graph according to an embodiment of the present invention.

FIG. 2 is a diagram of a 3D scene graph dataset relationship class example number in an embodiment of the invention.

Fig. 3 is a Point RoI module according to an embodiment of the invention.

FIG. 4 is a visual result of an embodiment of the present invention.

Detailed Description

In order that those skilled in the art will better understand the present invention, a technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and the claims of the present invention and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the invention described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

The 3D scene graph generating method based on the point cloud expands the object relationship on the 3D point cloud data set, and provides an end-to-end 3D scene graph generating model which contains rich semantic relationship and geometric relationship among objects, can promote the research on the 3D scene graph generating task,

the embodiment of the invention comprises the following steps:

step 1: constructing a 3D scene graph dataset;

the invention adopts a 3D scene graph dataset with rich semantic relations and geometric relations, and the dataset is based on the ScanNet dataset, so that the relation among objects is expanded. The invention divides the object relationship into two types of semantic relationship and geometric relationship. Semantic relationships include identification, same set, and part of, to describe the internal relationships between objects; the geometric relationships include support, on, below, above, near, beside, pushed in, pushed out, and pushed in, which describe the relative positional relationship between objects. The data set employed in the embodiments of the present invention includes 1513 scan scenes in total, 20196 3D truth bounding boxes (13 per scene average) and 113708 relationship instances (75 per scene average).

Step 2: extracting basic point cloud characteristics from the 3D scene graph dataset;

compared with the 2D format image, the 3D point cloud data is more sparse and irregular, and lacks texture information. Therefore, the conventional Convolutional Neural Network (CNN) cannot be directly used for feature extraction of the point cloud. In the present invention, only the point cloud data is input to the model, and RGB information (n×3) is not included. Firstly, downsampling an input point cloud by adopting a furthest point sampling algorithm (FPS), then, searching other point clouds taking a sampling point as a center by using a nearest neighbor algorithm (KNN) to form a plurality of point clusters (the number of the point clusters is consistent with that of the sampling point), finally, extracting features of the point clusters by adopting a group of full-connection layers, and taking the extracted features of each point cluster as the features of the corresponding sampling point. The above operation was repeated 4 times, and the number of sampling points was N, 2048, 1024, 512, and 256 in this order. And then, recovering the point cloud characteristics to 1024 point characteristics by using 2 up-sampling layers, wherein each up-sampling layer is connected with a corresponding down-sampling layer in the characteristic extraction module so as to realize characteristic interpolation. The KNN of inverse distance weighted average is adopted for characteristic interpolation, and the calculation formula is as follows:

where d (·) is the Euclidean distance of two points, f ^(j) (x) And as the j-th dimensional characteristic of the point x, obtaining 1024 multiplied by 256 basic point cloud characteristics after the step is finished.

Step 3: performing point cloud target detection based on the generated basic point cloud characteristics to generate a 3D boundary box;

the input of the network structure corresponding to the step is the basic point cloud characteristics obtained in the step 2, the output is a 3D boundary box of the object and the category thereof, and the three-dimensional boundary box can be characterized by using an eight-dimensional vector [ cls, x, y, z, l, w, h, theta ]. Where cls is the class of the frame, [ x, y, z ] is the center coordinate of the frame, [ l, w, h ] is the length, width, height of the frame, and θ is the rotation angle of the frame along the horizontal direction.

First, the basic point cloud features are input to a voting module. Specifically, the input to the voting module may be noted as f= { f _i |f _i ＝[x _i ；a _i ]I=1, …, M, where M is the number of points (1024),and->Respectively is point f _i And features, D is the dimension of the feature. The voting module predicts the coordinate offset Δx for each point _i And a characteristic bias Δa _i Finally output g= { g _i |g _i ＝[x _i +Δx _i ；a _i +Δa _i ]I=1, …, M }. The voting module is implemented by a 3-layer MLP network and uses the ReLU as an activation function.

After the voting module, a candidate frame generating module clusters the output of the voting module, obtains initial Q candidate frames through a group of full connection layers, filters out overlapped candidate frames through a 3D NMS module, and finally obtains predicted Q boundary frames.

Step 4: generating a 3D scene graph according to the basic point cloud characteristics and the 3D boundary box;

the input of the initial network model structure generated by the scene graph corresponding to the step is the basic point cloud characteristics obtained in the step 2 and the 3D boundary box obtained in the step 3, and the input is the 3D scene graph;

the method comprises the following steps: it is quite simple to derive the RoI (Region of Interest) feature of the bounding box level from the image, since the input image is regular. The extracted point cloud features are spatially irregular because the distribution of these features is determined by the coordinates of their corresponding points. In order to obtain a feature representation of each 3D bounding box, the present invention proposes a method of treating the features of the points as a sequence—point RoI: the method takes the geometrical attribute values of the basic point cloud characteristics and the 3D bounding box as inputs; for each 3D bounding box, all points inside the bounding box are first queried, and then sorted by their distance from the center of the bounding box. These ordered points and their corresponding features are sent to a bi-directional LSTM network for encoding context information between the points, and then through a layer 2 MLP network to obtain the final bounding box RoI features, the concrete model structure of which is shown in fig. 3.

In relational prediction, the semantic class of an object is important because some pairs of items are naturally more related than others. In the present invention, word to vec is used to encode the category of objects into word vectors. The position of the bounding box is also important in 3D space, so the position parameters of the object are encoded by a 2-layer MLP.

One of the biggest differences between the image and the point cloud is that the point cloud can provide accurate geometric position information, while the geometric information of the image is blurred. The detected positions of the 3D bounding boxes accurately reflect their positioning in the real world. In accordance with this, the present invention constructs a localization map using the geometric parameters of the bounding box, which is used in the attention module to learn the effect of the relative positions of different pairs of objects. The nodes of the graph are 3D bounding boxes detected by us, the edges are the relative positions among the bounding boxes, the relative distances are represented by a 4-dimensional vector, the first dimension represents the relative distances, and the other three dimensions represent directions, as shown in the following formula:

the RoI features, word vectors and position vectors are cascaded, the feature representation of the object pair level is obtained through a bidirectional LSTM module, then the feature and the object positioning map feature are input into the attention module to obtain the final feature representation, the final 3D scene graph is obtained through a layer of softmax, and the visualization result is shown in fig. 4.

The model provided by the invention is end-to-end in the training stage, and the total training round number is 100, namely the maximum iteration number; the batch size is 8 and the loss is calculated for cross entropy using a loss function. The optimizer adopting the Adam algorithm as the model carries out back propagation, the initial learning rate is 0.001, and the learning rate attenuation step length is [40,60,80 ]]The corresponding attenuation rate is [0.1,0.1,0.1 ]]. For data augmentation, randomly overturning the input point cloud in the horizontal direction; rotating the point cloud to make the rotation angle obey to be minus 5 ^° ,5 ^° ]Uniformly distributed on the upper part; scaling the point cloud by a scaling factor obeying [0.9,1.1 ]]Uniformly distributed on the surface.

The detailed structural parameters of the model are shown in table 1.

In one embodiment of the present invention, there is provided a terminal device including a processor and a memory for storing a computer program including program instructions, the processor for executing the program instructions stored by the computer storage medium. The processor adopts a Central Processing Unit (CPU), or adopts other general purpose processors, digital Signal Processors (DSPs), application Specific Integrated Circuits (ASICs), ready-made programmable gate arrays (FPGAs) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components and the like, which are a computation core and a control core of the terminal, and are suitable for realizing one or more instructions, in particular for loading and executing one or more instructions so as to realize corresponding method flows or corresponding functions; the processor provided by the embodiment of the invention can be used for the operation of the 3D scene graph generating method based on the point cloud.

A point cloud based 3D scene graph generation system, comprising:

In still another embodiment of the present invention, a storage medium, specifically a computer readable storage medium (Memory), is a Memory device in a terminal device, for storing programs and data. The computer readable storage medium includes a built-in storage medium in the terminal device, provides a storage space, stores an operating system of the terminal, and may also include an extended storage medium supported by the terminal device. Also stored in the memory space are one or more instructions, which may be one or more computer programs (including program code), adapted to be loaded and executed by the processor. The computer readable storage medium may be a high-speed RAM memory or a Non-volatile memory (Non-volatile memory), such as at least one magnetic disk memory. One or more instructions stored in a computer-readable storage medium may be loaded and executed by a processor to implement the respective steps of the point cloud based 3D scene graph generation method that may be used in the above embodiments.

The model provided by the invention is suitable for indoor environments based on point clouds. When the method is used, the original data of the indoor environment are arranged into a standard point cloud format, then the standard point cloud format is input into a trained model, and after calculation, the network outputs a result generated by the 3D scene graph. The method is suitable for indoor scenes based on point clouds, and can efficiently and accurately detect various objects and generate a 3D scene graph.

According to the method, the 3D scene graph data set is established based on the semantic relation and the geometric relation, and the data set contains rich semantic relation and geometric relation among objects, so that research on a 3D scene graph generation task can be promoted. The method and the device can effectively predict the relation between the objects in the 3D scene.

Claims

1. The 3D scene graph generation method based on the point cloud is characterized by comprising the following steps of:

s1, establishing a 3D scene graph dataset based on semantic relations and geometric relations;

s2, extracting basic point cloud features from the 3D scene graph dataset;

downsampling an input point cloud by adopting a furthest point sampling algorithm, then searching other point clouds taking a sampling point as a center by using a nearest neighbor algorithm to form a plurality of point clusters, finally extracting features of the point clusters by adopting a group of full-connection layers, and taking the extracted features of each point cluster as the features of the corresponding sampling point;

the KNN of inverse distance weighted average is adopted for characteristic interpolation, and the calculation formula is as follows:

where d (·) is the Euclidean distance of two points, f ^(j) (x) The j-th dimensional characteristic of the point x;

s3, detecting a point cloud target based on the generated basic point cloud characteristics to generate a 3D boundary box;

using eight-dimensional vectors [ cls, x, y, z, l, w, h, theta ] to represent a 3D boundary box of an object based on the basic point cloud characteristics, wherein cls is a box type, [ x, y, z ] is a center coordinate of the box, [ l, w, h ] is the length, width and height of the box, and theta is the corner of the box along the horizontal direction;

inputting the basic point cloud characteristics into a voting module, and predicting coordinate bias delta x of each point by the voting module _i And a characteristic bias Δa _i Finally output g= { g _i |g _i ＝[x _i +Δx _i ；a _i +Δa _i ]The method comprises the steps that i=1, M, a candidate frame generation module clusters the output of a voting module, initial Q candidate frames are obtained through a group of full connection layers, overlapping candidate frames are filtered through a 3D NMS module, and predicted Q boundary frames are finally obtained;

and S4, establishing a scene graph to generate an initial network model, training the scene graph by adopting basic point cloud characteristics and a corresponding generated 3D boundary frame to generate the initial network model until the set maximum iteration times are reached, obtaining a scene graph generation network model, arranging original data of the indoor environment into a standard point cloud format, and generating a 3D scene graph by utilizing the scene graph generation network model.

2. The method for generating a 3D scene graph based on a point cloud as recited in claim 1, wherein the semantic relationship includes an identification, a same set, and a part of describing an internal relationship between objects; the geometric relationships include support, on, below, above, near, beside, pushedin, pumped out, and pumped in for describing the relative positional relationship between objects.

3. The method for generating the 3D scene graph based on the point cloud according to claim 1, wherein geometrical attribute values of basic point cloud features and 3D bounding boxes are taken as inputs; for each 3D bounding box, first querying all points inside the bounding box, and then ordering the points according to their distance from the center of the bounding box; the ordered points and their corresponding features encode context information between the points through a bi-directional LSTM network, and then through a layer 2 MLP network, the final bounding box RoI feature is obtained.

4. A point cloud based 3D scene graph generation system, comprising:

the 3D scene graph generating module is used for generating an initial network model according to the basic point cloud characteristics and the corresponding generated 3D boundary frame training scene graph until the set maximum iteration times are reached, obtaining a scene graph generating network model, arranging original data of the indoor environment into a standard point cloud format, and generating a 3D scene graph by utilizing the scene graph generating network model;

cloud the basic pointsFeatures are input to a voting module, which predicts the coordinate offset deltax of each point _i And a characteristic bias Δa _i Finally output g= { g _i |g _i ＝[x _i +Δx _i ；a _i +Δa _i ]I=1, …, M }, a candidate frame generation module clusters the outputs of the voting modules, and obtains initial Q candidate frames through a set of full connection layers, and then filters out overlapping candidate frames through a 3D NMS module, finally obtaining predicted Q bounding frames.

5. A terminal device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the steps of the method according to any of claims 1 to 3 when the computer program is executed by the processor.

6. A computer readable storage medium storing a computer program, characterized in that the computer program when executed by a processor implements the steps of the method of any one of claims 1 to 3.