CN112801059B

CN112801059B - Graph convolution network system and 3D object detection method based on graph convolution network system

Info

Publication number: CN112801059B
Application number: CN202110369721.XA
Authority: CN
Inventors: 杨光远; 黄瑾; 张凯; 丁冬睿
Original assignee: Guangdong Zhongju Artificial Intelligence Technology Co ltd
Current assignee: Lingxin Huizhi Shandong Intelligent Technology Co ltd
Priority date: 2021-04-07
Filing date: 2021-04-07
Publication date: 2021-07-20
Anticipated expiration: 2041-04-07
Also published as: CN112801059A

Abstract

The invention discloses a graph convolution network system and a 3D object detection method based on the graph convolution network system. The system comprises: the shape semantic extraction module is used for modeling the geometric position of the point cloud feature midpoint of the image; the multilayer perceptron is connected with the shape semantic extraction module and is used for extracting multilevel semantic features by utilizing a multilayer graph convolution neural network and filtering the multilevel semantic features by using an attention mechanism; the proposal generator is connected with the multilayer perceptron and used for summarizing the multi-level semantic features and generating a primary proposal in a weighting manner; and the proposal reasoning module is connected with the proposal generator and is used for predicting the 3D boundary box and the semantic category of the object in the image by utilizing the global semantic features and the primary proposal. The invention effectively gains the detection performance of the whole graph convolution network system, improves the precision of 3D object detection, and makes the interpretability of the depth network stronger.

Description

Graph convolution network system and 3D object detection method based on graph convolution network system

Technical Field

The embodiment of the invention relates to the field of computer vision, in particular to a graph volume network system and a 3D object detection method based on the graph volume network system.

Background

The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.

With the development of science and technology, people urgently need to use computer resources to sense the world and know the world, so that more convenience is provided for the life of people. Due to the existence of body organs such as eyes, noses, ears and the like, the human senses the world in a visual, olfactory, auditory and other ways, wherein visual information accounts for more than eighty percent of information acquired by the human outside. Just like the eyes are in the human body, the discipline of machine vision plays a very important role in the field of machine intelligence. Target detection is one of the traditional subjects of machine vision discipline, and especially, target detection under complex scenes is always the key research direction of researchers.

Object detection is a traditional task in the field of computer vision. Unlike image recognition, target detection not only needs to recognize an object existing on an image and give a corresponding category, but also needs to locate the object through a Bounding box (Bounding box). 2D object detection typically finds and classifies a variable number of objects in an RGB image and is indicated on the image using a 2D bounding box. Most of the research at present focuses on 2D object prediction, and by extending the prediction to 3D, people can capture the size, position and orientation of objects in the world, thus playing a role in a variety of application scenarios including robotics, auto-driving, robotic vision, image retrieval and augmented reality.

The 3D object detection is a detection technique for outputting information such as object semantic type, length, width, height, and rotation angle in a three-dimensional space using information such as an RGB image, an RGB-D depth image, and a laser point cloud. Although 2D object detection is relatively mature and has been widely used in the industry, 3D object detection from 2D images remains a challenging problem due to the lack of data and diversity of the appearance and shape of objects in semantic categories.

Disclosure of Invention

In order to solve the problems in the prior art, the invention provides a graph convolution network system and a 3D object detection method based on the graph convolution network system, which are used for processing a 3D point cloud through graph convolution of a multi-scale attention mechanism and improving the precision of target detection of an image in a three-dimensional space.

In a first aspect, an embodiment of the present invention provides a graph volume network system, including:

the shape semantic extraction module is used for receiving point cloud characteristics of the image, modeling the geometric position of the midpoint of the point cloud characteristics and obtaining global semantic characteristics;

the multilayer perceptron is connected with the shape semantic extraction module and is used for extracting multilevel semantic features by utilizing a multilayer graph convolution neural network based on the global semantic features and filtering the multilevel semantic features by using an attention mechanism;

the proposal generator is connected with the multilayer perceptron and used for summarizing the filtered multilevel semantic features and generating at least one primary proposal in a weighting way;

a proposal inference module, coupled to the proposal generator, for predicting a 3D bounding box and semantic categories of objects in the image using global semantic features and the at least one primary proposal.

In one embodiment, the shape semantic extraction module comprises:

a Fast Search Clustering (CFDP) module, configured to receive the point cloud features, And perform Clustering on feature points in the point cloud features By using a CFDP algorithm to obtain a plurality Of Clustering centers;

the k neighbor module is connected with the CFDP module and used for constructing a plurality of local areas related to the geometric positions of the points by using k neighbor relations according to the plurality of clustering centers;

and the attention aggregation module is used for adaptively aggregating the point characteristics of the clustering center and other points in the local area corresponding to the clustering center to obtain the global semantic characteristics.

In one embodiment, the attention aggregation module is to:

adaptively aggregating the other points in the clustering center and the local area corresponding to the clustering center to generate relative position information;

and constructing an aggregation method using an attention mechanism according to all points in the local area corresponding to the clustering center:

（1）

wherein,

representing the global semantic features of the image,

a modeling function of the relative geometric position is represented,

a point feature processing function is represented by,

a point feature represented as the center of the cluster,

representing point features in a local region corresponding to the cluster center,

and

respectively represent

And

the location information of (1).

In one embodiment, the multilayer perceptron comprises: the system comprises a multilayer graph convolution neural network and a plurality of self-adaptive aggregation modules, wherein the first layer of graph convolution neural network is connected with the shape semantic extraction module, and a self-adaptive module is connected between every two layers of graph convolution neural networks;

the multilayer graph convolution neural network is used for extracting the multilevel semantic features;

the self-adaptive aggregation module is used for filtering the semantic features output by the previous layer of the graph convolution neural network by using an attention mechanism, and inputting the filtered semantic features into the next layer of the graph convolution neural network.

In one embodiment, the attention mechanism is a polymerization process represented by formula (1);

the adaptive aggregation module is to: for a central point of convergence

Polymerization of

Other points in the corresponding local area

To update

The characteristics of (1).

In an embodiment, the proposal generator is connected to each layer in the multi-layer graph convolution neural network, the proposal generator being configured to:

converting the filtered multi-level semantic features into the same feature space using a voting module, wherein the voting module uses a function of:

（2）

wherein,

aggregation method using adaptation for representing design of multi-layer perceptron

Representing semantic features and relative positions before adaptive aggregation,

is shown to pass through

The resulting offset of the semantic features and the offset of the relative positions,

representing the semantic features and relative positions after adaptive aggregation;

using the VoteNet method, the

Generating the at least oneThe first proposal.

In one embodiment, the proposal inference module is for:

using formulas

Integrating all local information, wherein P represents the relative position of all local information, F represents the at least one primary proposal,

representing the integrated information; the integrating operation comprises the following steps: integrating the characteristic information along the vertex direction and the channel direction, and considering the integration of relative positions among proposals and Hadamard inner product operation;

using the VoteNet method by

Predicting the 3D bounding box and semantic category.

In a second aspect, an embodiment of the present invention further provides a 3D object detection method based on a graph-convolution network system. The method comprises the following steps:

s10: acquiring a training data set, wherein the training data set comprises a plurality of training data, and each training data is a point cloud feature of an image; performing 3D bounding box labeling and semantic category labeling on each training data;

s20: constructing any graph convolution network system provided by the embodiment of the invention;

s30: training the graph convolution network system by using the training data set;

s40: the method comprises the steps of collecting point cloud characteristics of an image to be predicted, inputting the point cloud characteristics of the image to be predicted into a trained graph convolution network system, and obtaining a 3D boundary box and a semantic category of an object in the image to be predicted.

In one embodiment, in step S30, the objective optimization function used is:

wherein,

representing the difference between the votes and the truth values obtained during the training process,

for calculating whether the aggregated voting results relate to an object,

representing the difference between the predicted 3D bounding box and the annotated 3D bounding box,

represents the cross-entropy loss between the predicted and labeled classes,

、

and

is a hyper-parameter.

In an embodiment, the method further comprises:

s50: and evaluating the performance of the 3D object detection method by using the average precision mean value, and evaluating the adaptability of the 3D object detection method for detecting various 3D objects by using the variation coefficient of the average precision.

In a third aspect, an embodiment of the present invention further provides a computer device. The device comprises a memory, a processor and a computer program which is stored on the memory and can run on the processor, wherein when the processor executes the program, the 3D object detection method based on the graph convolution network system provided by the embodiment of the invention is realized.

In a fourth aspect, the embodiment of the present invention further provides a storage medium, on which a computer-readable program is stored, where the program, when executed, implements any 3D object detection method based on a graph volume network system provided by the embodiment of the present invention.

The invention has the beneficial effects that: a fast search clustering algorithm is used to obtain a better clustering effect, and attention aggregation is introduced, so that the graph convolution neural network has better input characteristics; in a graph convolution neural network introducing a multilayer perceptron, multilayer geometric features with higher abstraction degree are obtained by using self-adaptive aggregation; and (3) fully utilizing the multilevel semantics, introducing global semantic information, and predicting a 3D boundary box and semantic categories. The operations effectively gain the final performance of the whole graph convolution network system, end-to-end 3D object detection based on a multi-scale attention mechanism is realized, the geometric corresponding relation between the shape semantics and the 3D point cloud characteristics is fully utilized, the precision of the 3D object detection is improved, and the interpretability of the depth network is stronger.

Drawings

Fig. 1 is a schematic structural diagram of a graph convolution network system according to an embodiment of the present invention.

Fig. 2 is a flowchart of a work flow of an adaptive aggregation module according to an embodiment of the present invention.

Fig. 3 is a flowchart of a 3D object detection method based on a graph convolution network system according to an embodiment of the present invention.

Fig. 4 is a schematic structural diagram of a computer device according to an embodiment of the present invention.

Detailed Description

The invention is further described with reference to the following figures and examples. The embodiments and features of the embodiments of the present invention may be combined with each other without conflict.

It is to be understood that the following detailed description is exemplary and is intended to provide further explanation of the invention as claimed. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the invention. As used herein, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise, and it should be understood that the terms "comprises" and "comprising", and any variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

With the rapid development of 3D acquisition technology, 3D sensors are becoming more and more popular and inexpensive, including various types of 3D scanners, lidar (lidar) and RGB-D cameras (e.g., Kinect, RealSense and Apple depth camera). The 3D data acquired by these sensors can provide rich geometric, shape, and dimensional information. The 3D data can typically be represented in different formats, including depth images, point clouds, meshes, and volumetric meshes. The point cloud representation is a common representation form and is characterized in that original geometric information is reserved in a three-dimensional space and discretization is not performed. Point clouds are therefore preferred representations for many scene understanding-related applications, such as autopilot and robotics. Deep learning techniques dominate many areas of research, such as computer vision, speech recognition, and natural language processing. However, deep learning of 3D point clouds still faces significant challenges of small data set size, high dimensionality and unstructured of the 3D point clouds. On the basis, the deep learning method for processing the 3D point cloud is mainly analyzed.

As the application field of graph convolution networks continues to expand, researchers have begun exploring how graph convolution neural networks directly model points in 3D point clouds. On the one hand, the ability to model local structures is crucial to the success of 3D object detection architectures, but the local shape information is not well interpreted. Because of the wide variety of objects that can be studied, the distribution of features required to detect different objects is not necessarily the same. In other words, multiple levels of semantics may be required to identify different objects. On the other hand, when the model extracts edge features between the cluster center and its neighboring points, local geometric relationships between these points are obtained, and shape information capture needs to be performed on the geometric structures between points in the local region deeply.

Example one

The embodiment provides a graph convolution network system for 3D object detection. The system comprises: the system comprises a shape semantic extraction module, a multilayer perceptron, a proposal generator and a proposal reasoning module.

The shape semantic extraction module is used for receiving point cloud characteristics of the image, modeling the geometric position of the midpoint of the point cloud characteristics and obtaining global semantic characteristics.

The multilayer perceptron is connected with the shape semantic extraction module and is used for extracting multilevel semantic features by utilizing a multilayer graph convolution neural network based on the global semantic features and filtering the multilevel semantic features by using an attention mechanism.

And the proposal generator is connected with the multilayer perceptron and used for summarizing the filtered multilevel semantic features and generating at least one primary proposal in a weighting manner.

A proposal inference module is coupled to the proposal generator for predicting a 3D bounding box and semantic categories of objects in the image using global semantic features and the at least one primary proposal.

In one embodiment, the shape semantic extraction module comprises: a CFDP module, a k-nearest neighbor module, and an attention aggregation module.

And the CFDP module is used for receiving the point cloud characteristics, clustering the characteristic points in the point cloud characteristics by using a CFDP algorithm and obtaining a plurality of clustering centers.

The k-nearest neighbor module is connected with the CFDP module and used for constructing a plurality of local areas related to the geometric positions of the points by using k-nearest neighbor relations according to the plurality of clustering centers.

And the attention aggregation module is used for adaptively aggregating point features of the clustering center and other points in the local area corresponding to the clustering center to obtain the global semantic features.

In one embodiment, the attention aggregation module is to: adaptively aggregating the other points in the clustering center and the local area corresponding to the clustering center to generate relative position information; and constructing an aggregation method using an attention mechanism according to all points in the local area corresponding to the clustering center:

（1）

wherein,

representing the global semantic features of the image,

a modeling function of the relative geometric position is represented,

a point feature processing function is represented by,

a point feature represented as the center of the cluster,

and

respectively represent

And

the location information of (1).

In one embodiment, the multilayer perceptron comprises: the system comprises the multilayer graph convolution neural network and a plurality of self-adaptive aggregation modules, wherein the first layer of graph convolution neural network is connected with the shape semantic extraction module, and a self-adaptive module is connected between every two layers of graph convolution neural networks.

The multilayer graph convolution neural network is used for extracting the multilevel semantic features.

In one embodiment, the attention mechanism is a polymerization process represented by equation (1).

The adaptive aggregation module is to: for a central point of convergence

Polymerization of

Other points in the corresponding local area

To update

The characteristics of (1).

In an embodiment, the proposal generator is connected to each layer in the multilayer atlas neural network.

The proposal generator is for: converting the filtered multi-level semantic features into the same feature space using a voting module, wherein the voting module uses a function of:

（2）

wherein,

represents an aggregation method using adaptation designed for the multi-layer perceptron,

is shown to pass through

using VoteNet method to fuse the multi-level semantic features

Generating the at least one primary proposal.

In one embodiment, the proposal inference module is for: using formulas

representing the integrated information; the integrating operation comprises the following steps: integrating the characteristic information along the vertex direction and the channel direction, and considering the integration of relative positions among proposals and Hadamard inner product operation; using the VoteNet method by

Predicting the 3D bounding box and semantic category.

Fig. 1 is a schematic structural diagram of a graph volume network system according to an embodiment of the present invention, in which main structures in the graph volume network are shown, and directions of arrows indicate directions of main signal flows in the graph volume network system. The specific structure and operation principle of the graph convolution network will be described in detail with reference to fig. 1.

After the image point cloud characteristics are input into the graph convolution network system, the following processing flows are sequentially carried out.

Shape semantic extraction: modeling the geometric position of the point cloud midpoint and highlighting the importance of the shape information in the 3D object detection, and the specific process is as follows:

1. and clustering the characteristic points in the point cloud by using a CFDP algorithm. And inputting the sample points and the minimum distance between the two clustering categories, carrying out sample normalization, and sequencing the sample points according to the density of the sample points to generate a density graph of the sample points. And finding outliers in the density map, wherein the outliers are the central points (called clustering centers) of the clustering classes. And judging the cluster type of each sample point according to the density from large to small, and then solving the maximum edge density value of each cluster type. And finally, judging the noise point according to the maximum edge density value, and outputting a clustering class label.

2. A local region of geometric position about the point is constructed using k-nearest neighbor relations. According to each clustering center obtained by clustering, the clustering center closest to the current clustering center is obtainedkThe point is regarded as a local area corresponding to one cluster center. And sequentially carrying out the same operation on each clustering center to obtain a local area corresponding to each clustering center.

3. The features are aggregated by an attention aggregation module. Specifically, the point features are adaptively aggregated with the other points of the cluster center and the local region corresponding to the cluster center, and relative position information is generated. And constructing an aggregation method using an attention mechanism according to all points in the local area corresponding to the clustering center. The function used for the polymerization process is:

（1）

wherein,

which represents the characteristics after the polymerization of the polymer,

a modeling function of the relative geometric position is represented,

representing point feature processing functions, parameters thereof

A point feature representing the center of the cluster,

representing the characteristics of a certain point in the local area corresponding to the cluster center,

and

respectively represent

And

the location information of (1). (II) the multilayer perceptron based on the neural network utilizes shape semantic extraction and generates multilevel semantics in a layered mode, and uses an adaptive aggregation module after graph convolution operation, and the specific process is as follows:

and aggregating the characteristics of each sampling point and updating by using a Graph Convolutional neural Network (GCN) and a designed flow (I) method. By applying GCN to each point and local neighborhoods thereof to obtain a local geometric structure, stacking multilayer images with gradually enlarged neighborhoods to obtain features, the method can gradually enlarge the reception field of convolution and abstract the gradually enlarged local regions, thereby extracting the features in a layered manner and retaining the geometric structure of the points along the hierarchy.

Fig. 2 is a schematic diagram of an operation of an adaptive aggregation module according to an embodiment of the present invention. As shown in fig. 2, the work flow of the adaptive aggregation module is as follows.

For a central point of convergence

And aggregating the features of other points in their respective local regions to update the feature of the point. In FIG. 2, p and q are used to represent

、

，

(vector of points p to q) represents the geometry between these two points.

Can be decomposed into three orthogonal basis vectors. Based on this vector decomposition, the edge features between the two points can be projected as three fixed orthogonal basis vectors, and the direction-dependent weighting matrix applied to extract the features along each direction, which are then weighted in proportion to the angle between the basis vectors. The vector decomposition method can reduce the variance of the absolute coordinates of the point cloud, enable the model to independently learn edge features along each basic direction, and aggregate according to the geometric relationship between the edge vectors and the basic vectors, so that the model can model the geometric structure between the points. Finally, the aggregated features for the point are generated with a weighting via a learnable self-attention module.

Through the self-adaptive aggregation module, the function of filtering the multi-level semantic features extracted by the multi-layer graph convolution neural network by using an attention mechanism is realized. It should be noted that "filtering" is a weighted summation process, and may include the following steps: for other points in the local area of the aggregation center, distributing different weights to the point features of each point according to the correlation between each point and the aggregation center, wherein the distributed weights are increased along with the increase of the correlation; according to the distributed weight, carrying out weighted summation on the point characteristics of other points in the local area of the aggregation center; and updating the point feature of the aggregation center by using the weighted and summed point feature, namely using the weighted and averaged point feature as the point feature of the aggregation center. For example, for a point with a greater correlation to the aggregate center point, a greater weight may be assigned; for points with small correlation with the aggregation center, a small weight value can be allocated, even a weight value of 0 is allocated; and then the weighted summation of the point features is used to realize the effect of 'filtering' the point features with small or even irrelevant relevance.

Thirdly, summarizing multilayer semantics through a proposal generator to generate a primary proposal, and the specific process is as follows:

1. and (5) converting the filtered multilevel semantics obtained by the flow (II) into the same feature space by using a voting module. The function used by the voting module is:

(2) wherein

shows the self-adaptive polymerization method designed and used in the flow (II),

is shown to pass through

representing the semantic features and relative positions after adaptive aggregation. And converting the multi-layer semantic information with different sizes into the semantic information with the same size by executing the improved voting module.

2. Using VoteNet method, fused multilevel semantics

And generating a proposal. The voting result is retained through a Farth Point Sampling (FPS) technology, and multilevel semantic information is fused by adopting a VoteNet method (the number of points in the method is set as 256 by default) to predict a bounding box and a category, which is called as a primary proposal.

And (IV) predicting the 3D bounding box and the semantic category by using the global semantics and the primary proposal by using a proposal reasoning module, wherein the specific process is as follows:

and (3) combining the global semantic information generated by the flow (I) and the primary proposal generated by the flow (III) by using a VoteNet method to finally generate a 3D boundary box and a semantic category, which are called as a final proposal. The "global semantic information" herein refers to an aggregate feature having global semantics, and may also be referred to as a "global semantic feature".

First, using a formula

Integrating all local information, wherein P represents the relative position of all local information, F represents all primary proposals,

representing the integrated information. The integration operation includes integration of feature information in the vertex direction and the channel direction, integration considering relative positions between proposals, and a Hadamard inner product operation. Finally, a VoteNet method is used for predicting the 3D bounding box and the semantic category through the information after integration.

And (3) combining the global semantic information generated by the process (I) and the primary proposal generated by the process (III) by using a VoteNet method to finally generate a 3D bounding box and a semantic category.

In the embodiment of the application, each module can achieve the following beneficial effects:

(1) shape semantic extraction module

A. And (2) clustering the points in the point cloud by using a CFDP algorithm, wherein a density threshold value must be set by a DBscan algorithm because the classical k-means clustering algorithm cannot detect the data distribution of the aspheric surface category, the CFDP algorithm selects the maximum value of the density of each region through the improvement of the two classical methods, and the clustering category is selected according to the density.

B. The central point of the local area and other points of the local area are used, the characteristics of the aggregation point are controlled by the attention machine, and the method is different from the prior method that the 3D object detection uses the maximum pooling operation and only uses single information; the invention makes full use of all information by attention aggregation, and increases the accuracy of the prediction result while ensuring the information quantity.

(2) Multilayer perceptron based on neural network

A. The invention does not adopt a U-shaped network structure of sampling first and then sampling, and only uses a hierarchical graph convolution neural network to generate multilevel semantics. The method ensures the calculation speed and does not influence the accuracy rate due to the introduction of noise generated in the up-sampling process.

B. After the point cloud is subjected to graph convolution neural network, self-adaptive aggregation is used, the vector decomposition method is used for reducing the absolute coordinate variance of the point cloud, the model learns edge features along the basic direction, aggregation is carried out according to the geometric relation between the edge and the basic vector, and the model can be used for modeling the geometric structure between the points. And self-adaptive aggregation is used for acquiring the geometric information of the point cloud as much as possible, the information amount of the central point is increased, and the geometric characteristics with higher abstraction degree are obtained.

(3) A proposal generator:

different from the prior method (such as VoteNet), which only applies one feature map to predict objects, because the invention generates multi-level semantics through the multi-layer perceptron, and converts the multi-level semantics into the same feature space through the voting module, the voting module fully utilizes the characteristic of large information quantity reserved by the multi-level semantics, and the accuracy rate of the result can be obviously improved.

(4) Proposal reasoning module

By the structure in (1), (2) and (3), the local semantics of the multilevel structure are captured and fused, but the global semantics are not used in object detection, so the inference module of the invention merges the global semantics through a new graph convolution neural network and operates on the primary proposal, and generates a prediction bounding box and semantic classes for output finally. The module can combine the local semantics with the global semantics to generate a more accurate boundary box.

In summary, in the graph convolution network system provided in the embodiment of the present invention, a fast search clustering algorithm is used to obtain a better clustering effect, and attention aggregation is introduced, so that the graph convolution neural network has a better input characteristic; in a graph convolution neural network introducing a multilayer perceptron, multilayer geometric features with higher abstraction degree are obtained by using self-adaptive aggregation; and (3) fully utilizing the multilevel semantics, introducing global semantic information, and predicting a 3D boundary box and semantic categories. These operations not only produce good results in the respective modules, but also effectively gain the final performance of the whole graph convolution network system. In addition, the invention also realizes end-to-end 3D object detection based on a multi-scale attention mechanism, and compared with the 3D object detection method which does not pay attention to shape semantics in the existing 3D object detection technology, the graph convolution network system of the invention fully utilizes the geometric corresponding relation between the shape semantics and the 3D point cloud characteristics, thereby not only improving the precision of 3D object detection, but also ensuring that the interpretability of a depth network is stronger.

It should be noted that, in the foregoing embodiment, each included unit and each included module are only divided according to functional logic, but are not limited to the above division as long as the corresponding functions can be implemented; in addition, specific names of the functional units are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present invention.

Example two

The embodiment provides a 3D object detection method based on a graph convolution network system. The method is based on the graph convolution network system described in embodiment 1. Fig. 3 is a flowchart of a 3D object detection method based on a graph convolution network system according to an embodiment of the present invention. As shown in FIG. 3, the method includes steps S10-S40.

S10: acquiring a training data set, wherein the training data set comprises a plurality of training data, and each training data is a point cloud feature of an image; and performing 3D bounding box labeling and semantic category labeling on each training data.

S20: constructing the convolution network system in any one of the first embodiment.

S30: and training the graph convolution network system by using the training data set.

In one embodiment, in step S30, the objective optimization function used is:

wherein,

for calculating whether the aggregated voting results relate to an object,

represents the cross-entropy loss between the predicted and labeled classes,

、

and

is a hyper-parameter.

In an embodiment, the method further comprises:

In the embodiment of the present invention, steps S40 and S50 represent a specific process for performing 3D object detection by using the graph volume network system, and may include the following steps:

(1) image point cloud feature collection

And in the point cloud feature extraction stage, acquiring by using corresponding acquisition equipment according to actual application requirements.

(2) Shape semantic extraction module

In the shape semantic extraction stage, the quick search clustering, the k nearest neighbor method and the attention aggregation are used for obtaining the aggregation characteristics. For more technical details, see flow (i) in example i.

(3) Neural network point cloud feature extraction

In the extraction stage of the point cloud characteristics of the neural network, according to the actual application requirements, a multilayer graph convolution neural network and a self-adaptive aggregation module can be used for preserving the geometric structure of the points along the hierarchy. For more details, see the flow (ii) in the first embodiment.

(4) Proposal generator

In the proposal generation stage, the summary multi-layer semantics generate a primary proposal. See flow (iii) in example one for more technical details.

(5) Proposal reasoning module

In the proposal inference phase, 3D bounding boxes and semantic categories are predicted using global semantics and primary proposals. See flow (iv) in example one for more technical details.

(6) Graph convolution network method based on multi-scale attention mechanism

In the stage of generating the bounding box (the stage of generating the whole graph convolution network system), an optimization objective function is established through object boundary information and object class information:

wherein, the loss function contains 4 items in total, which are respectively:

represents the loss of Vote (Vote loss): calculating the difference between the votes generated by the third procedure of the first embodiment and the true values (the votes regression loss including the distance L1);

object loss (Object loss): calculating whether the summarized voting result is related to an object (calculating the distance between the center of the object of the 3D bounding box predicted in the flow (iv) in the first embodiment and the center of the real object, setting the distance to be less than the threshold value to 0, otherwise setting the distance to be 1);

represents bounding Box loss (Box loss): calculating the difference (regression loss) between the predicted 3D bounding box in the step (4) and the actual 3D bounding box;

represents semantic Classification loss (semantic Classification loss): cross entropy loss of classes is calculated (whether the predicted semantic classification in flow (four) in example one is a loss that is not a true value of the semantic classification).

The condition for training termination is generally set as the number of iterations, and the number of common iterations may be set to 50, 100, etc.

It should be noted that only the object 3D bounding box and semantic category need to be labeled in the training dataset. Centre distance of real object (for calculating

) Can be calculated from the 3D bounding box of the object. Actual 3D bounding boxes of objects (for calculation

) And semantic classification of objects (for computing)

) Are all the existing labeling information in the training data set. The votes are obtained by network learning, and the truth values are obtained by 3D point cloud calculation: given an input point cloud containing N points and XYZ coordinates, the points are sampled and the depth features are learned, and a subset of M points is output. The subset of points is considered to be seed points (seed points), which are fixed, and each seed independently generates a vote, which is called a true value for ease of understanding since the votes are also fixed.

The votes in (1) are votes obtained by network learning, and include 3D coordinates and high-dimensional feature vectors as the true values.

Is the difference between the votes learned by the computational network and a fixed true value, both of which contain 3D coordinates and high-dimensional eigenvectors.

、

And

is a hyper-parameter. In the present embodiment, it is preferred that,

the setting is made to be 0.5,

the setting is 1, and the setting is,

set to 0.1. The graph convolution network system can be trained on a GeForce RTX 2080Ti GPU, random gradient descent is adopted in the training process to realize optimization, and the initial learning rate is set to be

Using a batch process of size 8, iteration 120 rounds of weight decay are used. This embodiment may also be written in the pytorech language.

(7) The evaluation index of the method can select a mean Average Precision (mAP) of general indexes of target detection work for evaluating the performance of the compared frames. The adaptability of the framework to detect various 3D objects can also be demonstrated using a coefficient of variation (cvAP) of Average Precision (AP).

Wherein,

representing 3D objectsNumber of semantic categories. The lower the cvAP, the better the performance of the framework.

In summary, in the 3D object detection method based on the convolutional neural network system provided in the embodiment of the present invention, a fast search clustering algorithm is used to obtain a better clustering effect, and attention aggregation is introduced, so that the convolutional neural network has better input characteristics; in a graph convolution neural network introducing a multilayer perceptron, multilayer geometric features with higher abstraction degree are obtained by using self-adaptive aggregation; and (3) fully utilizing the multilevel semantics, introducing global semantic information, and predicting a 3D boundary box and semantic categories. These operations together effectively gain the detection performance of the entire 3D object detection method. In addition, the invention also realizes the end-to-end 3D object detection based on the multi-scale attention mechanism, and compared with the 3D object detection method which does not pay attention to the shape semantics in the existing 3D object detection technology, the method of the invention fully utilizes the geometric corresponding relation between the shape semantics and the 3D point cloud characteristics, thereby not only improving the precision of the 3D object detection, but also ensuring that the interpretability of the depth network is stronger.

The 3D object detection method of the embodiment of the present invention has the same technical principle and beneficial effects as the graph convolution network system of the first embodiment. Please refer to the graph volume network system in the first embodiment without detailed technical details in this embodiment.

EXAMPLE III

Fig. 4 is a schematic structural diagram of a computer device according to an embodiment of the present invention. As shown in fig. 4, the apparatus includes a processor 410 and a memory 420. The number of the processors 410 may be one or more, and one processor 410 is taken as an example in fig. 4.

The memory 420 serves as a computer-readable storage medium, and may be used to store software programs, computer-executable programs, and modules, such as program instructions/modules of the 3D object detection method based on the graph volume network system in the embodiment of the present invention. The processor 410 implements the above-described 3D object detection method based on the graph convolution network system by executing software programs, instructions, and modules stored in the memory 420.

The memory 420 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the terminal, and the like. Further, the memory 420 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some examples, the memory 420 may further include memory located remotely from the processor 410, which may be connected to the device/terminal/server via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

Example four

The embodiment of the invention also provides a storage medium. Alternatively, in this embodiment, the storage medium may be configured to store a computer program for executing the 3D object detection method based on the graph volume network system provided in any embodiment of the present invention.

Optionally, in this embodiment, the storage medium may include, but is not limited to: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of a hardware embodiment, a software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, optical storage, and the like) having computer-usable program code embodied therein.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A graph convolution network system for 3D object detection, comprising:

a proposal inference module, coupled to the proposal generator, for predicting a 3D bounding box and semantic categories of objects in the image using the global semantic features and the at least one primary proposal;

wherein the shape semantic extraction module comprises:

the fast searching and clustering CFDP module is used for receiving the point cloud characteristics and clustering the characteristic points in the point cloud characteristics by using a CFDP algorithm to obtain a plurality of clustering centers;

the attention aggregation module is used for adaptively aggregating point features of the clustering center and other points in the local area corresponding to the clustering center to obtain the global semantic features;

the attention aggregation module is for:

（1）

wherein,

representing the global semantic features of the image,

a modeling function of the relative geometric position is represented,

a point feature processing function is represented by,

a point feature represented as the center of the cluster,

and

respectively represent

And

the location information of (1).

2. The graph convolution network system of claim 1,

the multilayer perceptron includes: the system comprises a multilayer graph convolution neural network and a plurality of self-adaptive aggregation modules, wherein the first layer of graph convolution neural network is connected with the shape semantic extraction module, and a self-adaptive module is connected between every two layers of graph convolution neural networks;

3. The graph convolution network system of claim 2,

the attention mechanism is a polymerization method represented by formula (1);

the adaptive aggregation module is to: for a central point of convergence

Polymerization of

Other points in the corresponding local area

To update

The characteristics of (1).

4. The system of claim 3, wherein the proposal generator is connected to each layer in the multi-layer atlas neural network, the proposal generator being configured to: