CN116665185A

CN116665185A - Three-dimensional target detection method, system and storage medium for automatic driving

Info

Publication number: CN116665185A
Application number: CN202310694348.4A
Authority: CN
Inventors: 刘仪婷; 李兴通; 薛俊; 杨易堃; 钱星铭; 肖昊; 陶重犇
Original assignee: Suzhou University of Science and Technology
Current assignee: Suzhou University of Science and Technology
Priority date: 2023-06-13
Filing date: 2023-06-13
Publication date: 2023-08-29

Abstract

The invention provides a three-dimensional target detection method, a system and a storage medium for automatic driving, comprising the following steps: step one: in the voxel feature extraction branch, extracting multi-scale voxel features of local neighborhood and context information by using a graph rolling network with an attention mechanism; step two: the image feature extraction branches adopt a densely connected 2D convolution network multi-level superposition aggregation to aggregate shallower and deeper layers, and pyramid superposition structures are introduced to aggregate multi-scale image features; step three: and based on the features extracted by the voxel feature extraction branch and the image feature extraction branch, fusing the multi-scale image features and the voxel features through multi-modal iterative mutual attention fusion, and finally carrying out region proposal and classification regression based on the multi-modal features to realize 3D target detection. The invention has the advantages that: the invention can accurately identify and position the targets of the remote vehicles, pedestrians, riding people and the like, and can be applied to actual scenes.

Description

Three-dimensional target detection method, system and storage medium for automatic driving

Technical Field

The present invention relates to the field of autopilot technologies, and in particular, to a three-dimensional object detection method, system, and storage medium for autopilot.

Background

In recent years, the object detection technology has been rapidly developed, and has been widely used in the fields of automatic driving and robots. 3D object detection is a task of identifying and locating objects in a three-dimensional scene, but currently still faces significant challenges. Recent approaches mainly utilize image and lidar type data. The point cloud data has rich geometric information, and the image has rich semantic information. Because of the complementarity of the point cloud and the image, some methods project the point cloud to various views of the compact representation, then acquire the features of the views by using a mature 2D convolutional neural network, and finally fuse the features with the image features. Based on such fusion, some methods screen point clouds with 2D detectors, and only 3D object detection is performed on points within the cone of a 2D image object. However, such methods suffer from spatial information misalignment during projection, especially for distant objects. Another solution has been to choose to fuse images of different resolution levels with point cloud features, but still suffer from point sparsity, particularly for distant objects.

Disclosure of Invention

In order to solve the problems in the background technology, the invention considers that the multi-scale image and the point cloud characteristic information should be fused together, designs a multi-mode iterative mutual attention fusion of depth fusion, and uses an iterative mutual attention mechanism to align voxel characteristics and image characteristics.

The invention provides a target detection algorithm based on iterative voxel-image attention fusion.

As a further improvement of the present invention, the voxel map feature filter (VGFF). And distributing proper attention weights to adjacent voxels according to the dynamically learned image voxel characteristics, enabling the characteristics to selectively pay attention to the most relevant parts, and extracting multi-scale voxel characteristics with more local neighborhood and context information.

As a further improvement of the invention, the densely connected multi-scale image features aggregate MHA-ResNet modules. The multi-scale aggregation features are obtained, so that the encoder obtains richer information, the input dimension reduction conversion to the spatial information of feature loss is reduced, and the accuracy of remote small target detection is improved.

As a further improvement of the invention, the multi-modal iterative mutual attention fusion module. And an iteration mechanism is utilized to capture the correlation between the two modal characteristics, and the alignment accuracy of the two modal characteristics is improved.

The beneficial effects of the invention are as follows: the invention combines the advantages of two sensors of the laser radar and the camera, realizes the technology of detecting the long-distance small target in the automatic driving field, can accurately identify and position the targets of long-distance vehicles, pedestrians, riding people and the like, and can be applied to actual scenes.

Drawings

FIG. 1 is a frame diagram based on iterative voxel-image attention fusion;

FIG. 2 is a block diagram of voxel map feature screening;

fig. 3 is a diagram of a deep aggregation pyramid res net module.

Detailed Description

The invention discloses a three-dimensional target detection method for automatic driving, in particular to an automatic driving multi-mode three-dimensional target detection method based on iterative voxel-image attention fusion.

The invention can reduce the condition of losing space information when dimension reduction is converted to the feature layer, and enhance the 3D target detection precision of the remote small target.

As shown in fig. 1 to 3, the method effectively fuses the image features and voxel features of multiple scales by multi-mode iteration based on the features extracted by two branches, improves the 3D target detection precision of a remote small target, and specifically comprises the following steps:

step one: in the voxel feature extraction branch, a graph convolution network with an attention mechanism is used to extract multi-scale voxel features with more local neighborhood and context information.

The first step comprises the following steps:

step 1: voxelization is performed. The points in the range of depth, height and width (D, H, W) are expressed as the depth, height and width V of the voxels _d *V _h *V _w Dividing into voxels, keeping the number of points in each voxel not to exceed T, obtaining the voxel characteristics by adopting a random downsampling method, wherein T is algebra and refers to the number of points in each voxel.

Step 2: the voxel features construct a spatial map. Constructing a graph G (P, E) based on sparse voxel characteristics, wherein G (P, E) is a structure of an F-dimensional voxel set with N voxels, and the vertex P= { P ₁ ，p ₂ ，...，p _N }∈R ³ The set of edges is represented asThe neighbor set of vertex i is N (i) = { j: (i, j) ∈E }. U.S. { i }. Input a group of voxel bitsSign v= { V ₁ ，v ₂ ，...，v _N }, v is _i ∈R ^F Vertex i ε P. The voxel map feature filter learns a function f: r is R ^F →R ^K Mapping the input voxel feature V to obtain a new group of image voxel features V '= { V' ₁ ，v′ ₂ ，...，v′ _N }, v' _i ∈R ^K The new map voxel feature V' maintains a spatial structural relationship between the original voxel features.

Step 3: the neighbor features are learned using an attention mechanism. The invention constructs an attention mechanism alpha to filter learning features, and the obtained features are focused on related neighbor features. Attention weighting of neighbor verticesj∈N(i)，Δp _ij ＝p _j -p _i ，Δv _ij ＝Mg(v _j )-Mg(v _i ) Wherein->Mg uses a multi-layer perceptron as a feature mapping function. Δp _ij The method is beneficial to filtering the disordered neighbors to obtain meaningful neighbor relations. Deltav _ij The pilot filter assigns more attention to similar neighbor voxels. The attention mechanism of the present invention α (Δp _ij ，Δh _ij )＝M _α (Δp _ij +Δh _ij ) Where +is the join operation, M _α Representing the applied multi-layer perceptron. In addition, the attention weights of the vertex i are normalized to process neighbors of different spatial scales, and the kth feature channel is as follows: />The final output of the voxel feature V' is expressed as +.>Wherein->Representing Hadamard product, b _i ∈R ^K Is a learnable bias. Pooling the aggregate features v' on the vertices of the output voxel map by means of a map _i ＝pooling{v′ _j : j ε N (V') } the invention obtains new voxel map features V= { V "" ₁ ，v″ ₂ ，...，v″ _N }. Transforming the sparse voxel feature V into a conventional 4D dense tensor of size c×d '×h' ×w ', where D' =d/V _d ，H′＝H/v _h ，W′＝W/v _W 。

Step 4: and generating spatial attention by utilizing the spatial relation of the features and connecting the global features to make up global information. The invention uses the voxel V _k The two operations of average pooling and maximum pooling are used along the channel to jointly obtain an effective feature space descriptor, and the space response is U _avg And U _max ∈R ^{1×D′×H×w} . Convolving the two descriptor connections through a standard convolution layer to generate the voxel 3D spatial attention M of the present invention _s ∈R ^{1×D′×H×W} ，M _s (V)＝σ-(f _7×7×7 (AvgPool(V)；MaxPool(V)))＝σ(f _7×7×7 (U _arg ；Y _max ) Where σ represents a sigmoid function, f _7×7×7 A convolution operation representing a filter size of 7 x 7 represents the importance of each characteristic channel representing a filter size voxel. V is the output by spatial attention asWherein->Representing element-by-element multiplication. To supplement global information, the entire scene information is aggregated into global attention blocks using a max-pooling layer. Then a C x 1 feature of global information is obtained and expanded to the size C x D ' x H ' x W ' to connect with V to obtain the final voxel map feature. And finally gradually downsampling the voxel characteristics after 3D sparse convolution, wherein the voxel characteristics obtain a 2D aerial view characteristic map of the area proposal network through 4 times downsampling.

Step two: the image feature extraction branches adopt a densely connected 2D convolution network multi-level superposition aggregation to aggregate shallower and deeper layers, and a pyramid superposition structure is introduced to aggregate multi-scale image features so as to learn finer multi-depth features.

The second step comprises:

step A1: the convolved blocks are subjected to multi-depth aggregation. In order to preserve hierarchical information, the invention builds a simple aggregate form on the basic block of ResNet-50, improving its depth of information and efficiency of delivery. The characteristics in each stage of the network are aggregated, the output of one aggregation node is returned to the backbone network, downsampled and then fed back to the backbone network as input. The invention fuses the blocks of each stage with the subsequent blocks, and fuses together the aggregation nodes with the same depth. Depth aggregation Tn for depth n is represented as:wherein N represents a polymerization operation, and A and B are defined as +.>Where C represents a convolution operation. These nodes select important information through training to maintain the same scale output consistent with the input dimension. Although an aggregation node adopts any block or layer structure, the invention adopts a structure of 1×1 convolution and one BN layer and one nonlinear active layer, which avoids the excessive complexity of the aggregation structure.

Step A2: and obtaining the multi-scale characteristics of the image by adopting a multi-level aggregation pyramid. The invention introduces pyramid (FPN) concept on the basis of multi-depth aggregation, and extracts features from bottom to top. C in the three-stage FPN model of the invention ₃ 、C ₄ And C ₅ The output profiles of the third, fourth and fifth stages of ResNet-50, respectively. C according to the structure of ResNet-50 ₃ Is C ₄ 2 times, C ₄ Is C ₅ Is 2 times as large as the above. The top-level feature map F5 of the FPN of the invention is represented by C ₅ From the 1 x 1 kernel convolution operation, each high-level feature map of the FPN is then up-sampled 2 times and added toA1 x 1 convolution kernel underlying ResNet-50 hierarchical feature map is used to construct a lower level feature map of the FPN. The calculations of F5, F4 and F3 of the final outputs are expressed asWherein conv _1×1 Is a convolution kernel of 1×1, upsamples ₂ To increase the size of the feature map by a factor of four.

Step three: based on the features extracted by the first two branches, multi-mode iterative mutual attention fusion is used for fusing multi-scale image features and voxel features, and finally region proposal and classification regression are carried out based on the multi-mode features to realize 3D target detection.

The third step comprises the following steps:

step B1: the invention applies fusion on the features of different levels, so that the point cloud features can have high-level semantic information from the image. First, the present invention extracts features of four sized voxel convolution layers and projects them into a bird's eye view of the corresponding size. For example, given the coordinates of a point of (H/32, W/32) size of the bird's eye view, the present invention projects it onto an image feature map of (H/32, W/32) size. The invention combines the concept of cross-correlation of signal processing with an attention mechanism, and designs a cross-correlation attention feature fusion module to fuse information of different mode features extracted by two branches. The point cloud aerial view and the image features of the corresponding scale are respectively expressed as Vi and Gi, a fusion feature F is obtained through bitwise addition, and the fusion feature F is respectively calculated with the point cloud features and the image features to obtain the attention weight M _p (F)、M _i (F) A. The invention relates to a method for producing a fibre-reinforced plastic composite Wherein weight M _p (F)、M _i (F) Is calculated as (1) Wherein->Representing addition, & lt + & gt>Representing element-by-element multiplication. Respectively combining the fusion features F with a matrix W _p And W is _i Multiplying and converting feature space and adding bias b _i And b _p Performing feature space translation, the algorithm is optimized by W _p 、W _i 、b _i And b _p The fused features are aligned with the feature space of the point cloud aerial view and the image. Then, obtaining a cross-correlation value R of Vi and Gi and a fusion characteristic F through a hyperbolic tangent function, and normalizing the related function value by softmax to obtain an attention weight M _p (F) And M _i (F) A. The invention relates to a method for producing a fibre-reinforced plastic composite Is converted into [0,1 ] by the normalized cross-correlation value R]Real numbers in between, on the other hand, the extraction of deep semantics is better facilitated. Attention fusion based on multiple modes is expressed asWherein F is E R ^C×H×W Is a fusion feature->Representing feature integration. The present invention chooses a sum by element as the initial integral for simplicity. As input to the attention module, the initial fusion quality may profoundly influence the final fusion weight and in order for the fusion scheme to acquire context awareness. The present invention adopts a straightforward approach, namely to use a further attention module to fuse input features. The invention refers to the two-stage method as multi-mode iterative attention fusion, initial integration +.>Rewritten as +.>

Step B2: region proposal network and classification regression learning network based on CenterPoint realize 3D target detection.

The foregoing is a further detailed description of the invention in connection with the preferred embodiments, and it is not intended that the invention be limited to the specific embodiments described. It will be apparent to those skilled in the art that several simple deductions or substitutions may be made without departing from the spirit of the invention, and these should be considered to be within the scope of the invention.

Claims

1. A three-dimensional object detection method for automatic driving, comprising the steps of:

step one: in the voxel feature extraction branch, extracting multi-scale voxel features of local neighborhood and context information by using a graph rolling network with an attention mechanism;

step two: the image feature extraction branches adopt a densely connected 2D convolution network multi-level superposition aggregation to aggregate shallower and deeper layers, and pyramid superposition structures are introduced to aggregate multi-scale image features;

step three: and based on the features extracted by the voxel feature extraction branch and the image feature extraction branch, fusing the multi-scale image features and the voxel features through multi-modal iterative mutual attention fusion, and finally carrying out region proposal and classification regression based on the multi-modal features to realize 3D target detection.

2. The method of claim 1, wherein the first step comprises the steps of:

step 1, voxelization is carried out: the points in the range of depth, height and width (D, H, W) are expressed as the depth, height and width V of the voxels _d *V _h *V _w Dividing voxels, keeping the number of points in each voxel not to exceed T, adopting a random downsampling method to obtain voxel characteristics, wherein T is algebra and refers to the number of points in each voxel;

step 2, constructing a space diagram by voxel characteristic: constructing a graph G (P, E) from sparse voxel features, wherein a set of vertices P, edges are denoted as E; the neighbor set of the vertex i is N (i); inputting a group of voxel characteristics V, and learning a function f by a voxel map characteristic filter: r is R ^F →R ^K Mapping the input voxel characteristic V to obtain a group of new image voxel characteristics V ', wherein the new image voxel characteristics V' maintain the spatial structure relation among original voxel characteristics;

step 3: filtering the learning features by an attention mechanism delta, wherein the obtained features are focused on related neighbor features;

step 4: and generating spatial attention by utilizing the spatial relation of the features and connecting the global features to make up global information.

3. The three-dimensional object detection method according to claim 2, wherein in the step 3, the attention weights of the neighboring verticesΔp _ij ＝p _j -p _i ，Δv _ij ＝Mg(v _j )-Mg(v _i ) Wherein->Mg uses a multi-layer perceptron as a feature mapping function; Δp _ij The method is beneficial to filtering disordered neighbors to obtain meaningful neighbor relations; deltav _ij Directing the filter to assign more attention to neighbor voxels with subscripts i and j; attention mechanism alpha (Δp) _ij ,Δh _ij )＝M _α (Δp _ij +Δh _ij ) Where +is the join operation, M _α Representing the applied multi-layer perceptron.

4. The method according to claim 2, wherein in the step 3, attention weights of the vertices i are normalized to process neighbors of different spatial scales, and the kth feature channel is as follows:the final output of the voxel feature V' is expressed as +.>Wherein->Representing Hadamard product, b _i ∈R ^K Is a learnable bias; pooling the aggregate features v' on the vertices of the output voxel map by means of a map _i ＝pooling{v′ _j : j e N (V') } to obtain a new voxel map feature V= { V } " ₁ ，v″ ₂ ，...，v″ _N }. Transforming the sparse voxel feature V into a conventional 4D dense tensor of size c×d '×h' ×w ', where D' =d/V _d ，H′＝H/v _h ，W′＝W/v _W 。

5. The three-dimensional object detection method according to claim 2, wherein in the step 4, the voxel V is set _k The two operations of average pooling and maximum pooling are used along the channel to jointly obtain an effective feature space descriptor, and the space response is U _avg And U _max ∈R ^{1×D′×H′×W′} The method comprises the steps of carrying out a first treatment on the surface of the Convolving the two descriptor connections through a standard convolution layer to generate the voxel 3D spatial attention M of the present invention _s ∈R ^{1×D′×H′×W′} ，M _s (V)＝σ(f _7×7×7 (AvgPool(V)；MaxPool(V)))＝σ(f _7×7×7 (U _avg ；U _max ) Where σ represents a sigmoid function, f _7×7×7 A convolution operation representing a filter size of 7 x 7, representing the importance of each characteristic channel representing a filter size voxel; v is the output by spatial attention asWherein->Representing element-by-element multiplication; to supplement global information, the entire scene information is aggregated into global attention blocks using a max pooling layer; then a Cx1 feature of global information is obtained and expanded to CxD' ×The size of H 'x W' is connected with v to obtain the final voxel map feature; and finally gradually downsampling the voxel characteristics after 3D sparse convolution, wherein the voxel characteristics obtain a 2D aerial view characteristic map of the area proposal network through 4 times downsampling.

6. The three-dimensional object detection method according to claim 1, wherein the second step comprises the steps of:

step A1: performing multi-depth aggregation on the convolution blocks;

step A2: and obtaining the multi-scale characteristics of the image by adopting a multi-level aggregation pyramid.

7. The method according to claim 6, wherein in the step A1, the characteristics in each stage in the network are aggregated, the output of one aggregation node is returned to the backbone network, and the output is fed back to the backbone network as the input after downsampling; fusing the blocks of each stage with the subsequent blocks, and fusing aggregation nodes with the same depth together; depth aggregation Tn for depth n is represented as:wherein N represents a polymerization operation, and A and B are defined as +.>C represents a convolution operation; the nodes select important information through training to maintain the same scale output consistent with the input dimensions.

8. The method according to claim 6, wherein in the step A2, C is included in the three-stage FPN model ₃ 、C ₄ And C ₅ Output feature maps of the third, fourth and fifth stages of ResNet-50, respectively, C according to the structure of ResNet-50 ₃ Is C ₄ 2 times, C ₄ Is C ₅ 2 times of (2); top level feature map F5 of FPN is represented by C ₅ From a1 x 1 kernel convolution operation, thenEach high level feature map of the FPN is up-sampled 2 times and added to the 1 x 1 convolution kernel lower level res net-50 level feature map for constructing the lower level feature map of the FPN.

9. The method according to claim 8, wherein in the step A2, the calculations of F5, F4, and F3 which are finally output are expressed asWherein conv _1×1 Is a convolution kernel of 1×1, upsamples ₂ To increase the size of the feature map by a factor of four.

10. The three-dimensional object detection method according to claim 1, wherein the step three comprises the steps of:

step B1: extracting the characteristics of the voxel convolution layers with four sizes, and projecting the characteristics into a bird's eye view with corresponding sizes; information fusion is carried out on different modal features extracted by the two branches through a cross-correlation attention feature fusion module; the point cloud aerial view and the image features of the corresponding scale are respectively expressed as Vi and Gi, a fusion feature F is obtained through bitwise addition, and the fusion feature F is respectively calculated with the point cloud features and the image features to obtain the attention weight M _p (F)、M _i (F) The method comprises the steps of carrying out a first treatment on the surface of the Respectively combining the fusion features F with a matrix W _p And W is _i Multiplying and converting feature space and adding bias b _i And b _p Performing a translation of the feature space by optimizing W _p 、W _i 、b _i And b _p Aligning the fusion feature with the feature space of the point cloud aerial view and the image; then, obtaining a cross-correlation value R of Vi and Gi and a fusion characteristic F through a hyperbolic tangent function, and normalizing the related function value by softmax to obtain an attention weight M _p (F) And M _i (F) The method comprises the steps of carrying out a first treatment on the surface of the Is converted into [0,1 ] by the normalized cross-correlation value R]Real numbers in between; attention fusion based on multiple modes is expressed asWherein F is a fusion feature, < >>Representing feature integration;

11. The method according to claim 10, wherein in the step B1, an attention module is used to fuse input features, and the two-stage method is called multi-modal iterative attention fusion, initial integrationRewritten as +.>

12. A three-dimensional object detection system for autopilot, comprising: a memory, a processor and a computer program stored on the memory, the computer program being configured to implement the steps of the three-dimensional object detection method of any one of claims 1-11 when called by the processor.

13. A computer-readable storage medium, characterized by: the computer readable storage medium stores a computer program configured to implement the steps of the three-dimensional object detection method of any one of claims 1-11 when invoked by a processor.