CN116665185A - Three-dimensional target detection method, system and storage medium for automatic driving - Google Patents

Three-dimensional target detection method, system and storage medium for automatic driving Download PDF

Info

Publication number
CN116665185A
CN116665185A CN202310694348.4A CN202310694348A CN116665185A CN 116665185 A CN116665185 A CN 116665185A CN 202310694348 A CN202310694348 A CN 202310694348A CN 116665185 A CN116665185 A CN 116665185A
Authority
CN
China
Prior art keywords
voxel
features
feature
attention
fusion
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310694348.4A
Other languages
Chinese (zh)
Inventor
刘仪婷
李兴通
薛俊
杨易堃
钱星铭
肖昊
陶重犇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou University of Science and Technology
Original Assignee
Suzhou University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou University of Science and Technology filed Critical Suzhou University of Science and Technology
Priority to CN202310694348.4A priority Critical patent/CN116665185A/en
Publication of CN116665185A publication Critical patent/CN116665185A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/56Context or environment of the image exterior to a vehicle by using sensors mounted on the vehicle
    • G06V20/58Recognition of moving objects or obstacles, e.g. vehicles or pedestrians; Recognition of traffic objects, e.g. traffic signs, traffic lights or roads
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Multimedia (AREA)
  • Data Mining & Analysis (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Mathematical Physics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • Molecular Biology (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Image Analysis (AREA)
  • Traffic Control Systems (AREA)

Abstract

The invention provides a three-dimensional target detection method, a system and a storage medium for automatic driving, comprising the following steps: step one: in the voxel feature extraction branch, extracting multi-scale voxel features of local neighborhood and context information by using a graph rolling network with an attention mechanism; step two: the image feature extraction branches adopt a densely connected 2D convolution network multi-level superposition aggregation to aggregate shallower and deeper layers, and pyramid superposition structures are introduced to aggregate multi-scale image features; step three: and based on the features extracted by the voxel feature extraction branch and the image feature extraction branch, fusing the multi-scale image features and the voxel features through multi-modal iterative mutual attention fusion, and finally carrying out region proposal and classification regression based on the multi-modal features to realize 3D target detection. The invention has the advantages that: the invention can accurately identify and position the targets of the remote vehicles, pedestrians, riding people and the like, and can be applied to actual scenes.

Description

Three-dimensional target detection method, system and storage medium for automatic driving
Technical Field
The present invention relates to the field of autopilot technologies, and in particular, to a three-dimensional object detection method, system, and storage medium for autopilot.
Background
In recent years, the object detection technology has been rapidly developed, and has been widely used in the fields of automatic driving and robots. 3D object detection is a task of identifying and locating objects in a three-dimensional scene, but currently still faces significant challenges. Recent approaches mainly utilize image and lidar type data. The point cloud data has rich geometric information, and the image has rich semantic information. Because of the complementarity of the point cloud and the image, some methods project the point cloud to various views of the compact representation, then acquire the features of the views by using a mature 2D convolutional neural network, and finally fuse the features with the image features. Based on such fusion, some methods screen point clouds with 2D detectors, and only 3D object detection is performed on points within the cone of a 2D image object. However, such methods suffer from spatial information misalignment during projection, especially for distant objects. Another solution has been to choose to fuse images of different resolution levels with point cloud features, but still suffer from point sparsity, particularly for distant objects.
Disclosure of Invention
In order to solve the problems in the background technology, the invention considers that the multi-scale image and the point cloud characteristic information should be fused together, designs a multi-mode iterative mutual attention fusion of depth fusion, and uses an iterative mutual attention mechanism to align voxel characteristics and image characteristics.
The invention provides a target detection algorithm based on iterative voxel-image attention fusion.
As a further improvement of the present invention, the voxel map feature filter (VGFF). And distributing proper attention weights to adjacent voxels according to the dynamically learned image voxel characteristics, enabling the characteristics to selectively pay attention to the most relevant parts, and extracting multi-scale voxel characteristics with more local neighborhood and context information.
As a further improvement of the invention, the densely connected multi-scale image features aggregate MHA-ResNet modules. The multi-scale aggregation features are obtained, so that the encoder obtains richer information, the input dimension reduction conversion to the spatial information of feature loss is reduced, and the accuracy of remote small target detection is improved.
As a further improvement of the invention, the multi-modal iterative mutual attention fusion module. And an iteration mechanism is utilized to capture the correlation between the two modal characteristics, and the alignment accuracy of the two modal characteristics is improved.
The beneficial effects of the invention are as follows: the invention combines the advantages of two sensors of the laser radar and the camera, realizes the technology of detecting the long-distance small target in the automatic driving field, can accurately identify and position the targets of long-distance vehicles, pedestrians, riding people and the like, and can be applied to actual scenes.
Drawings
FIG. 1 is a frame diagram based on iterative voxel-image attention fusion;
FIG. 2 is a block diagram of voxel map feature screening;
fig. 3 is a diagram of a deep aggregation pyramid res net module.
Detailed Description
The invention discloses a three-dimensional target detection method for automatic driving, in particular to an automatic driving multi-mode three-dimensional target detection method based on iterative voxel-image attention fusion.
The invention can reduce the condition of losing space information when dimension reduction is converted to the feature layer, and enhance the 3D target detection precision of the remote small target.
As shown in fig. 1 to 3, the method effectively fuses the image features and voxel features of multiple scales by multi-mode iteration based on the features extracted by two branches, improves the 3D target detection precision of a remote small target, and specifically comprises the following steps:
step one: in the voxel feature extraction branch, a graph convolution network with an attention mechanism is used to extract multi-scale voxel features with more local neighborhood and context information.
The first step comprises the following steps:
step 1: voxelization is performed. The points in the range of depth, height and width (D, H, W) are expressed as the depth, height and width V of the voxels d *V h *V w Dividing into voxels, keeping the number of points in each voxel not to exceed T, obtaining the voxel characteristics by adopting a random downsampling method, wherein T is algebra and refers to the number of points in each voxel.
Step 2: the voxel features construct a spatial map. Constructing a graph G (P, E) based on sparse voxel characteristics, wherein G (P, E) is a structure of an F-dimensional voxel set with N voxels, and the vertex P= { P 1 ,p 2 ,...,p N }∈R 3 The set of edges is represented asThe neighbor set of vertex i is N (i) = { j: (i, j) ∈E }. U.S. { i }. Input a group of voxel bitsSign v= { V 1 ,v 2 ,...,v N }, v is i ∈R F Vertex i ε P. The voxel map feature filter learns a function f: r is R F →R K Mapping the input voxel feature V to obtain a new group of image voxel features V '= { V' 1 ,v′ 2 ,...,v′ N }, v' i ∈R K The new map voxel feature V' maintains a spatial structural relationship between the original voxel features.
Step 3: the neighbor features are learned using an attention mechanism. The invention constructs an attention mechanism alpha to filter learning features, and the obtained features are focused on related neighbor features. Attention weighting of neighbor verticesj∈N(i),Δp ij =p j -p i ,Δv ij =Mg(v j )-Mg(v i ) Wherein->Mg uses a multi-layer perceptron as a feature mapping function. Δp ij The method is beneficial to filtering the disordered neighbors to obtain meaningful neighbor relations. Deltav ij The pilot filter assigns more attention to similar neighbor voxels. The attention mechanism of the present invention α (Δp ij ,Δh ij )=M α (Δp ij +Δh ij ) Where +is the join operation, M α Representing the applied multi-layer perceptron. In addition, the attention weights of the vertex i are normalized to process neighbors of different spatial scales, and the kth feature channel is as follows: />The final output of the voxel feature V' is expressed as +.>Wherein->Representing Hadamard product, b i ∈R K Is a learnable bias. Pooling the aggregate features v' on the vertices of the output voxel map by means of a map i =pooling{v′ j : j ε N (V') } the invention obtains new voxel map features V= { V "" 1 ,v″ 2 ,...,v″ N }. Transforming the sparse voxel feature V into a conventional 4D dense tensor of size c×d '×h' ×w ', where D' =d/V d ,H′=H/v h ,W′=W/v W
Step 4: and generating spatial attention by utilizing the spatial relation of the features and connecting the global features to make up global information. The invention uses the voxel V k The two operations of average pooling and maximum pooling are used along the channel to jointly obtain an effective feature space descriptor, and the space response is U avg And U max ∈R 1×D′×H×w . Convolving the two descriptor connections through a standard convolution layer to generate the voxel 3D spatial attention M of the present invention s ∈R 1×D′×H×W ,M s (V)=σ-(f 7×7×7 (AvgPool(V);MaxPool(V)))=σ(f 7×7×7 (U arg ;Y max ) Where σ represents a sigmoid function, f 7×7×7 A convolution operation representing a filter size of 7 x 7 represents the importance of each characteristic channel representing a filter size voxel. V is the output by spatial attention asWherein->Representing element-by-element multiplication. To supplement global information, the entire scene information is aggregated into global attention blocks using a max-pooling layer. Then a C x 1 feature of global information is obtained and expanded to the size C x D ' x H ' x W ' to connect with V to obtain the final voxel map feature. And finally gradually downsampling the voxel characteristics after 3D sparse convolution, wherein the voxel characteristics obtain a 2D aerial view characteristic map of the area proposal network through 4 times downsampling.
Step two: the image feature extraction branches adopt a densely connected 2D convolution network multi-level superposition aggregation to aggregate shallower and deeper layers, and a pyramid superposition structure is introduced to aggregate multi-scale image features so as to learn finer multi-depth features.
The second step comprises:
step A1: the convolved blocks are subjected to multi-depth aggregation. In order to preserve hierarchical information, the invention builds a simple aggregate form on the basic block of ResNet-50, improving its depth of information and efficiency of delivery. The characteristics in each stage of the network are aggregated, the output of one aggregation node is returned to the backbone network, downsampled and then fed back to the backbone network as input. The invention fuses the blocks of each stage with the subsequent blocks, and fuses together the aggregation nodes with the same depth. Depth aggregation Tn for depth n is represented as:wherein N represents a polymerization operation, and A and B are defined as +.>Where C represents a convolution operation. These nodes select important information through training to maintain the same scale output consistent with the input dimension. Although an aggregation node adopts any block or layer structure, the invention adopts a structure of 1×1 convolution and one BN layer and one nonlinear active layer, which avoids the excessive complexity of the aggregation structure.
Step A2: and obtaining the multi-scale characteristics of the image by adopting a multi-level aggregation pyramid. The invention introduces pyramid (FPN) concept on the basis of multi-depth aggregation, and extracts features from bottom to top. C in the three-stage FPN model of the invention 3 、C 4 And C 5 The output profiles of the third, fourth and fifth stages of ResNet-50, respectively. C according to the structure of ResNet-50 3 Is C 4 2 times, C 4 Is C 5 Is 2 times as large as the above. The top-level feature map F5 of the FPN of the invention is represented by C 5 From the 1 x 1 kernel convolution operation, each high-level feature map of the FPN is then up-sampled 2 times and added toA1 x 1 convolution kernel underlying ResNet-50 hierarchical feature map is used to construct a lower level feature map of the FPN. The calculations of F5, F4 and F3 of the final outputs are expressed asWherein conv 1×1 Is a convolution kernel of 1×1, upsamples 2 To increase the size of the feature map by a factor of four.
Step three: based on the features extracted by the first two branches, multi-mode iterative mutual attention fusion is used for fusing multi-scale image features and voxel features, and finally region proposal and classification regression are carried out based on the multi-mode features to realize 3D target detection.
The third step comprises the following steps:
step B1: the invention applies fusion on the features of different levels, so that the point cloud features can have high-level semantic information from the image. First, the present invention extracts features of four sized voxel convolution layers and projects them into a bird's eye view of the corresponding size. For example, given the coordinates of a point of (H/32, W/32) size of the bird's eye view, the present invention projects it onto an image feature map of (H/32, W/32) size. The invention combines the concept of cross-correlation of signal processing with an attention mechanism, and designs a cross-correlation attention feature fusion module to fuse information of different mode features extracted by two branches. The point cloud aerial view and the image features of the corresponding scale are respectively expressed as Vi and Gi, a fusion feature F is obtained through bitwise addition, and the fusion feature F is respectively calculated with the point cloud features and the image features to obtain the attention weight M p (F)、M i (F) A. The invention relates to a method for producing a fibre-reinforced plastic composite Wherein weight M p (F)、M i (F) Is calculated as (1) Wherein->Representing addition, & lt + & gt>Representing element-by-element multiplication. Respectively combining the fusion features F with a matrix W p And W is i Multiplying and converting feature space and adding bias b i And b p Performing feature space translation, the algorithm is optimized by W p 、W i 、b i And b p The fused features are aligned with the feature space of the point cloud aerial view and the image. Then, obtaining a cross-correlation value R of Vi and Gi and a fusion characteristic F through a hyperbolic tangent function, and normalizing the related function value by softmax to obtain an attention weight M p (F) And M i (F) A. The invention relates to a method for producing a fibre-reinforced plastic composite Is converted into [0,1 ] by the normalized cross-correlation value R]Real numbers in between, on the other hand, the extraction of deep semantics is better facilitated. Attention fusion based on multiple modes is expressed asWherein F is E R C×H×W Is a fusion feature->Representing feature integration. The present invention chooses a sum by element as the initial integral for simplicity. As input to the attention module, the initial fusion quality may profoundly influence the final fusion weight and in order for the fusion scheme to acquire context awareness. The present invention adopts a straightforward approach, namely to use a further attention module to fuse input features. The invention refers to the two-stage method as multi-mode iterative attention fusion, initial integration +.>Rewritten as +.>
Step B2: region proposal network and classification regression learning network based on CenterPoint realize 3D target detection.
The foregoing is a further detailed description of the invention in connection with the preferred embodiments, and it is not intended that the invention be limited to the specific embodiments described. It will be apparent to those skilled in the art that several simple deductions or substitutions may be made without departing from the spirit of the invention, and these should be considered to be within the scope of the invention.

Claims (13)

1. A three-dimensional object detection method for automatic driving, comprising the steps of:
step one: in the voxel feature extraction branch, extracting multi-scale voxel features of local neighborhood and context information by using a graph rolling network with an attention mechanism;
step two: the image feature extraction branches adopt a densely connected 2D convolution network multi-level superposition aggregation to aggregate shallower and deeper layers, and pyramid superposition structures are introduced to aggregate multi-scale image features;
step three: and based on the features extracted by the voxel feature extraction branch and the image feature extraction branch, fusing the multi-scale image features and the voxel features through multi-modal iterative mutual attention fusion, and finally carrying out region proposal and classification regression based on the multi-modal features to realize 3D target detection.
2. The method of claim 1, wherein the first step comprises the steps of:
step 1, voxelization is carried out: the points in the range of depth, height and width (D, H, W) are expressed as the depth, height and width V of the voxels d *V h *V w Dividing voxels, keeping the number of points in each voxel not to exceed T, adopting a random downsampling method to obtain voxel characteristics, wherein T is algebra and refers to the number of points in each voxel;
step 2, constructing a space diagram by voxel characteristic: constructing a graph G (P, E) from sparse voxel features, wherein a set of vertices P, edges are denoted as E; the neighbor set of the vertex i is N (i); inputting a group of voxel characteristics V, and learning a function f by a voxel map characteristic filter: r is R F →R K Mapping the input voxel characteristic V to obtain a group of new image voxel characteristics V ', wherein the new image voxel characteristics V' maintain the spatial structure relation among original voxel characteristics;
step 3: filtering the learning features by an attention mechanism delta, wherein the obtained features are focused on related neighbor features;
step 4: and generating spatial attention by utilizing the spatial relation of the features and connecting the global features to make up global information.
3. The three-dimensional object detection method according to claim 2, wherein in the step 3, the attention weights of the neighboring verticesΔp ij =p j -p i ,Δv ij =Mg(v j )-Mg(v i ) Wherein->Mg uses a multi-layer perceptron as a feature mapping function; Δp ij The method is beneficial to filtering disordered neighbors to obtain meaningful neighbor relations; deltav ij Directing the filter to assign more attention to neighbor voxels with subscripts i and j; attention mechanism alpha (Δp) ij ,Δh ij )=M α (Δp ij +Δh ij ) Where +is the join operation, M α Representing the applied multi-layer perceptron.
4. The method according to claim 2, wherein in the step 3, attention weights of the vertices i are normalized to process neighbors of different spatial scales, and the kth feature channel is as follows:the final output of the voxel feature V' is expressed as +.>Wherein->Representing Hadamard product, b i ∈R K Is a learnable bias; pooling the aggregate features v' on the vertices of the output voxel map by means of a map i =pooling{v′ j : j e N (V') } to obtain a new voxel map feature V= { V } " 1 ,v″ 2 ,...,v″ N }. Transforming the sparse voxel feature V into a conventional 4D dense tensor of size c×d '×h' ×w ', where D' =d/V d ,H′=H/v h ,W′=W/v W
5. The three-dimensional object detection method according to claim 2, wherein in the step 4, the voxel V is set k The two operations of average pooling and maximum pooling are used along the channel to jointly obtain an effective feature space descriptor, and the space response is U avg And U max ∈R 1×D′×H′×W′ The method comprises the steps of carrying out a first treatment on the surface of the Convolving the two descriptor connections through a standard convolution layer to generate the voxel 3D spatial attention M of the present invention s ∈R 1×D′×H′×W′ ,M s (V)=σ(f 7×7×7 (AvgPool(V);MaxPool(V)))=σ(f 7×7×7 (U avg ;U max ) Where σ represents a sigmoid function, f 7×7×7 A convolution operation representing a filter size of 7 x 7, representing the importance of each characteristic channel representing a filter size voxel; v is the output by spatial attention asWherein->Representing element-by-element multiplication; to supplement global information, the entire scene information is aggregated into global attention blocks using a max pooling layer; then a Cx1 feature of global information is obtained and expanded to CxD' ×The size of H 'x W' is connected with v to obtain the final voxel map feature; and finally gradually downsampling the voxel characteristics after 3D sparse convolution, wherein the voxel characteristics obtain a 2D aerial view characteristic map of the area proposal network through 4 times downsampling.
6. The three-dimensional object detection method according to claim 1, wherein the second step comprises the steps of:
step A1: performing multi-depth aggregation on the convolution blocks;
step A2: and obtaining the multi-scale characteristics of the image by adopting a multi-level aggregation pyramid.
7. The method according to claim 6, wherein in the step A1, the characteristics in each stage in the network are aggregated, the output of one aggregation node is returned to the backbone network, and the output is fed back to the backbone network as the input after downsampling; fusing the blocks of each stage with the subsequent blocks, and fusing aggregation nodes with the same depth together; depth aggregation Tn for depth n is represented as:wherein N represents a polymerization operation, and A and B are defined as +.>C represents a convolution operation; the nodes select important information through training to maintain the same scale output consistent with the input dimensions.
8. The method according to claim 6, wherein in the step A2, C is included in the three-stage FPN model 3 、C 4 And C 5 Output feature maps of the third, fourth and fifth stages of ResNet-50, respectively, C according to the structure of ResNet-50 3 Is C 4 2 times, C 4 Is C 5 2 times of (2); top level feature map F5 of FPN is represented by C 5 From a1 x 1 kernel convolution operation, thenEach high level feature map of the FPN is up-sampled 2 times and added to the 1 x 1 convolution kernel lower level res net-50 level feature map for constructing the lower level feature map of the FPN.
9. The method according to claim 8, wherein in the step A2, the calculations of F5, F4, and F3 which are finally output are expressed asWherein conv 1×1 Is a convolution kernel of 1×1, upsamples 2 To increase the size of the feature map by a factor of four.
10. The three-dimensional object detection method according to claim 1, wherein the step three comprises the steps of:
step B1: extracting the characteristics of the voxel convolution layers with four sizes, and projecting the characteristics into a bird's eye view with corresponding sizes; information fusion is carried out on different modal features extracted by the two branches through a cross-correlation attention feature fusion module; the point cloud aerial view and the image features of the corresponding scale are respectively expressed as Vi and Gi, a fusion feature F is obtained through bitwise addition, and the fusion feature F is respectively calculated with the point cloud features and the image features to obtain the attention weight M p (F)、M i (F) The method comprises the steps of carrying out a first treatment on the surface of the Respectively combining the fusion features F with a matrix W p And W is i Multiplying and converting feature space and adding bias b i And b p Performing a translation of the feature space by optimizing W p 、W i 、b i And b p Aligning the fusion feature with the feature space of the point cloud aerial view and the image; then, obtaining a cross-correlation value R of Vi and Gi and a fusion characteristic F through a hyperbolic tangent function, and normalizing the related function value by softmax to obtain an attention weight M p (F) And M i (F) The method comprises the steps of carrying out a first treatment on the surface of the Is converted into [0,1 ] by the normalized cross-correlation value R]Real numbers in between; attention fusion based on multiple modes is expressed asWherein F is a fusion feature, < >>Representing feature integration;
step B2: region proposal network and classification regression learning network based on CenterPoint realize 3D target detection.
11. The method according to claim 10, wherein in the step B1, an attention module is used to fuse input features, and the two-stage method is called multi-modal iterative attention fusion, initial integrationRewritten as +.>
12. A three-dimensional object detection system for autopilot, comprising: a memory, a processor and a computer program stored on the memory, the computer program being configured to implement the steps of the three-dimensional object detection method of any one of claims 1-11 when called by the processor.
13. A computer-readable storage medium, characterized by: the computer readable storage medium stores a computer program configured to implement the steps of the three-dimensional object detection method of any one of claims 1-11 when invoked by a processor.
CN202310694348.4A 2023-06-13 2023-06-13 Three-dimensional target detection method, system and storage medium for automatic driving Pending CN116665185A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310694348.4A CN116665185A (en) 2023-06-13 2023-06-13 Three-dimensional target detection method, system and storage medium for automatic driving

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310694348.4A CN116665185A (en) 2023-06-13 2023-06-13 Three-dimensional target detection method, system and storage medium for automatic driving

Publications (1)

Publication Number Publication Date
CN116665185A true CN116665185A (en) 2023-08-29

Family

ID=87716922

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310694348.4A Pending CN116665185A (en) 2023-06-13 2023-06-13 Three-dimensional target detection method, system and storage medium for automatic driving

Country Status (1)

Country Link
CN (1) CN116665185A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118314488A (en) * 2024-06-11 2024-07-09 合肥工业大学 Extra-high voltage transformer station space-earth multi-scale re-decision target detection method

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118314488A (en) * 2024-06-11 2024-07-09 合肥工业大学 Extra-high voltage transformer station space-earth multi-scale re-decision target detection method

Similar Documents

Publication Publication Date Title
Garcia-Garcia et al. A survey on deep learning techniques for image and video semantic segmentation
Xie et al. Linking points with labels in 3D: A review of point cloud semantic segmentation
Huang et al. Autonomous driving with deep learning: A survey of state-of-art technologies
Liang et al. Deep continuous fusion for multi-sensor 3d object detection
Liu et al. FG-Net: Fast large-scale LiDAR point clouds understanding network leveraging correlated feature mining and geometric-aware modelling
CN109377530A (en) A kind of binocular depth estimation method based on deep neural network
JP7439153B2 (en) Lifted semantic graph embedding for omnidirectional location recognition
Kazerouni et al. Ghost-UNet: an asymmetric encoder-decoder architecture for semantic segmentation from scratch
CN116758130A (en) Monocular depth prediction method based on multipath feature extraction and multi-scale feature fusion
CN116129233A (en) Automatic driving scene panoramic segmentation method based on multi-mode fusion perception
CN113378756B (en) Three-dimensional human body semantic segmentation method, terminal device and storage medium
Yang et al. Robustifying semantic cognition of traversability across wearable RGB-depth cameras
CN113762267B (en) Semantic association-based multi-scale binocular stereo matching method and device
Shi et al. An improved lightweight deep neural network with knowledge distillation for local feature extraction and visual localization using images and LiDAR point clouds
CN111209840A (en) 3D target detection method based on multi-sensor data fusion
CN116665185A (en) Three-dimensional target detection method, system and storage medium for automatic driving
CN114677479A (en) Natural landscape multi-view three-dimensional reconstruction method based on deep learning
Wang et al. A survey of 3D point cloud and deep learning-based approaches for scene understanding in autonomous driving
CN116129234A (en) Attention-based 4D millimeter wave radar and vision fusion method
Gao et al. S2G2: Semi-supervised semantic bird-eye-view grid-map generation using a monocular camera for autonomous driving
Drobnitzky et al. Survey and systematization of 3D object detection models and methods
Wang et al. LiDAR-only 3D object detection based on spatial context
CN116503746B (en) Infrared small target detection method based on multilayer nested non-full-mapping U-shaped network
Li et al. Monocular 3-D Object Detection Based on Depth-Guided Local Convolution for Smart Payment in D2D Systems
Tang et al. Encoder-decoder structure with the feature pyramid for depth estimation from a single image

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination