CN114821505A - Multi-view 3D target detection method, memory and system based on aerial view - Google Patents

Multi-view 3D target detection method, memory and system based on aerial view Download PDF

Info

Publication number
CN114821505A
CN114821505A CN202210501805.9A CN202210501805A CN114821505A CN 114821505 A CN114821505 A CN 114821505A CN 202210501805 A CN202210501805 A CN 202210501805A CN 114821505 A CN114821505 A CN 114821505A
Authority
CN
China
Prior art keywords
features
view
module
bird
aerial view
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210501805.9A
Other languages
Chinese (zh)
Inventor
陈远鹏
张军良
赵天坤
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hozon New Energy Automobile Co Ltd
Original Assignee
Hozon New Energy Automobile Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hozon New Energy Automobile Co Ltd filed Critical Hozon New Energy Automobile Co Ltd
Priority to CN202210501805.9A priority Critical patent/CN114821505A/en
Publication of CN114821505A publication Critical patent/CN114821505A/en
Priority to PCT/CN2022/114418 priority patent/WO2023216460A1/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biomedical Technology (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a multi-view 3D target detection method, a multi-view 3D target detection memory and a multi-view 3D target detection system based on an aerial view. The method comprises the following steps: coding the multi-view picture by utilizing a residual error network and a characteristic pyramid to obtain multi-scale characteristics; mapping the multi-scale features to the aerial view through a mapping relation to obtain aerial view features; randomly initializing a query vector, constructing a plurality of subspaces through a first multi-head attention mechanism module, and projecting the query vector into the plurality of subspaces to obtain an initialization characteristic; performing first residual error connection and regularization on the initialization features; combining the features after the first residual error connection and regularization with the aerial view features by using a second multi-head attention mechanism module to obtain learning features; and performing second residual error connection and regularization on the learning features, and outputting the target detection category by using the first feedforward neural network module and outputting the size of a target detection frame by using the second feedforward neural network module.

Description

Multi-view 3D target detection method, memory and system based on aerial view
Technical Field
The invention relates to the field of automatic driving, in particular to a target detection algorithm.
Background
Currently, in the field of automatic driving, 3D target detection using visual information is a long-standing challenge in low-cost automatic driving systems. Currently, two common methods are commonly used in the art: one is to establish the detection flow based on 2D calculations. The process uses a target detection process designed for 2D tasks to predict 3D information, such as target pose and velocity, regardless of 3D scene structure or sensor configuration. This approach often requires many post-processing steps to fuse the predictions of different cameras for removing redundant envelope boxes. The disadvantage of this approach is that the post-processing algorithm is complex and often requires a compromise between performance and efficiency. Another common method is to generate a pseudo-lidar from the camera image using 3D reconstruction, integrating more 3D computation information into the target detection procedure. These inputs are then used as data acquired directly from the 3D sensor using a 3D object detection method. The method can effectively improve the precision of 3D target detection. However, this method is often affected by compound errors, and when the depth value is not predicted correctly, the accuracy of 3D target detection is often negatively affected.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention provides a multi-view 3D object detection method based on a bird's eye view, which comprises the following steps:
coding the multi-view picture by utilizing a residual error network and a characteristic pyramid to obtain multi-scale characteristics;
mapping the multi-scale features to the aerial view through a mapping relation to obtain aerial view features;
randomly initializing a query vector, constructing a plurality of subspaces through a first multi-head attention mechanism module, and projecting the query vector into the plurality of subspaces to obtain an initialization characteristic;
performing first residual error connection and regularization on the initialization features;
combining the features after the first residual error connection and regularization with the aerial view features by using a second multi-head attention mechanism module to obtain learning features; and
and performing second residual error connection and regularization on the learning features, and outputting the target detection category by using a first feedforward neural network module and outputting the size of a target detection frame by using a second feedforward neural network module.
In an embodiment, the step of encoding the multi-view picture by using the residual error network and the feature pyramid to obtain the multi-scale feature includes:
the residual error network extracts features of the multi-view image and performs up-sampling to obtain a plurality of layers of features which are sequentially arranged from a bottom layer to a high layer; and
and accumulating the multilayer characteristics output by the residual error network according to the characteristic mapping map by the characteristic pyramid, and outputting the multi-scale characteristics.
In one embodiment, the step of mapping the multi-scale feature to the bird's eye view by the mapping relation to obtain the bird's eye view feature comprises:
compressing the multi-scale features along the vertical direction, and meanwhile, reserving the dimension in the horizontal direction to obtain compressed aerial view features of different scales;
resampling the compressed aerial view features with different dimensions, and converting the aerial view features into a polar coordinate system to obtain aerial view features with the same dimension;
and downsampling the aerial view features with the same dimension to obtain the aerial view features with the dimensions reduced.
In one embodiment, the step of mapping the multi-scale feature to the bird's eye view by the mapping relation to obtain the bird's eye view feature comprises:
compressing the multi-scale features along the vertical direction, simultaneously reserving the dimension in the horizontal direction, and directly performing dimension transformation to obtain the aerial view features with the same dimension;
and downsampling the aerial view features with the same dimension to obtain the aerial view features with the dimensions reduced.
In one embodiment, the relationship between the input and the output of the first multi-headed attention mechanism module is as shown in equation (1):
Figure BDA0003634699030000021
wherein Q, K, V are inputs, Q is the query vector, K is the vector to be searched, V is the content vector, K, V is the same as Q,
Figure BDA0003634699030000022
on a scale of d k Is the dimension of K; softmax is an activation function, which will
Figure BDA0003634699030000023
Normalizing into probability distribution; t represents a transpose of K; attention (Q, K, V) is the output of the first multi-headed Attention mechanism module, i.e. the initialization feature.
In one embodiment, the relationship between the input and the output of the second multi-headed attention mechanism module is as shown in equation (1):
Figure BDA0003634699030000031
wherein Q, K and V are input, Q and K are the features after the first residual error connection and regularization, V is the aerial view feature,
Figure BDA0003634699030000032
on a scale of d k Is the dimension of K; softmax is an activation function, which will
Figure BDA0003634699030000033
Normalizing into probability distribution; t represents a transpose of K; attention (Q, K, V) is the output of the second multi-headed Attention mechanism module, i.e., the learning feature.
In one embodiment, the first or second feedforward neural network linearly transforms the features after the second residual join and regularization, and the expression of the first or second feedforward neural network is as shown in equation (2):
FFN(x)=max(0,xW 1 +b 1 )*W 2 +b 2 (2)
where x is the feature after the second residual join and regularization, W1 and W2 are the weights of the activation function, b1 and b2 are the weights of the bias, and the max function is taken to be 0 and xW 1 +b 1 The larger of them.
In one embodiment, the step of outputting the target detection class with the first feedforward neural network module and outputting the size of the target detection box with the second feedforward neural network module includes:
performing supervised learning on the first feedforward neural network through a loss module associated with a target detection category to obtain the target detection category;
and performing supervised learning on the second feedforward neural network through a loss module associated with a target detection frame to obtain the size of the target detection frame.
In one embodiment, the multi-view pictures are from six cameras of a front camera, a left front camera, a right front camera, a rear camera, a left rear camera, and a right rear camera of the autonomous vehicle, respectively.
The invention also provides a computer readable storage medium on which computer instructions are stored, which when executed perform the bird's eye view-based multi-perspective 3D object detection method of the invention.
The invention also provides a multi-view 3D target detection system based on the aerial view, which comprises a memory and a processor, wherein the memory is stored with computer instructions capable of running on the processor, and the processor executes the computer instructions to execute the multi-view 3D target detection method based on the aerial view.
The invention also provides a multi-view 3D target detection system based on the aerial view, which comprises an encoding module, an aerial view characteristic acquisition module and a conversion decoding module.
And the coding module is used for coding the multi-view picture to obtain the multi-scale characteristics.
And the aerial view characteristic acquisition module is used for mapping the multi-scale characteristics to the aerial view through a mapping relation to obtain the aerial view characteristics.
And the conversion decoding module comprises an initial module and a learning module.
The initial module comprises:
and the first multi-head attention mechanism is used for constructing a plurality of subspaces, projecting the query vectors into the plurality of subspaces and outputting the spliced features of the plurality of initialized subspaces, namely the initialized features.
The first residual connecting module is used for performing identity mapping according to the query vector and the initialized features and outputting features after the first residual connection; and
the first regularization module is used for regularizing the features after the first residual error connection to obtain features after the first regularization;
the learning module includes:
the second multi-head attention mechanism module is used for combining the regularized features with the aerial view features to obtain learning features;
the second residual error connection module is used for performing identity mapping on the learning features and outputting the features after the second residual error connection;
the second regularization module is used for regularizing the features after the second residual error connection to obtain features after the second regularization;
the first feed-forward neural network outputs a target detection category under the supervision and learning of a loss module associated with the target detection category according to the features after the second regularization; and
and the second feedforward neural network outputs the size of the target detection frame under the supervision and learning of a loss module associated with the target detection frame according to the features after the second regularization.
In one embodiment, the encoding module includes a residual network and a feature pyramid.
And the residual error network is used for extracting features of the multi-view picture and performing up-sampling to obtain a plurality of layers of features which are sequentially arranged from the bottom layer to the high layer.
And the characteristic pyramid is used for accumulating the multilayer characteristics according to the characteristic mapping graph and outputting the multi-scale characteristics.
In one embodiment, the mapping relationship is:
compressing the multi-scale features along the vertical direction, and meanwhile, reserving the dimension in the horizontal direction to obtain compressed aerial view features of different scales;
resampling the compressed aerial view features with different dimensions, and converting the aerial view features into a polar coordinate system to obtain aerial view features with the same dimension;
and downsampling the aerial view features with the same dimension to obtain the aerial view features with the dimensions reduced.
In one embodiment, the mapping relationship is:
compressing the multi-scale features along the vertical direction, simultaneously reserving the dimension in the horizontal direction, and directly performing dimension transformation to obtain the aerial view features with the same dimension;
and downsampling the aerial view features with the same dimension to obtain the aerial view features with the dimensions reduced.
In one embodiment, the relationship between the input and the output of the first multi-headed attention mechanism module is as shown in equation (1):
Figure BDA0003634699030000051
wherein Q, K, V are inputs, Q is the query vector, K is the vector to be searched, V is the content vector, K, V is the same as Q,
Figure BDA0003634699030000052
on a scale of d k Is the dimension of K; softmax is an activation function, which will
Figure BDA0003634699030000053
Normalizing into probability distribution; t represents a transpose of K; attention (Q, K, V) is the output of the first multi-headed Attention mechanism module, i.e. the initialization feature.
In one embodiment, the relationship between the input and the output of the second multi-headed attention mechanism module is as shown in equation (1):
Figure BDA0003634699030000054
wherein Q, K and V are input, Q and K are the features after the first residual error connection and regularization, V is the aerial view feature,
Figure BDA0003634699030000055
on a scale of d k Is the dimension of K; softmax is an activation function, which will
Figure BDA0003634699030000056
Normalizing into probability distribution; t represents a transpose of K; attention (Q, K, V) is the output of the second multi-headed Attention mechanism module, i.e., the learning feature.
In one embodiment, the first or second feedforward neural network linearly transforms the features after the second residual connection and regularization, and the expression of the first or second feedforward neural network is as shown in formula (2):
FFN(x)=max(0,xW 1 +b 1 )*W 2 +b 2 (2)
wherein x isSecond residual join and regularized features, W1 and W2 are weights for activation functions, b1 and b2 are weights for bias, and max is taken to be 0 and xW 1 +b 1 The larger of them.
In one embodiment, the multi-view pictures are from six cameras of a front camera, a left front camera, a right front camera, a rear camera, a left rear camera, and a right rear camera of the autonomous vehicle, respectively.
The multi-view 3D target detection method and system based on the aerial view have extremely beneficial technical effects. First, the object maintains a physical size when projected to the bird's eye view, and thus has a small size difference, compared to the RGB plane or the like. Secondly, the objects of the bird's eye view occupy different spaces, thereby avoiding occlusion problems. Third, in road scenes, since objects are usually located on the ground, the vertical position changes little, and the bird's eye view position is more advantageous for obtaining an accurate three-dimensional bounding box. Compared with single-view camera input, the multi-view 3D detection algorithm can effectively utilize the relation between multi-view point images and improve feature fusion, so that the detection precision can be well improved.
In other words, compared with a monocular detection algorithm, the method disclosed by the invention is based on multi-vision image fusion, can obtain more characteristics, and can well solve the truncation problem of the monocular; compared with an image visual angle space, the method transfers the characteristics into a bird's-eye view (BEV) vector space, and can well solve the problem of multi-visual angle superposition; in addition, the detection effect of the target detection algorithm is superior due to the full consideration of the multi-view angle and the aerial view characteristics.
Drawings
The foregoing summary, as well as the following detailed description of the invention, will be better understood when read in conjunction with the appended drawings. It is to be noted that the appended drawings are intended as examples of the claimed invention. In the drawings, like reference characters designate the same or similar elements.
FIG. 1 shows an overall architecture of a bird's eye view-based 3D object detection algorithm according to an embodiment of the invention;
FIG. 2 is a block diagram of an encoding module according to an embodiment of the invention;
fig. 3 illustrates a network structure of a Bird-eye-view Feature acquisition module (Bird-eye-view Feature) according to an embodiment of the present invention;
fig. 4 shows a network structure of a Bird-eye-view Feature acquisition module (Bird-eye-view Feature) according to still another embodiment of the present invention;
FIG. 5 is a block diagram of a translation decode module according to an embodiment of the invention;
FIG. 6 illustrates a multi-headed attention mechanism module implementation diagram according to an embodiment of the invention;
fig. 7 shows a specific structure of a residual connecting module according to an embodiment of the present invention; and
fig. 8 is a flowchart illustrating a multi-view 3D object detection method based on a bird's eye view according to an embodiment of the invention.
Detailed Description
The detailed features and advantages of the present invention are described in detail in the detailed description which follows, and will be sufficient for anyone skilled in the art to understand the technical content of the present invention and to implement the present invention, and the related objects and advantages of the present invention will be easily understood by those skilled in the art from the description, claims and drawings disclosed in the present specification.
The invention provides a 3D target detection method and system based on a bird's-eye view, which are used for carrying out feature fusion on multi-view pictures and carrying out 3D target detection based on the bird's-eye view.
Fig. 1 shows an overall architecture of a bird's eye view-based 3D object detection algorithm according to an embodiment of the invention. The whole algorithm framework comprises an encoding module (Encoder)101, a Bird-eye-view Feature acquisition module (Bird-eye-view Feature)102 and a transformation decoding module (transform Decoder) 103.
The input of the whole 3D target detection algorithm network architecture based on the aerial view is a multi-view image. The multi-view pictures can come from, for example, six cameras, a front left camera, a front right camera, a back left camera, and a back right camera, respectively, and the output of the entire network architecture is the category of the object in the 3D frame and the size of the 3D frame.
The coding module includes a residual Network (Res-Net) and a Feature Pyramid (Feature Pyramid Network). And extracting features of the multi-view picture by the residual error network to obtain multi-layer features. The feature pyramid fuses the features of each layer (e.g., fusing the underlying and overlying features) to obtain a multi-scale feature. The feature pyramid functions to reinforce higher features in the multi-layer features and to reinforce the positioning details of the lower features in the multi-layer features.
Fig. 2 shows a schematic structural diagram of an encoding module according to an embodiment of the present invention. The coding module is used for up-sampling a high-level feature map which is more abstract and has stronger semantic meaning, and then transversely connecting the feature to a previous-level feature, so that the high-level feature is enhanced, and the coding module has the advantage of being capable of well utilizing the positioning detail information of a bottom layer. Moreover, the network structure can solve the problems caused by different sizes of the targets to be detected, particularly the problem that the targets with small sizes are difficult to detect.
As can be seen from fig. 2, the encoding module includes a residual Network (Res-Net)201 and a Feature Pyramid (Feature Pyramid Network) 202.
The residual error network (Res-Net)201 is used for extracting features from the multi-view map and performing upsampling to obtain multiple layers of features which are sequentially arranged from a bottom layer to a high layer.
The Feature Pyramid (FPN) 202 accumulates the multi-layer features output by the residual error Network according to the Feature mapping map, and outputs the multi-scale features.
The bird's-eye view feature acquisition module is an important module of the invention, and the network structure of the bird's-eye view feature acquisition module completes feature conversion from an image space to a bird's-eye view space.
Fig. 3 shows a network structure of a Bird-eye-view Feature acquisition module (Bird-eye-view Feature) according to an embodiment of the present invention. The bird's-eye view feature acquisition module inputs the multi-scale features output in the Feature Pyramid (FPN) of the coding module, maps the multi-scale features to the bird's-eye view through a mapping relation, and outputs the bird's-eye view features (BEV features).
The method for mapping the multi-scale features to the aerial view and outputting the aerial view features through the mapping relationship mainly comprises the following steps of: firstly, compressing the multi-scale features along the vertical direction, and simultaneously keeping the dimension of the horizontal direction to obtain compressed aerial view features with different scales (301); then, through resampling, converting into a polar coordinate system, and obtaining aerial view features with the same dimension (namely, predicting a group of features along the depth axis direction in the polar coordinate) (302); and then, downsampling the bird's-eye view features with the same dimension size to reduce the dimension to obtain the bird's-eye view features (303) with the reduced dimension so as to adapt to the input dimension of the conversion decoding module.
Fig. 4 shows a network structure of a Bird-eye-view Feature acquisition module (Bird-eye-view Feature) according to still another embodiment of the present invention. The bird's-eye view feature acquisition module inputs the multi-scale features output in the Feature Pyramid (FPN) of the coding module, maps the multi-scale features to the bird's-eye view through a mapping relation, and outputs the bird's-eye view features (BEV features).
The method for mapping the multi-scale features to the aerial view and outputting the aerial view features through the mapping relationship mainly comprises the following steps of: firstly, compressing the multi-scale features along the vertical direction, simultaneously reserving the dimension in the horizontal direction, and directly carrying out dimension transformation to obtain aerial view features with the same dimension (401); dimensionality is reduced on the aerial view features through resampling (namely, downsampling), and the aerial view features (402) with the dimensionality reduced are obtained so as to adapt to the input dimensionality of the conversion decoding module.
Fig. 5 is a schematic diagram of an architecture of a translation decoding module according to an embodiment of the present invention. The conversion decoding module is mainly used for decoding, the conversion decoding module firstly carries out random initialization on a target Query vector (Query) (target feature), then a plurality of subspaces are constructed through a first Multi-head self-attention, and the feature of the target Query vector (Query) is projected into the plurality of subspaces, so that the information of all aspects can be comprehensively utilized, the same problem can be seen from different angles by a model, and a better effect can be obtained; then, the depth of the network is deepened through a residual connecting and regularizing module (Add & Norm), and convergence of the network is accelerated. Then, the characteristics output by the encoder and the target characteristics are well combined together with the aerial view characteristics through a second multi-head attention mechanism. And then outputting the final target detection category and a 3D frame (3D bounding box including a center point coordinate) through a residual connecting module, a regularization module (Add & Norm) and two feedforward neural network modules.
As shown in fig. 5, the transform decoding module mainly includes an initial module 501 and a learning module 502. The initial block 501 includes a first Multi-Head Attention mechanism block (Multi-Head Self-Attention), a first residual concatenation block (Add), and a first regularization block (Norm). The learning module 502 includes a second Multi-Head Attention mechanism module (Multi-Head Self-Attention), a second residual concatenation module (Add) and a second regularization module (Norm), a first feed-forward neural network (FFN) (i.e., a target detection class feed-forward neural network), and a second feed-forward neural network (FFN) (i.e., a target detection box feed-forward neural network).
FIG. 6 illustrates a multi-headed attention mechanism module implementation according to an embodiment of the invention. Wherein, MatMul represents matrix multiplication, Scale represents Scale, and Softmax represents Softmax function. The first multi-head attention mechanism module constructs a plurality of subspaces, projects the features of a target Query vector (Query) into the plurality of subspaces, and outputs the spliced features of the plurality of initialized subspaces, namely the initialized features. And the second multi-head attention mechanism module combines the output of the first regularization module with the BEV characteristics and outputs characteristics, namely learning characteristics, of the spliced plurality of subspaces fused with the BEV characteristics.
The output of the multi-head attention mechanism module is shown in formula (1):
Figure BDA0003634699030000091
wherein the content of the first and second substances,
Figure BDA0003634699030000092
is a scale markDegree, divided by a scale
Figure BDA0003634699030000093
To prevent the result from being too large, d k Dimension of K (Key) vector; softmax is an activation function that will
Figure BDA0003634699030000094
Normalizing into probability distribution; the Softmax function is multiplied by the matrix V to obtain a summed representation of the weights. T denotes the transpose of matrix K.
For the first multi-head attention mechanism module, since it is used for initialization, the three matrices of Q vector, K vector, and V vector are all from the same input, i.e. the three matrices of Q vector, K vector, and V vector are all equal to the query vector (Q vector).
For the second multi-head attention mechanism module, the Q vector and the K vector are the same, the Q vector and the K vector are both the features after the first residual error connection and regularization, and the V vector is a bird's-eye view feature (BEV feature), so that the learning function is embodied.
The residual connecting module has the function of transferring information deeper and enhancing the fitting capability of the model.
The regularization module (Norm) network structure usually represents Layer Normalization, which converts the input of each Layer of neurons into features with the same mean and variance. The regularization module is used for solving the problem that parameters after calculation through multiple layers are too large, too small, and variance is large and the like with the increase of the number of network layers, which can cause abnormality in the learning process. The convergence of the model is very slow, so that the performance of the model can be improved by regularizing the calculated values of each layer, and the convergence of the network is accelerated.
According to an embodiment of the present invention, the input of the first residual concatenation module (Add) is a Query vector (Query) and an initialization feature, and the feature after the first residual concatenation is output after identity mapping, and a specific structure of the residual concatenation module is shown in fig. 7. And a first regularization module (Norm) regularizes the features after the first residual error connection to obtain the features after the first regularization.
According to an embodiment of the present invention, the input of the second residual error concatenation module (Add) is the learned feature output by the second multi-head attention mechanism module, the feature after the second residual error concatenation is output after the identity mapping, and a specific structure of the residual error concatenation module is shown in fig. 7. And a second regularization module (Norm) regularizes the features after the second residual error connection to obtain features after the second regularization.
The output of the second regularization module is divided into two paths to be respectively output to a first feed-forward neural network FFN (target detection class feed-forward neural network) and a second feed-forward neural network FFN (target detection frame feed-forward neural network). The first feed-forward neural network outputs a final target detection class. The second feedforward neural network outputs the size of the target detection box (3D bounding box) and the center coordinates of the target detection box.
The expression of the first or second feedforward neural network is shown in equation (2):
FFN(x)=max(0,xW 1 +b 1 )*W 2 +b 2 (2)
formula (2) represents an expression form of a feedforward neural network (FFN) network structure, and mainly performs linear transformation on the regularized features. Where x is the output of the second regularization module, W1 and W2 are the weights of the activation functions, and b1 and b2 are the weights of the bias. The functional meaning of Max is to take 0 and xW 1 +b 1 The larger of them. The first feed-forward neural network outputs a target detection class under supervised learning of a loss module associated with the target detection class feed-forward neural network. The second feed-forward neural network obtains the size and the center coordinates of the 3D frame under supervised learning of a loss module associated with the target detection frame feed-forward neural network.
Fig. 8 is a flowchart illustrating a multi-view 3D object detection method based on a bird's eye view according to an embodiment of the invention. The method comprises the following steps:
801: coding the multi-view picture by utilizing a residual error network and a characteristic pyramid to obtain multi-scale characteristics;
802: mapping the multi-scale features to the aerial view through a mapping relation to obtain aerial view features;
803: randomly initializing a query vector, constructing a plurality of subspaces through a first multi-head attention mechanism module, and projecting the query vector into the plurality of subspaces to obtain an initialization characteristic;
804: performing first residual error connection and regularization on the initialization features;
805: combining the features after the first residual error connection and regularization with the aerial view features by using a second multi-head attention mechanism module to obtain learning features; and
806: and performing second residual error connection and regularization on the learning features, and outputting the target detection category by using a first feedforward neural network module and outputting the size of a target detection frame by using a second feedforward neural network module.
In an embodiment, the step of encoding the multi-view picture by using the residual error network and the feature pyramid to obtain the multi-scale feature includes:
the residual error network extracts features of the multi-view image and performs up-sampling to obtain a plurality of layers of features which are sequentially arranged from a bottom layer to a high layer; and
and accumulating the multilayer characteristics output by the residual error network according to the characteristic mapping map by the characteristic pyramid, and outputting the multi-scale characteristics.
In one embodiment, the step of mapping the multi-scale feature to the bird's eye view by the mapping relation to obtain the bird's eye view feature comprises:
compressing the multi-scale features along the vertical direction, and meanwhile, reserving the dimension in the horizontal direction to obtain compressed aerial view features of different scales;
resampling the compressed aerial view features with different dimensions, and converting the aerial view features into a polar coordinate system to obtain aerial view features with the same dimension;
and downsampling the aerial view features with the same dimension to obtain the aerial view features with the dimensions reduced.
In one embodiment, the step of mapping the multi-scale feature to the bird's eye view by the mapping relation to obtain the bird's eye view feature comprises:
compressing the multi-scale features along the vertical direction, simultaneously reserving the dimension in the horizontal direction, and directly performing dimension transformation to obtain aerial view features with the same dimension;
and downsampling the aerial view features with the same dimension to obtain the aerial view features with the dimensions reduced.
In one embodiment, the relationship between the input and the output of the first multi-headed attention mechanism module is as shown in equation (1):
Figure BDA0003634699030000111
wherein Q, K, V are inputs, Q is the query vector, K is the vector to be searched, V is the content vector, K, V is the same as Q,
Figure BDA0003634699030000112
on a scale of d k Is the dimension of K; softmax is an activation function, which will
Figure BDA0003634699030000117
Normalizing into probability distribution; t represents a transpose of K; attention (Q, K, V) is the output of the first multi-headed Attention mechanism module, i.e. the initialization feature.
In one embodiment, the relationship between the input and the output of the second multi-headed attention mechanism module is as shown in equation (1):
Figure BDA0003634699030000114
wherein Q, K and V are input, Q and K are the features after the first residual error connection and regularization, V is the aerial view feature,
Figure BDA0003634699030000115
on a scale of d k Is the dimension of K; sSoft max is an activation function, which will
Figure BDA0003634699030000116
Normalizing into probability distribution; t represents a transpose of K; attention (Q, K, V) is the output of the second multi-head Attention mechanism module, i.e., the learning feature.
In one embodiment, the first or second feedforward neural network linearly transforms the features after the second residual join and regularization, and the expression of the first or second feedforward neural network is as shown in equation (2):
FFN(x)=max(0,xW 1 +b 1 )*W 2 +b 2 (2)
where x is the feature after the second residual join and regularization, W1 and W2 are the weights of the activation function, b1 and b2 are the weights of the bias, and the max function is taken to be 0 and xW 1 +b 1 The larger of them.
In one embodiment, the step of outputting the target detection class with the first feedforward neural network module and outputting the size of the target detection box with the second feedforward neural network module includes:
performing supervised learning on the first feedforward neural network through a loss module associated with a target detection category to obtain the target detection category;
and performing supervised learning on the second feedforward neural network through a loss module associated with a target detection frame to obtain the size of the target detection frame.
In one embodiment, the multi-view pictures are from six cameras of a front camera, a left front camera, a right front camera, a rear camera, a left rear camera, and a right rear camera of the autonomous vehicle, respectively.
The invention also provides a computer readable storage medium on which computer instructions are stored, which when executed perform the bird's eye view-based multi-perspective 3D object detection method of the invention.
The invention also provides a multi-view 3D target detection system based on the aerial view, which comprises a memory and a processor, wherein the memory is stored with computer instructions capable of running on the processor, and the processor executes the computer instructions to execute the multi-view 3D target detection method based on the aerial view.
The invention also provides a multi-view 3D target detection system based on the aerial view, which comprises an encoding module, an aerial view characteristic acquisition module and a conversion decoding module.
And the coding module is used for coding the multi-view picture to obtain the multi-scale characteristics.
And the aerial view characteristic acquisition module is used for mapping the multi-scale characteristics to the aerial view through a mapping relation to obtain the aerial view characteristics.
And the conversion decoding module comprises an initial module and a learning module.
The initial module comprises:
and the first multi-head attention mechanism is used for constructing a plurality of subspaces, projecting the query vectors into the plurality of subspaces and outputting the spliced features of the plurality of initialized subspaces, namely the initialized features.
The first residual connecting module is used for performing identity mapping according to the query vector and the initialized features and outputting features after the first residual connection; and
the first regularization module is used for regularizing the features after the first residual error connection to obtain features after the first regularization;
the learning module includes:
the second multi-head attention mechanism module is used for combining the regularized features with the aerial view features to obtain learning features;
the second residual error connection module is used for performing identity mapping on the learning features and outputting the features after the second residual error connection;
the second regularization module is used for regularizing the features after the second residual error connection to obtain features after the second regularization;
the first feed-forward neural network outputs a target detection category under the supervision and learning of a loss module associated with the target detection category according to the features after the second regularization; and
and the second feedforward neural network outputs the size of the target detection frame under the supervision and learning of a loss module associated with the target detection frame according to the features after the second regularization.
In one embodiment, the encoding module includes a residual network and a feature pyramid.
And the residual error network is used for extracting features of the multi-view picture and performing up-sampling to obtain a plurality of layers of features which are sequentially arranged from the bottom layer to the high layer.
And the characteristic pyramid is used for accumulating the multilayer characteristics according to the characteristic mapping graph and outputting the multi-scale characteristics.
In one embodiment, the mapping relationship is:
compressing the multi-scale features along the vertical direction, and meanwhile, reserving the dimension in the horizontal direction to obtain compressed aerial view features of different scales;
resampling the compressed aerial view features with different dimensions, and converting the aerial view features into a polar coordinate system to obtain aerial view features with the same dimension;
and downsampling the aerial view features with the same dimension to obtain the aerial view features with the dimensions reduced.
In one embodiment, the mapping relationship is:
compressing the multi-scale features along the vertical direction, simultaneously reserving the dimension in the horizontal direction, and directly performing dimension transformation to obtain the aerial view features with the same dimension;
and downsampling the aerial view features with the same dimensionality to obtain the aerial view features with reduced dimensionalities.
In one embodiment, the relationship between the input and the output of the first multi-headed attention mechanism module is as shown in equation (1):
Figure BDA0003634699030000131
wherein Q, K, V are inputs, Q is the query vector, K is the vector to be searched, V is the content vector, K, V is the same as Q,
Figure BDA0003634699030000132
on a scale of d k Is the dimension of K; softmax is an activation function, which will
Figure BDA0003634699030000133
Normalizing into probability distribution; t represents a transpose of K; attention (Q, K, V) is the output of the first multi-headed Attention mechanism module, i.e. the initialization feature.
In one embodiment, the relationship between the input and the output of the second multi-headed attention mechanism module is as shown in equation (1):
Figure BDA0003634699030000141
wherein Q, K and V are input, Q and K are the features after the first residual error connection and regularization, V is the aerial view feature,
Figure BDA0003634699030000142
on a scale of d k Is the dimension of K; softmax is an activation function, which will
Figure BDA0003634699030000143
Normalizing into probability distribution; t represents a transpose of K; attention (Q, K, V) is the output of the second multi-headed Attention mechanism module, i.e., the learning feature.
In one embodiment, the first or second feedforward neural network linearly transforms the features after the second residual join and regularization, and the expression of the first or second feedforward neural network is as shown in equation (2):
FFN(x)=max(0,xW 1 +b 1 )*W 2 +b 2 (2)
wherein x is the second residual connectionAnd regularized features, W1 and W2 are weights for activation functions, b1 and b2 are weights for bias, and max functions are taken to be 0 and xW 1 +b 1 The larger of them.
In one embodiment, the multi-view pictures are from six cameras of a front camera, a left front camera, a right front camera, a rear camera, a left rear camera, and a right rear camera of the autonomous vehicle, respectively.
Compared with a monocular detection algorithm, the method has the advantages that fusion is performed based on the multi-vision images, more features can be obtained, and the problem of truncation of the monocular can be well solved; compared with an image visual angle space, the method transfers the characteristics into a bird's-eye view (BEV) vector space, and can well solve the problem of multi-visual angle superposition; in addition, the detection effect of the target detection algorithm is superior due to the full consideration of the multi-view angle and the aerial view characteristics.
Having thus described the basic concept, it will be apparent to those skilled in the art that the foregoing disclosure is by way of example only, and is not intended to limit the present application. Various modifications, improvements and adaptations to the present application may occur to those skilled in the art, although not explicitly described herein. Such modifications, improvements and adaptations are proposed in the present application and thus fall within the spirit and scope of the exemplary embodiments of the present application.
Flow charts are used herein to illustrate operations performed by systems according to embodiments of the present application. It should be understood that the preceding or following operations are not necessarily performed in the exact order in which they are performed. Rather, various steps may be processed in reverse order or simultaneously. Meanwhile, other operations are added to or removed from these processes.
Also, this application uses specific language to describe embodiments of the application. Reference throughout this specification to "one embodiment," "an embodiment," and/or "some embodiments" means that a particular feature, structure, or characteristic described in connection with at least one embodiment of the present application is included in at least one embodiment of the present application. Therefore, it is emphasized and should be appreciated that two or more references to "an embodiment" or "one embodiment" or "an alternative embodiment" in various places throughout this specification are not necessarily all referring to the same embodiment. Furthermore, some features, structures, or characteristics of one or more embodiments of the present application may be combined as appropriate.
Moreover, those skilled in the art will appreciate that aspects of the present application may be illustrated and described in terms of several patentable species or situations, including any new and useful combination of processes, machines, manufacture, or materials, or any new and useful improvement thereon. Accordingly, various aspects of the present application may be embodied entirely in hardware, entirely in software (including firmware, resident software, micro-code, etc.) or in a combination of hardware and software. The above hardware or software may be referred to as "data block," module, "" engine, "" unit, "" component, "or" system. Furthermore, aspects of the present application may be represented as a computer product, including computer readable program code, embodied in one or more computer readable media.
A computer readable signal medium may comprise a propagated data signal with computer program code embodied therein, for example, on a baseband or as part of a carrier wave. The propagated signal may take any of a variety of forms, including electromagnetic, optical, and the like, or any suitable combination. A computer readable signal medium may be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code on a computer readable signal medium may be propagated over any suitable medium, including radio, electrical cable, fiber optic cable, RF, or the like, or any combination of the preceding.
Computer program code required for the operation of various portions of the present application may be written in any one or more programming languages, including an object oriented programming language such as Java, Scala, Smalltalk, Eiffel, JADE, Emerald, C + +, C #, VB.NET, Python, and the like, a conventional programming language such as C, Visual Basic, Fortran 2003, Perl, COBOL 2002, PHP, ABAP, a dynamic programming language such as Python, Ruby, and Groovy, or other programming languages, and the like. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any network format, such as a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet), or in a cloud computing environment, or as a service, such as a software as a service (SaaS).
Additionally, the order in which elements and sequences of the processes described herein are processed, the use of alphanumeric characters, or the use of other designations, is not intended to limit the order of the processes and methods described herein, unless explicitly claimed. While various presently contemplated embodiments of the invention have been discussed in the foregoing disclosure by way of example, it is to be understood that such detail is solely for that purpose and that the appended claims are not limited to the disclosed embodiments, but, on the contrary, are intended to cover all modifications and equivalent arrangements that are within the spirit and scope of the embodiments herein. For example, although the system components described above may be implemented by hardware devices, they may also be implemented by software-only solutions, such as installing the described system on an existing server or mobile device.
Similarly, it should be noted that in the foregoing description of embodiments of the application, various features are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure aiding in the understanding of one or more of the embodiments. This method of disclosure, however, is not intended to require more features than are expressly recited in the claims. Indeed, the embodiments may be characterized as having less than all of the features of a single embodiment disclosed above.
The terms and expressions which have been employed herein are used as terms of description and not of limitation. The use of such terms and expressions is not intended to exclude any equivalents of the features shown and described (or portions thereof), and it is recognized that various modifications may be made within the scope of the claims. Other modifications, variations, and alternatives are also possible. Accordingly, the claims should be looked to in order to cover all such equivalents.
Also, it should be noted that although the present invention has been described with reference to the current specific embodiments, it should be understood by those skilled in the art that the above embodiments are merely illustrative of the present invention, and various equivalent changes or substitutions may be made without departing from the spirit of the present invention, and therefore, it is intended that all changes and modifications to the above embodiments be included within the scope of the claims of the present application.

Claims (19)

1. A multi-view 3D object detection method based on an aerial view is characterized by comprising the following steps:
coding the multi-view picture by utilizing a residual error network and a characteristic pyramid to obtain multi-scale characteristics;
mapping the multi-scale features to the aerial view through a mapping relation to obtain aerial view features;
randomly initializing a query vector, constructing a plurality of subspaces through a first multi-head attention mechanism module, and projecting the query vector to the plurality of subspaces to obtain an initialization characteristic;
performing first residual error connection and regularization on the initialization features;
combining the features after the first residual error connection and regularization with the aerial view features by using a second multi-head attention mechanism module to obtain learning features; and
and performing second residual error connection and regularization on the learning features, and outputting the target detection category by using a first feedforward neural network module and outputting the size of a target detection frame by using a second feedforward neural network module.
2. The bird's eye view-based multi-view 3D object detection method of claim 1, wherein the step of encoding the multi-view picture using the residual network and the feature pyramid to obtain the multi-scale features comprises:
the residual error network extracts features of the multi-view image and performs up-sampling to obtain a plurality of layers of features which are sequentially arranged from a bottom layer to a high layer; and
and accumulating the multilayer characteristics output by the residual error network according to the characteristic mapping map by the characteristic pyramid, and outputting the multi-scale characteristics.
3. The bird's eye view-based multi-perspective 3D object detection method of claim 1, wherein the step of mapping the multi-scale features to the bird's eye view via a mapping relationship to obtain the bird's eye view features comprises:
compressing the multi-scale features along the vertical direction, and meanwhile, reserving the dimension in the horizontal direction to obtain compressed aerial view features of different scales;
resampling the compressed aerial view features with different dimensions, and converting the aerial view features into a polar coordinate system to obtain aerial view features with the same dimension;
and downsampling the aerial view features with the same dimension to obtain the aerial view features with the dimensions reduced.
4. The bird's eye view-based multi-perspective 3D object detection method of claim 1, wherein the step of mapping the multi-scale features to the bird's eye view via a mapping relationship to obtain the bird's eye view features comprises:
compressing the multi-scale features along the vertical direction, simultaneously reserving the dimension in the horizontal direction, and directly performing dimension transformation to obtain the aerial view features with the same dimension;
and downsampling the aerial view features with the same dimension to obtain the aerial view features with the dimensions reduced.
5. The bird's eye view-based multi-perspective 3D object detection method of claim 1, wherein the relationship between the input and output of the first multi-head attention mechanism module is as shown in equation (1):
Figure FDA0003634699020000021
wherein Q, K, V are inputs, Q is the query vector, K is the vector to be searched, V is the content vector, K, V is the same as Q,
Figure FDA0003634699020000022
on a scale of d k Is the dimension of K; softmax is an activation function, which will
Figure FDA0003634699020000023
Normalizing into probability distribution; t represents a transpose of K; attention (Q, K, V) is the output of the first multi-headed Attention mechanism module, i.e. the initialization feature.
6. The bird's eye view-based multi-perspective 3D object detection method of claim 1, wherein the relationship between the input and the output of the second multi-headed attention mechanism module is as shown in equation (1):
Figure FDA0003634699020000024
wherein Q, K and V are input, Q and K are the features after the first residual error connection and regularization, V is the aerial view feature,
Figure FDA0003634699020000025
on a scale of d k Is the dimension of K; softmax is an activation function, which will
Figure FDA0003634699020000026
Normalizing into probability distribution; t represents a transpose of K; attention (Q, K, V) is the output of the second multi-head Attention mechanism module, i.e., the learning feature.
7. The bird's eye view-based multi-perspective 3D object detection method of claim 1, wherein the first or second feed-forward neural network linearly transforms the features after the second residual join and regularization, and the expression of the first or second feed-forward neural network is shown in formula (2):
FFN(x)=max(0,xW 1 +b 1 )*W 2 +b 2 (2)
where x is the feature after the second residual join and regularization, W1 and W2 are the weights of the activation function, b1 and b2 are the weights of the bias, and the max function is taken to be 0 and W 1 + 1 The larger of them.
8. The bird's eye view-based multi-perspective 3D object detection method of claim 1, wherein outputting the object detection category with a first feed-forward neural network module and outputting the size of the object detection box with a second feed-forward neural network module comprises:
performing supervised learning on the first feedforward neural network through a loss module associated with a target detection category to obtain the target detection category;
and performing supervised learning on the second feedforward neural network through a loss module associated with a target detection frame to obtain the size of the target detection frame.
9. The bird's eye view-based multi-view 3D object detection method of claim 1, characterized in that the multi-view pictures are from six cameras of a front camera, a left front camera, a right front camera, a rear camera, a left rear camera, and a right rear camera of an autonomous vehicle, respectively.
10. A computer readable storage medium having stored thereon computer instructions, wherein the computer instructions are executable to perform the bird's eye view based multi-perspective 3D object detection method according to any one of claims 1 to 9.
11. A bird's-eye-view-based multi-perspective 3D object detection system comprising a memory and a processor, the memory having stored thereon computer instructions executable on the processor, wherein the processor, when executing the computer instructions, performs the bird's-eye-view-based multi-perspective 3D object detection method of any one of claims 1 to 9.
12. A multi-perspective 3D object detection system based on an aerial view, the system comprising:
the encoding module is used for encoding the multi-view picture to obtain multi-scale characteristics;
the aerial view characteristic acquisition module is used for mapping the multi-scale characteristics to an aerial view through a mapping relation to obtain aerial view characteristics; and
the conversion decoding module comprises an initial module and a learning module;
the initial module comprises:
the first multi-head attention mechanism is used for constructing a plurality of subspaces, projecting the query vectors into the plurality of subspaces and outputting characteristics spliced by the plurality of initialized subspaces, namely initialized characteristics;
the first residual connecting module is used for performing identity mapping according to the query vector and the initialized features and outputting features after the first residual connection; and
the first regularization module is used for regularizing the features after the first residual error connection to obtain features after the first regularization;
the learning module includes:
the second multi-head attention mechanism module is used for combining the regularized features with the aerial view features to obtain learning features;
the second residual error connection module is used for performing identity mapping on the learning features and outputting features after the second residual error connection;
the second regularization module is used for regularizing the features after the second residual error connection to obtain features after the second regularization;
the first feed-forward neural network outputs a target detection category under the supervision and learning of a loss module associated with the target detection category according to the features after the second regularization; and
and the second feedforward neural network outputs the size of the target detection frame under the supervision and learning of a loss module associated with the target detection frame according to the features after the second regularization.
13. The bird's eye-view based multi-perspective 3D object detection system of claim 12, wherein the encoding module comprises:
the residual error network is used for extracting features of the multi-view picture and performing up-sampling on the features to obtain multilayer features which are sequentially distributed from a bottom layer to a high layer;
and the characteristic pyramid is used for accumulating the multilayer characteristics according to the characteristic mapping graph and outputting the multi-scale characteristics.
14. The bird's eye view-based multi-perspective 3D object detection system of claim 12, wherein the mapping relationship is:
compressing the multi-scale features along the vertical direction, and meanwhile, reserving the dimension in the horizontal direction to obtain compressed aerial view features of different scales;
resampling the compressed aerial view features with different dimensions, and converting the aerial view features into a polar coordinate system to obtain aerial view features with the same dimension;
and downsampling the aerial view features with the same dimension to obtain the aerial view features with the dimensions reduced.
15. The bird's eye view-based multi-perspective 3D object detection system of claim 12, wherein the mapping relationship is:
compressing the multi-scale features along the vertical direction, simultaneously reserving the dimension in the horizontal direction, and directly performing dimension transformation to obtain the aerial view features with the same dimension;
and downsampling the aerial view features with the same dimension to obtain the aerial view features with the dimensions reduced.
16. The bird's eye view-based multi-perspective 3D object detection system of claim 12, wherein the first multi-head attention mechanism module has a relationship of input and output as shown in equation (1):
Figure FDA0003634699020000051
wherein Q, K, V are inputs, Q is the query vector, K is the vector to be searched, V is the content vector, K, V is the same as Q,
Figure FDA0003634699020000052
in order to be a scale of a scale, k is the dimension of K; softmax is an activation function, which will
Figure FDA0003634699020000053
Normalizing into probability distribution; t represents a transpose of K; attention (Q, K, V) is the output of the first multi-headed Attention mechanism module, i.e. the initialization feature.
17. The bird's eye view-based multi-perspective 3D object detection system of claim 12, wherein the relationship of the inputs and outputs of the second multi-headed attention mechanism module is as shown in equation (1):
Figure FDA0003634699020000054
wherein Q, K and V are input, Q and K are the features after the first residual error connection and regularization, V is the aerial view feature,
Figure FDA0003634699020000055
in order to be a scale of a scale, k is the dimension of K; softmax is an activation function, which will
Figure FDA0003634699020000056
Normalizing into probability distribution; t represents a transpose of K; attention (Q, K, V) is the output of the second multi-headed Attention mechanism module, i.e., the learning feature.
18. The bird's eye view-based multi-perspective 3D object detection system of claim 12, wherein the first or second feed-forward neural network linearly transforms the features after the second residual join and regularization, and the expression of the first or second feed-forward neural network is shown in equation (2):
FFN(x)=max(0,xW 1 +b 1 )*W 2 +b 2 (2)
where x is the feature after the second residual join and regularization, W1 and W2 are the weights of the activation function, b1 and b2 are the weights of the bias, and the max function is taken to be 0 and xW 1 +b 1 The larger of them.
19. The bird's eye view-based multi-perspective 3D object detection system of claim 12, wherein the multi-perspective pictures are from six cameras of a front camera, a left front camera, a right front camera, a rear camera, a left rear camera, and a right rear camera of an autonomous vehicle, respectively.
CN202210501805.9A 2022-05-09 2022-05-09 Multi-view 3D target detection method, memory and system based on aerial view Pending CN114821505A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202210501805.9A CN114821505A (en) 2022-05-09 2022-05-09 Multi-view 3D target detection method, memory and system based on aerial view
PCT/CN2022/114418 WO2023216460A1 (en) 2022-05-09 2022-08-24 Aerial view-based multi-view 3d object detection method, memory and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210501805.9A CN114821505A (en) 2022-05-09 2022-05-09 Multi-view 3D target detection method, memory and system based on aerial view

Publications (1)

Publication Number Publication Date
CN114821505A true CN114821505A (en) 2022-07-29

Family

ID=82514245

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210501805.9A Pending CN114821505A (en) 2022-05-09 2022-05-09 Multi-view 3D target detection method, memory and system based on aerial view

Country Status (2)

Country Link
CN (1) CN114821505A (en)
WO (1) WO2023216460A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115880555A (en) * 2023-02-07 2023-03-31 北京百度网讯科技有限公司 Target detection method, model training method, device, equipment and medium
CN116561534A (en) * 2023-07-10 2023-08-08 苏州映赛智能科技有限公司 Method and system for improving accuracy of road side sensor based on self-supervision learning
WO2023216460A1 (en) * 2022-05-09 2023-11-16 合众新能源汽车股份有限公司 Aerial view-based multi-view 3d object detection method, memory and system

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210390714A1 (en) * 2020-06-11 2021-12-16 Toyota Research Institute, Inc. Producing a bird's eye view image from a two dimensional image
CN111832655B (en) * 2020-07-16 2022-10-14 四川大学 Multi-scale three-dimensional target detection method based on characteristic pyramid network
CN113011317B (en) * 2021-03-16 2022-06-14 青岛科技大学 Three-dimensional target detection method and detection device
CN113658100A (en) * 2021-07-16 2021-11-16 上海高德威智能交通系统有限公司 Three-dimensional target object detection method and device, electronic equipment and storage medium
CN113610044B (en) * 2021-08-19 2022-02-15 清华大学 4D millimeter wave three-dimensional target detection method and system based on self-attention mechanism
CN114218999A (en) * 2021-11-02 2022-03-22 上海交通大学 Millimeter wave radar target detection method and system based on fusion image characteristics
CN114821505A (en) * 2022-05-09 2022-07-29 合众新能源汽车有限公司 Multi-view 3D target detection method, memory and system based on aerial view

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023216460A1 (en) * 2022-05-09 2023-11-16 合众新能源汽车股份有限公司 Aerial view-based multi-view 3d object detection method, memory and system
CN115880555A (en) * 2023-02-07 2023-03-31 北京百度网讯科技有限公司 Target detection method, model training method, device, equipment and medium
CN116561534A (en) * 2023-07-10 2023-08-08 苏州映赛智能科技有限公司 Method and system for improving accuracy of road side sensor based on self-supervision learning
CN116561534B (en) * 2023-07-10 2023-10-13 苏州映赛智能科技有限公司 Method and system for improving accuracy of road side sensor based on self-supervision learning

Also Published As

Publication number Publication date
WO2023216460A1 (en) 2023-11-16

Similar Documents

Publication Publication Date Title
Alonso et al. 3d-mininet: Learning a 2d representation from point clouds for fast and efficient 3d lidar semantic segmentation
CN111862126B (en) Non-cooperative target relative pose estimation method combining deep learning and geometric algorithm
Yang et al. Pixor: Real-time 3d object detection from point clouds
Yuan et al. RGGNet: Tolerance aware LiDAR-camera online calibration with geometric deep learning and generative model
Yin et al. Scale recovery for monocular visual odometry using depth estimated with deep convolutional neural fields
CN114821505A (en) Multi-view 3D target detection method, memory and system based on aerial view
CN108230235B (en) Disparity map generation system, method and storage medium
KR20230070253A (en) Efficient 3D object detection from point clouds
US11948310B2 (en) Systems and methods for jointly training a machine-learning-based monocular optical flow, depth, and scene flow estimator
US11321859B2 (en) Pixel-wise residual pose estimation for monocular depth estimation
CN116612468A (en) Three-dimensional target detection method based on multi-mode fusion and depth attention mechanism
CN115273002A (en) Image processing method, device, storage medium and computer program product
CN115578516A (en) Three-dimensional imaging method, device, equipment and storage medium
CN114494395A (en) Depth map generation method, device and equipment based on plane prior and storage medium
CN117037141A (en) 3D target detection method and device and electronic equipment
CN116844129A (en) Road side target detection method, system and device for multi-mode feature alignment fusion
CN114648639B (en) Target vehicle detection method, system and device
Camaioni et al. EpiDepth: a real-time monocular dense-depth estimation pipeline using generic image rectification
CN116486038A (en) Three-dimensional construction network training method, three-dimensional model generation method and device
US20230377180A1 (en) Systems and methods for neural implicit scene representation with dense, uncertainty-aware monocular depth constraints
US11908202B2 (en) Method and system of using a global transformer for efficient modeling of global context in point clouds
Alaba et al. Multi-sensor fusion 3D object detection for autonomous driving
CN115861601A (en) Multi-sensor fusion sensing method and device
CN111062479B (en) Neural network-based rapid model upgrading method and device
Zhang et al. A Vision-Centric Approach for Static Map Element Annotation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: 314500 988 Tong Tong Road, Wu Tong Street, Tongxiang, Jiaxing, Zhejiang

Applicant after: United New Energy Automobile Co.,Ltd.

Address before: 314500 988 Tong Tong Road, Wu Tong Street, Tongxiang, Jiaxing, Zhejiang

Applicant before: Hezhong New Energy Vehicle Co.,Ltd.

CB02 Change of applicant information