WO2023216460A1 - 基于鸟瞰图的多视角3d目标检测方法、存储器及系统 - Google Patents

基于鸟瞰图的多视角3d目标检测方法、存储器及系统 Download PDF

Info

Publication number
WO2023216460A1
WO2023216460A1 PCT/CN2022/114418 CN2022114418W WO2023216460A1 WO 2023216460 A1 WO2023216460 A1 WO 2023216460A1 CN 2022114418 W CN2022114418 W CN 2022114418W WO 2023216460 A1 WO2023216460 A1 WO 2023216460A1
Authority
WO
WIPO (PCT)
Prior art keywords
features
bird
eye view
view
target detection
Prior art date
Application number
PCT/CN2022/114418
Other languages
English (en)
French (fr)
Inventor
陈远鹏
张军良
赵天坤
Original Assignee
合众新能源汽车股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 合众新能源汽车股份有限公司 filed Critical 合众新能源汽车股份有限公司
Publication of WO2023216460A1 publication Critical patent/WO2023216460A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • the invention relates to the field of automatic driving, and in particular to a target detection algorithm.
  • the present invention provides a multi-view 3D target detection method based on a bird's-eye view.
  • the method includes the following steps:
  • Randomly initialize the query vector construct multiple subspaces through the first multi-head attention mechanism module and project the query vector into multiple subspaces to obtain initialization features
  • a second residual connection and regularization are performed on the learning features, and the first feedforward neural network module is used to output the target detection category and the second feedforward neural network module is used to output the size of the target detection frame.
  • the step of using a residual network and a feature pyramid to encode multi-view images to obtain multi-scale features includes:
  • the residual network extracts features from the multi-view image and performs upsampling to obtain multi-layer features arranged sequentially from the bottom layer to the high layer;
  • the feature pyramid accumulates the multi-layer features output by the residual network according to the feature map and outputs multi-scale features.
  • the step of mapping the multi-scale features to a bird's-eye view through a mapping relationship to obtain the bird's-eye view features includes:
  • the bird's-eye view features with the same dimensions are down-sampled to obtain bird's-eye view features with reduced dimensions.
  • the step of mapping the multi-scale features to a bird's-eye view through a mapping relationship to obtain the bird's-eye view features includes:
  • the bird's-eye view features with the same dimensions are down-sampled to obtain bird's-eye view features with reduced dimensions.
  • the relationship between the input and output of the first multi-head attention mechanism module is as shown in formula (1):
  • Q, K, V are inputs
  • Q is the query vector
  • K is the vector to be checked
  • V is the content vector
  • K, V and Q are the same, is the scale
  • d k is the dimension of K
  • softmax is the activation function, which will Normalized to probability distribution
  • T represents the transpose of K
  • Attention (Q, K, V) is the output of the first multi-head attention mechanism module, that is, the initialization feature.
  • the relationship between the input and output of the second multi-head attention mechanism module is as shown in formula (1):
  • Q, K, V are inputs
  • Q and K are the features after the first residual connection and regularization
  • V is the bird's-eye view feature
  • d k is the dimension of K
  • softmax is the activation function, which will Normalized to probability distribution
  • T represents the transpose of K
  • Attention (Q, K, V) is the output of the second multi-head attention mechanism module, that is, the learning feature.
  • the first or second feedforward neural network linearly transforms the second residual connection and regularized features, and the expression of the first or second feedforward neural network is as follows: (2) shown:
  • x is the feature after the second residual connection and regularization
  • W1 and W2 are the weights of the activation function
  • b1 and b2 are the weights of the bias
  • the max function is the larger of 0 and xW 1 + b 1 By.
  • the steps of using the first feedforward neural network module to output the target detection category and using the second feedforward neural network module to output the size of the target detection frame include:
  • the size of the target detection frame is obtained by performing supervised learning on the second feedforward neural network through a loss module associated with the target detection frame.
  • the multi-view images come from six cameras including the front camera, the left front camera, the right front camera, the rear camera, the left rear camera and the right rear camera of the autonomous vehicle.
  • the present invention also provides a computer-readable storage medium on which computer instructions are stored.
  • the computer instructions are run, the multi-view 3D target detection method based on a bird's-eye view of the present invention is executed.
  • the present invention also provides a multi-view 3D target detection system based on a bird's-eye view, including a memory and a processor.
  • the memory stores computer instructions that can be run on the processor.
  • the processor runs the computer. When instructed, the multi-view 3D target detection method based on a bird's eye view of the present invention is executed.
  • the invention also provides a multi-view 3D target detection system based on a bird's-eye view.
  • the system includes a coding module, a bird's-eye view feature acquisition module and a conversion decoding module.
  • the encoding module is used to encode multi-view images to obtain multi-scale features.
  • a bird's-eye view feature acquisition module is used to map the multi-scale features to a bird's-eye view through a mapping relationship to obtain bird's-eye view features.
  • Transformation decoding module including initial module and learning module.
  • the initial modules include:
  • the first multi-head attention mechanism is used to construct multiple subspaces, project query vectors into multiple subspaces, and output features after splicing multiple initialized subspaces, that is, initialized features.
  • the first residual connection module performs identity mapping based on the query vector and initialization features, and outputs the features after the first residual connection;
  • the first regularization module regularizes the features after the first residual connection to obtain the features after the first regularization
  • the learning modules include:
  • the second multi-head attention mechanism module is used to combine the regularized features with the bird's-eye view features to obtain learning features
  • the second residual connection module is used to perform identity mapping on the learning features and output the features after the second residual connection;
  • the second regularization module is used to regularize the features after the second residual connection to obtain the features after the second regularization
  • a first feedforward neural network that outputs a target detection category under supervised learning of a loss module associated with the target detection category based on the second regularized features
  • the second feedforward neural network outputs the size of the target detection frame under the supervised learning of the loss module associated with the target detection frame based on the second regularized features.
  • the encoding module includes a residual network and a feature pyramid.
  • the residual network is used to extract features from the multi-view images and perform upsampling to obtain multi-layer features arranged sequentially from the bottom layer to the high layer.
  • the feature pyramid is used to accumulate the multi-layer features according to the feature map and output multi-scale features.
  • mapping relationship is:
  • the bird's-eye view features with the same dimensions are down-sampled to obtain bird's-eye view features with reduced dimensions.
  • mapping relationship is:
  • the bird's-eye view features with the same dimensions are down-sampled to obtain bird's-eye view features with reduced dimensions.
  • the relationship between the input and output of the first multi-head attention mechanism module is as shown in formula (1):
  • Q, K, V are inputs
  • Q is the query vector
  • K is the vector to be checked
  • V is the content vector
  • K, V and Q are the same, is the scale
  • d k is the dimension of K
  • softmax is the activation function, which will Normalized to probability distribution
  • T represents the transpose of K
  • Attention (Q, K, V) is the output of the first multi-head attention mechanism module, that is, the initialization feature.
  • the relationship between the input and output of the second multi-head attention mechanism module is as shown in formula (1):
  • Q, K, V are inputs
  • Q and K are the features after the first residual connection and regularization
  • V is the bird's-eye view feature
  • d k is the dimension of K
  • softmax is the activation function, which will Normalized to probability distribution
  • T represents the transpose of K
  • Attention (Q, K, V) is the output of the second multi-head attention mechanism module, that is, the learning feature.
  • the first or second feedforward neural network linearly transforms the second residual connection and regularized features, and the expression of the first or second feedforward neural network is as follows: (2) shown:
  • x is the feature after the second residual connection and regularization
  • W1 and W2 are the weights of the activation function
  • b1 and b2 are the weights of the bias
  • the max function is the larger of 0 and xW 1 + b 1 By.
  • the multi-view images come from six cameras including the front camera, the left front camera, the right front camera, the rear camera, the left rear camera and the right rear camera of the autonomous vehicle.
  • the multi-view 3D target detection method and system based on a bird's eye view proposed by the present invention have extremely beneficial technical effects.
  • objects maintain their physical size when projected onto a bird's-eye view and therefore have smaller size differences.
  • objects in the bird's-eye view occupy different spaces, thus avoiding occlusion problems.
  • the bird's-eye view position is more advantageous for obtaining accurate three-dimensional bounding boxes.
  • the multi-view 3D detection algorithm of the present invention can effectively utilize the relationship between multi-view point images to improve feature fusion, thereby well improving detection accuracy.
  • the present invention is based on multi-visual image fusion, can obtain more features, and can well solve the truncation problem that occurs in the monocular; compared with the image perspective space, the present invention converts features into In the bird's-eye view (BEV) vector space, the problem of multi-view overlap can be well handled; in addition, due to full consideration of multi-view and bird's-eye view characteristics, the detection effect of the target detection algorithm of the present invention is outstanding.
  • BEV bird's-eye view
  • Figure 1 shows the overall architecture of a 3D target detection algorithm based on a bird's eye view according to an embodiment of the present invention
  • Figure 2 shows a schematic structural diagram of a coding module according to an embodiment of the present invention
  • Figure 3 shows the network structure of a bird-eye-view feature acquisition module (Bird-eye-view Feature) according to an embodiment of the present invention
  • Figure 4 shows the network structure of a bird-eye-view feature acquisition module (Bird-eye-view Feature) according to another embodiment of the present invention
  • Figure 5 shows a schematic architectural diagram of a conversion decoding module according to an embodiment of the present invention
  • Figure 6 shows an implementation diagram of a multi-head attention mechanism module according to an embodiment of the present invention
  • Figure 7 shows the specific structure of the residual connection module according to an embodiment of the present invention.
  • Figure 8 shows a flow chart of a multi-view 3D target detection method based on a bird's eye view according to an embodiment of the present invention.
  • the present invention fuses features of multi-view pictures and performs 3D target detection based on a bird's-eye view, and proposes a 3D target detection method and system based on a bird's-eye view.
  • Figure 1 shows the overall architecture of a 3D target detection algorithm based on a bird's eye view according to an embodiment of the present invention.
  • the entire algorithm architecture includes an encoding module (Encoder) 101, a Bird-eye-view Feature acquisition module (Bird-eye-view Feature) 102, and a Transformer Decoder module (Transformer Decoder) 103.
  • Encoder encoding module
  • Bird-eye-view Feature acquisition module Bird-eye-view Feature
  • Transformer Decoder Transformer Decoder
  • the input of the entire bird's-eye view-based 3D target detection algorithm network architecture is multi-view images.
  • Multi-view pictures can come from six cameras, such as the front camera, the left front camera, the right front camera, the rear camera, the left rear camera, and the right rear camera.
  • the output of the entire network architecture is the category of the object in the 3D frame and the size of the 3D frame.
  • the encoding module includes Res-Net and Feature Pyramid Network.
  • the residual network extracts features from multi-view images and obtains multi-layer features.
  • the feature pyramid fuses features of each layer (for example, fuses low-level and high-level features) to obtain multi-scale features.
  • the function of the feature pyramid is to strengthen the high-level features in the multi-layer features and to enhance the positioning details of the low-level features in the multi-layer features.
  • Figure 2 shows a schematic structural diagram of an encoding module according to an embodiment of the present invention.
  • the function of this encoding module is to upsample the more abstract and semantic high-level feature maps, and then connect the features laterally to the previous layer features, so the high-level features are strengthened, and the benefit of doing so is that it can Exploiting underlying positioning details.
  • such a network structure can solve the problems caused by the different sizes of the targets to be detected, especially the problem that small sizes are difficult to detect.
  • the encoding module includes Res-Net 201 and Feature Pyramid Network 202.
  • Residual network (Res-Net) 201 is used to extract features from multi-view images and perform upsampling to obtain multi-layer features arranged sequentially from the bottom layer to the high layer.
  • Feature Pyramid (FPN, Feature Pyramid Network) 202 accumulates the multi-layer features output by the residual network according to the feature map and outputs multi-scale features.
  • the bird's-eye view feature acquisition module is an important module of the present invention, and its network structure completes the feature conversion from image space to bird's-eye view space.
  • Figure 3 shows the network structure of a bird-eye-view feature acquisition module (Bird-eye-view Feature) according to an embodiment of the present invention.
  • the input of the bird's-eye view feature acquisition module comes from the multi-scale features output in the feature pyramid (FPN) of the encoding module.
  • the bird's-eye view feature acquisition module maps the multi-scale features to the bird's-eye view through the mapping relationship and outputs the bird's-eye view features (BEV features) .
  • the method of mapping multi-scale features to a bird's-eye view through a mapping relationship and outputting the bird's-eye view features mainly includes the following steps: First, compress the multi-scale features along the vertical direction while retaining the dimensions in the horizontal direction to obtain the compressed bird's-eye view of different scales.
  • Figure 4 shows the network structure of a bird-eye-view feature acquisition module (Bird-eye-view Feature) according to another embodiment of the present invention.
  • the input of the bird's-eye view feature acquisition module comes from the multi-scale features output in the feature pyramid (FPN) of the encoding module.
  • the bird's-eye view feature acquisition module maps the multi-scale features to the bird's-eye view through the mapping relationship and outputs the bird's-eye view features (BEV features) .
  • the method of mapping multi-scale features to a bird's-eye view through mapping relationships and outputting bird's-eye view features mainly includes the following steps: first, compress the multi-scale features along the vertical direction while retaining the dimensions in the horizontal direction, and directly perform dimensional transformation to obtain the dimensions Bird's-eye view features of the same size (401); and then through resampling (i.e., downsampling), the bird's-eye view features are reduced in dimension to obtain reduced-dimensional bird's-eye view features (402) to adapt to the input dimension of the conversion decoding module.
  • Figure 5 shows a schematic architectural diagram of a conversion decoding module according to an embodiment of the present invention.
  • the main function of the transformation decoding module is to decode.
  • the transformation decoding module first randomly initializes the target query vector (Query) (target feature), and then constructs multiple subspaces through the first multi-head attention mechanism (Multi-head self-attention) , project the characteristics of the target query vector (Query) into multiple subspaces.
  • the function of this is to comprehensively utilize information from all aspects, which allows the model to view the same problem from different angles and achieve better results; and then Through the residual connection and regularization module (Add&Norm), we can deepen the depth of the network and accelerate the convergence of the network.
  • Query target query vector
  • Multi-head self-attention multi-head attention mechanism
  • the features output by the encoder and the target features are well combined through the second multi-head attention mechanism.
  • the residual connection module and regularization module (Add&Norm) and two feedforward neural network modules are output.
  • the conversion decoding module mainly includes an initial module 501 and a learning module 502.
  • the initial module 501 includes the first multi-head attention mechanism module (Multi-Head Self-Attention), the first residual connection module (Add) and the first regularization module (Norm).
  • the learning module 502 includes a second multi-head attention mechanism module (Multi-Head Self-Attention), a second residual connection module (Add), a second regularization module (Norm), a first feedforward neural network (FFN) ( That is, the target detection category feedforward neural network) and the second feedforward neural network (FFN) (ie, the target detection frame feedforward neural network).
  • FNN feedforward neural network
  • Figure 6 shows an implementation diagram of a multi-head attention mechanism module according to an embodiment of the present invention.
  • MatMul represents matrix multiplication
  • Scale represents scale
  • Softmax represents Softmax function.
  • the first multi-head attention mechanism module constructs multiple subspaces, projects the features of the target query vector (Query) into multiple subspaces, and outputs the spliced features of multiple initialized subspaces, that is, initialized features.
  • the second multi-head attention mechanism module combines the output of the first regularization module with the BEV features, and outputs features that are spliced into multiple subspaces after integrating the BEV features, that is, learning features.
  • d k is the dimension of the K (Key) vector
  • Softmax is the activation function, which will Normalized to a probability distribution
  • the Softmax function is then multiplied by the matrix V to obtain the summation representation of the weights.
  • T represents the transpose of matrix K.
  • the three matrices Q vector, K vector, and V vector all come from the same input, that is, the three matrices Q vector, K vector, and V vector are all equal to the query vector. (Q vector).
  • the Q vector and the K vector are the same.
  • the Q and K vectors are the features after the first residual connection and regularization, and the V vector is the bird's-eye view feature (BEV feature) , which reflects the learning function.
  • BEV feature bird's-eye view feature
  • the role of the residual connection module is to transfer information deeper and enhance the fitting ability of the model.
  • the regularization module (Norm) network structure usually represents layer normalization (Layer Normalization), which converts the input of each layer of neurons into features with the same uniform value and variance.
  • Layer Normalization Layer Normalization
  • the role of the regularization module is that as the number of network layers increases, the parameters may appear to be too large, too small, or the variance becomes large after multi-layer calculations, which will lead to abnormalities in the learning process.
  • the convergence of the model is very slow, so regularizing the calculated values of each layer can improve the performance of the model and accelerate the convergence of the network.
  • the input of the first residual connection module (Add) is the query vector (Query) and initialization features, and after performing identity mapping, the features after the first residual connection are output.
  • the residual connection module The specific structure is shown in Figure 7.
  • the first regularization module (Norm) regularizes the features after the first residual connection to obtain the features after the first regularization.
  • the input of the second residual connection module is the learning feature output by the second multi-head attention mechanism module. After identity mapping, the feature after the second residual connection is output.
  • the residual The specific structure of the connection module is shown in Figure 7.
  • the second regularization module regularizes the features after the second residual connection to obtain the features after the second regularization.
  • the output of the second regularization module is divided into two channels and is output to the first feedforward neural network FFN (target detection category feedforward neural network) and the second feedforward neural network FFN (target detection frame feedforward neural network).
  • the first feedforward neural network outputs the final object detection category.
  • the second feedforward neural network outputs the size of the target detection box (3D bounding box) and the center coordinates of the target detection box.
  • the expression of the first or second feedforward neural network is as shown in formula (2):
  • Formula (2) represents the expression form of the feedforward neural network (FFN) network structure, which mainly performs linear transformation on the regularized features.
  • x is the output of the second regularization module
  • W1 and W2 are the weights of the activation function
  • b1 and b2 are the weights of the bias.
  • the function meaning of Max is to take the larger of 0 and xW 1 +b 1 .
  • the first feedforward neural network outputs an object detection category under supervised learning of a loss module associated with the object detection category feedforward neural network.
  • the second feedforward neural network obtains the size and center coordinates of the 3D box under the supervised learning of the loss module associated with the target detection frame feedforward neural network.
  • Figure 8 shows a flow chart of a multi-view 3D target detection method based on a bird's eye view according to an embodiment of the present invention. The method includes the following steps:
  • 801 Use residual network and feature pyramid to encode multi-view images to obtain multi-scale features
  • the step of using a residual network and a feature pyramid to encode multi-view images to obtain multi-scale features includes:
  • the residual network extracts features from the multi-view image and performs upsampling to obtain multi-layer features arranged sequentially from the bottom layer to the high layer;
  • the feature pyramid accumulates the multi-layer features output by the residual network according to the feature map and outputs multi-scale features.
  • the step of mapping the multi-scale features to a bird's-eye view through a mapping relationship to obtain the bird's-eye view features includes:
  • the bird's-eye view features with the same dimensions are down-sampled to obtain bird's-eye view features with reduced dimensions.
  • the step of mapping the multi-scale features to a bird's-eye view through a mapping relationship to obtain the bird's-eye view features includes:
  • the bird's-eye view features with the same dimensions are down-sampled to obtain bird's-eye view features with reduced dimensions.
  • the relationship between the input and output of the first multi-head attention mechanism module is as shown in formula (1):
  • Q, K, V are inputs
  • Q is the query vector
  • K is the vector to be checked
  • V is the content vector
  • K, V and Q are the same, is the scale
  • d k is the dimension of K
  • softmax is the activation function, which will Normalized to probability distribution
  • T represents the transpose of K
  • Attention (Q, K, V) is the output of the first multi-head attention mechanism module, that is, the initialization feature.
  • the relationship between the input and output of the second multi-head attention mechanism module is as shown in formula (1):
  • Q, K, V are inputs
  • Q and K are the features after the first residual connection and regularization
  • V is the bird's-eye view feature
  • d k is the dimension of K
  • softmax is the activation function, which will Normalized to probability distribution
  • T represents the transpose of K
  • Attention (Q, K, V) is the output of the second multi-head attention mechanism module, that is, the learning feature.
  • the first or second feedforward neural network linearly transforms the second residual connection and regularized features, and the expression of the first or second feedforward neural network is as follows: (2) shown:
  • x is the feature after the second residual connection and regularization
  • W1 and W2 are the weights of the activation function
  • b1 and b2 are the weights of the bias
  • the max function is the larger of 0 and xW 1 + b 1 By.
  • the steps of using the first feedforward neural network module to output the target detection category and using the second feedforward neural network module to output the size of the target detection frame include:
  • the size of the target detection frame is obtained by performing supervised learning on the second feedforward neural network through a loss module associated with the target detection frame.
  • the multi-view images come from six cameras including the front camera, the left front camera, the right front camera, the rear camera, the left rear camera and the right rear camera of the autonomous vehicle.
  • the present invention also provides a computer-readable storage medium on which computer instructions are stored.
  • the computer instructions are run, the multi-view 3D target detection method based on a bird's-eye view of the present invention is executed.
  • the present invention also provides a multi-view 3D target detection system based on a bird's-eye view, including a memory and a processor.
  • the memory stores computer instructions that can be run on the processor.
  • the processor runs the computer. When instructed, the multi-view 3D target detection method based on a bird's eye view of the present invention is executed.
  • the invention also provides a multi-view 3D target detection system based on a bird's-eye view.
  • the system includes a coding module, a bird's-eye view feature acquisition module and a conversion decoding module.
  • the encoding module is used to encode multi-view images to obtain multi-scale features.
  • a bird's-eye view feature acquisition module is used to map the multi-scale features to a bird's-eye view through a mapping relationship to obtain bird's-eye view features.
  • Transformation decoding module including initial module and learning module.
  • the initial modules include:
  • the first multi-head attention mechanism is used to construct multiple subspaces, project query vectors into multiple subspaces, and output features after splicing multiple initialized subspaces, that is, initialized features.
  • the first residual connection module performs identity mapping based on the query vector and initialization features, and outputs the features after the first residual connection;
  • the first regularization module regularizes the features after the first residual connection to obtain the features after the first regularization
  • the learning modules include:
  • the second multi-head attention mechanism module is used to combine the regularized features with the bird's-eye view features to obtain learning features
  • the second residual connection module is used to perform identity mapping on the learning features and output the features after the second residual connection;
  • the second regularization module is used to regularize the features after the second residual connection to obtain the features after the second regularization
  • a first feedforward neural network that outputs a target detection category under supervised learning of a loss module associated with the target detection category based on the second regularized features
  • the second feedforward neural network outputs the size of the target detection frame under the supervised learning of the loss module associated with the target detection frame based on the second regularized features.
  • the encoding module includes a residual network and a feature pyramid.
  • the residual network is used to extract features from the multi-view images and perform upsampling to obtain multi-layer features arranged sequentially from the bottom layer to the high layer.
  • the feature pyramid is used to accumulate the multi-layer features according to the feature map and output multi-scale features.
  • mapping relationship is:
  • the bird's-eye view features with the same dimensions are down-sampled to obtain bird's-eye view features with reduced dimensions.
  • mapping relationship is:
  • the bird's-eye view features with the same dimensions are down-sampled to obtain bird's-eye view features with reduced dimensions.
  • the relationship between the input and output of the first multi-head attention mechanism module is as shown in formula (1):
  • Q, K, V are inputs
  • Q is the query vector
  • K is the vector to be checked
  • V is the content vector
  • K, V and Q are the same, is the scale
  • d k is the dimension of K
  • softmax is the activation function, which will Normalized to probability distribution
  • T represents the transpose of K
  • Attention (Q, K, V) is the output of the first multi-head attention mechanism module, that is, the initialization feature.
  • the relationship between the input and output of the second multi-head attention mechanism module is as shown in formula (1):
  • Q, K, V are inputs
  • Q and K are the features after the first residual connection and regularization
  • V is the bird's-eye view feature
  • d k is the dimension of K
  • softmax is the activation function, which will Normalized to probability distribution
  • T represents the transpose of K
  • Attention (Q, K, V) is the output of the second multi-head attention mechanism module, that is, the learning feature.
  • the first or second feedforward neural network linearly transforms the second residual connection and regularized features, and the expression of the first or second feedforward neural network is as follows: (2) shown:
  • x is the feature after the second residual connection and regularization
  • W1 and W2 are the weights of the activation function
  • b1 and b2 are the weights of the bias
  • the max function is the larger of 0 and xW 1 + b 1 By.
  • the multi-view images come from six cameras including the front camera, the left front camera, the right front camera, the rear camera, the left rear camera and the right rear camera of the autonomous vehicle.
  • the present invention is based on multi-visual image fusion, can obtain more features, and can well solve the truncation problem that occurs in the monocular; compared with the image perspective space, the present invention combines features into Switching to the bird's-eye view (BEV) vector space can well handle the problem of multi-view overlap; in addition, due to full consideration of multi-view and bird's-eye view characteristics, the detection effect of the target detection algorithm of the present invention is outstanding.
  • BEV bird's-eye view
  • this application uses specific words to describe the embodiments of the application.
  • “one embodiment”, “an embodiment”, and/or “some embodiments” means a certain feature, structure or characteristic related to at least one embodiment of the present application. Therefore, it should be emphasized and noted that “one embodiment” or “an embodiment” or “an alternative embodiment” mentioned twice or more at different places in this specification does not necessarily refer to the same embodiment. .
  • certain features, structures or characteristics in one or more embodiments of the present application may be appropriately combined.
  • aspects of the present application may be illustrated and described in several patentable categories or circumstances, including any new and useful process, machine, product, or combination of matter, or combination thereof. any new and useful improvements. Accordingly, various aspects of the present application may be executed entirely by hardware, may be entirely executed by software (including firmware, resident software, microcode, etc.), or may be executed by a combination of hardware and software.
  • the above hardware or software may be referred to as "data block”, “module”, “engine”, “unit”, “component” or “system”.
  • aspects of the present application may be embodied as a computer product including computer-readable program code located on one or more computer-readable media.
  • a computer-readable signal medium may contain a propagated data signal embodying a computer program encoding, such as on baseband or as part of a carrier wave.
  • the propagation signal may have multiple manifestations, including electromagnetic form, optical form, etc., or a suitable combination.
  • Computer-readable signal media can be any computer-readable medium other than computer-readable storage media that can communicate, propagate, or transport a program for use in connection with an instruction execution system, apparatus, or device.
  • Program code located on a computer-readable signal medium may be transmitted via any suitable medium, including radio, electrical cable, fiber optic cable, RF, or similar media, or a combination of any of the foregoing.
  • the computer program coding required for the operation of each part of this application can be written in any one or more programming languages, including object-oriented programming languages such as Java, Scala, Smalltalk, Eiffel, JADE, Emerald, C++, C#, VB.NET, Python etc., conventional procedural programming languages such as C language, Visual Basic, Fortran2003, Perl, COBOL 2002, PHP, ABAP, dynamic programming languages such as Python, Ruby and Groovy, or other programming languages.
  • the program code may run entirely on the user's computer, as a stand-alone software package, or partially on the user's computer and partially on a remote computer, or entirely on the remote computer or server.
  • the remote computer can be connected to the user computer via any form of network, such as a local area network (LAN) or a wide area network (WAN), or to an external computer (e.g. via the Internet), or in a cloud computing environment, or as a service Use software as a service (SaaS).
  • LAN local area network
  • WAN wide area network
  • SaaS service Use software as a service

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)

Abstract

基于鸟瞰图的多视角3D目标检测方法、存储器及系统。该方法包括:利用残差网络以及特征金字塔对多视角图片进行编码,得到多尺度特征;通过映射关系将所述多尺度特征映射到鸟瞰图,得到鸟瞰图特征;对查询向量进行随机初始化,通过第一多头注意力机制模块构建多个子空间并将所述查询向量投射到多个子空间中,得到初始化特征;对所述初始化特征进行第一次残差连接和正则化;利用第二多头注意力机制模块将所述第一次残差连接和正则化后的特征与所述鸟瞰图特征结合,得到学习特征;以及对所述学习特征进行第二次残差连接和正则化,并利用第一前馈神经网络模块输出目标检测类别以及利用第二前馈神经网络模块输出目标检测框的大小。

Description

基于鸟瞰图的多视角3D目标检测方法、存储器及系统 技术领域
本发明涉及自动驾驶领域,尤其涉及目标检测算法。
背景技术
目前在自动驾驶领域,利用视觉信息进行3D目标检测是低成本自动驾驶系统中一个长期存在的挑战。目前该领域通常使用两种常用的方法:一种是基于2D计算建立检测流程。该流程使用为2D任务设计的目标检测流程来预测3D信息,例如目标姿态和速度,而不考虑3D场景结构或传感器配置。这种方法往往需要很多后处理步骤来融合不同相机的预测结果,用于去除冗余包络框。该方法的缺点在于后处理算法比较复杂,并且往往需要在性能和效率之间做一个折中。另一种常用的方法是利用3D重建从相机图像生成伪激光雷达,将更多的3D计算信息整合到目标检测流程中。然后将这些输入当作直接从3D传感器采集的数据,使用3D目标检测方法。这种方法能够有效地提高3D目标检测的精度。但是这种方法往往受到复合误差的影响,当深度值预测不准时,往往对3D目标检测的准确性会带来负面的作用。
发明内容
为了克服现有技术的缺陷,本发明提供了一种基于鸟瞰图的多视角3D目标检测方法,所述方法包括以下步骤:
利用残差网络以及特征金字塔对多视角图片进行编码,得到多尺度特征;
通过映射关系将所述多尺度特征映射到鸟瞰图,得到鸟瞰图特征;
对查询向量进行随机初始化,通过第一多头注意力机制模块构建多个子空间并将所述查询向量投射到多个子空间中,得到初始化特征;
对所述初始化特征进行第一次残差连接和正则化;
利用第二多头注意力机制模块将所述第一次残差连接和正则化后的特征与所述鸟瞰图特征结合,得到学习特征;以及
对所述学习特征进行第二次残差连接和正则化,并利用第一前馈神经网络模块输出目标检测类别以及利用第二前馈神经网络模块输出目标检测框的大小。
在一个实施例中,所述利用残差网络以及特征金字塔对多视角图片进行编码得到多尺度特征的步骤包括:
所述残差网络对所述多视角图提取特征并进行上采样,得到从底层到高层依次排布的多层特征;以及
所述特征金字塔根据特征映射图将所述残差网络输出的多层特征进行累加,输出多尺度特征。
在一个实施例中,所述通过映射关系将所述多尺度特征映射到鸟瞰图得到鸟瞰图特征的步骤包括:
沿着垂直方向压缩所述多尺度特征,同时保留水平方向的维度,得到压缩后的不同尺度的鸟瞰图特征;
对所述压缩后的不同尺度的鸟瞰图特征进行再采样,转换到极坐标系中,得到维度大小相同的鸟瞰图特征;
对所述维度大小相同的鸟瞰图特征进行下采样,得到降低维度后的鸟瞰图特征。
在一个实施例中,所述通过映射关系将所述多尺度特征映射到鸟瞰图得到鸟瞰图特征的步骤包括:
沿着垂直方向压缩所述多尺度特征,同时保留水平方向的维度,并直接进行维度变换,得到维度大小相同的鸟瞰图特征;
对所述维度大小相同的鸟瞰图特征进行下采样,得到降低维度后的鸟瞰图特征。
在一个实施例中,所述第一多头注意力机制模块的输入和输出的关系如公式(1)所示:
Figure PCTCN2022114418-appb-000001
其中,Q,K,V为输入,Q为所述查询向量,K为被查向量,V为内容向量,K、V与Q相同,
Figure PCTCN2022114418-appb-000002
为尺度标度,d k为K的维度;softmax为激活函数,其将
Figure PCTCN2022114418-appb-000003
归一化为概率分布;T表示对K的转置;Attention(Q,K,V)为所述第一多头注意力机制模块的输出,即所述初始化特征。
在一个实施例中,所述第二多头注意力机制模块的输入和输出的关系如公式(1)所示:
Figure PCTCN2022114418-appb-000004
其中,Q,K,V为输入,Q和K均为所述第一次残差连接和正则化后的特征,V为所述鸟瞰图特征,
Figure PCTCN2022114418-appb-000005
为尺度标度,d k为K的维度;softmax为激活函数, 其将
Figure PCTCN2022114418-appb-000006
归一化为概率分布;T表示对K的转置;Attention(Q,K,V)为所述第二多头注意力机制模块的输出,即所述学习特征。
在一个实施例中,所述第一或第二前馈神经网络对第二次残差连接和正则化后的特征进行线性变换,所述第一或第二前馈神经网络的表达式如公式(2)所示:
FFN(x)=max(0,xW 1+b 1)*W 2+b 2       (2)
其中,x为第二次残差连接和正则化后的特征,W1和W2为激活函数的权重,b1和b2为偏置的权重,max函数为取0和xW 1+b 1中的较大者。
在一个实施例中,利用第一前馈神经网络模块输出目标检测类别以及利用第二前馈神经网络模块输出目标检测框的大小的步骤包括:
通过与目标检测类别相关联的损失模块对所述第一前馈神经网络进行监督学习,得到所述目标检测类别;
通过与目标检测框相关联的损失模块对所述第二前馈神经网络进行监督学习,得到所述目标检测框的大小。
在一个实施例中,所述多视角图片分别来自自动驾驶交通工具的前摄像头、左前摄像头、右前摄像头、后摄像头、左后摄像头、右后摄像头六个相机。
本发明还提供了一种计算机可读存储介质,其上存储有计算机指令,所述计算机指令运行时执行本发明的基于鸟瞰图的多视角3D目标检测方法。
本发明还提供了一种基于鸟瞰图的多视角3D目标检测系统,包括存储器和处理器,所述存储器上存储有可在所述处理器上运行的计算机指令,所述处理器运行所述计算机指令时执行本发明的基于鸟瞰图的多视角3D目标检测方法。
本发明还提供了一种基于鸟瞰图的多视角3D目标检测系统,所述系统包括编码模块、鸟瞰图特征获取模块以及转换解码模块。
编码模块,用于对多视角图片进行编码得到多尺度特征。
鸟瞰图特征获取模块,用于通过映射关系将所述多尺度特征映射到鸟瞰图,得到鸟瞰图特征。
转换解码模块,包括初始模块和学习模块。
所述初始模块包括:
第一多头注意力机制构,用于构建多个子空间,将查询向量投射到多个子空间中,输出多个初始化的子空间拼接后的特征,即初始化特征。
第一次残差连接模块,根据所述查询向量以及初始化特征进行恒等映射,输出第一次残差连接后的特征;以及
第一正则化模块,对所述第一次残差连接后的特征进行正则化,得到第一次正则化后的特征;
所述学习模块包括:
第二多头注意力机制模块,用于将所述正则化后的特征与所述鸟瞰图特征结合,得到学习特征;
第二残差连接模块,用于对所述学习特征进行恒等映射,输出所述第二次残差连接后的特征;
第二正则化模块,用于对所述第二次残差连接后的特征进行正则化,得到第二次正则化后的特征;
第一前馈神经网络,根据所述第二次正则化后的特征,在与目标检测类别相关联的损失模块的监督学习下输出目标检测类别;以及
第二前馈神经网络,根据所述第二次正则化后的特征,在与目标检测框相关联的损失模块的监督学习下输出目标检测框的大小。
在一个实施例中,所述编码模块包括残差网络以及特征金字塔。
残差网络用于对所述多视角图片提取特征并进行上采样,得到从底层到高层依次排布的多层特征。
特征金字塔用于根据特征映射图将所述多层特征进行累加,输出多尺度特征。
在一个实施例中,所述映射关系为:
沿着垂直方向压缩所述多尺度特征,同时保留水平方向的维度,得到压缩后的不同尺度的鸟瞰图特征;
对所述压缩后的不同尺度的鸟瞰图特征进行再采样,转换到极坐标系中,得到维度大小相同的鸟瞰图特征;
对所述维度大小相同的鸟瞰图特征进行下采样,得到降低维度后的鸟瞰图特征。
在一个实施例中,所述映射关系为:
沿着垂直方向压缩所述多尺度特征,同时保留水平方向的维度,并直接进行维度变换,得到维度大小相同的鸟瞰图特征;
对所述维度大小相同的鸟瞰图特征进行下采样,得到降低维度后的鸟瞰图特征。
在一个实施例中,所述第一多头注意力机制模块的输入和输出的关系如公式(1)所示:
Figure PCTCN2022114418-appb-000007
其中,Q,K,V为输入,Q为所述查询向量,K为被查向量,V为内容向量,K、V与Q相同,
Figure PCTCN2022114418-appb-000008
为尺度标度,d k为K的维度;softmax为激活函数,其将
Figure PCTCN2022114418-appb-000009
归一化为概率分布;T表示对K的转置;Attention(Q,K,V)为所述第一多头注意力机制模块的输出,即所述初始化特征。
在一个实施例中,所述第二多头注意力机制模块的输入和输出的关系如公式(1)所示:
Figure PCTCN2022114418-appb-000010
其中,Q,K,V为输入,Q和K均为所述第一次残差连接和正则化后的特征,V为所述鸟瞰图特征,
Figure PCTCN2022114418-appb-000011
为尺度标度,d k为K的维度;softmax为激活函数,其将
Figure PCTCN2022114418-appb-000012
归一化为概率分布;T表示对K的转置;Attention(Q,K,V)为所述第二多头注意力机制模块的输出,即所述学习特征。
在一个实施例中,所述第一或第二前馈神经网络对第二次残差连接和正则化后的特征进行线性变换,所述第一或第二前馈神经网络的表达式如公式(2)所示:
FFN(x)=max(0,xW 1+b 1)*W 2+b 2         (2)
其中,x为第二次残差连接和正则化后的特征,W1和W2为激活函数的权重,b1和b2为偏置的权重,max函数为取0和xW 1+b 1中的较大者。
在一个实施例中,所述多视角图片分别来自自动驾驶交通工具的前摄像头、左前摄像头、右前摄像头、后摄像头、左后摄像头、右后摄像头六个相机。
本发明提出的基于鸟瞰图的多视角3D目标检测方法以及系统具有极为有益的技术效果。首先,相比于RGB平面等,物体在投影到鸟瞰图时保持了物理尺寸,因此具有较小的尺寸差异。其次,鸟瞰图的物体占据不同的空间,从而避免了遮挡问题。第三,在道路场景中,由于物体通常位于地面上,垂直位置的变化很小,鸟瞰图位置对于获得准确的三维边界盒更有优势。相比于单视角的相机输入,本发明的多视角3D检测算法能够有效地利用多视角点图像之间的关系,改进特征融合的,从而可以很好的提高检测的精度。
换言之,相比于单目检测算法,本发明基于多视觉图像进行融合,能够获得更多的特征,能够很好的解决单目出现的截断问题;相比于图像视角空间,本发明 将特征转到鸟瞰图(BEV)向量空间中,能够很好的处理多视角重合的问题;此外,由于充分考虑多视角和鸟瞰图特征,本发明目标检测算法的检测效果表现出众。
附图说明
本发明的以上发明内容以及下面的具体实施方式在结合附图阅读时会得到更好的理解。需要说明的是,附图仅作为所请求保护的发明的示例。在附图中,相同的附图标记代表相同或类似的元素。
图1示出根据本发明一实施例的基于鸟瞰图的3D目标检测算法的整体架构;
图2示出根据本发明一实施例的编码模块的结构示意图;
图3示出根据本发明一实施例的鸟瞰图特征获取模块(Bird-eye-view Feature)的网络结构;
图4示出根据本发明又一实施例的鸟瞰图特征获取模块(Bird-eye-view Feature)的网络结构;
图5示出根据本发明一实施例的转换解码模块的架构示意图;
图6示出根据本发明一实施例的多头注意力机制模块实现图;
图7示出根据本发明一实施例的残差连接模块的具体结构;以及
图8示出根据本发明一实施例的基于鸟瞰图的多视角3D目标检测方法流程图。
具体实施方式
以下在具体实施方式中详细叙述本发明的详细特征以及优点,其内容足以使任何本领域技术人员了解本发明的技术内容并据以实施,且根据本说明书所揭露的说明书、权利要求及附图,本领域技术人员可轻易地理解本发明相关的目的及优点。
本发明将多视角的图片进行特征融合并基于鸟瞰图进行3d目标检测,提出了一种基于鸟瞰图的3D目标检测方法以及系统。
图1示出根据本发明一实施例的基于鸟瞰图的3D目标检测算法的整体架构。整个算法构架包括编码模块(Encoder)101、鸟瞰图特征获取模块(Bird-eye-view Feature)102以及转换解码模块(Transformer Decoder)103。
整个基于鸟瞰图的3D目标检测算法网络架构的输入为多视角图片。多视角图片可以分别来自例如前摄像头、左前摄像头、右前摄像头、后摄像头、左后摄像头、 右后摄像头六个相机,整个网络架构的输出为3D框架中的物体的类别以及3D框架的大小。
编码模块包括残差网络(Res-Net)以及特征金字塔(Feature Pyramid Network)。残差网络对多视角图片进行提取特征,得到多层特征。特征金字塔融合各层特征(例如,融合底层和高层特征),得到多尺度的特征。特征金字塔的作用是加强多层特征中的高层特征,并加强多层特征中的底层特征的定位细节。
图2示出根据本发明一实施例的编码模块的结构示意图。该编码模块的作用在于把更抽象、语义更强的高层特征图进行上采样,然后把该特征横向连接到前一层特征,因此高层特征得到了加强,且这样做的好处还在于能够很好的利用底层的定位细节信息。并且,这样的网络结构能够解决待检测的目标的尺寸不同带来的问题,尤其是小尺寸难以检测的问题。
从图2中可以看出,编码模块包括残差网络(Res-Net)201以及特征金字塔(Feature Pyramid Network)202。
残差网络(Res-Net)201用于对多视角图提取特征并进行上采样,得到从底层到高层依次排布的多层特征。
特征金字塔(FPN,Feature Pyramid Network)202根据特征映射图将残差网络输出的多层特征进行累加,输出多尺度特征。
鸟瞰图特征获取模块是本发明的重要模块,其网络结构完成了图像空间到鸟瞰图空间的特征转换。
图3示出根据本发明一实施例的鸟瞰图特征获取模块(Bird-eye-view Feature)的网络结构。鸟瞰图特征获取模块的输入来自于编码模块的特征金字塔(FPN)中输出的多尺度特征,该鸟瞰图特征获取模块通过映射关系将多尺度特征映射到鸟瞰图,输出鸟瞰图特征(BEV特征)。
通过映射关系将多尺度特征映射到鸟瞰图并输出鸟瞰图特征的方式主要包括如下步骤:首先,沿着垂直方向压缩该多尺度特征,同时保留水平方向的维度,得到压缩后的不同尺度的鸟瞰图特征(301);然后,通过再采样,转换到极坐标系中,得到维度大小相同的鸟瞰图特征(即,极坐标中沿深度轴方向预测一组特征)(302);接着,对这些维度大小相同的鸟瞰图特征再进行下采样,降低维度,得到降低维度后的鸟瞰图特征(303),以适应转换解码模块的输入维度。
图4示出根据本发明又一实施例的鸟瞰图特征获取模块(Bird-eye-view Feature)的网络结构。鸟瞰图特征获取模块的输入来自于编码模块的特征金字塔(FPN)中输出的多尺度特征,该鸟瞰图特征获取模块通过映射关系将多尺度特征映射到鸟瞰图,输出鸟瞰图特征(BEV特征)。
通过映射关系将多尺度特征映射到鸟瞰图并输出鸟瞰图特征的方式主要包括如下步骤:首先,沿着垂直方向压缩该多尺度特征,同时保留水平方向的维度,并直接进行维度变换,得到维度大小相同的鸟瞰图特征(401);再通过再采样(即,下采样),对鸟瞰图特征降低维度,得到降低维度后的鸟瞰图特征(402),以适应转换解码模块的输入维度。
图5示出根据本发明一实施例的转换解码模块的架构示意图。转换解码模块的主要作用是进行解码,转换解码模块首先对目标查询向量(Query)(目标特征)进行随机初始化,接着通过第一多头注意力机制(Multi-head self-attention)构建多个子空间,将目标查询向量(Query)的特征投射到多个子空间中,这样做的作用在于可以综合利用各个方面的信息,这使得模型可以从不同角度看同一问题,可以收获更好的效果;然后再通过残差连接和正则化模块(Add&Norm),来加深网络的深度,加速网络的收敛。随后,和鸟瞰图特征一起再经过第二多头注意力机制将编码器输出的特征和目标特征进行很好的结合。然后再通过残差连接模块和正则化模块(Add&Norm)和两个前馈神经网络模块,输出最终的目标检测类别和3D框(3D bounding box,包括中心点坐标)。
如图5所示,转换解码模块主要包括初始模块501和学习模块502。初始模块501包括第一多头注意力机制模块(Multi-Head Self-Attention)、第一残差连接模块(Add)和第一正则化模块(Norm)。学习模块502包括第二多头注意力机制模块(Multi-Head Self-Attention)、第二残差连接模块(Add)和第二正则化模块(Norm)、第一前馈神经网络(FFN)(即,目标检测类别前馈神经网络)以及第二前馈神经网络(FFN)(即,目标检测框前馈神经网络)。
图6示出根据本发明一实施例的多头注意力机制模块实现图。其中,MatMul表示矩阵相乘,Scale表示尺度标度,Softmax表示Softmax函数。第一多头注意力机制模块构建多个子空间,将目标查询向量(Query)的特征投射到多个子空间中,输出多个初始化的子空间拼接后的特征,即初始化特征。第二多头注意力机制模块 将第一正则化模块的输出与BEV特征结合,输出融合了BEV特征后的多个子空间拼接后的特征,即学习特征。
多头注意力机制模块的输出如公式(1)所示:
Figure PCTCN2022114418-appb-000013
其中,
Figure PCTCN2022114418-appb-000014
为尺度标度,除以一个尺度标度
Figure PCTCN2022114418-appb-000015
是为了防止结果过大,d k为K(Key)向量的维度;Softmax为激活函数,其将
Figure PCTCN2022114418-appb-000016
归一化为概率分布;Softmax函数再乘以矩阵V就得权重的求和表示。T表示对矩阵K的转置。
对于第一多头注意力机制模块,由于是用于初始化,所以Q向量、K向量、V向量三个矩阵均来自同一个输入,即Q向量、K向量、V向量三个矩阵均等于查询向量(Q向量)。
对于第二多头注意力机制模块,Q向量和K向量是一样的,Q和K向量均为所述第一次残差连接和正则化后的特征,V向量是鸟瞰图特征(BEV特征),这体现出了学习的功能。
残差连接模块的作用是为了将信息传递的更深,增强模型的拟合能力。
正则化模块(Norm)网络结构通常表示层归一化(Layer Normalization),会将每一层神经元的输入都转成匀值和方差都一样的特征。正则化模块的作用是,随着网络层数的增加,通过多层的计算后参数可能会出现过大、过小、方差变大等现象,这会导致学习过程出现异常。模型的收敛非常慢,因此对每一层计算后的数值进行正则化可以提升模型的表现,加速网络的收敛。
根据本发明的一实施例,第一残差连接模块(Add)的输入为查询向量(Query)以及初始化特征,在进行恒等映射后输出第一次残差连接后的特征,残差连接模块的具体结构如图7所示。第一正则化模块(Norm)对第一次残差连接后的特征进行正则化,得到第一次正则化后的特征。
根据本发明的一实施例,第二残差连接模块(Add)的输入为第二多头注意力机制模块输出的学习特征,恒等映射后输出第二次残差连接后的特征,残差连接模块的具体结构如图7所示。第二正则化模块(Norm)对第二次残差连接后的特征进行正则化,得到第二次正则化后的特征。
第二正则化模块的输出分两路分别输出至第一前馈神经网络FFN(目标检测类别前馈神经网络)和第二前馈神经网络FFN(目标检测框前馈神经网络)。第一 前馈神经网络输出最终的目标检测类别。第二前馈神经网络输出目标检测框(3D bounding box)的大小以及目标检测框的中心坐标。
第一或第二前馈神经网络的表达式如公式(2)所示:
FFN(x)=max(0,xW 1+b 1)*W 2+b 2       (2)
公式(2)表示的是前馈神经网络(FFN)网络结构的表达形式,主要是对正则化后的特征进行线性变换。其中,x为第二正则化模块的输出,W1和W2为激活函数的权重,b1和b2是偏置的权重。Max的函数意义是取0和xW 1+b 1中的较大者。第一前馈神经网络在与目标检测类别前馈神经网络相关联的损失模块的监督学习下,输出目标检测类别。第二前馈神经网络在与目标检测框前馈神经网络相关联的损失模块的监督学习下,得到3D框的大小以及中心坐标。
图8示出根据本发明一实施例的基于鸟瞰图的多视角3D目标检测方法流程图。所述方法包括以下步骤:
801:利用残差网络以及特征金字塔对多视角图片进行编码,得到多尺度特征;
802:通过映射关系将所述多尺度特征映射到鸟瞰图,得到鸟瞰图特征;
803:对查询向量进行随机初始化,通过第一多头注意力机制模块构建多个子空间并将所述查询向量投射到多个子空间中,得到初始化特征;
804:对所述初始化特征进行第一次残差连接和正则化;
805:利用第二多头注意力机制模块将所述第一次残差连接和正则化后的特征与所述鸟瞰图特征结合,得到学习特征;以及
806:对所述学习特征进行第二次残差连接和正则化,并利用第一前馈神经网络模块输出目标检测类别以及利用第二前馈神经网络模块输出目标检测框的大小。
在一个实施例中,所述利用残差网络以及特征金字塔对多视角图片进行编码得到多尺度特征的步骤包括:
所述残差网络对所述多视角图提取特征并进行上采样,得到从底层到高层依次排布的多层特征;以及
所述特征金字塔根据特征映射图将所述残差网络输出的多层特征进行累加,输出多尺度特征。
在一个实施例中,所述通过映射关系将所述多尺度特征映射到鸟瞰图得到鸟瞰图特征的步骤包括:
沿着垂直方向压缩所述多尺度特征,同时保留水平方向的维度,得到压缩后的不同尺度的鸟瞰图特征;
对所述压缩后的不同尺度的鸟瞰图特征进行再采样,转换到极坐标系中,得到维度大小相同的鸟瞰图特征;
对所述维度大小相同的鸟瞰图特征进行下采样,得到降低维度后的鸟瞰图特征。
在一个实施例中,所述通过映射关系将所述多尺度特征映射到鸟瞰图得到鸟瞰图特征的步骤包括:
沿着垂直方向压缩所述多尺度特征,同时保留水平方向的维度,并直接进行维度变换,得到维度大小相同的鸟瞰图特征;
对所述维度大小相同的鸟瞰图特征进行下采样,得到降低维度后的鸟瞰图特征。
在一个实施例中,所述第一多头注意力机制模块的输入和输出的关系如公式(1)所示:
Figure PCTCN2022114418-appb-000017
其中,Q,K,V为输入,Q为所述查询向量,K为被查向量,V为内容向量,K、V与Q相同,
Figure PCTCN2022114418-appb-000018
为尺度标度,d k为K的维度;softmax为激活函数,其将
Figure PCTCN2022114418-appb-000019
归一化为概率分布;T表示对K的转置;Attention(Q,K,V)为所述第一多头注意力机制模块的输出,即所述初始化特征。
在一个实施例中,所述第二多头注意力机制模块的输入和输出的关系如公式(1)所示:
Figure PCTCN2022114418-appb-000020
其中,Q,K,V为输入,Q和K均为所述第一次残差连接和正则化后的特征,V为所述鸟瞰图特征,
Figure PCTCN2022114418-appb-000021
为尺度标度,d k为K的维度;softmax为激活函数,其将
Figure PCTCN2022114418-appb-000022
归一化为概率分布;T表示对K的转置;Attention(Q,K,V)为所述第二多头注意力机制模块的输出,即所述学习特征。
在一个实施例中,所述第一或第二前馈神经网络对第二次残差连接和正则化后的特征进行线性变换,所述第一或第二前馈神经网络的表达式如公式(2)所示:
FFN(x)=max(0,xW 1+b 1)*W 2+b 2       (2)
其中,x为第二次残差连接和正则化后的特征,W1和W2为激活函数的权重,b1和b2为偏置的权重,max函数为取0和xW 1+b 1中的较大者。
在一个实施例中,利用第一前馈神经网络模块输出目标检测类别以及利用第二前馈神经网络模块输出目标检测框的大小的步骤包括:
通过与目标检测类别相关联的损失模块对所述第一前馈神经网络进行监督学习,得到所述目标检测类别;
通过与目标检测框相关联的损失模块对所述第二前馈神经网络进行监督学习,得到所述目标检测框的大小。
在一个实施例中,所述多视角图片分别来自自动驾驶交通工具的前摄像头、左前摄像头、右前摄像头、后摄像头、左后摄像头、右后摄像头六个相机。
本发明还提供了一种计算机可读存储介质,其上存储有计算机指令,所述计算机指令运行时执行本发明的基于鸟瞰图的多视角3D目标检测方法。
本发明还提供了一种基于鸟瞰图的多视角3D目标检测系统,包括存储器和处理器,所述存储器上存储有可在所述处理器上运行的计算机指令,所述处理器运行所述计算机指令时执行本发明的基于鸟瞰图的多视角3D目标检测方法。
本发明还提供了一种基于鸟瞰图的多视角3D目标检测系统,所述系统包括编码模块、鸟瞰图特征获取模块以及转换解码模块。
编码模块,用于对多视角图片进行编码得到多尺度特征。
鸟瞰图特征获取模块,用于通过映射关系将所述多尺度特征映射到鸟瞰图,得到鸟瞰图特征。
转换解码模块,包括初始模块和学习模块。
所述初始模块包括:
第一多头注意力机制构,用于构建多个子空间,将查询向量投射到多个子空间中,输出多个初始化的子空间拼接后的特征,即初始化特征。
第一次残差连接模块,根据所述查询向量以及初始化特征进行恒等映射,输出第一次残差连接后的特征;以及
第一正则化模块,对所述第一次残差连接后的特征进行正则化,得到第一次正则化后的特征;
所述学习模块包括:
第二多头注意力机制模块,用于将所述正则化后的特征与所述鸟瞰图特征结合,得到学习特征;
第二残差连接模块,用于对所述学习特征进行恒等映射,输出所述第二次残差连接后的特征;
第二正则化模块,用于对所述第二次残差连接后的特征进行正则化,得到第二次正则化后的特征;
第一前馈神经网络,根据所述第二次正则化后的特征,在与目标检测类别相关联的损失模块的监督学习下输出目标检测类别;以及
第二前馈神经网络,根据所述第二次正则化后的特征,在与目标检测框相关联的损失模块的监督学习下输出目标检测框的大小。
在一个实施例中,所述编码模块包括残差网络以及特征金字塔。
残差网络用于对所述多视角图片提取特征并进行上采样,得到从底层到高层依次排布的多层特征。
特征金字塔用于根据特征映射图将所述多层特征进行累加,输出多尺度特征。
在一个实施例中,所述映射关系为:
沿着垂直方向压缩所述多尺度特征,同时保留水平方向的维度,得到压缩后的不同尺度的鸟瞰图特征;
对所述压缩后的不同尺度的鸟瞰图特征进行再采样,转换到极坐标系中,得到维度大小相同的鸟瞰图特征;
对所述维度大小相同的鸟瞰图特征进行下采样,得到降低维度后的鸟瞰图特征。
在一个实施例中,所述映射关系为:
沿着垂直方向压缩所述多尺度特征,同时保留水平方向的维度,并直接进行维度变换,得到维度大小相同的鸟瞰图特征;
对所述维度大小相同的鸟瞰图特征进行下采样,得到降低维度后的鸟瞰图特征。
在一个实施例中,所述第一多头注意力机制模块的输入和输出的关系如公式(1)所示:
Figure PCTCN2022114418-appb-000023
其中,Q,K,V为输入,Q为所述查询向量,K为被查向量,V为内容向量,K、V与Q相同,
Figure PCTCN2022114418-appb-000024
为尺度标度,d k为K的维度;softmax为激活函数,其将
Figure PCTCN2022114418-appb-000025
归一化为概率分布;T表示对K的转置;Attention(Q,K,V)为所述第一多头注意力机制模块的输出,即所述初始化特征。
在一个实施例中,所述第二多头注意力机制模块的输入和输出的关系如公式(1)所示:
Figure PCTCN2022114418-appb-000026
其中,Q,K,V为输入,Q和K均为所述第一次残差连接和正则化后的特征,V为所述鸟瞰图特征,
Figure PCTCN2022114418-appb-000027
为尺度标度,d k为K的维度;softmax为激活函数,其将
Figure PCTCN2022114418-appb-000028
归一化为概率分布;T表示对K的转置;Attention(Q,K,V)为所述第二多头注意力机制模块的输出,即所述学习特征。
在一个实施例中,所述第一或第二前馈神经网络对第二次残差连接和正则化后的特征进行线性变换,所述第一或第二前馈神经网络的表达式如公式(2)所示:
FFN(x)=max(0,xW 1+b 1)*W 2+b 2        (2)
其中,x为第二次残差连接和正则化后的特征,W1和W2为激活函数的权重,b1和b2为偏置的权重,max函数为取0和xW 1+b 1中的较大者。
在一个实施例中,所述多视角图片分别来自自动驾驶交通工具的前摄像头、左前摄像头、右前摄像头、后摄像头、左后摄像头、右后摄像头六个相机。
综上,相比于单目检测算法,本发明基于多视觉图像进行融合,能够获得更多的特征,能够很好的解决单目出现的截断问题;相比于图像视角空间,本发明将特征转到鸟瞰图(BEV)向量空间中,能够很好的处理多视角重合的问题;此外,由于充分考虑多视角和鸟瞰图特征,本发明目标检测算法的检测效果表现出众。
上文已对基本概念做了描述,显然,对于本领域技术人员来说,上述发明披露仅仅作为示例,而并不构成对本申请的限定。虽然此处并没有明确说明,本领域技术人员可能会对本申请进行各种修改、改进和修正。该类修改、改进和修正在本申请中被建议,所以该类修改、改进、修正仍属于本申请示范实施例的精神和范围。
本申请中使用了流程图用来说明根据本申请的实施例的系统所执行的操作。应当理解的是,前面或下面操作不一定按照顺序来精确地执行。相反,可以按照倒序或同时处理各种步骤。同时,或将其他操作添加到这些过程中,或从这些过程移除某一步或数步操作。
同时,本申请使用了特定词语来描述本申请的实施例。如“一个实施例”、“一实施例”、和/或“一些实施例”意指与本申请至少一个实施例相关的某一特征、结构或特点。因此,应强调并注意的是,本说明书中在不同位置两次或多次提及的“一 实施例”或“一个实施例”或“一替代性实施例”并不一定是指同一实施例。此外,本申请的一个或多个实施例中的某些特征、结构或特点可以进行适当的组合。
此外,本领域技术人员可以理解,本申请的各方面可以通过若干具有可专利性的种类或情况进行说明和描述,包括任何新的和有用的工序、机器、产品或物质的组合,或对他们的任何新的和有用的改进。相应地,本申请的各个方面可以完全由硬件执行、可以完全由软件(包括固件、常驻软件、微码等)执行、也可以由硬件和软件组合执行。以上硬件或软件均可被称为“数据块”、“模块”、“引擎”、“单元”、“组件”或“系统”。此外,本申请的各方面可能表现为位于一个或多个计算机可读介质中的计算机产品,该产品包括计算机可读程序编码。
计算机可读信号介质可能包含一个内含有计算机程序编码的传播数据信号,例如在基带上或作为载波的一部分。该传播信号可能有多种表现形式,包括电磁形式、光形式等等、或合适的组合形式。计算机可读信号介质可以是除计算机可读存储介质之外的任何计算机可读介质,该介质可以通过连接至一个指令执行系统、装置或设备以实现通讯、传播或传输供使用的程序。位于计算机可读信号介质上的程序编码可以通过任何合适的介质进行传播,包括无线电、电缆、光纤电缆、RF、或类似介质、或任何上述介质的组合。
本申请各部分操作所需的计算机程序编码可以用任意一种或多种程序语言编写,包括面向对象编程语言如Java、Scala、Smalltalk、Eiffel、JADE、Emerald、C++、C#、VB.NET、Python等,常规程序化编程语言如C语言、Visual Basic、Fortran2003、Perl、COBOL 2002、PHP、ABAP,动态编程语言如Python、Ruby和Groovy,或其他编程语言等。该程序编码可以完全在用户计算机上运行、或作为独立的软件包在用户计算机上运行、或部分在用户计算机上运行部分在远程计算机运行、或完全在远程计算机或服务器上运行。在后种情况下,远程计算机可以通过任何网络形式与用户计算机连接,比如局域网(LAN)或广域网(WAN),或连接至外部计算机(例如通过因特网),或在云计算环境中,或作为服务使用如软件即服务(SaaS)。
此外,除非权利要求中明确说明,本申请所述处理元素和序列的顺序、数字字母的使用、或其他名称的使用,并非用于限定本申请流程和方法的顺序。尽管上述披露中通过各种示例讨论了一些目前认为有用的发明实施例,但应当理解的是,该类细节仅起到说明的目的,附加的权利要求并不仅限于披露的实施例,相反,权利要求旨在覆盖所有符合本申请实施例实质和范围的修正和等价组合。例如,虽然以 上所描述的系统组件可以通过硬件设备实现,但是也可以只通过软件的解决方案得以实现,如在现有的服务器或移动设备上安装所描述的系统。
同理,应当注意的是,为了简化本申请披露的表述,从而帮助对一个或多个发明实施例的理解,前文对本申请实施例的描述中,有时会将多种特征归并至一个实施例、附图或对其的描述中。但是,这种披露方法并不意味着本申请对象所需要的特征比权利要求中提及的特征多。实际上,实施例的特征要少于上述披露的单个实施例的全部特征。
这里采用的术语和表述方式只是用于描述,本发明并不应局限于这些术语和表述。使用这些术语和表述并不意味着排除任何示意和描述(或其中部分)的等效特征,应认识到可能存在的各种修改也应包含在权利要求范围内。其他修改、变化和替换也可能存在。相应的,权利要求应视为覆盖所有这些等效物。
同样,需要指出的是,虽然本发明已参照当前的具体实施例来描述,但是本技术领域中的普通技术人员应当认识到,以上的实施例仅是用来说明本发明,在没有脱离本发明精神的情况下还可做出各种等效的变化或替换,因此,只要在本发明的实质精神范围内对上述实施例的变化、变型都将落在本申请的权利要求书的范围内。

Claims (19)

  1. 一种基于鸟瞰图的多视角3D目标检测方法,其特征在于,所述方法包括:
    利用残差网络以及特征金字塔对多视角图片进行编码,得到多尺度特征;
    通过映射关系将所述多尺度特征映射到鸟瞰图,得到鸟瞰图特征;
    对查询向量进行随机初始化,通过第一多头注意力机制模块构建多个子空间并将所述查询向量投射到多个子空间中,得到初始化特征;
    对所述初始化特征进行第一次残差连接和正则化;
    利用第二多头注意力机制模块将所述第一次残差连接和正则化后的特征与所述鸟瞰图特征结合,得到学习特征;以及
    对所述学习特征进行第二次残差连接和正则化,并利用第一前馈神经网络模块输出目标检测类别以及利用第二前馈神经网络模块输出目标检测框的大小。
  2. 如权利要求1所述的基于鸟瞰图的多视角3D目标检测方法,其特征在于,所述利用残差网络以及特征金字塔对多视角图片进行编码得到多尺度特征的步骤包括:
    所述残差网络对所述多视角图提取特征并进行上采样,得到从底层到高层依次排布的多层特征;以及
    所述特征金字塔根据特征映射图将所述残差网络输出的多层特征进行累加,输出多尺度特征。
  3. 如权利要求1所述的基于鸟瞰图的多视角3D目标检测方法,其特征在于,所述通过映射关系将所述多尺度特征映射到鸟瞰图得到鸟瞰图特征的步骤包括:
    沿着垂直方向压缩所述多尺度特征,同时保留水平方向的维度,得到压缩后的不同尺度的鸟瞰图特征;
    对所述压缩后的不同尺度的鸟瞰图特征进行再采样,转换到极坐标系中,得到维度大小相同的鸟瞰图特征;
    对所述维度大小相同的鸟瞰图特征进行下采样,得到降低维度后的鸟瞰图特征。
  4. 如权利要求1所述的基于鸟瞰图的多视角3D目标检测方法,其特征在于,所述通过映射关系将所述多尺度特征映射到鸟瞰图得到鸟瞰图特征的步骤包括:
    沿着垂直方向压缩所述多尺度特征,同时保留水平方向的维度,并直接进行维度变换,得到维度大小相同的鸟瞰图特征;
    对所述维度大小相同的鸟瞰图特征进行下采样,得到降低维度后的鸟瞰图特征。
  5. 如权利要求1所述的基于鸟瞰图的多视角3D目标检测方法,其特征在于,所述第一多头注意力机制模块的输入和输出的关系如公式(1)所示:
    Figure PCTCN2022114418-appb-100001
    其中,Q,K,V为输入,Q为所述查询向量,K为被查向量,V为内容向量,K、V与Q相同,
    Figure PCTCN2022114418-appb-100002
    为尺度标度,d k为K的维度;softmax为激活函数,其将
    Figure PCTCN2022114418-appb-100003
    归一化为概率分布;T表示对K的转置;Attention(Q,K,V)为所述第一多头注意力机制模块的输出,即所述初始化特征。
  6. 如权利要求1所述的基于鸟瞰图的多视角3D目标检测方法,其特征在于,所述第二多头注意力机制模块的输入和输出的关系如公式(1)所示:
    Figure PCTCN2022114418-appb-100004
    其中,Q,K,V为输入,Q和K均为所述第一次残差连接和正则化后的特征,V为所述鸟瞰图特征,
    Figure PCTCN2022114418-appb-100005
    为尺度标度,d k为K的维度;softmax为激活函数,其将
    Figure PCTCN2022114418-appb-100006
    归一化为概率分布;T表示对K的转置;Attention(Q,K,V)为所述第二多头注意力机制模块的输出,即所述学习特征。
  7. 如权利要求1所述的基于鸟瞰图的多视角3D目标检测方法,其特征在于,所述第一或第二前馈神经网络对第二次残差连接和正则化后的特征进行线性变换,所述第一或第二前馈神经网络的表达式如公式(2)所示:
    FFN(x)=max(0,xW 1+b 1)*W 2+b 2  (2)
    其中,x为第二次残差连接和正则化后的特征,W1和W2为激活函数的权重,b1和b2为偏置的权重,max函数为取0和xW 1+b 1中的较大者。
  8. 如权利要求1所述的基于鸟瞰图的多视角3D目标检测方法,其特征在于,利用第一前馈神经网络模块输出目标检测类别以及利用第二前馈神经网络模块输出目标检测框的大小的步骤包括:
    通过与目标检测类别相关联的损失模块对所述第一前馈神经网络进行监督学习,得到所述目标检测类别;
    通过与目标检测框相关联的损失模块对所述第二前馈神经网络进行监督学习,得到所述目标检测框的大小。
  9. 如权利要求1所述的基于鸟瞰图的多视角3D目标检测方法,其特征在于,所述多视角图片分别来自自动驾驶交通工具的前摄像头、左前摄像头、右前摄像头、后摄像头、左后摄像头、右后摄像头六个相机。
  10. 一种计算机可读存储介质,其上存储有计算机指令,其特征在于,所述计算机指令运行时执行如权利要求1至9任一项所述的基于鸟瞰图的多视角3D目标检测方法。
  11. 一种基于鸟瞰图的多视角3D目标检测系统,包括存储器和处理器,所述存储器上存储有可在所述处理器上运行的计算机指令,其特征在于,所述处理器运行所述计算机指令时执行如权利要求1至9任一项所述的基于鸟瞰图的多视角3D目标检测方法。
  12. 一种基于鸟瞰图的多视角3D目标检测系统,其特征在于,所述系统包括:
    编码模块,用于对多视角图片进行编码得到多尺度特征;
    鸟瞰图特征获取模块,用于通过映射关系将所述多尺度特征映射到鸟瞰图,得到鸟瞰图特征;以及
    转换解码模块,包括初始模块和学习模块;
    所述初始模块包括:
    第一多头注意力机制构,用于构建多个子空间,将查询向量投射到多个子空间中,输出多个初始化的子空间拼接后的特征,即初始化特征;
    第一次残差连接模块,根据所述查询向量以及初始化特征进行恒等映射,输出第一次残差连接后的特征;以及
    第一正则化模块,对所述第一次残差连接后的特征进行正则化,得到第一次正则化后的特征;
    所述学习模块包括:
    第二多头注意力机制模块,用于将所述正则化后的特征与所述鸟瞰图特征结合,得到学习特征;
    第二残差连接模块,用于对所述学习特征进行恒等映射,输出所述第二次残差连接后的特征;
    第二正则化模块,用于对所述第二次残差连接后的特征进行正则化,得到第二次正则化后的特征;
    第一前馈神经网络,根据所述第二次正则化后的特征,在与目标检测类别相关联的损失模块的监督学习下输出目标检测类别;以及
    第二前馈神经网络,根据所述第二次正则化后的特征,在与目标检测框相关联的损失模块的监督学习下输出目标检测框的大小。
  13. 如权利要求12所述的基于鸟瞰图的多视角3D目标检测系统,其特征在于,所述编码模块包括:
    残差网络,用于对所述多视角图片提取特征并进行上采样,得到从底层到高层依次排布的多层特征;
    特征金字塔,用于根据特征映射图将所述多层特征进行累加,输出多尺度特征。
  14. 如权利要求12所述的基于鸟瞰图的多视角3D目标检测系统,其特征在于,所述映射关系为:
    沿着垂直方向压缩所述多尺度特征,同时保留水平方向的维度,得到压缩后的不同尺度的鸟瞰图特征;
    对所述压缩后的不同尺度的鸟瞰图特征进行再采样,转换到极坐标系中,得到维度大小相同的鸟瞰图特征;
    对所述维度大小相同的鸟瞰图特征进行下采样,得到降低维度后的鸟瞰图特征。
  15. 如权利要求12所述的基于鸟瞰图的多视角3D目标检测系统,其特征在于,所述映射关系为:
    沿着垂直方向压缩所述多尺度特征,同时保留水平方向的维度,并直接进行维度变换,得到维度大小相同的鸟瞰图特征;
    对所述维度大小相同的鸟瞰图特征进行下采样,得到降低维度后的鸟瞰图特征。
  16. 如权利要求12所述的基于鸟瞰图的多视角3D目标检测系统,其特征在于,所述第一多头注意力机制模块的输入和输出的关系如公式(1)所示:
    Figure PCTCN2022114418-appb-100007
    其中,Q,K,V为输入,Q为所述查询向量,K为被查向量,V为内容向量,K、V与Q相同,
    Figure PCTCN2022114418-appb-100008
    为尺度标度,d k为K的维度;softmax为激活函数,其将
    Figure PCTCN2022114418-appb-100009
    归一化为概率分布;T表示对K的转置;Attention(Q,K,V)为所述第一多头注意力机制模块的输出,即所述初始化特征。
  17. 如权利要求12所述的基于鸟瞰图的多视角3D目标检测系统,其特征在于,所述第二多头注意力机制模块的输入和输出的关系如公式(1)所示:
    Figure PCTCN2022114418-appb-100010
    其中,Q,K,V为输入,Q和K均为所述第一次残差连接和正则化后的特征,V为所述鸟瞰图特征,
    Figure PCTCN2022114418-appb-100011
    为尺度标度,d k为K的维度;softmax为激活函数,其将
    Figure PCTCN2022114418-appb-100012
    归一化为概率分布;T表示对K的转置;Attention(Q,K,V)为所述第二多头注意力机制模块的输出,即所述学习特征。
  18. 如权利要求12所述的基于鸟瞰图的多视角3D目标检测系统,其特征在于,所述第一或第二前馈神经网络对第二次残差连接和正则化后的特征进行线性变换,所述第一或第二前馈神经网络的表达式如公式(2)所示:
    FFN(x)=max(0,xW 1+b 1)*W 2+b 2  (2)
    其中,x为第二次残差连接和正则化后的特征,W1和W2为激活函数的权重,b1和b2为偏置的权重,max函数为取0和xW 1+b 1中的较大者。
  19. 如权利要求12所述的基于鸟瞰图的多视角3D目标检测系统,其特征在于,所述多视角图片分别来自自动驾驶交通工具的前摄像头、左前摄像头、右前摄像头、后摄像头、左后摄像头、右后摄像头六个相机。
PCT/CN2022/114418 2022-05-09 2022-08-24 基于鸟瞰图的多视角3d目标检测方法、存储器及系统 WO2023216460A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210501805.9A CN114821505A (zh) 2022-05-09 2022-05-09 基于鸟瞰图的多视角3d目标检测方法、存储器及系统
CN202210501805.9 2022-05-09

Publications (1)

Publication Number Publication Date
WO2023216460A1 true WO2023216460A1 (zh) 2023-11-16

Family

ID=82514245

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/114418 WO2023216460A1 (zh) 2022-05-09 2022-08-24 基于鸟瞰图的多视角3d目标检测方法、存储器及系统

Country Status (2)

Country Link
CN (1) CN114821505A (zh)
WO (1) WO2023216460A1 (zh)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114821505A (zh) * 2022-05-09 2022-07-29 合众新能源汽车有限公司 基于鸟瞰图的多视角3d目标检测方法、存储器及系统
CN115880555B (zh) * 2023-02-07 2023-05-30 北京百度网讯科技有限公司 目标检测方法、模型训练方法、装置、设备及介质
CN116561534B (zh) * 2023-07-10 2023-10-13 苏州映赛智能科技有限公司 一种基于自监督学习的提升路侧传感器精度的方法及系统

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111832655A (zh) * 2020-07-16 2020-10-27 四川大学 一种基于特征金字塔网络的多尺度三维目标检测方法
CN113011317A (zh) * 2021-03-16 2021-06-22 青岛科技大学 三维目标检测方法及检测装置
CN113610044A (zh) * 2021-08-19 2021-11-05 清华大学 基于自注意力机制的4d毫米波三维目标检测方法及系统
CN113658100A (zh) * 2021-07-16 2021-11-16 上海高德威智能交通系统有限公司 三维目标物体检测方法、装置、电子设备及存储介质
US20210390714A1 (en) * 2020-06-11 2021-12-16 Toyota Research Institute, Inc. Producing a bird's eye view image from a two dimensional image
CN114218999A (zh) * 2021-11-02 2022-03-22 上海交通大学 一种基于融合图像特征的毫米波雷达目标检测方法及系统
CN114821505A (zh) * 2022-05-09 2022-07-29 合众新能源汽车有限公司 基于鸟瞰图的多视角3d目标检测方法、存储器及系统

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210390714A1 (en) * 2020-06-11 2021-12-16 Toyota Research Institute, Inc. Producing a bird's eye view image from a two dimensional image
CN111832655A (zh) * 2020-07-16 2020-10-27 四川大学 一种基于特征金字塔网络的多尺度三维目标检测方法
CN113011317A (zh) * 2021-03-16 2021-06-22 青岛科技大学 三维目标检测方法及检测装置
CN113658100A (zh) * 2021-07-16 2021-11-16 上海高德威智能交通系统有限公司 三维目标物体检测方法、装置、电子设备及存储介质
CN113610044A (zh) * 2021-08-19 2021-11-05 清华大学 基于自注意力机制的4d毫米波三维目标检测方法及系统
CN114218999A (zh) * 2021-11-02 2022-03-22 上海交通大学 一种基于融合图像特征的毫米波雷达目标检测方法及系统
CN114821505A (zh) * 2022-05-09 2022-07-29 合众新能源汽车有限公司 基于鸟瞰图的多视角3d目标检测方法、存储器及系统

Also Published As

Publication number Publication date
CN114821505A (zh) 2022-07-29

Similar Documents

Publication Publication Date Title
WO2023216460A1 (zh) 基于鸟瞰图的多视角3d目标检测方法、存储器及系统
Shivakumar et al. Dfusenet: Deep fusion of rgb and sparse depth information for image guided dense depth completion
WO2019223382A1 (zh) 单目深度估计方法及其装置、设备和存储介质
Yuan et al. RGGNet: Tolerance aware LiDAR-camera online calibration with geometric deep learning and generative model
WO2022242416A1 (zh) 点云数据的生成方法和装置
WO2024021194A1 (zh) 激光雷达点云分割方法、装置、设备及存储介质
US20230154170A1 (en) Method and apparatus with multi-modal feature fusion
CN113159151A (zh) 面向自动驾驶的多传感器深度融合3d目标检测方法
Zhao et al. A surface geometry model for lidar depth completion
CN113052109A (zh) 一种3d目标检测系统及其3d目标检测方法
EP4307219A1 (en) Three-dimensional target detection method and apparatus
Shi et al. An improved lightweight deep neural network with knowledge distillation for local feature extraction and visual localization using images and LiDAR point clouds
WO2022000469A1 (en) Method and apparatus for 3d object detection and segmentation based on stereo vision
WO2023216654A1 (zh) 多视角语义分割方法、装置、电子设备和存储介质
WO2024083006A1 (zh) 一种三维成像方法、装置、设备和存储介质
CN116452573A (zh) 变电站设备缺陷检测方法、模型训练方法、装置和设备
CN115115917A (zh) 基于注意力机制和图像特征融合的3d点云目标检测方法
Zheng et al. Real-time GAN-based image enhancement for robust underwater monocular SLAM
CN115866229B (zh) 多视角图像的视角转换方法、装置、设备和介质
Li et al. 6DoF-3D: Efficient and accurate 3D object detection using six degrees-of-freedom for autonomous driving
CN117037141A (zh) 一种3d目标检测方法、装置和电子设备
US20230377180A1 (en) Systems and methods for neural implicit scene representation with dense, uncertainty-aware monocular depth constraints
CN114648639B (zh) 一种目标车辆的检测方法、系统及装置
CN116246119A (zh) 3d目标检测方法、电子设备及存储介质
Alaba et al. Multi-sensor fusion 3D object detection for autonomous driving

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22941401

Country of ref document: EP

Kind code of ref document: A1