CN114821505A

CN114821505A - Multi-view 3D target detection method, memory and system based on aerial view

Info

Publication number: CN114821505A
Application number: CN202210501805.9A
Authority: CN
Inventors: 陈远鹏; 张军良; 赵天坤
Original assignee: Hozon New Energy Automobile Co Ltd
Current assignee: Hozon New Energy Automobile Co Ltd
Priority date: 2022-05-09
Filing date: 2022-05-09
Publication date: 2022-07-29
Also published as: WO2023216460A1

Abstract

The invention provides a multi-view 3D target detection method, a multi-view 3D target detection memory and a multi-view 3D target detection system based on an aerial view. The method comprises the following steps: coding the multi-view picture by utilizing a residual error network and a characteristic pyramid to obtain multi-scale characteristics; mapping the multi-scale features to the aerial view through a mapping relation to obtain aerial view features; randomly initializing a query vector, constructing a plurality of subspaces through a first multi-head attention mechanism module, and projecting the query vector into the plurality of subspaces to obtain an initialization characteristic; performing first residual error connection and regularization on the initialization features; combining the features after the first residual error connection and regularization with the aerial view features by using a second multi-head attention mechanism module to obtain learning features; and performing second residual error connection and regularization on the learning features, and outputting the target detection category by using the first feedforward neural network module and outputting the size of a target detection frame by using the second feedforward neural network module.

Description

Multi-view 3D target detection method, memory and system based on aerial view

Technical Field

The invention relates to the field of automatic driving, in particular to a target detection algorithm.

Background

Currently, in the field of automatic driving, 3D target detection using visual information is a long-standing challenge in low-cost automatic driving systems. Currently, two common methods are commonly used in the art: one is to establish the detection flow based on 2D calculations. The process uses a target detection process designed for 2D tasks to predict 3D information, such as target pose and velocity, regardless of 3D scene structure or sensor configuration. This approach often requires many post-processing steps to fuse the predictions of different cameras for removing redundant envelope boxes. The disadvantage of this approach is that the post-processing algorithm is complex and often requires a compromise between performance and efficiency. Another common method is to generate a pseudo-lidar from the camera image using 3D reconstruction, integrating more 3D computation information into the target detection procedure. These inputs are then used as data acquired directly from the 3D sensor using a 3D object detection method. The method can effectively improve the precision of 3D target detection. However, this method is often affected by compound errors, and when the depth value is not predicted correctly, the accuracy of 3D target detection is often negatively affected.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides a multi-view 3D object detection method based on a bird's eye view, which comprises the following steps:

coding the multi-view picture by utilizing a residual error network and a characteristic pyramid to obtain multi-scale characteristics;

mapping the multi-scale features to the aerial view through a mapping relation to obtain aerial view features;

randomly initializing a query vector, constructing a plurality of subspaces through a first multi-head attention mechanism module, and projecting the query vector into the plurality of subspaces to obtain an initialization characteristic;

performing first residual error connection and regularization on the initialization features;

combining the features after the first residual error connection and regularization with the aerial view features by using a second multi-head attention mechanism module to obtain learning features; and

and performing second residual error connection and regularization on the learning features, and outputting the target detection category by using a first feedforward neural network module and outputting the size of a target detection frame by using a second feedforward neural network module.

In an embodiment, the step of encoding the multi-view picture by using the residual error network and the feature pyramid to obtain the multi-scale feature includes:

the residual error network extracts features of the multi-view image and performs up-sampling to obtain a plurality of layers of features which are sequentially arranged from a bottom layer to a high layer; and

and accumulating the multilayer characteristics output by the residual error network according to the characteristic mapping map by the characteristic pyramid, and outputting the multi-scale characteristics.

In one embodiment, the step of mapping the multi-scale feature to the bird's eye view by the mapping relation to obtain the bird's eye view feature comprises:

compressing the multi-scale features along the vertical direction, and meanwhile, reserving the dimension in the horizontal direction to obtain compressed aerial view features of different scales;

resampling the compressed aerial view features with different dimensions, and converting the aerial view features into a polar coordinate system to obtain aerial view features with the same dimension;

and downsampling the aerial view features with the same dimension to obtain the aerial view features with the dimensions reduced.

compressing the multi-scale features along the vertical direction, simultaneously reserving the dimension in the horizontal direction, and directly performing dimension transformation to obtain the aerial view features with the same dimension;

In one embodiment, the relationship between the input and the output of the first multi-headed attention mechanism module is as shown in equation (1):

wherein Q, K, V are inputs, Q is the query vector, K is the vector to be searched, V is the content vector, K, V is the same as Q,

on a scale of d _k Is the dimension of K; softmax is an activation function, which will

Normalizing into probability distribution; t represents a transpose of K; attention (Q, K, V) is the output of the first multi-headed Attention mechanism module, i.e. the initialization feature.

In one embodiment, the relationship between the input and the output of the second multi-headed attention mechanism module is as shown in equation (1):

wherein Q, K and V are input, Q and K are the features after the first residual error connection and regularization, V is the aerial view feature,

Normalizing into probability distribution; t represents a transpose of K; attention (Q, K, V) is the output of the second multi-headed Attention mechanism module, i.e., the learning feature.

In one embodiment, the first or second feedforward neural network linearly transforms the features after the second residual join and regularization, and the expression of the first or second feedforward neural network is as shown in equation (2):

FFN(x)＝max(0，xW ₁ +b ₁ )*W ₂ +b ₂ (2)

where x is the feature after the second residual join and regularization, W1 and W2 are the weights of the activation function, b1 and b2 are the weights of the bias, and the max function is taken to be 0 and xW ₁ +b ₁ The larger of them.

In one embodiment, the step of outputting the target detection class with the first feedforward neural network module and outputting the size of the target detection box with the second feedforward neural network module includes:

performing supervised learning on the first feedforward neural network through a loss module associated with a target detection category to obtain the target detection category;

and performing supervised learning on the second feedforward neural network through a loss module associated with a target detection frame to obtain the size of the target detection frame.

In one embodiment, the multi-view pictures are from six cameras of a front camera, a left front camera, a right front camera, a rear camera, a left rear camera, and a right rear camera of the autonomous vehicle, respectively.

The invention also provides a computer readable storage medium on which computer instructions are stored, which when executed perform the bird's eye view-based multi-perspective 3D object detection method of the invention.

The invention also provides a multi-view 3D target detection system based on the aerial view, which comprises a memory and a processor, wherein the memory is stored with computer instructions capable of running on the processor, and the processor executes the computer instructions to execute the multi-view 3D target detection method based on the aerial view.

The invention also provides a multi-view 3D target detection system based on the aerial view, which comprises an encoding module, an aerial view characteristic acquisition module and a conversion decoding module.

And the coding module is used for coding the multi-view picture to obtain the multi-scale characteristics.

And the aerial view characteristic acquisition module is used for mapping the multi-scale characteristics to the aerial view through a mapping relation to obtain the aerial view characteristics.

And the conversion decoding module comprises an initial module and a learning module.

The initial module comprises:

and the first multi-head attention mechanism is used for constructing a plurality of subspaces, projecting the query vectors into the plurality of subspaces and outputting the spliced features of the plurality of initialized subspaces, namely the initialized features.

The first residual connecting module is used for performing identity mapping according to the query vector and the initialized features and outputting features after the first residual connection; and

the first regularization module is used for regularizing the features after the first residual error connection to obtain features after the first regularization;

the learning module includes:

the second multi-head attention mechanism module is used for combining the regularized features with the aerial view features to obtain learning features;

the second residual error connection module is used for performing identity mapping on the learning features and outputting the features after the second residual error connection;

the second regularization module is used for regularizing the features after the second residual error connection to obtain features after the second regularization;

the first feed-forward neural network outputs a target detection category under the supervision and learning of a loss module associated with the target detection category according to the features after the second regularization; and

and the second feedforward neural network outputs the size of the target detection frame under the supervision and learning of a loss module associated with the target detection frame according to the features after the second regularization.

In one embodiment, the encoding module includes a residual network and a feature pyramid.

And the residual error network is used for extracting features of the multi-view picture and performing up-sampling to obtain a plurality of layers of features which are sequentially arranged from the bottom layer to the high layer.

And the characteristic pyramid is used for accumulating the multilayer characteristics according to the characteristic mapping graph and outputting the multi-scale characteristics.

In one embodiment, the mapping relationship is:

In one embodiment, the first or second feedforward neural network linearly transforms the features after the second residual connection and regularization, and the expression of the first or second feedforward neural network is as shown in formula (2):

FFN(x)＝max(0，xW ₁ +b ₁ )*W ₂ +b ₂ (2)

wherein x isSecond residual join and regularized features, W1 and W2 are weights for activation functions, b1 and b2 are weights for bias, and max is taken to be 0 and xW ₁ +b ₁ The larger of them.

The multi-view 3D target detection method and system based on the aerial view have extremely beneficial technical effects. First, the object maintains a physical size when projected to the bird's eye view, and thus has a small size difference, compared to the RGB plane or the like. Secondly, the objects of the bird's eye view occupy different spaces, thereby avoiding occlusion problems. Third, in road scenes, since objects are usually located on the ground, the vertical position changes little, and the bird's eye view position is more advantageous for obtaining an accurate three-dimensional bounding box. Compared with single-view camera input, the multi-view 3D detection algorithm can effectively utilize the relation between multi-view point images and improve feature fusion, so that the detection precision can be well improved.

In other words, compared with a monocular detection algorithm, the method disclosed by the invention is based on multi-vision image fusion, can obtain more characteristics, and can well solve the truncation problem of the monocular; compared with an image visual angle space, the method transfers the characteristics into a bird's-eye view (BEV) vector space, and can well solve the problem of multi-visual angle superposition; in addition, the detection effect of the target detection algorithm is superior due to the full consideration of the multi-view angle and the aerial view characteristics.

Drawings

The foregoing summary, as well as the following detailed description of the invention, will be better understood when read in conjunction with the appended drawings. It is to be noted that the appended drawings are intended as examples of the claimed invention. In the drawings, like reference characters designate the same or similar elements.

FIG. 1 shows an overall architecture of a bird's eye view-based 3D object detection algorithm according to an embodiment of the invention;

FIG. 2 is a block diagram of an encoding module according to an embodiment of the invention;

fig. 3 illustrates a network structure of a Bird-eye-view Feature acquisition module (Bird-eye-view Feature) according to an embodiment of the present invention;

fig. 4 shows a network structure of a Bird-eye-view Feature acquisition module (Bird-eye-view Feature) according to still another embodiment of the present invention;

FIG. 5 is a block diagram of a translation decode module according to an embodiment of the invention;

FIG. 6 illustrates a multi-headed attention mechanism module implementation diagram according to an embodiment of the invention;

fig. 7 shows a specific structure of a residual connecting module according to an embodiment of the present invention; and

fig. 8 is a flowchart illustrating a multi-view 3D object detection method based on a bird's eye view according to an embodiment of the invention.

Detailed Description

The detailed features and advantages of the present invention are described in detail in the detailed description which follows, and will be sufficient for anyone skilled in the art to understand the technical content of the present invention and to implement the present invention, and the related objects and advantages of the present invention will be easily understood by those skilled in the art from the description, claims and drawings disclosed in the present specification.

The invention provides a 3D target detection method and system based on a bird's-eye view, which are used for carrying out feature fusion on multi-view pictures and carrying out 3D target detection based on the bird's-eye view.

Fig. 1 shows an overall architecture of a bird's eye view-based 3D object detection algorithm according to an embodiment of the invention. The whole algorithm framework comprises an encoding module (Encoder)101, a Bird-eye-view Feature acquisition module (Bird-eye-view Feature)102 and a transformation decoding module (transform Decoder) 103.

The input of the whole 3D target detection algorithm network architecture based on the aerial view is a multi-view image. The multi-view pictures can come from, for example, six cameras, a front left camera, a front right camera, a back left camera, and a back right camera, respectively, and the output of the entire network architecture is the category of the object in the 3D frame and the size of the 3D frame.

The coding module includes a residual Network (Res-Net) and a Feature Pyramid (Feature Pyramid Network). And extracting features of the multi-view picture by the residual error network to obtain multi-layer features. The feature pyramid fuses the features of each layer (e.g., fusing the underlying and overlying features) to obtain a multi-scale feature. The feature pyramid functions to reinforce higher features in the multi-layer features and to reinforce the positioning details of the lower features in the multi-layer features.

Fig. 2 shows a schematic structural diagram of an encoding module according to an embodiment of the present invention. The coding module is used for up-sampling a high-level feature map which is more abstract and has stronger semantic meaning, and then transversely connecting the feature to a previous-level feature, so that the high-level feature is enhanced, and the coding module has the advantage of being capable of well utilizing the positioning detail information of a bottom layer. Moreover, the network structure can solve the problems caused by different sizes of the targets to be detected, particularly the problem that the targets with small sizes are difficult to detect.

As can be seen from fig. 2, the encoding module includes a residual Network (Res-Net)201 and a Feature Pyramid (Feature Pyramid Network) 202.

The residual error network (Res-Net)201 is used for extracting features from the multi-view map and performing upsampling to obtain multiple layers of features which are sequentially arranged from a bottom layer to a high layer.

The Feature Pyramid (FPN) 202 accumulates the multi-layer features output by the residual error Network according to the Feature mapping map, and outputs the multi-scale features.

The bird's-eye view feature acquisition module is an important module of the invention, and the network structure of the bird's-eye view feature acquisition module completes feature conversion from an image space to a bird's-eye view space.

Fig. 3 shows a network structure of a Bird-eye-view Feature acquisition module (Bird-eye-view Feature) according to an embodiment of the present invention. The bird's-eye view feature acquisition module inputs the multi-scale features output in the Feature Pyramid (FPN) of the coding module, maps the multi-scale features to the bird's-eye view through a mapping relation, and outputs the bird's-eye view features (BEV features).

The method for mapping the multi-scale features to the aerial view and outputting the aerial view features through the mapping relationship mainly comprises the following steps of: firstly, compressing the multi-scale features along the vertical direction, and simultaneously keeping the dimension of the horizontal direction to obtain compressed aerial view features with different scales (301); then, through resampling, converting into a polar coordinate system, and obtaining aerial view features with the same dimension (namely, predicting a group of features along the depth axis direction in the polar coordinate) (302); and then, downsampling the bird's-eye view features with the same dimension size to reduce the dimension to obtain the bird's-eye view features (303) with the reduced dimension so as to adapt to the input dimension of the conversion decoding module.

Fig. 4 shows a network structure of a Bird-eye-view Feature acquisition module (Bird-eye-view Feature) according to still another embodiment of the present invention. The bird's-eye view feature acquisition module inputs the multi-scale features output in the Feature Pyramid (FPN) of the coding module, maps the multi-scale features to the bird's-eye view through a mapping relation, and outputs the bird's-eye view features (BEV features).

The method for mapping the multi-scale features to the aerial view and outputting the aerial view features through the mapping relationship mainly comprises the following steps of: firstly, compressing the multi-scale features along the vertical direction, simultaneously reserving the dimension in the horizontal direction, and directly carrying out dimension transformation to obtain aerial view features with the same dimension (401); dimensionality is reduced on the aerial view features through resampling (namely, downsampling), and the aerial view features (402) with the dimensionality reduced are obtained so as to adapt to the input dimensionality of the conversion decoding module.

Fig. 5 is a schematic diagram of an architecture of a translation decoding module according to an embodiment of the present invention. The conversion decoding module is mainly used for decoding, the conversion decoding module firstly carries out random initialization on a target Query vector (Query) (target feature), then a plurality of subspaces are constructed through a first Multi-head self-attention, and the feature of the target Query vector (Query) is projected into the plurality of subspaces, so that the information of all aspects can be comprehensively utilized, the same problem can be seen from different angles by a model, and a better effect can be obtained; then, the depth of the network is deepened through a residual connecting and regularizing module (Add & Norm), and convergence of the network is accelerated. Then, the characteristics output by the encoder and the target characteristics are well combined together with the aerial view characteristics through a second multi-head attention mechanism. And then outputting the final target detection category and a 3D frame (3D bounding box including a center point coordinate) through a residual connecting module, a regularization module (Add & Norm) and two feedforward neural network modules.

As shown in fig. 5, the transform decoding module mainly includes an initial module 501 and a learning module 502. The initial block 501 includes a first Multi-Head Attention mechanism block (Multi-Head Self-Attention), a first residual concatenation block (Add), and a first regularization block (Norm). The learning module 502 includes a second Multi-Head Attention mechanism module (Multi-Head Self-Attention), a second residual concatenation module (Add) and a second regularization module (Norm), a first feed-forward neural network (FFN) (i.e., a target detection class feed-forward neural network), and a second feed-forward neural network (FFN) (i.e., a target detection box feed-forward neural network).

FIG. 6 illustrates a multi-headed attention mechanism module implementation according to an embodiment of the invention. Wherein, MatMul represents matrix multiplication, Scale represents Scale, and Softmax represents Softmax function. The first multi-head attention mechanism module constructs a plurality of subspaces, projects the features of a target Query vector (Query) into the plurality of subspaces, and outputs the spliced features of the plurality of initialized subspaces, namely the initialized features. And the second multi-head attention mechanism module combines the output of the first regularization module with the BEV characteristics and outputs characteristics, namely learning characteristics, of the spliced plurality of subspaces fused with the BEV characteristics.

The output of the multi-head attention mechanism module is shown in formula (1):

wherein the content of the first and second substances,

is a scale markDegree, divided by a scale

To prevent the result from being too large, d _k Dimension of K (Key) vector; softmax is an activation function that will

Normalizing into probability distribution; the Softmax function is multiplied by the matrix V to obtain a summed representation of the weights. T denotes the transpose of matrix K.

For the first multi-head attention mechanism module, since it is used for initialization, the three matrices of Q vector, K vector, and V vector are all from the same input, i.e. the three matrices of Q vector, K vector, and V vector are all equal to the query vector (Q vector).

For the second multi-head attention mechanism module, the Q vector and the K vector are the same, the Q vector and the K vector are both the features after the first residual error connection and regularization, and the V vector is a bird's-eye view feature (BEV feature), so that the learning function is embodied.

The residual connecting module has the function of transferring information deeper and enhancing the fitting capability of the model.

The regularization module (Norm) network structure usually represents Layer Normalization, which converts the input of each Layer of neurons into features with the same mean and variance. The regularization module is used for solving the problem that parameters after calculation through multiple layers are too large, too small, and variance is large and the like with the increase of the number of network layers, which can cause abnormality in the learning process. The convergence of the model is very slow, so that the performance of the model can be improved by regularizing the calculated values of each layer, and the convergence of the network is accelerated.

According to an embodiment of the present invention, the input of the first residual concatenation module (Add) is a Query vector (Query) and an initialization feature, and the feature after the first residual concatenation is output after identity mapping, and a specific structure of the residual concatenation module is shown in fig. 7. And a first regularization module (Norm) regularizes the features after the first residual error connection to obtain the features after the first regularization.

According to an embodiment of the present invention, the input of the second residual error concatenation module (Add) is the learned feature output by the second multi-head attention mechanism module, the feature after the second residual error concatenation is output after the identity mapping, and a specific structure of the residual error concatenation module is shown in fig. 7. And a second regularization module (Norm) regularizes the features after the second residual error connection to obtain features after the second regularization.

The output of the second regularization module is divided into two paths to be respectively output to a first feed-forward neural network FFN (target detection class feed-forward neural network) and a second feed-forward neural network FFN (target detection frame feed-forward neural network). The first feed-forward neural network outputs a final target detection class. The second feedforward neural network outputs the size of the target detection box (3D bounding box) and the center coordinates of the target detection box.

The expression of the first or second feedforward neural network is shown in equation (2):

FFN(x)＝max(0，xW ₁ +b ₁ )*W ₂ +b ₂ (2)

formula (2) represents an expression form of a feedforward neural network (FFN) network structure, and mainly performs linear transformation on the regularized features. Where x is the output of the second regularization module, W1 and W2 are the weights of the activation functions, and b1 and b2 are the weights of the bias. The functional meaning of Max is to take 0 and xW ₁ +b ₁ The larger of them. The first feed-forward neural network outputs a target detection class under supervised learning of a loss module associated with the target detection class feed-forward neural network. The second feed-forward neural network obtains the size and the center coordinates of the 3D frame under supervised learning of a loss module associated with the target detection frame feed-forward neural network.

Fig. 8 is a flowchart illustrating a multi-view 3D object detection method based on a bird's eye view according to an embodiment of the invention. The method comprises the following steps:

801: coding the multi-view picture by utilizing a residual error network and a characteristic pyramid to obtain multi-scale characteristics;

802: mapping the multi-scale features to the aerial view through a mapping relation to obtain aerial view features;

803: randomly initializing a query vector, constructing a plurality of subspaces through a first multi-head attention mechanism module, and projecting the query vector into the plurality of subspaces to obtain an initialization characteristic;

804: performing first residual error connection and regularization on the initialization features;

805: combining the features after the first residual error connection and regularization with the aerial view features by using a second multi-head attention mechanism module to obtain learning features; and

806: and performing second residual error connection and regularization on the learning features, and outputting the target detection category by using a first feedforward neural network module and outputting the size of a target detection frame by using a second feedforward neural network module.

compressing the multi-scale features along the vertical direction, simultaneously reserving the dimension in the horizontal direction, and directly performing dimension transformation to obtain aerial view features with the same dimension;

on a scale of d _k Is the dimension of K; sSoft max is an activation function, which will

Normalizing into probability distribution; t represents a transpose of K; attention (Q, K, V) is the output of the second multi-head Attention mechanism module, i.e., the learning feature.

FFN(x)＝max(0，xW ₁ +b ₁ )*W ₂ +b ₂ (2)

The initial module comprises:

the learning module includes:

In one embodiment, the mapping relationship is:

and downsampling the aerial view features with the same dimensionality to obtain the aerial view features with reduced dimensionalities.

FFN(x)＝max(0，xW ₁ +b ₁ )*W ₂ +b ₂ (2)

wherein x is the second residual connectionAnd regularized features, W1 and W2 are weights for activation functions, b1 and b2 are weights for bias, and max functions are taken to be 0 and xW ₁ +b ₁ The larger of them.

Compared with a monocular detection algorithm, the method has the advantages that fusion is performed based on the multi-vision images, more features can be obtained, and the problem of truncation of the monocular can be well solved; compared with an image visual angle space, the method transfers the characteristics into a bird's-eye view (BEV) vector space, and can well solve the problem of multi-visual angle superposition; in addition, the detection effect of the target detection algorithm is superior due to the full consideration of the multi-view angle and the aerial view characteristics.

Having thus described the basic concept, it will be apparent to those skilled in the art that the foregoing disclosure is by way of example only, and is not intended to limit the present application. Various modifications, improvements and adaptations to the present application may occur to those skilled in the art, although not explicitly described herein. Such modifications, improvements and adaptations are proposed in the present application and thus fall within the spirit and scope of the exemplary embodiments of the present application.

Flow charts are used herein to illustrate operations performed by systems according to embodiments of the present application. It should be understood that the preceding or following operations are not necessarily performed in the exact order in which they are performed. Rather, various steps may be processed in reverse order or simultaneously. Meanwhile, other operations are added to or removed from these processes.

Also, this application uses specific language to describe embodiments of the application. Reference throughout this specification to "one embodiment," "an embodiment," and/or "some embodiments" means that a particular feature, structure, or characteristic described in connection with at least one embodiment of the present application is included in at least one embodiment of the present application. Therefore, it is emphasized and should be appreciated that two or more references to "an embodiment" or "one embodiment" or "an alternative embodiment" in various places throughout this specification are not necessarily all referring to the same embodiment. Furthermore, some features, structures, or characteristics of one or more embodiments of the present application may be combined as appropriate.

Moreover, those skilled in the art will appreciate that aspects of the present application may be illustrated and described in terms of several patentable species or situations, including any new and useful combination of processes, machines, manufacture, or materials, or any new and useful improvement thereon. Accordingly, various aspects of the present application may be embodied entirely in hardware, entirely in software (including firmware, resident software, micro-code, etc.) or in a combination of hardware and software. The above hardware or software may be referred to as "data block," module, "" engine, "" unit, "" component, "or" system. Furthermore, aspects of the present application may be represented as a computer product, including computer readable program code, embodied in one or more computer readable media.

A computer readable signal medium may comprise a propagated data signal with computer program code embodied therein, for example, on a baseband or as part of a carrier wave. The propagated signal may take any of a variety of forms, including electromagnetic, optical, and the like, or any suitable combination. A computer readable signal medium may be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code on a computer readable signal medium may be propagated over any suitable medium, including radio, electrical cable, fiber optic cable, RF, or the like, or any combination of the preceding.

Computer program code required for the operation of various portions of the present application may be written in any one or more programming languages, including an object oriented programming language such as Java, Scala, Smalltalk, Eiffel, JADE, Emerald, C + +, C #, VB.NET, Python, and the like, a conventional programming language such as C, Visual Basic, Fortran 2003, Perl, COBOL 2002, PHP, ABAP, a dynamic programming language such as Python, Ruby, and Groovy, or other programming languages, and the like. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any network format, such as a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet), or in a cloud computing environment, or as a service, such as a software as a service (SaaS).

Additionally, the order in which elements and sequences of the processes described herein are processed, the use of alphanumeric characters, or the use of other designations, is not intended to limit the order of the processes and methods described herein, unless explicitly claimed. While various presently contemplated embodiments of the invention have been discussed in the foregoing disclosure by way of example, it is to be understood that such detail is solely for that purpose and that the appended claims are not limited to the disclosed embodiments, but, on the contrary, are intended to cover all modifications and equivalent arrangements that are within the spirit and scope of the embodiments herein. For example, although the system components described above may be implemented by hardware devices, they may also be implemented by software-only solutions, such as installing the described system on an existing server or mobile device.

Similarly, it should be noted that in the foregoing description of embodiments of the application, various features are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure aiding in the understanding of one or more of the embodiments. This method of disclosure, however, is not intended to require more features than are expressly recited in the claims. Indeed, the embodiments may be characterized as having less than all of the features of a single embodiment disclosed above.

The terms and expressions which have been employed herein are used as terms of description and not of limitation. The use of such terms and expressions is not intended to exclude any equivalents of the features shown and described (or portions thereof), and it is recognized that various modifications may be made within the scope of the claims. Other modifications, variations, and alternatives are also possible. Accordingly, the claims should be looked to in order to cover all such equivalents.

Also, it should be noted that although the present invention has been described with reference to the current specific embodiments, it should be understood by those skilled in the art that the above embodiments are merely illustrative of the present invention, and various equivalent changes or substitutions may be made without departing from the spirit of the present invention, and therefore, it is intended that all changes and modifications to the above embodiments be included within the scope of the claims of the present application.

Claims

1. A multi-view 3D object detection method based on an aerial view is characterized by comprising the following steps:

randomly initializing a query vector, constructing a plurality of subspaces through a first multi-head attention mechanism module, and projecting the query vector to the plurality of subspaces to obtain an initialization characteristic;

2. The bird's eye view-based multi-view 3D object detection method of claim 1, wherein the step of encoding the multi-view picture using the residual network and the feature pyramid to obtain the multi-scale features comprises:

3. The bird's eye view-based multi-perspective 3D object detection method of claim 1, wherein the step of mapping the multi-scale features to the bird's eye view via a mapping relationship to obtain the bird's eye view features comprises:

4. The bird's eye view-based multi-perspective 3D object detection method of claim 1, wherein the step of mapping the multi-scale features to the bird's eye view via a mapping relationship to obtain the bird's eye view features comprises:

5. The bird's eye view-based multi-perspective 3D object detection method of claim 1, wherein the relationship between the input and output of the first multi-head attention mechanism module is as shown in equation (1):

6. The bird's eye view-based multi-perspective 3D object detection method of claim 1, wherein the relationship between the input and the output of the second multi-headed attention mechanism module is as shown in equation (1):

7. The bird's eye view-based multi-perspective 3D object detection method of claim 1, wherein the first or second feed-forward neural network linearly transforms the features after the second residual join and regularization, and the expression of the first or second feed-forward neural network is shown in formula (2):

FFN(x)＝max(0,xW ₁ +b ₁ )*W ₂ +b ₂ (2)

where x is the feature after the second residual join and regularization, W1 and W2 are the weights of the activation function, b1 and b2 are the weights of the bias, and the max function is taken to be 0 and W ₁ + ₁ The larger of them.

8. The bird's eye view-based multi-perspective 3D object detection method of claim 1, wherein outputting the object detection category with a first feed-forward neural network module and outputting the size of the object detection box with a second feed-forward neural network module comprises:

9. The bird's eye view-based multi-view 3D object detection method of claim 1, characterized in that the multi-view pictures are from six cameras of a front camera, a left front camera, a right front camera, a rear camera, a left rear camera, and a right rear camera of an autonomous vehicle, respectively.

10. A computer readable storage medium having stored thereon computer instructions, wherein the computer instructions are executable to perform the bird's eye view based multi-perspective 3D object detection method according to any one of claims 1 to 9.

11. A bird's-eye-view-based multi-perspective 3D object detection system comprising a memory and a processor, the memory having stored thereon computer instructions executable on the processor, wherein the processor, when executing the computer instructions, performs the bird's-eye-view-based multi-perspective 3D object detection method of any one of claims 1 to 9.

12. A multi-perspective 3D object detection system based on an aerial view, the system comprising:

the encoding module is used for encoding the multi-view picture to obtain multi-scale characteristics;

the aerial view characteristic acquisition module is used for mapping the multi-scale characteristics to an aerial view through a mapping relation to obtain aerial view characteristics; and

the conversion decoding module comprises an initial module and a learning module;

the initial module comprises:

the first multi-head attention mechanism is used for constructing a plurality of subspaces, projecting the query vectors into the plurality of subspaces and outputting characteristics spliced by the plurality of initialized subspaces, namely initialized characteristics;

the learning module includes:

the second residual error connection module is used for performing identity mapping on the learning features and outputting features after the second residual error connection;

13. The bird's eye-view based multi-perspective 3D object detection system of claim 12, wherein the encoding module comprises:

the residual error network is used for extracting features of the multi-view picture and performing up-sampling on the features to obtain multilayer features which are sequentially distributed from a bottom layer to a high layer;

14. The bird's eye view-based multi-perspective 3D object detection system of claim 12, wherein the mapping relationship is:

15. The bird's eye view-based multi-perspective 3D object detection system of claim 12, wherein the mapping relationship is:

16. The bird's eye view-based multi-perspective 3D object detection system of claim 12, wherein the first multi-head attention mechanism module has a relationship of input and output as shown in equation (1):

in order to be a scale of a scale, _k is the dimension of K; softmax is an activation function, which will

17. The bird's eye view-based multi-perspective 3D object detection system of claim 12, wherein the relationship of the inputs and outputs of the second multi-headed attention mechanism module is as shown in equation (1):

18. The bird's eye view-based multi-perspective 3D object detection system of claim 12, wherein the first or second feed-forward neural network linearly transforms the features after the second residual join and regularization, and the expression of the first or second feed-forward neural network is shown in equation (2):

FFN(x)＝max(0,xW ₁ +b ₁ )*W ₂ +b ₂ (2)

19. The bird's eye view-based multi-perspective 3D object detection system of claim 12, wherein the multi-perspective pictures are from six cameras of a front camera, a left front camera, a right front camera, a rear camera, a left rear camera, and a right rear camera of an autonomous vehicle, respectively.