CN117132952A

CN117132952A - Bird's eye view angle vehicle perception system based on many cameras

Info

Publication number: CN117132952A
Application number: CN202310880060.6A
Authority: CN
Inventors: 张云翔; 姬永超; 张秋磊; 赵梓良; 李博伦; 王强
Original assignee: Beijing Machinery Equipment Research Institute
Current assignee: Beijing Machinery Equipment Research Institute
Priority date: 2023-07-18
Filing date: 2023-07-18
Publication date: 2023-11-28

Abstract

The disclosure relates to a vehicle perception system with an aerial view angle based on multiple cameras. The bird's eye view vehicle perception system comprises a feature extraction module, a task encoder and a task head, wherein: the feature extraction module comprises a skeleton network module, a multi-scale feature fusion module, a multi-camera fusion module and a time sequence fusion module, and is used for carrying out feature extraction processing on images under BEV visual angles generated based on images acquired by the multi-cameras; the task encoder comprises a semantic map segmentation encoder, a target detection encoder and a motion trail prediction encoder, and is used for encoding image features based on a preset convolutional neural network; the task head comprises a 3D detection task head, a motion track prediction task head and a semantic map segmentation task head, and is used for carrying out preset task identification on image features and completing vehicle perception at an aerial view. The method and the device effectively solve the problems of target shielding, scene scaling and the like, and are more beneficial to improving the accuracy of downstream perception tasks.

Description

Bird's eye view angle vehicle perception system based on many cameras

Technical Field

The disclosure relates to the field of unmanned, in particular to a bird's eye view vehicle sensing system based on multiple cameras.

Background

The 3D visual perception task is a key technology in the unmanned field and is widely applied to the fields of automatic driving, emergency rescue, investigation and investigation, anti-terrorism and explosion prevention and the like. The task realizes unmanned operation by predicting 3D information such as the spatial position, the size, the gesture and the like of objects contained in a scene. Based on the rich color and texture features of images, convolutional neural networks that extract image features have evolved over decades to achieve significant results in many advanced visual tasks. Thus, in 3D visual perception tasks, existing methods mostly convert 3D spatial points into 2D feature representations based on front or top views (BEV), and construct corresponding visual perception networks based on 2D views. However, the front view-based 3D visual perception method has the following problems: on the one hand, the front view has limited representation capability for 3D scenes, resulting in image-based 3D visual perception tasks that perform much less than 2D visual perception tasks, while also laterally reflecting images is not a suitable representation of data for 3D visual perception. On the other hand, the calculation amount of fusing other types of data in the front view is large, the loss precision is high, and the generalization capability and the expandability of the network model trained based on the image features are poor. In recent years, with the wide application of the multi-mode fusion technology in the unmanned field, the disadvantage is particularly obvious.

In the prior art, the implementation scheme of the visual angle conversion method of the multi-visual angle image is as follows: extracting image features of the multi-view images to obtain feature images of the images of all the views, and taking the feature images as values; constructing local keys of the images of each view angle under a local 3D coordinate system of the corresponding camera view angle according to the feature map; constructing local inquiry of images of all view angles under a local 3D coordinate system according to the conversion relation from the global coordinate system to the camera coordinate system of all view angles; and inputting the values, the local keys and the local queries, and the global keys and the global queries under the global coordinate system into a decoder of the transducer network, and obtaining the image characteristics of the multi-view image under the global coordinate system through the decoder. The method reduces the learning difficulty of the transformer network, thereby improving the viewing angle conversion accuracy. But because of the transducer-based architecture, a large amount of data is required for training, with poor interpretability. The universal space-time fusion surrounding bird's eye view sensing method comprises the following steps: acquiring an image data set for training a neural network, and defining an algorithm target; establishing a virtual visual angle model; extracting the surrounding image characteristics of the basic backbone network; establishing a time sequence characteristic queue; unified space-time fusion modeling fusion characteristics; the head network outputs the prediction result. Compared with other perception models in the prior art, the method can effectively and simultaneously fuse the spatial relationship of the looking-around images, and can fuse the time sequence relationship of the looking-around images at different moments, and better perception effect and faster perception speed are obtained by better fusing different time sequence steps. However, this method can detect only vehicles around the team, and cannot predict the motion trajectory of the surrounding target.

Accordingly, there is a need for one or more approaches to address the above-described problems.

It should be noted that the information disclosed in the above background section is only for enhancing understanding of the background of the present disclosure and thus may include information that does not constitute prior art known to those of ordinary skill in the art.

Disclosure of Invention

It is an object of the present disclosure to provide a multiple camera based bird's eye view vehicle perception system that overcomes, at least in part, one or more of the problems due to the limitations and disadvantages of the related art.

According to one aspect of the present disclosure, there is provided a bird's eye view vehicle perception system based on multiple cameras, including a feature extraction module, a task encoder, a task head, wherein:

the feature extraction module comprises a skeleton network module, a multi-scale feature fusion module, a multi-camera fusion module and a time sequence fusion module, and is used for carrying out feature extraction processing on images under BEV view angles generated based on images acquired by the multi-cameras;

the task encoder comprises a semantic map segmentation encoder, a target detection encoder and a motion trail prediction encoder, and is used for carrying out encoding processing on the image features extracted by the feature extraction module based on a preset convolutional neural network;

the task head comprises a 3D detection task head, a motion track prediction task head and a semantic map segmentation task head, and is used for carrying out preset task identification based on image features subjected to coding processing to finish aerial view vehicle perception.

In one exemplary embodiment of the disclosure, in the feature extraction module of the system, the skeleton network module is based on a RegNet deep neural network architecture, and the skeleton network module is used for constructing the deep neural network architecture for images under BEV viewing angles generated based on images acquired by multiple cameras.

In an exemplary embodiment of the disclosure, in the feature extraction module of the system, the multi-scale feature fusion module is based on a BiFPN architecture of a modified feature pyramid network, and the multi-scale feature fusion module is configured to perform multi-scale feature fusion processing on an image generated based on an image acquired by multiple cameras and under a BEV view angle.

In an exemplary embodiment of the present disclosure, in the feature extraction module of the system, the multi-camera fusion module is based on a left-Splat method, and the multi-camera fusion module is configured to perform multi-camera fusion processing on an image generated under a BEV view angle based on an image acquired by multiple cameras.

In an exemplary embodiment of the present disclosure, in the feature extraction module of the system, the time sequence fusion module is configured to perform coordinate system transformation processing on the image feature.

In an exemplary embodiment of the present disclosure, a task encoder of the system is configured to perform encoding processing on the image features extracted by the feature extraction module based on a preset convolutional neural network, and generate image features with preset resolution respectively.

In an exemplary embodiment of the present disclosure, the system further comprises:

among the image features with preset resolution generated by the task encoder, the resolution of the image features detected based on the target generated by the target detection encoder and the resolution of the image features predicted based on the motion trail generated by the motion trail prediction encoder are smaller than the resolution of the image features segmented based on the semantic map generated by the semantic map segmentation encoder.

In an exemplary embodiment of the present disclosure, in the task head of the system, the 3D detection task head is based on a centrpoint detection head, and the 3D detection task head is used for predicting a width and height of a target and a gaussian heat map of a probability of occurrence of the target.

In an exemplary embodiment of the present disclosure, in the task head of the system, the motion track prediction task head is based on a Shoot mode, and the motion track prediction task head is used for predicting template tracks of different targets.

In an exemplary embodiment of the disclosure, in the task header of the system, the semantic map segmentation task header is based on an HDMap manner, and the semantic map segmentation task header is used for performing semantic environment segmentation processing based on a semantic segmentation algorithm.

An aerial view vehicle perception system based on multiple cameras in an exemplary embodiment of the present disclosure, wherein the aerial view vehicle perception system includes a feature extraction module, a task encoder, a task head, wherein: the feature extraction module comprises a skeleton network module, a multi-scale feature fusion module, a multi-camera fusion module and a time sequence fusion module, and is used for carrying out feature extraction processing on images under BEV visual angles generated based on images acquired by the multi-cameras; the task encoder comprises a semantic map segmentation encoder, a target detection encoder and a motion trail prediction encoder, and is used for encoding image features based on a preset convolutional neural network; the task head comprises a 3D detection task head, a motion track prediction task head and a semantic map segmentation task head, and is used for carrying out preset task identification on image features and completing vehicle perception at an aerial view. The method and the device effectively solve the problems of target shielding, scene scaling and the like, and are more beneficial to improving the accuracy of downstream perception tasks.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The above and other features and advantages of the present disclosure will become more apparent by describing in detail exemplary embodiments thereof with reference to the attached drawings.

FIG. 1 illustrates a schematic block diagram of a multi-camera based overhead view vehicle perception system in accordance with an exemplary embodiment of the present disclosure;

fig. 2 illustrates a general block diagram of a solution for a multi-camera based overhead view vehicle perception system according to an exemplary embodiment of the present disclosure.

Detailed Description

Example embodiments will now be described more fully with reference to the accompanying drawings. However, the exemplary embodiments can be embodied in many forms and should not be construed as limited to the embodiments set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of the example embodiments to those skilled in the art. The same reference numerals in the drawings denote the same or similar parts, and thus a repetitive description thereof will be omitted.

Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the disclosure. One skilled in the relevant art will recognize, however, that the disclosed aspects may be practiced without one or more of the specific details, or with other methods, components, materials, devices, steps, etc. In other instances, well-known structures, methods, devices, implementations, materials, or operations are not shown or described in detail to avoid obscuring aspects of the disclosure.

The block diagrams depicted in the figures are merely functional entities and do not necessarily correspond to physically separate entities. That is, these functional entities may be implemented in software, or in one or more software-hardened modules, or in different networks and/or processor devices and/or microcontroller devices.

In this exemplary embodiment, a bird's eye view vehicle sensing system based on multiple cameras is provided first; referring to fig. 1, the multi-camera-based bird's eye view vehicle sensing system includes a feature extraction module 110, a task encoder 120, and a task head 130, wherein:

the feature extraction module 110 includes a skeleton network module, a multi-scale feature fusion module, a multi-camera fusion module, and a time sequence fusion module, where the feature extraction module 110 is configured to perform feature extraction processing on an image under a BEV view angle generated based on an image acquired by the multi-camera;

the task encoder 120 includes a semantic map segmentation encoder, a target detection encoder, and a motion trail prediction encoder, where the task encoder 120 is configured to perform encoding processing on the image features extracted by the feature extraction module 110 based on a preset convolutional neural network;

the task head 130 includes a 3D detection task head, a motion track prediction task head, and a semantic map segmentation task head, where the task head 130 is configured to perform preset task recognition based on image features that perform coding processing, so as to complete vehicle perception with a bird's eye view.

A bird's-eye view vehicle perception system based on multiple cameras in an exemplary embodiment of the present disclosure, wherein the bird's-eye view vehicle perception system includes a feature extraction module 110, a task encoder 120, a task head 130, wherein: the feature extraction module 110 includes a skeleton network module, a multi-scale feature fusion module, a multi-camera fusion module, and a time sequence fusion module, and is configured to perform feature extraction processing on an image under a BEV view angle generated based on an image acquired by the multi-camera; the task encoder 120 includes a semantic map segmentation encoder, a target detection encoder, and a motion track prediction encoder, and is configured to encode image features based on a preset convolutional neural network; the task head 130 includes a 3D detection task head, a motion track prediction task head, and a semantic map segmentation task head, which are configured to perform preset task recognition on image features, so as to complete vehicle perception at an aerial view. The method and the device effectively solve the problems of target shielding, scene scaling and the like, and are more beneficial to improving the accuracy of downstream perception tasks.

Next, a bird's eye view vehicle sensing system based on multiple cameras in the present exemplary embodiment will be further described.

Embodiment one:

a multi-camera based bird's eye view vehicle perception system includes a feature extraction module 110, a task encoder 120, a task head 130, wherein:

the feature extraction module 110 includes a skeleton network module, a multi-scale feature fusion module, a multi-camera fusion module, and a time sequence fusion module, where the feature extraction module 110 is configured to perform feature extraction processing on an image under a BEV view angle generated based on an image acquired by the multi-camera.

In an embodiment of the present example, in the feature extraction module 110 of the system, the skeleton network module is based on a RegNet deep neural network architecture, and the skeleton network module is configured to construct a deep neural network architecture for an image under BEV viewing angles generated based on images acquired by multiple cameras.

In the embodiment of the present example, in the feature extraction module 110 of the system, the multi-scale feature fusion module is based on a BiFPN architecture of an improved feature pyramid network, and the multi-scale feature fusion module is used for performing multi-scale feature fusion processing on an image under a BEV view angle generated based on an image acquired by multiple cameras.

In the embodiment of the present example, in the feature extraction module 110 of the system, the multi-camera fusion module is based on a Lift-Splat method, and the multi-camera fusion module is configured to perform multi-camera fusion processing on an image generated based on an image acquired by multiple cameras and under a BEV viewing angle.

In the embodiment of the present example, in the feature extraction module 110 of the system, the time sequence fusion module is used to perform coordinate system transformation processing on the image features.

The task encoder 120 includes a semantic map segmentation encoder, a target detection encoder, and a motion trail prediction encoder, and the task encoder 120 is configured to encode the image features extracted by the feature extraction module 110 based on a preset convolutional neural network.

In this exemplary embodiment, the task encoder 120 of the system is configured to perform encoding processing on the image features extracted by the feature extraction module 110 based on a preset convolutional neural network, and generate image features with preset resolutions, respectively.

In an embodiment of the present example, the system further comprises:

among the image features with preset resolution generated by the task encoder 120, the resolution of the image features detected based on the target generated by the target detection encoder and the resolution of the image features predicted based on the motion track generated by the motion track prediction encoder are smaller than the resolution of the image features segmented based on the semantic map generated by the semantic map segmentation encoder.

In the embodiment of the present example, in the task head 130 of the system, the 3D detection task head is based on a centrpoint detection head, and the 3D detection task head is used for predicting the width and height of the target and the gaussian heat map of the probability of occurrence of the target.

In the embodiment of the present example, in the task head 130 of the system, the motion track prediction task head is based on a Shoot mode, and the motion track prediction task head is used for predicting template tracks of different targets.

In the embodiment of the present example, in the task header 130 of the system, the semantic map segmentation task header is based on the HDMap manner, and the semantic map segmentation task header is used for performing semantic environment segmentation processing based on a semantic segmentation algorithm.

Embodiment two:

in an embodiment of the present example, the present disclosure will be based on a purely visual BEV perception scheme for the perception of BEV features. The general block diagram of the technical scheme is shown in figure 2.

The first part is called a feature extraction module 110, and includes four modules in total: the system comprises a skeleton network module, a multi-scale feature fusion module, a multi-camera fusion module and a time sequence fusion module.

The backbone network module employs RegNet. RegNet is a deep neural network architecture, which is proposed by microsoft institute. The design goal of RegNet is to improve computational efficiency and model performance while maintaining model scalability and flexibility. The RegNet design concept is to improve the performance of the model by increasing the depth and width of the network. Unlike other neural network architectures, regNet uses a method called "network design space search" to determine the depth and width of the network. The method can reduce the calculation cost to the greatest extent while maintaining the performance of the model. The network structure of RegNet is composed of a plurality of modules, each module containing a plurality of convolution layers and pooling layers. These modules may be stacked as needed to build deeper and wider networks. RegNet also uses a technique called "channel attention" that can adaptively adjust the number of channels per convolutional layer to further improve the performance of the model. RegNet can achieve good performance under hardware conditions of different computational power.

The multi-scale fusion module adopts BiFPN. BiFPN, a modified version of FPN (feature pyramid network) proposed by the Efficientdet paper. The main idea of BiFPN is to introduce a bi-directional connection mode based on FPN so as to integrate characteristic information of different levels better. Specifically, biFPN introduces two branches at each feature level, one passing down from the features of the previous level and the other passing up from the features of the next level. Thus, information can be better integrated among different levels, and accuracy and efficiency of target detection are improved. BiFPN also adopts an adaptive mode to adjust the characteristic weights among different levels so as to better adapt to different target detection tasks. Specifically, the BiFPN automatically adjusts its weight according to the contribution of each feature level, thereby better balancing the feature information between different levels.

The multi-camera fusion module adopts Lift Splat. In the shift stage, it is necessary to map the pixel points in the bird's eye image into a three-dimensional space. This process requires consideration of the camera's in-and-out parameters and depth information of the pixel points. The commonly used method is to back project the pixel point into the camera coordinate system through the camera internal and external parameters, and then map it into three-dimensional space through the transformation of the camera coordinate system into the world coordinate system. In the Splat stage, a point cloud needs to be projected into a three-dimensional grid. In this process, the density and distribution of the point cloud need to be considered, and a commonly used method is to project the point cloud into the grid according to a certain rule, and assign the attribute (such as color, normal vector, etc.) of the point cloud to the corresponding position in the grid. In the Shoot phase, target detection is required in a three-dimensional grid. The method typically used in this process is to divide the grid into several small cubes and then subject each small cube to object detection. The detection method can use a traditional two-dimensional target detection algorithm, and can also use a method based on a three-dimensional convolutional neural network.

The time sequence fusion module uses the features of the previous frames, transforms the original features to the coordinate system of the current vehicle through the IMU and the motion information of the vehicle, and performs cancat with the current features.

The second portion is referred to as the task encoder 120. The task encoder 120 encodes the features extracted by the feature extraction module 110 using different convolutional neural networks. Since the resolutions required for different tasks are different, the resolutions output by different encoders for the results after feature encoding are also different. The resolution of semantic map segmentation is higher than that of object detection and motion trail prediction.

The third portion is referred to as a task head 130. Wherein the 3D detection head uses a detection head of centrpoint. The centrpoint predicts the width and height of the target and the gaussian heat map of the probability of occurrence of the target. And combining the two to calculate the final target position. The motion trail prediction task head uses a Shoot mode. Shoot predicts template trajectories for different targets using a method similar to semantic segmentation. The semantic map segmentation task head uses the mode of HDMap to segment by using the semantic environment around the semantic segmentation algorithm team.

It should be noted that although in the above detailed description several modules or units of a multi-camera based bird's eye view vehicle perception system are mentioned, this division is not mandatory. Indeed, the features and functionality of two or more modules or units described above may be embodied in one module or unit in accordance with embodiments of the present disclosure. Conversely, the features and functions of one module or unit described above may be further divided into a plurality of modules or units to be embodied.

Furthermore, the above-described drawings are only schematic illustrations of processes included in the method according to the exemplary embodiment of the present application, and are not intended to be limiting. It will be readily appreciated that the processes shown in the above figures do not indicate or limit the temporal order of these processes. In addition, it is also readily understood that these processes may be performed synchronously or asynchronously, for example, among a plurality of modules.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any adaptations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It is to be understood that the present disclosure is not limited to the precise arrangements and instrumentalities shown in the drawings, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. An aerial view vehicle perception system based on multiple cameras, which is characterized by comprising a feature extraction module, a task encoder and a task head, wherein:

2. The system of claim 1, wherein in the feature extraction module of the system, the skeleton network module is based on RegNet deep neural network architecture, the skeleton network module is to construct a deep neural network architecture for images at BEV perspectives generated based on images acquired by multiple cameras.

3. The system of claim 2, wherein in the feature extraction module of the system, the multi-scale feature fusion module is based on a bippn architecture of a modified feature pyramid network, and the multi-scale feature fusion module is used for performing multi-scale feature fusion processing on images generated based on images acquired by multiple cameras and under BEV viewing angles.

4. The system of claim 3, wherein in the feature extraction module of the system, the multi-camera fusion module is based on a left-Splat method, and the multi-camera fusion module is configured to perform multi-camera fusion processing on an image generated at a BEV view angle based on an image acquired by the multi-camera.

5. The system of claim 4, wherein the timing fusion module is configured to perform a coordinate system transformation on the image features in the feature extraction module of the system.

6. The system of claim 1, wherein a task encoder of the system is configured to encode the image features extracted by the feature extraction module based on a preset convolutional neural network, to generate image features of a preset resolution, respectively.

7. The system of claim 6, wherein the system further comprises:

8. The system of claim 1, wherein the 3D detection task head is based on a centrpoint detection head, and wherein the 3D detection task head is configured to predict a width and height of a target and a gaussian heat map of a probability of occurrence of the target.

9. The system of claim 1, wherein the motion trail prediction task head in the task head of the system is based on a Shoot mode, and the motion trail prediction task head is used for predicting template trails of different targets.

10. The system according to claim 1, wherein in task heads of the system, the semantic map segmentation task heads are based on an HDMap mode, and the semantic map segmentation task heads are used for performing semantic environment segmentation processing based on a semantic segmentation algorithm.