CN117132952A - Bird's eye view angle vehicle perception system based on many cameras - Google Patents

Bird's eye view angle vehicle perception system based on many cameras Download PDF

Info

Publication number
CN117132952A
CN117132952A CN202310880060.6A CN202310880060A CN117132952A CN 117132952 A CN117132952 A CN 117132952A CN 202310880060 A CN202310880060 A CN 202310880060A CN 117132952 A CN117132952 A CN 117132952A
Authority
CN
China
Prior art keywords
task
encoder
module
head
feature extraction
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310880060.6A
Other languages
Chinese (zh)
Inventor
张云翔
姬永超
张秋磊
赵梓良
李博伦
王强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Machinery Equipment Research Institute
Original Assignee
Beijing Machinery Equipment Research Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Machinery Equipment Research Institute filed Critical Beijing Machinery Equipment Research Institute
Priority to CN202310880060.6A priority Critical patent/CN117132952A/en
Publication of CN117132952A publication Critical patent/CN117132952A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/56Context or environment of the image exterior to a vehicle by using sensors mounted on the vehicle
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/246Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/292Multi-camera tracking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/803Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of input or preprocessed data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/60Type of objects
    • G06V20/64Three-dimensional objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30241Trajectory
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30248Vehicle exterior or interior
    • G06T2207/30252Vehicle exterior; Vicinity of vehicle
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Image Analysis (AREA)

Abstract

The disclosure relates to a vehicle perception system with an aerial view angle based on multiple cameras. The bird's eye view vehicle perception system comprises a feature extraction module, a task encoder and a task head, wherein: the feature extraction module comprises a skeleton network module, a multi-scale feature fusion module, a multi-camera fusion module and a time sequence fusion module, and is used for carrying out feature extraction processing on images under BEV visual angles generated based on images acquired by the multi-cameras; the task encoder comprises a semantic map segmentation encoder, a target detection encoder and a motion trail prediction encoder, and is used for encoding image features based on a preset convolutional neural network; the task head comprises a 3D detection task head, a motion track prediction task head and a semantic map segmentation task head, and is used for carrying out preset task identification on image features and completing vehicle perception at an aerial view. The method and the device effectively solve the problems of target shielding, scene scaling and the like, and are more beneficial to improving the accuracy of downstream perception tasks.

Description

Bird's eye view angle vehicle perception system based on many cameras
Technical Field
The disclosure relates to the field of unmanned, in particular to a bird's eye view vehicle sensing system based on multiple cameras.
Background
The 3D visual perception task is a key technology in the unmanned field and is widely applied to the fields of automatic driving, emergency rescue, investigation and investigation, anti-terrorism and explosion prevention and the like. The task realizes unmanned operation by predicting 3D information such as the spatial position, the size, the gesture and the like of objects contained in a scene. Based on the rich color and texture features of images, convolutional neural networks that extract image features have evolved over decades to achieve significant results in many advanced visual tasks. Thus, in 3D visual perception tasks, existing methods mostly convert 3D spatial points into 2D feature representations based on front or top views (BEV), and construct corresponding visual perception networks based on 2D views. However, the front view-based 3D visual perception method has the following problems: on the one hand, the front view has limited representation capability for 3D scenes, resulting in image-based 3D visual perception tasks that perform much less than 2D visual perception tasks, while also laterally reflecting images is not a suitable representation of data for 3D visual perception. On the other hand, the calculation amount of fusing other types of data in the front view is large, the loss precision is high, and the generalization capability and the expandability of the network model trained based on the image features are poor. In recent years, with the wide application of the multi-mode fusion technology in the unmanned field, the disadvantage is particularly obvious.
In the prior art, the implementation scheme of the visual angle conversion method of the multi-visual angle image is as follows: extracting image features of the multi-view images to obtain feature images of the images of all the views, and taking the feature images as values; constructing local keys of the images of each view angle under a local 3D coordinate system of the corresponding camera view angle according to the feature map; constructing local inquiry of images of all view angles under a local 3D coordinate system according to the conversion relation from the global coordinate system to the camera coordinate system of all view angles; and inputting the values, the local keys and the local queries, and the global keys and the global queries under the global coordinate system into a decoder of the transducer network, and obtaining the image characteristics of the multi-view image under the global coordinate system through the decoder. The method reduces the learning difficulty of the transformer network, thereby improving the viewing angle conversion accuracy. But because of the transducer-based architecture, a large amount of data is required for training, with poor interpretability. The universal space-time fusion surrounding bird's eye view sensing method comprises the following steps: acquiring an image data set for training a neural network, and defining an algorithm target; establishing a virtual visual angle model; extracting the surrounding image characteristics of the basic backbone network; establishing a time sequence characteristic queue; unified space-time fusion modeling fusion characteristics; the head network outputs the prediction result. Compared with other perception models in the prior art, the method can effectively and simultaneously fuse the spatial relationship of the looking-around images, and can fuse the time sequence relationship of the looking-around images at different moments, and better perception effect and faster perception speed are obtained by better fusing different time sequence steps. However, this method can detect only vehicles around the team, and cannot predict the motion trajectory of the surrounding target.
Accordingly, there is a need for one or more approaches to address the above-described problems.
It should be noted that the information disclosed in the above background section is only for enhancing understanding of the background of the present disclosure and thus may include information that does not constitute prior art known to those of ordinary skill in the art.
Disclosure of Invention
It is an object of the present disclosure to provide a multiple camera based bird's eye view vehicle perception system that overcomes, at least in part, one or more of the problems due to the limitations and disadvantages of the related art.
According to one aspect of the present disclosure, there is provided a bird's eye view vehicle perception system based on multiple cameras, including a feature extraction module, a task encoder, a task head, wherein:
the feature extraction module comprises a skeleton network module, a multi-scale feature fusion module, a multi-camera fusion module and a time sequence fusion module, and is used for carrying out feature extraction processing on images under BEV view angles generated based on images acquired by the multi-cameras;
the task encoder comprises a semantic map segmentation encoder, a target detection encoder and a motion trail prediction encoder, and is used for carrying out encoding processing on the image features extracted by the feature extraction module based on a preset convolutional neural network;
the task head comprises a 3D detection task head, a motion track prediction task head and a semantic map segmentation task head, and is used for carrying out preset task identification based on image features subjected to coding processing to finish aerial view vehicle perception.
In one exemplary embodiment of the disclosure, in the feature extraction module of the system, the skeleton network module is based on a RegNet deep neural network architecture, and the skeleton network module is used for constructing the deep neural network architecture for images under BEV viewing angles generated based on images acquired by multiple cameras.
In an exemplary embodiment of the disclosure, in the feature extraction module of the system, the multi-scale feature fusion module is based on a BiFPN architecture of a modified feature pyramid network, and the multi-scale feature fusion module is configured to perform multi-scale feature fusion processing on an image generated based on an image acquired by multiple cameras and under a BEV view angle.
In an exemplary embodiment of the present disclosure, in the feature extraction module of the system, the multi-camera fusion module is based on a left-Splat method, and the multi-camera fusion module is configured to perform multi-camera fusion processing on an image generated under a BEV view angle based on an image acquired by multiple cameras.
In an exemplary embodiment of the present disclosure, in the feature extraction module of the system, the time sequence fusion module is configured to perform coordinate system transformation processing on the image feature.
In an exemplary embodiment of the present disclosure, a task encoder of the system is configured to perform encoding processing on the image features extracted by the feature extraction module based on a preset convolutional neural network, and generate image features with preset resolution respectively.
In an exemplary embodiment of the present disclosure, the system further comprises:
among the image features with preset resolution generated by the task encoder, the resolution of the image features detected based on the target generated by the target detection encoder and the resolution of the image features predicted based on the motion trail generated by the motion trail prediction encoder are smaller than the resolution of the image features segmented based on the semantic map generated by the semantic map segmentation encoder.
In an exemplary embodiment of the present disclosure, in the task head of the system, the 3D detection task head is based on a centrpoint detection head, and the 3D detection task head is used for predicting a width and height of a target and a gaussian heat map of a probability of occurrence of the target.
In an exemplary embodiment of the present disclosure, in the task head of the system, the motion track prediction task head is based on a Shoot mode, and the motion track prediction task head is used for predicting template tracks of different targets.
In an exemplary embodiment of the disclosure, in the task header of the system, the semantic map segmentation task header is based on an HDMap manner, and the semantic map segmentation task header is used for performing semantic environment segmentation processing based on a semantic segmentation algorithm.
An aerial view vehicle perception system based on multiple cameras in an exemplary embodiment of the present disclosure, wherein the aerial view vehicle perception system includes a feature extraction module, a task encoder, a task head, wherein: the feature extraction module comprises a skeleton network module, a multi-scale feature fusion module, a multi-camera fusion module and a time sequence fusion module, and is used for carrying out feature extraction processing on images under BEV visual angles generated based on images acquired by the multi-cameras; the task encoder comprises a semantic map segmentation encoder, a target detection encoder and a motion trail prediction encoder, and is used for encoding image features based on a preset convolutional neural network; the task head comprises a 3D detection task head, a motion track prediction task head and a semantic map segmentation task head, and is used for carrying out preset task identification on image features and completing vehicle perception at an aerial view. The method and the device effectively solve the problems of target shielding, scene scaling and the like, and are more beneficial to improving the accuracy of downstream perception tasks.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Drawings
The above and other features and advantages of the present disclosure will become more apparent by describing in detail exemplary embodiments thereof with reference to the attached drawings.
FIG. 1 illustrates a schematic block diagram of a multi-camera based overhead view vehicle perception system in accordance with an exemplary embodiment of the present disclosure;
fig. 2 illustrates a general block diagram of a solution for a multi-camera based overhead view vehicle perception system according to an exemplary embodiment of the present disclosure.
Detailed Description
Example embodiments will now be described more fully with reference to the accompanying drawings. However, the exemplary embodiments can be embodied in many forms and should not be construed as limited to the embodiments set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of the example embodiments to those skilled in the art. The same reference numerals in the drawings denote the same or similar parts, and thus a repetitive description thereof will be omitted.
Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the disclosure. One skilled in the relevant art will recognize, however, that the disclosed aspects may be practiced without one or more of the specific details, or with other methods, components, materials, devices, steps, etc. In other instances, well-known structures, methods, devices, implementations, materials, or operations are not shown or described in detail to avoid obscuring aspects of the disclosure.
The block diagrams depicted in the figures are merely functional entities and do not necessarily correspond to physically separate entities. That is, these functional entities may be implemented in software, or in one or more software-hardened modules, or in different networks and/or processor devices and/or microcontroller devices.
In this exemplary embodiment, a bird's eye view vehicle sensing system based on multiple cameras is provided first; referring to fig. 1, the multi-camera-based bird's eye view vehicle sensing system includes a feature extraction module 110, a task encoder 120, and a task head 130, wherein:
the feature extraction module 110 includes a skeleton network module, a multi-scale feature fusion module, a multi-camera fusion module, and a time sequence fusion module, where the feature extraction module 110 is configured to perform feature extraction processing on an image under a BEV view angle generated based on an image acquired by the multi-camera;
the task encoder 120 includes a semantic map segmentation encoder, a target detection encoder, and a motion trail prediction encoder, where the task encoder 120 is configured to perform encoding processing on the image features extracted by the feature extraction module 110 based on a preset convolutional neural network;
the task head 130 includes a 3D detection task head, a motion track prediction task head, and a semantic map segmentation task head, where the task head 130 is configured to perform preset task recognition based on image features that perform coding processing, so as to complete vehicle perception with a bird's eye view.
A bird's-eye view vehicle perception system based on multiple cameras in an exemplary embodiment of the present disclosure, wherein the bird's-eye view vehicle perception system includes a feature extraction module 110, a task encoder 120, a task head 130, wherein: the feature extraction module 110 includes a skeleton network module, a multi-scale feature fusion module, a multi-camera fusion module, and a time sequence fusion module, and is configured to perform feature extraction processing on an image under a BEV view angle generated based on an image acquired by the multi-camera; the task encoder 120 includes a semantic map segmentation encoder, a target detection encoder, and a motion track prediction encoder, and is configured to encode image features based on a preset convolutional neural network; the task head 130 includes a 3D detection task head, a motion track prediction task head, and a semantic map segmentation task head, which are configured to perform preset task recognition on image features, so as to complete vehicle perception at an aerial view. The method and the device effectively solve the problems of target shielding, scene scaling and the like, and are more beneficial to improving the accuracy of downstream perception tasks.
Next, a bird's eye view vehicle sensing system based on multiple cameras in the present exemplary embodiment will be further described.
Embodiment one:
a multi-camera based bird's eye view vehicle perception system includes a feature extraction module 110, a task encoder 120, a task head 130, wherein:
the feature extraction module 110 includes a skeleton network module, a multi-scale feature fusion module, a multi-camera fusion module, and a time sequence fusion module, where the feature extraction module 110 is configured to perform feature extraction processing on an image under a BEV view angle generated based on an image acquired by the multi-camera.
In an embodiment of the present example, in the feature extraction module 110 of the system, the skeleton network module is based on a RegNet deep neural network architecture, and the skeleton network module is configured to construct a deep neural network architecture for an image under BEV viewing angles generated based on images acquired by multiple cameras.
In the embodiment of the present example, in the feature extraction module 110 of the system, the multi-scale feature fusion module is based on a BiFPN architecture of an improved feature pyramid network, and the multi-scale feature fusion module is used for performing multi-scale feature fusion processing on an image under a BEV view angle generated based on an image acquired by multiple cameras.
In the embodiment of the present example, in the feature extraction module 110 of the system, the multi-camera fusion module is based on a Lift-Splat method, and the multi-camera fusion module is configured to perform multi-camera fusion processing on an image generated based on an image acquired by multiple cameras and under a BEV viewing angle.
In the embodiment of the present example, in the feature extraction module 110 of the system, the time sequence fusion module is used to perform coordinate system transformation processing on the image features.
The task encoder 120 includes a semantic map segmentation encoder, a target detection encoder, and a motion trail prediction encoder, and the task encoder 120 is configured to encode the image features extracted by the feature extraction module 110 based on a preset convolutional neural network.
In this exemplary embodiment, the task encoder 120 of the system is configured to perform encoding processing on the image features extracted by the feature extraction module 110 based on a preset convolutional neural network, and generate image features with preset resolutions, respectively.
In an embodiment of the present example, the system further comprises:
among the image features with preset resolution generated by the task encoder 120, the resolution of the image features detected based on the target generated by the target detection encoder and the resolution of the image features predicted based on the motion track generated by the motion track prediction encoder are smaller than the resolution of the image features segmented based on the semantic map generated by the semantic map segmentation encoder.
The task head 130 includes a 3D detection task head, a motion track prediction task head, and a semantic map segmentation task head, where the task head 130 is configured to perform preset task recognition based on image features that perform coding processing, so as to complete vehicle perception with a bird's eye view.
In the embodiment of the present example, in the task head 130 of the system, the 3D detection task head is based on a centrpoint detection head, and the 3D detection task head is used for predicting the width and height of the target and the gaussian heat map of the probability of occurrence of the target.
In the embodiment of the present example, in the task head 130 of the system, the motion track prediction task head is based on a Shoot mode, and the motion track prediction task head is used for predicting template tracks of different targets.
In the embodiment of the present example, in the task header 130 of the system, the semantic map segmentation task header is based on the HDMap manner, and the semantic map segmentation task header is used for performing semantic environment segmentation processing based on a semantic segmentation algorithm.
Embodiment two:
in an embodiment of the present example, the present disclosure will be based on a purely visual BEV perception scheme for the perception of BEV features. The general block diagram of the technical scheme is shown in figure 2.
The first part is called a feature extraction module 110, and includes four modules in total: the system comprises a skeleton network module, a multi-scale feature fusion module, a multi-camera fusion module and a time sequence fusion module.
The backbone network module employs RegNet. RegNet is a deep neural network architecture, which is proposed by microsoft institute. The design goal of RegNet is to improve computational efficiency and model performance while maintaining model scalability and flexibility. The RegNet design concept is to improve the performance of the model by increasing the depth and width of the network. Unlike other neural network architectures, regNet uses a method called "network design space search" to determine the depth and width of the network. The method can reduce the calculation cost to the greatest extent while maintaining the performance of the model. The network structure of RegNet is composed of a plurality of modules, each module containing a plurality of convolution layers and pooling layers. These modules may be stacked as needed to build deeper and wider networks. RegNet also uses a technique called "channel attention" that can adaptively adjust the number of channels per convolutional layer to further improve the performance of the model. RegNet can achieve good performance under hardware conditions of different computational power.
The multi-scale fusion module adopts BiFPN. BiFPN, a modified version of FPN (feature pyramid network) proposed by the Efficientdet paper. The main idea of BiFPN is to introduce a bi-directional connection mode based on FPN so as to integrate characteristic information of different levels better. Specifically, biFPN introduces two branches at each feature level, one passing down from the features of the previous level and the other passing up from the features of the next level. Thus, information can be better integrated among different levels, and accuracy and efficiency of target detection are improved. BiFPN also adopts an adaptive mode to adjust the characteristic weights among different levels so as to better adapt to different target detection tasks. Specifically, the BiFPN automatically adjusts its weight according to the contribution of each feature level, thereby better balancing the feature information between different levels.
The multi-camera fusion module adopts Lift Splat. In the shift stage, it is necessary to map the pixel points in the bird's eye image into a three-dimensional space. This process requires consideration of the camera's in-and-out parameters and depth information of the pixel points. The commonly used method is to back project the pixel point into the camera coordinate system through the camera internal and external parameters, and then map it into three-dimensional space through the transformation of the camera coordinate system into the world coordinate system. In the Splat stage, a point cloud needs to be projected into a three-dimensional grid. In this process, the density and distribution of the point cloud need to be considered, and a commonly used method is to project the point cloud into the grid according to a certain rule, and assign the attribute (such as color, normal vector, etc.) of the point cloud to the corresponding position in the grid. In the Shoot phase, target detection is required in a three-dimensional grid. The method typically used in this process is to divide the grid into several small cubes and then subject each small cube to object detection. The detection method can use a traditional two-dimensional target detection algorithm, and can also use a method based on a three-dimensional convolutional neural network.
The time sequence fusion module uses the features of the previous frames, transforms the original features to the coordinate system of the current vehicle through the IMU and the motion information of the vehicle, and performs cancat with the current features.
The second portion is referred to as the task encoder 120. The task encoder 120 encodes the features extracted by the feature extraction module 110 using different convolutional neural networks. Since the resolutions required for different tasks are different, the resolutions output by different encoders for the results after feature encoding are also different. The resolution of semantic map segmentation is higher than that of object detection and motion trail prediction.
The third portion is referred to as a task head 130. Wherein the 3D detection head uses a detection head of centrpoint. The centrpoint predicts the width and height of the target and the gaussian heat map of the probability of occurrence of the target. And combining the two to calculate the final target position. The motion trail prediction task head uses a Shoot mode. Shoot predicts template trajectories for different targets using a method similar to semantic segmentation. The semantic map segmentation task head uses the mode of HDMap to segment by using the semantic environment around the semantic segmentation algorithm team.
It should be noted that although in the above detailed description several modules or units of a multi-camera based bird's eye view vehicle perception system are mentioned, this division is not mandatory. Indeed, the features and functionality of two or more modules or units described above may be embodied in one module or unit in accordance with embodiments of the present disclosure. Conversely, the features and functions of one module or unit described above may be further divided into a plurality of modules or units to be embodied.
Furthermore, the above-described drawings are only schematic illustrations of processes included in the method according to the exemplary embodiment of the present application, and are not intended to be limiting. It will be readily appreciated that the processes shown in the above figures do not indicate or limit the temporal order of these processes. In addition, it is also readily understood that these processes may be performed synchronously or asynchronously, for example, among a plurality of modules.
Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any adaptations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.
It is to be understood that the present disclosure is not limited to the precise arrangements and instrumentalities shown in the drawings, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims (10)

1. An aerial view vehicle perception system based on multiple cameras, which is characterized by comprising a feature extraction module, a task encoder and a task head, wherein:
the feature extraction module comprises a skeleton network module, a multi-scale feature fusion module, a multi-camera fusion module and a time sequence fusion module, and is used for carrying out feature extraction processing on images under BEV view angles generated based on images acquired by the multi-cameras;
the task encoder comprises a semantic map segmentation encoder, a target detection encoder and a motion trail prediction encoder, and is used for carrying out encoding processing on the image features extracted by the feature extraction module based on a preset convolutional neural network;
the task head comprises a 3D detection task head, a motion track prediction task head and a semantic map segmentation task head, and is used for carrying out preset task identification based on image features subjected to coding processing to finish aerial view vehicle perception.
2. The system of claim 1, wherein in the feature extraction module of the system, the skeleton network module is based on RegNet deep neural network architecture, the skeleton network module is to construct a deep neural network architecture for images at BEV perspectives generated based on images acquired by multiple cameras.
3. The system of claim 2, wherein in the feature extraction module of the system, the multi-scale feature fusion module is based on a bippn architecture of a modified feature pyramid network, and the multi-scale feature fusion module is used for performing multi-scale feature fusion processing on images generated based on images acquired by multiple cameras and under BEV viewing angles.
4. The system of claim 3, wherein in the feature extraction module of the system, the multi-camera fusion module is based on a left-Splat method, and the multi-camera fusion module is configured to perform multi-camera fusion processing on an image generated at a BEV view angle based on an image acquired by the multi-camera.
5. The system of claim 4, wherein the timing fusion module is configured to perform a coordinate system transformation on the image features in the feature extraction module of the system.
6. The system of claim 1, wherein a task encoder of the system is configured to encode the image features extracted by the feature extraction module based on a preset convolutional neural network, to generate image features of a preset resolution, respectively.
7. The system of claim 6, wherein the system further comprises:
among the image features with preset resolution generated by the task encoder, the resolution of the image features detected based on the target generated by the target detection encoder and the resolution of the image features predicted based on the motion trail generated by the motion trail prediction encoder are smaller than the resolution of the image features segmented based on the semantic map generated by the semantic map segmentation encoder.
8. The system of claim 1, wherein the 3D detection task head is based on a centrpoint detection head, and wherein the 3D detection task head is configured to predict a width and height of a target and a gaussian heat map of a probability of occurrence of the target.
9. The system of claim 1, wherein the motion trail prediction task head in the task head of the system is based on a Shoot mode, and the motion trail prediction task head is used for predicting template trails of different targets.
10. The system according to claim 1, wherein in task heads of the system, the semantic map segmentation task heads are based on an HDMap mode, and the semantic map segmentation task heads are used for performing semantic environment segmentation processing based on a semantic segmentation algorithm.
CN202310880060.6A 2023-07-18 2023-07-18 Bird's eye view angle vehicle perception system based on many cameras Pending CN117132952A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310880060.6A CN117132952A (en) 2023-07-18 2023-07-18 Bird's eye view angle vehicle perception system based on many cameras

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310880060.6A CN117132952A (en) 2023-07-18 2023-07-18 Bird's eye view angle vehicle perception system based on many cameras

Publications (1)

Publication Number Publication Date
CN117132952A true CN117132952A (en) 2023-11-28

Family

ID=88861828

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310880060.6A Pending CN117132952A (en) 2023-07-18 2023-07-18 Bird's eye view angle vehicle perception system based on many cameras

Country Status (1)

Country Link
CN (1) CN117132952A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117765226A (en) * 2024-02-22 2024-03-26 之江实验室 Track prediction method, track prediction device and storage medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117765226A (en) * 2024-02-22 2024-03-26 之江实验室 Track prediction method, track prediction device and storage medium

Similar Documents

Publication Publication Date Title
CN111968129B (en) Instant positioning and map construction system and method with semantic perception
CN110378838B (en) Variable-view-angle image generation method and device, storage medium and electronic equipment
CN111832655A (en) Multi-scale three-dimensional target detection method based on characteristic pyramid network
CN110688905B (en) Three-dimensional object detection and tracking method based on key frame
CN108537844B (en) Visual SLAM loop detection method fusing geometric information
CN105719352B (en) Face three-dimensional point cloud super-resolution fusion method and apply its data processing equipment
Moustakas et al. Stereoscopic video generation based on efficient layered structure and motion estimation from a monoscopic image sequence
CN113256699B (en) Image processing method, image processing device, computer equipment and storage medium
Ouyang et al. A cgans-based scene reconstruction model using lidar point cloud
CN109063549A (en) High-resolution based on deep neural network is taken photo by plane video moving object detection method
CN117132952A (en) Bird's eye view angle vehicle perception system based on many cameras
Fischer et al. Cc-3dt: Panoramic 3d object tracking via cross-camera fusion
Li et al. Adv-depth: Self-supervised monocular depth estimation with an adversarial loss
Tao et al. Pseudo-mono for monocular 3d object detection in autonomous driving
CN115205463A (en) New visual angle image generation method, device and equipment based on multi-spherical scene expression
CN107767393B (en) Scene flow estimation method for mobile hardware
Gählert et al. Single-shot 3d detection of vehicles from monocular rgb images via geometry constrained keypoints in real-time
Li et al. Sat2vid: Street-view panoramic video synthesis from a single satellite image
CN112819849B (en) Mark point-free visual motion capture method based on three eyes
CN112766120B (en) Three-dimensional human body posture estimation method and system based on depth point cloud
US20210065430A1 (en) 3d representation reconstruction from images using volumic probability data
CN114548224A (en) 2D human body pose generation method and device for strong interaction human body motion
Nakatsuka et al. Denoising 3d human poses from low-resolution video using variational autoencoder
He et al. Neural Radiance Field in Autonomous Driving: A Survey
Chen et al. An Automatic key-frame selection method for visual odometry based on the improved PWC-NET

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination