CN117173657A

CN117173657A - Pre-training method for automatic driving perception model

Info

Publication number: CN117173657A
Application number: CN202311138196.6A
Authority: CN
Inventors: 张铂; 石博天; 袁家康; 窦民; 闫翔超; 李怡康
Original assignee: Shanghai AI Innovation Center
Current assignee: Shanghai AI Innovation Center
Priority date: 2023-09-05
Filing date: 2023-09-05
Publication date: 2023-12-05

Abstract

The invention discloses a pre-training method for an automatic driving perception model. The method comprises the following steps: acquiring an existing point cloud data set, wherein the point cloud data set contains unlabeled data; for the unlabeled data, labeling different categories by utilizing a pseudo tag generator in combination with semi-supervised learning, and increasing the diversity of the point cloud data set through harness resampling and object rescaling to obtain a unified data set, wherein the object rescaling is used for enhancing the diversity of an instance level, and the harness resampling is used for enhancing the diversity of a scene level; pre-training a perception model by using the unified data set and taking a set overall loss function as an optimization target; and aiming at the automatic driving task, on a target point cloud data set, performing classified sensing by utilizing the trained sensing model. The invention improves the generalization capability of the pre-training downstream data set and enhances the accuracy of automatic driving.

Description

Pre-training method for automatic driving perception model

Technical Field

The invention relates to the technical field of unmanned aerial vehicles, in particular to a pre-training method for an automatic driving perception model.

Background

The 3D perception technology plays a very critical role in the field of autopilot, and can help the vehicle perceive the surrounding environment. In recent years, some work begins to research Self-monitored Pre-training (SS-PT) of an automatic driving scene, and the Self-monitored Pre-training and the SS-PT perform Pre-training and fine-tuning on the same reference data set, so that a model Pre-trained in this way can only improve the performance of a single data set, cannot enable a network to learn general characterization, and is difficult to apply across the data sets. While ideal autopilot pretraining (Autonomous Driving Pre-training, AD-PT) would like to learn a generic characterization from a diverse set of data, thereby enhancing performance across a variety of different data sets.

Currently, most advanced LiDAR-based 3D object detection methods are typically trained and evaluated in a single dataset, but models trained in one dataset are difficult to help improve performance in new domains (e.g., different sensor settings or unseen cities) and thus have poor generalization capabilities. One long-term perspective of the autopilot community is to develop a generalizable scene pre-training model that can be widely applied to different downstream tasks, which makes generic pre-training in autopilot scenes an important challenge.

Pre-training (Pre-training) is a method that improves downstream multi-task, multi-dataset performance by training on large-scale data, allowing models to learn generic characterizations. In two-dimensional image scenes, pre-training has been fully explored, but the general pre-training study for three-dimensional point cloud data remains blank.

In the prior art, some researchers have been inspired by 2D Pre-training in an attempt to address Pre-training in 3D scenes by means of Self-monitored Pre-training (SS-PT). Existing approaches focus mainly on using contrast learning (Contrastive Learning) or masking auto encoders (Masked Autoencoder, MAE) to enhance model feature extraction capabilities. The contrast learning-based method uses point clouds of different perspectives or temporally related frames as input and further constructs positive and negative samples. For example, a consistency graph is constructed to find the corresponding relation of point clouds among different views, and a contrast learning strategy of a local area level is further carried out. As another example, MAE-based methods utilize different masking strategies and reconstruct the covered point cloud by designing a decoder. However, these methods are typically pre-trained and fine-tuned on the same data set, which only improves performance on the same data set.

In summary, the existing 3D pre-training follows a self-supervision pre-training manner, and pre-training and fine tuning are generally performed on the same data set, so that unified characterization cannot be learned, and only the performance of a single data set of a model can be improved, so that generalization capability is poor.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provides a pre-training method for an automatic driving perception model. The method comprises the following steps:

acquiring an existing point cloud data set, wherein the point cloud data set contains unlabeled data;

for the unlabeled data, labeling different categories by utilizing a pseudo tag generator in combination with semi-supervised learning, and increasing the diversity of the point cloud data set through harness resampling and object rescaling to obtain a unified data set, wherein the object rescaling is used for enhancing the diversity of an instance level, and the harness resampling is used for enhancing the diversity of a scene level;

pre-training a perception model by using the unified data set and taking a set overall loss function as an optimization target;

and aiming at the automatic driving task, on a target point cloud data set, performing classified sensing by utilizing the trained sensing model.

Compared with the prior art, the method has the advantages that the model learns the general representation through constructing various large-scale automatic driving data sets and pre-training tasks for the first time, and the improvement of the performance of the model in a plurality of different downstream data sets is facilitated. In addition, a pseudo tag strategy of category perception is designed, the accuracy of pseudo labeling is improved, and the distribution diversity of scenes and target layers is improved by utilizing a laser radar harness resampling and object rescaling strategy. Moreover, the invention provides an unknown example learning strategy and consistency loss so as to ensure that a plurality of potential foreground areas are activated during pre-training and enhance the feature extraction capability of the network.

Other features of the present invention and its advantages will become apparent from the following detailed description of exemplary embodiments of the invention, which proceeds with reference to the accompanying drawings.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description, serve to explain the principles of the invention.

FIG. 1 is a flow chart of a pre-training method for an autopilot awareness model in accordance with one embodiment of the present invention;

FIG. 2 is an overall process diagram of a pre-training method for an autopilot awareness model in accordance with one embodiment of the present invention.

Detailed Description

Various exemplary embodiments of the present invention will now be described in detail with reference to the accompanying drawings. It should be noted that: the relative arrangement of the components and steps, numerical expressions and numerical values set forth in these embodiments do not limit the scope of the present invention unless it is specifically stated otherwise.

The following description of at least one exemplary embodiment is merely exemplary in nature and is in no way intended to limit the invention, its application, or uses.

Techniques, methods, and apparatus known to one of ordinary skill in the relevant art may not be discussed in detail, but are intended to be part of the specification where appropriate.

In all examples shown and discussed herein, any specific values should be construed as merely illustrative, and not a limitation. Thus, other examples of exemplary embodiments may have different values.

It should be noted that: like reference numerals and letters denote like items in the following figures, and thus once an item is defined in one figure, no further discussion thereof is necessary in subsequent figures.

In order to realize the general pre-training of the automatic driving scene, the invention constructs a large-scale diversified automatic driving pre-training data set, and provides a set of unknown object learning strategies based on the data set, thereby improving the precision of a plurality of downstream data sets. In general, to learn a model (or perception model) to a generic representation of an autopilot scenario, a large-scale pre-trained point cloud dataset with diverse data distribution is first constructed and the generic representation is learned from such a diverse pre-trained dataset. Further, representing the autopilot pre-training task as a semi-supervised problem, using a small number of tagged and large number of untagged point cloud data to generate a unified feature representation can be directly applied to many baseline models and baseline datasets, decoupling the autopilot-related pre-training process and downstream fine tuning tasks. With this design, pre-trained model weights can be loaded into different baseline models and help the models promote performance across multiple downstream datasets.

Specifically, as shown in fig. 1 and 2, the provided pre-training method for the automatic driving perception model includes the following steps:

step S110, based on the existing point cloud data set, constructing a large-scale diversified automatic driving data set with the aim of improving the diversity of scene level and instance level as a unified data set.

To scale up the pre-training dataset, a large-scale diverse autopilot dataset is built, for example, with the existing point cloud dataset ONCE. ONCE contains a small amount of labeled data and a large amount of unlabeled data, and different classes of ONCE unlabeled data sets are labeled with different models by combining a proposed class-aware pseudo-tag generator with a Semi-supervised learning (Semi-supervised Learning) technique. Specifically, since the center-based detection head has better detection performance for small objects (such as pedestrians), and the anchor-based detection head has better detection effect for other categories (such as automobiles and bicycles), pedestrians can be marked by using the center-based detector (such as center point), and automobiles and bicycles can be marked by using the anchor-based detector (such as PV-RCNN++). In order to improve the accuracy of the labeling, a semi-supervised learning method (such as MeanTeacher) can be utilized to further improve the accuracy of the labeling.

The diversity of the data is critical to pre-training, as highly diverse data can greatly enhance the generalization ability of the model. However, existing datasets are typically acquired by the same LiDAR sensor over a limited geographic area, which is detrimental to the diversity of the data. In an autopilot scenario, the differences between different data sets can be categorized into a scene level (e.g., liDAR harness) and an instance level (e.g., object size). Thus, in one embodiment, diversity is increased from LiDAR harness and object size, and harness resampling strategies and an object rescaling strategy are presented.

For the harness resampling strategy, in order to obtain data of different harnesses, a distance image (or Range image) is used as an intermediate variable of up-sampling and down-sampling of point data. Specifically, given a LiDAR point cloud with one harness of n and m points per ring, range image R ^n×m The method can be obtained by the following formula:

wherein phi and theta are the inclination angle and azimuth angle of the point cloud respectively, and r represents the range of the point cloud. Each column and each row of the range image corresponds to the same azimuth and inclination, respectively, of the point cloud. The rows of the range image may then be interpolated or sampled, which may also be considered as resampling of the LiDAR beams. Finally, the range image is reconverted into a point cloud, expressed as:

x＝rcos(φ)sin(θ),y＝rcos(φ)sin(θ),z＝rsin(φ) (2)

where x, y, z denote the Cartesian coordinates of the point.

For the object rescaling strategy, since different three-dimensional data sets are collected at different locations, the distribution of object sizes is not uniform, and in order to overcome this problem, an object rescaling mechanism is proposed, and the length, width and height of each object can be rescaled randomly. Specifically, given a bounding box and the points therein, the points are first converted to local coordinates, and then the coordinates of the points and the size of the bounding box are multiplied by a set scaling factor. Finally, the scaled points are converted into vehicle coordinates along with the bounding box. The correlation formula is as follows:

wherein,representing the length of the bounding box,/-, and>representing the width of the bounding box +.>Representing the high,/-of the bounding box>Andrepresenting the three-dimensional coordinates of the center point, c _x ，c _y And c _z Representing coordinates of the target point, R representing the transformation matrix, < >>Represents the predicted result, alpha represents the scaling factor, theta _h Representing the rotation angle.

In summary, in the process of constructing a unified data set, a category-aware pseudo tag strategy is designed for improving the accuracy of pseudo labeling, and the laser radar harness resampling and object rescaling strategy is utilized for improving the data distribution diversity of scenes and target layers.

Step S120, pre-training the perception model to learn the universal characterization by using the unified data set and using the set overall loss function as an optimization target.

With step S110, a unified pre-training data set is obtained, so that diversity of scene level and instance level can be improved. However, unlike two-dimensional or visual language pre-training datasets (which cover a large number of categories), the pseudo tag dataset of the present invention has only a limited number of category tags (e.g., vehicles, pedestrians, and bicycles). Also, in order to obtain accurate spurious annotations, a high confidence threshold is set at the time of spurious annotation, which may inevitably ignore some difficult instances. Thus, these ignored difficult instances, as well as categories not included in the pre-training dataset (e.g., barrers in the nuScenes dataset), will be suppressed during the pre-training process.

To alleviate such problems, it is necessary to consider both the pre-training related instance and some lower scoring unknown instance at the time of pre-training. From a new perspective, pre-training is considered an open set learning problem. Unlike conventional open set detection, which aims at detecting unknown instances, the object of the present invention is to activate as much of the foreground region as possible during the pre-training phase. Thus, a dual-branched unknown example learning head is proposed to avoid treating potential foreground examples as background parts. In addition, consistency loss is utilized to ensure the consistency of the calculated corresponding foreground regions.

Specifically, the overall structure of the model is shown in fig. 2, and includes a voxel feature extractor, a three-dimensional skeleton (or 3D backbone network) with sparse convolution, a two-dimensional skeleton (or 2D backbone network), a Dense detection head (Dense head), and the like. Specifically, a given point cloud P ε R ^N×(3+d) First, by different data enhancement methods Γ ₁ And Γ ₂ The point clouds are converted into different views, wherein N represents the number of the point clouds and d represents other information acquired by the sensor. Voxel features are then extracted through the three-dimensional skeleton and mapped to a Bird-Eye-View (BEV) space. Thereafter, dense features generated by the 2D backbone network may be obtained, and finally, the dense features are input to a dense detection head of an unknown instance.

Regarding unknown instance learning heads, consider an unknown instance where background suggested regions with relatively high target scores are ignored during the pre-training phase, but may be critical to downstream tasks, where the target scores are obtained by the region proposal network (Region Proposal Network, RPN). However, since these unknown instances contain a large number of background areas, having these instances directly as foreground instances in pre-training will result in the backbone network activating a large number of background areas. To overcome such problems, a dual tap is utilized as a committee to discover which regions can be effectively used as a foreground instance.

Specifically, given a region of interest (Region of interest, roI) feature And their corresponding bounding boxesWhere N is the number of RoI features and C represents the dimension of the feature. First, M features with highest scores are selected +.>And its corresponding bounding boxThen, in order to obtain the positional relationship corresponding to the activation areas of the two branches, +.>And->The frame center distance between the two can be obtained by the following formula:

wherein,and->Representation->And->τ is a threshold at the i and j-th box centers. These unknown instances will be updated to foreground instances once the corresponding features of the different input views are obtained.

Further, after obtaining the corresponding activation features of the different branches, consistency loss is utilized to ensure consistency of the corresponding features, which is specifically as follows:

where B is the lot size and K is the number of corresponding activation features.

In one embodiment, the overall loss function set by the pre-training process is as follows:

wherein,and->Representing the classification loss and regression loss of the closed head (Dense head), respectively, < >>Is a consistency loss. The classification loss and regression loss can be existing loss types, such as mean square error loss, average absolute error loss, etc., and the invention is not limited thereto.

In summary, in the model pre-training stage, an unknown example learning strategy and consistency loss are designed to ensure that some potential foreground areas are activated during pre-training, so that the feature extraction capability of the network is enhanced.

Step S130, aiming at the target downstream task, classifying and sensing are carried out by utilizing a trained sensing model.

After model pre-training is completed, fine adjustments may also be made on other autopilot related datasets for downstream tasks, e.g., waymo, nuScenes and KITTI, etc., to further enhance generalization ability. Different baseline detectors may also be used in the trimming process, such as SECOND, PV-RCNN, PV-rcnn++ and centrpoint), etc.

By utilizing the model after pre-training or fine adjustment, downstream tasks such as automatic driving and the like under various scenes can be realized, and the model is adapted to different point cloud data sets. It should be noted that, the model training or fine tuning process related by the invention can be performed offline in a server or a cloud, and the obtained pre-training model is embedded into the electronic device to realize tasks such as real-time automatic driving. The electronic device may be a terminal device or a server, and the terminal device includes any terminal device such as a mobile phone, a tablet computer, a Personal Digital Assistant (PDA), a point of sale (POS), a vehicle-mounted computer, and an intelligent wearable device. The server includes, but is not limited to, an application server or a Web server, and may be a stand-alone server or a cluster server or a cloud server, etc.

In order to further verify the effect of the present invention, a relevant experimental verification was performed. Experimental results show that compared with the existing 3D self-supervision pre-training method, the invention provides a general characterization pre-training method, and higher target domain detection precision is realized by performing fine tuning experiments on different reference data sets (including Waymo, nuScenes, KITTI) and different baseline detectors (including SECOND, PV-RCNN, PV-RCNN++, centerPoint). By loading the pre-training model, a greater improvement in accuracy is achieved on different benchmarks, e.g., 3.41%, 8.45%, 4.25% on Waymo, nuScenes and KITTI, respectively.

In summary, the present invention proposes an ideal autopilot pre-training (AD-PT), i.e. the pre-trained model can be adapted to different downstream data sets. The advantages of the present invention over the prior art are mainly represented by the following aspects: the pretraining method based on the large-scale diversified data sets can enable pretraining to effectively help a plurality of downstream reference data sets to improve performance; by constructing a large-scale diversified automatic driving data set, the diversity of the data set is improved; by designing the unknown example learning head and consistency loss, the generalization capability of the pre-training in the downstream data set is improved.

The present invention may be a system, method, and/or computer program product. The computer program product may include a computer readable storage medium having computer readable program instructions embodied thereon for causing a processor to implement aspects of the present invention.

The computer readable storage medium may be a tangible device that can hold and store instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer-readable storage medium would include the following: portable computer disks, hard disks, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), static Random Access Memory (SRAM), portable compact disk read-only memory (CD-ROM), digital Versatile Disks (DVD), memory sticks, floppy disks, mechanical coding devices, punch cards or in-groove structures such as punch cards or grooves having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media, as used herein, are not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through waveguides or other transmission media (e.g., optical pulses through fiber optic cables), or electrical signals transmitted through wires.

The computer readable program instructions described herein may be downloaded from a computer readable storage medium to a respective computing/processing device or to an external computer or external storage device over a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmissions, wireless transmissions, routers, firewalls, switches, gateway computers and/or edge servers. The network interface card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium in the respective computing/processing device.

Computer program instructions for carrying out operations of the present invention may be assembly instructions, instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, c++, python, and the like, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer readable program instructions may be executed entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, aspects of the present invention are implemented by personalizing electronic circuitry, such as programmable logic circuitry, field Programmable Gate Arrays (FPGAs), or Programmable Logic Arrays (PLAs), with state information for computer readable program instructions, which can execute the computer readable program instructions.

Various aspects of the present invention are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.

These computer readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable medium having the instructions stored therein includes an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.

The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions. It is well known to those skilled in the art that implementation by hardware, implementation by software, and implementation by a combination of software and hardware are all equivalent.

The foregoing description of embodiments of the invention has been presented for purposes of illustration and description, and is not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the various embodiments described. The terminology used herein was chosen in order to best explain the principles of the embodiments, the practical application, or the technical improvements in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein. The scope of the invention is defined by the appended claims.

Claims

1. A pre-training method for an automatic driving perception model comprises the following steps:

for an automatic driving task, on a target point cloud data set, classification sensing is performed by using a trained sensing model.

2. The method of claim 1, wherein the harness resampling comprises:

for a point cloud with a bundle of n, m points per ring, a range image is used as an intermediate variable for up-sampling and down-sampling of point data, the range image being expressed as:

φ＝arctan(x/y)

θ＝arcsin(z/r)

wherein phi is the inclination angle of the point cloud, theta is the azimuth angle of the point cloud, r represents the range of the point cloud, and each column and each row of the range image respectively correspond to the same azimuth angle and inclination angle of the point cloud;

interpolating or sampling a row of the range image;

reconvert the range image into a point cloud, expressed as:

x＝rcos(φ)sin(θ)

y＝rcos(φ)sin(θ)

z＝rsin(φ)

where x, y, z denote the Cartesian coordinates of the point.

3. The method of claim 1, wherein the object rescaling comprises:

given a bounding box and points therein, firstly converting the points into local coordinates, and then multiplying the coordinates of the points and the size of the bounding box by a set scaling factor; finally, the scaled points are converted into vehicle coordinates together with the bounding box, and the related formulas are expressed as follows:

wherein,representing the length of the bounding box,/-, and>representing the width of the bounding box +.>Representing the high,/-of the bounding box>And->Representing the three-dimensional coordinates of the center point, c _x ，c _y And c _z Representing coordinates of the target point, R representing the transformation matrix, < >>Represents the predicted result, alpha represents the scaling factor, theta _h Representing the rotation angle.

4. The method of claim 1, wherein the perceptual model comprises a voxel feature extractor, a three-dimensional backbone network, a two-dimensional backbone network, and a dense detection head, the point cloud being first converted into a different view for a given point cloud; then, extracting voxel characteristics through a three-dimensional backbone network and mapping the voxel characteristics to a bird's eye view angle space; and then obtaining dense features generated by the two-dimensional backbone network, and finally inputting the dense features into an unknown example learning head.

5. The method of claim 4, wherein the determining of the foreground instance using the dual-branch head as a committee in pre-training the perceptual model comprises the steps of:

given a region of interest feature And their corresponding bounding boxes->Where N is the number of RoI features, C represents the dimension of the feature, and the M features with highest scores are selectedAnd phases thereofA corresponding bounding box->

Calculation ofAnd->The center distance of the frames between the two frames and the characteristic corresponding relation are obtained according to the following formula:

wherein,and->Representation->And->τ is a threshold value and these unknown instances are updated as foreground instances when corresponding features of different input views are obtained.

6. The method of claim 1, wherein pedestrians are marked with center-based detectors and cars and bicycles are marked with anchor-centered detectors for the pseudo tag generator.

7. The method of claim 5, wherein the overall loss function is set to:

wherein,indicating total loss value, ++>Classification loss indicative of dense detection heads, +.>Representing regression loss of dense detection heads, +.>Representing a consistency loss, B is the batch size and K is the number of corresponding activation features.

8. The method of claim 1, further comprising performing fine-tuning on the plurality of reference data sets and the plurality of baseline detectors after pre-training the perception model.

9. A computer readable storage medium having stored thereon a computer program, wherein the computer program when executed by a processor realizes the steps of the method according to any of claims 1 to 8.

10. A computer device comprising a memory and a processor, on which memory a computer program is stored which can be run on the processor, characterized in that the processor implements the steps of the method according to any of claims 1 to 8 when the computer program is executed.