WO2024055551A1

WO2024055551A1 - Point cloud feature extraction network model training method, point cloud feature extraction method, apparatus, and driverless vehicle

Info

Publication number: WO2024055551A1
Application number: PCT/CN2023/082809
Authority: WO
Inventors: 刘浩
Original assignee: 北京京东乾石科技有限公司
Priority date: 2022-09-14
Filing date: 2023-03-21
Publication date: 2024-03-21
Also published as: CN115482391A

Abstract

The present disclosure relates to the technical field of driverless vehicles, and provides a point cloud feature extraction network model training method, a point cloud feature extraction method, an apparatus, and a driverless vehicle. The point cloud feature extraction network model training method comprises: performing first encoding on a sample point cloud frame sequence by using a first feature extraction network model to obtain an encoded feature map of each sample point cloud frame in the sample point cloud frame sequence; according to the encoded feature maps of a plurality of adjacent sample point cloud frames, determining a predicted feature map of the next sample point cloud frame following the plurality of adjacent sample point cloud frames; determining a loss function value according to the predicted feature map and the encoded feature map of the next sample point cloud frame following the plurality of adjacent sample point cloud frames; and training the first feature extraction network model according to the loss function value. By means of the steps above, self-supervised learning of a point cloud feature extraction network model is realized, so that the cost of data annotation is reduced, and the performance of a trained feature extraction model is improved.

Description

Point cloud feature extraction network model training, point cloud feature extraction method, device and unmanned vehicle

Cross-references to related applications

This application is based on the application with CN application number 202211115103.3 and the filing date is September 14, 2022, and claims its priority. The disclosure content of the CN application is hereby incorporated into this application as a whole.

Technical field

The present disclosure relates to the field of computer vision technology, especially to the field of unmanned driving, and in particular to a point cloud feature extraction network model training, point cloud feature extraction method, device and unmanned vehicle.

Background technique

Currently, unmanned driving equipment is used to automatically transport people or objects from one location to another. Unmanned driving equipment collects environmental information through sensors on the equipment and completes automatic transportation. Logistics and transportation using unmanned delivery vehicles controlled by unmanned driving technology has greatly improved the convenience of production and life and saved labor costs.

In autonomous driving tasks, in order to ensure operational safety, obstacles that may hinder driving must be detected and identified, so that reasonable avoidance actions can be made based on different obstacle types and states. Currently, the most mature detection solution in autonomous driving is point cloud detection. The training of detection models usually uses supervised learning. In this process, the performance of the model is limited by the quantity of data collected and the quality of annotation. In order to obtain a high-performance detection model, it is often necessary to use a large amount of annotated data to train the network. However, the labor cost and long cycle of data collection and annotation are high, which is not conducive to model iteration. In contrast, using self-supervised learning does not require data labeling.

Contents of the invention

A technical problem to be solved by this disclosure is to provide a point cloud feature extraction network model training, point cloud feature extraction method, device and unmanned vehicle.

According to a first aspect of the present disclosure, a point cloud feature extraction network model training method is proposed, including: using a first feature extraction network model to perform a first encoding on a sample point cloud frame sequence to obtain the sample point cloud frame The coding feature map of the sample point cloud of each frame in the sequence; according to the coding feature map of the adjacent multi-frame sample point cloud, determine the predicted feature map of the next frame sample point cloud located after the adjacent multi-frame sample point cloud; Determine the loss function value according to the prediction feature map and its encoding feature map of the next frame sample point cloud located after the adjacent multi-frame sample point cloud; The first feature extraction network model is trained according to the loss function value.

In some embodiments, according to the encoding feature map of the adjacent multi-frame sample point cloud, determining the predicted feature map of the next frame sample point cloud located after the adjacent multi-frame sample point cloud includes: using a second feature extraction network model, perform second encoding on the coding feature maps of the adjacent multi-frame sample point clouds respectively to obtain the intermediate feature map of the adjacent multi-frame sample point clouds; The feature maps are fused to obtain a fused feature map; the fused feature map is decoded to obtain a predicted feature map of the next frame sample point cloud located after the adjacent multi-frame sample point cloud.

In some embodiments, fusing the intermediate feature maps of the adjacent multi-frame sample point clouds to obtain the fused feature map includes: determining the said The feature point matching relationship between adjacent multi-frame sample point clouds; according to the feature point matching relationship, the intermediate feature maps of the adjacent multi-frame sample point cloud are fused to obtain a fused feature map.

In some embodiments, according to the encoding feature map of the adjacent multi-frame sample point cloud, determining the predicted feature map of the next frame sample point cloud located after the adjacent multi-frame sample point cloud includes: based on the adjacent multi-frame sample point cloud The coding feature map of the point cloud determines the feature point matching relationship between the adjacent multi-frame sample point clouds; according to the feature point matching relationship, the coding feature map of the adjacent multi-frame sample point cloud is fused, To obtain a fusion feature map; according to the fusion feature map, determine the predicted feature map of the next frame sample point cloud located after the adjacent multi-frame sample point cloud.

In some embodiments, determining the feature point matching relationship between the adjacent multi-frame sample point clouds based on the intermediate feature maps of the adjacent multi-frame sample point clouds includes: based on the adjacent multi-frame sample points The intermediate feature map of the point cloud is used to calculate the correlation of the feature points between the adjacent multi-frame sample point clouds; according to the correlation of the feature points between the adjacent multi-frame sample point clouds, the adjacent multi-frame sample point clouds are determined. Feature point matching relationship between two frame sample point clouds.

In some embodiments, the adjacent multi-frame sample point clouds are two adjacent frame sample point clouds, and the intermediate feature map of the adjacent multi-frame sample point cloud includes: The first intermediate feature map corresponding to the first frame, and the second intermediate feature map corresponding to the second frame in the two adjacent frame sample point clouds; and, the sample point cloud based on the adjacent multiple frames The intermediate feature map, calculating the correlation of the feature points between the adjacent multi-frame sample point clouds includes: calculating each feature point on the first intermediate feature map, and the specified value on the second intermediate feature map. The correlation degree of the feature points within the range, the specified range is the neighborhood range of the feature points of the first intermediate feature map; according to the correlation degree, the matching relationship of the feature points between the sample point clouds of the two adjacent frames is determined .

In some embodiments, the adjacent multi-frame sample point clouds are two adjacent frame sample point clouds; the intermediate feature maps of the two adjacent frame sample point clouds are fused according to the feature point matching relationship. , to obtain the fused feature map includes: according to the matching relationship of the feature points, matching the intermediate feature maps of the sample point clouds of the two adjacent frames The feature points are spliced together, and the spliced feature map is used as the fused feature map.

In some embodiments, determining the loss function value based on the prediction feature map of the next frame sample point cloud located after the adjacent multi-frame sample point cloud and its encoding feature map includes: in the next frame sample Between the predicted feature map and the encoded feature map of the point cloud, the Euclidean distance between feature points with the same position index is calculated; the loss function value is calculated based on the Euclidean distance between the feature points with all position indexes.

In some embodiments, the first feature extraction network model is a shared weight encoder. The shared weight encoder includes multiple encoding modules, each encoding module is used to encode the sample point cloud frame sequence. One frame is encoded.

In some embodiments, the encoding module includes: a convolutional neural network and a self-attention network.

In some embodiments, the method further includes: converting the original feature data of the multi-frame sample point cloud into two-dimensional image feature data to obtain a sample point cloud frame sequence composed of the two-dimensional image feature data of the multi-frame sample point cloud.

In some embodiments, converting the original feature data of the multi-frame sample point cloud into two-dimensional image feature data includes: converting the original feature data of the multi-frame sample point cloud into bird's-eye view BEV feature data.

In some embodiments, the sample point cloud frame sequence consists of multiple frames of sample point clouds that are continuous in time series; and/or the number of frames of the sample point cloud included in the sample point cloud frame sequence is greater than or equal to 3, and Less than or equal to 5.

In some embodiments, the second feature extraction network model includes: an attention encoding module, used to perform a second encoding on the encoding feature maps of the adjacent multi-frame sample point clouds; an attention decoding module, used to perform a second encoding on the adjacent multi-frame sample point clouds; The fused feature map is decoded to obtain the predicted feature map of the next frame sample point cloud located after the adjacent multi-frame sample point cloud.

According to a second aspect of the present disclosure, a point cloud feature extraction method is provided, including: obtaining a sequence of point cloud frames to be processed; based on the first feature extraction network model trained using the above feature extraction network model training method, The point cloud frame sequence to be processed is encoded to obtain the feature map of the point cloud frame sequence to be processed.

In some embodiments, obtaining the point cloud frame sequence to be processed includes: obtaining the original data of the point cloud to be processed in multiple frames; converting the original feature data of the point cloud to be processed in the multiple frames into bird's-eye view BEV feature data to obtain the A sequence of point cloud frames to be processed consisting of bird's-eye view feature data of multiple frames of point clouds to be processed.

According to a third aspect of the present disclosure, a target detection method is proposed. The feature map of the point cloud frame sequence to be processed is extracted according to the aforementioned point cloud feature extraction method. The target detection is performed based on the feature map of the point cloud frame sequence to be processed.

According to the fourth aspect of the present disclosure, a point cloud semantic segmentation method is proposed, including: extracting a feature map of the point cloud frame sequence to be processed according to the aforementioned point cloud feature extraction method; and based on the feature map of the point cloud frame sequence to be processed, Perform point cloud semantic segmentation.

According to a fifth aspect of the present disclosure, a device is proposed, including: a module for performing the point cloud feature extraction network model training method as described above, or a module for performing the point cloud feature extraction method as described above, Or, a module for performing the above-mentioned target detection method, or a module for performing the above-mentioned point cloud semantic segmentation method.

According to a sixth aspect of the present disclosure, an electronic device is proposed, including: a memory; and, a processor coupled to the memory, the processor being configured to perform the above points based on instructions stored in the memory The cloud feature extraction network model training method is either the above-mentioned point cloud feature extraction method, or the above-mentioned target detection method, or the above-mentioned point cloud semantic segmentation method.

According to a seventh aspect of the present disclosure, a computer-readable storage medium is proposed, on which computer program instructions are stored. When the instructions are executed by a processor, the above-mentioned point cloud feature extraction network model training method is implemented, or the above-mentioned point cloud feature extraction network model training method is implemented. Cloud feature extraction method, or the above-mentioned target detection method, or the above-mentioned point cloud semantic segmentation method.

According to a sixth aspect of the present disclosure, an unmanned vehicle is also proposed, including the above device or electronic equipment.

Other features and advantages of the present disclosure will become apparent from the following detailed description of exemplary embodiments of the present disclosure with reference to the accompanying drawings.

Description of drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and, together with the description, serve to explain principles of the disclosure.

The present disclosure may be more clearly understood from the following detailed description with reference to the accompanying drawings, in which:

Figure 1 is a schematic flowchart of a point cloud feature extraction network model training method according to some embodiments of the present disclosure;

Figure 2 is a schematic flowchart of a point cloud feature extraction network model training method according to other embodiments of the present disclosure;

Figure 3 is a schematic structural diagram of a first feature extraction network model according to some embodiments of the present disclosure;

Figure 4a is a schematic flowchart of the steps of determining a prediction feature map according to some embodiments of the present disclosure;

Figure 4b is a schematic flowchart of a point cloud feature extraction network model training method according to some embodiments of the present disclosure;

Figure 5 is a schematic flowchart of a point cloud feature extraction method according to some embodiments of the present disclosure;

Figure 6 is a schematic structural diagram of a point cloud feature extraction network model training device according to some embodiments of the present disclosure;

Figure 7 is a schematic structural diagram of a point cloud feature extraction device according to some embodiments of the present disclosure;

Figure 8 is a schematic structural diagram of a point cloud feature extraction network model training device or a point cloud feature extraction device or a target detection device or a point cloud semantic segmentation device according to some embodiments of the present disclosure;

Figure 9 is a schematic structural diagram of a computer system according to some embodiments of the present disclosure;

Figure 10 is a schematic structural diagram of an autonomous vehicle according to some embodiments of the present disclosure;

Figure 11 is a schematic three-dimensional structural diagram of an autonomous vehicle according to some embodiments of the present disclosure.

Detailed ways

Various exemplary embodiments of the present disclosure will now be described in detail with reference to the accompanying drawings. It should be noted that the relative arrangement of components and steps, numerical expressions, and numerical values set forth in these examples do not limit the scope of the disclosure unless otherwise specifically stated.

At the same time, it should be understood that, for convenience of description, the dimensions of various parts shown in the drawings are not drawn according to actual proportional relationships.

The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the disclosure, its application or uses.

Techniques, methods and devices known to those of ordinary skill in the relevant art may not be discussed in detail, but where appropriate, such techniques, methods and devices should be considered part of the authorized specification.

In all examples shown and discussed herein, any specific values are to be construed as illustrative only and not as limiting. Accordingly, other examples of the exemplary embodiments may have different values.

It should be noted that similar reference numerals and letters refer to similar items in the following figures, so that once an item is defined in one figure, it does not need further discussion in subsequent figures.

In order to make the purpose, technical solutions and advantages of the present disclosure more clear, the present disclosure will be further described in detail below in conjunction with specific embodiments and with reference to the accompanying drawings.

Recently, the academic community has made several attempts at self-supervision in the field of image classification, and has achieved good results, even surpassing traditional supervised learning. However, in the field of autonomous driving, especially the self-supervised learning method in the field of lidar point cloud detection, there is currently little research, and the entire academic community and industry are still in a very early stage.

In related technology, a method for learning point cloud features based on self-supervision is proposed. In this method, multiple frames of continuous point clouds are projected onto the corresponding RGB images respectively, and then the optical flow method is used to find moving objects on the RGB images, and the point clouds corresponding to the moving objects are obtained, and then the point cloud features are learned. This method has the following shortcomings: 1. The calibration requirements for lidar and cameras are very high; 2. At the edge of the object, there is a high probability that it cannot be correctly projected. In addition, since the projection of the point cloud to the RGB image is a cone, It is possible that some point clouds will overlap after being projected onto the RGB image, thus affecting model performance. Third, since continuous frame point clouds and RGB images need to be used, the annotation time and economic cost are high.

In view of this, the present disclosure proposes a point cloud feature extraction network model training, point cloud feature extraction method, device and unmanned vehicle, which only need to use data of one modality and do not require calibration between radar and cameras. This can realize self-supervised learning of the point cloud feature extraction network model, which not only reduces the cost of data annotation, but also improves the performance of the trained point cloud feature extraction network model.

Figure 1 is a schematic flowchart of a point cloud feature extraction network model training method according to some embodiments of the present disclosure. As shown in Figure 1, the point cloud feature extraction network model training method of some embodiments of the present disclosure includes:

Step S110: Use the first feature extraction network model to perform first encoding on the sample point cloud frame sequence to obtain the encoded feature map of each frame of the sample point cloud in the sample point cloud frame sequence.

In some embodiments, the point cloud feature extraction network model training method is executed by a point cloud feature extraction network model training device.

In some embodiments, the sample point cloud frame sequence consists of multiple frames of sample point clouds that are sequential in time series. For example, the sample point cloud frame sequence consists of 3, 4, 5, 6, or other frame number sample point clouds that are consecutive in time series.

Since the point cloud collection method in actual autonomous driving scenarios is mostly continuous frame collection, in the embodiment of the present disclosure, by making the sample point cloud frame sequence consist of continuous point cloud frames, this solution is consistent with the actual scene, and the features can be further improved. Extract the feature extraction capabilities of the network model.

In some embodiments, the number of frames of the sample point cloud frame sequence is greater than or equal to 3 and less than or equal to 5. By making the number of frames in the sample point cloud frame sequence between 3 and 5 frames, the problem of difficulty in matching objects at the edge of the "Region of Interests" (RoI) caused by too long frame sequences can be alleviated. Among them, the object here refers to a broad concept, which can be any target in the scene, such as trees, buildings, etc., regardless of the type that needs to be specifically identified in autonomous driving (such as cars, pedestrians, bicycles, etc.). Through the above processing, the network can learn the low-level information of various objects in the autonomous driving scene, such as shape, size, etc., so that the learned network has broader feature extraction capabilities.

In some embodiments, each sample point cloud in the sample point cloud frame sequence is original point cloud feature data collected by lidar. For example, the original point cloud feature data includes the three-dimensional position coordinates of each point cloud point and the reflection intensity.

In some embodiments, each sample point cloud in the sample point cloud frame sequence is two-dimensional image feature data obtained by processing the original point cloud feature data. In these embodiments, the point cloud feature extraction network model training method also includes: converting the original feature data of the multi-frame sample point cloud into two-dimensional image feature data to obtain the two-dimensional image feature data composed of the multi-frame sample point cloud. A sequence of sample point cloud frames. For example, convert the original features of multi-frame sample point clouds into The feature data is converted into Bird's Eye View (BEV) feature data.

In some embodiments, the first feature extraction network model is a shared weight encoder. The shared weight encoder includes a plurality of encoding modules, each encoding module is used to encode one frame in the sample point cloud frame sequence.

For example, when the sample point cloud frame sequence includes BEV data of 4 frames of sample point clouds from t ₀ , t ₁ , t ₂ to t ₃ , the BEV data of these 4 frames of sample point clouds are simultaneously input into the first feature extraction network model. The four coding modules (specifically coding modules 1 to 4), for example, input the BEV data of the sample point cloud at time t ₀ into coding module 1, input the BEV data of the sample point cloud at time t ₁ into coding module 2, and input the BEV data of the sample point cloud at time t 1 into coding module 2. The BEV data of the sample point cloud at time _{t 2} is input into the encoding module 3, and the BEV data of the sample point cloud at time t ₃ is input into the encoding module 4.

Step S120: Determine the predicted feature map of the next frame sample point cloud located after the adjacent multi-frame sample point cloud based on the encoding feature map of the adjacent multi-frame sample point cloud.

In some embodiments, based on the encoding feature maps of the sample point clouds of the two adjacent frames, the predicted feature map of the sample point cloud of the next frame located after the sample point clouds of the two adjacent frames is determined.

For example, the sample point cloud frame sequence consists of three consecutive frames of sample point clouds, which are the sample point cloud frames at time t ₀ , t ₁ , and t ₂ respectively. According to the encoding feature map of the sample point cloud frame at time t ₀ and t ₁ The encoding feature map of the sample point cloud frame at time t2 determines the predicted feature map of the sample point cloud frame at time _t2 .

For example, the sample point cloud frame sequence consists of 4 consecutive sample point cloud frames, which are the sample point cloud frames at time t ₀ , t ₁ , t ₂ , and t ₃ respectively. According to the encoding feature map of the sample point cloud frame at time t ₀ and the coding feature map of the sample point cloud frame at time t ₁ , determine the prediction feature map of the sample point cloud frame at time t ₂ ; based on the coding feature map of the sample point cloud frame at time t ₁ and the sample point cloud frame at time t ₂ The encoding feature map is used to determine the predicted feature map of the sample point cloud frame at time t ₃ .

In some embodiments, the predicted feature map of the sample point cloud of the next frame located after the sample point cloud of three adjacent frames or more is determined based on the encoding feature map of the sample point cloud of three adjacent frames or more.

For example, the sample point cloud frame sequence consists of 4 consecutive sample point cloud frames, which are the sample point cloud frames at t ₀ , t ₁ , t ₂ , and t ₃ respectively. According to the three times t ₀ , t ₁ , and t ₂ The encoding feature map of the sample point cloud frame is determined to determine the predicted feature map of the sample point cloud frame at time t ₃ .

In some embodiments, step S120 includes: using a second feature extraction network model to perform second encoding on the coded feature maps of adjacent multi-frame sample point clouds, respectively, to obtain intermediate feature maps of adjacent multi-frame sample point clouds; Fusion of intermediate feature maps of adjacent multi-frame sample point clouds to obtain a fused feature map; decoding the fused feature map to obtain predictions of the next frame of sample point clouds located after the adjacent multi-frame sample point clouds Feature map.

In some embodiments, the second feature extraction network model includes an attention encoding module and an attention decoding module. Among them, the attention coding module is used to perform second coding on the coding feature maps of adjacent multi-frame sample point clouds; The attention decoding module is used to decode the fused feature map to obtain the predicted feature map of the next frame sample point cloud located after the adjacent multi-frame sample point cloud.

In some embodiments, the intermediate feature maps of adjacent multi-frame sample point clouds are fused according to the following method: based on the intermediate feature maps of adjacent multi-frame sample point clouds, the feature points between adjacent multi-frame sample point clouds are determined. Matching relationship; according to the matching relationship of feature points, the intermediate feature maps of adjacent multi-frame sample point clouds are fused to obtain a fused feature map.

In other embodiments, step S120 includes: determining the feature point matching relationship between adjacent multiple frame sample point clouds based on the coding feature maps of adjacent multiple frame sample point clouds; based on the feature point matching relationship, determining the matching relationship between adjacent multiple frame sample point clouds. The coding feature maps of the frame sample point clouds are fused to obtain a fusion feature map; based on the fusion feature map, the predicted feature map of the next frame sample point cloud located after the adjacent multi-frame sample point clouds is determined.

Step S130: Determine the loss function value based on the prediction feature map of the next frame sample point cloud located after the adjacent multi-frame sample point cloud and its encoding feature map.

In some embodiments, the predicted feature map of a frame of sample point cloud is determined through step S120. In these embodiments, the loss function value is determined based on the predicted feature map of the sample point cloud of this frame and its encoding feature map.

For example, the sample point cloud frame sequence consists of three consecutive frames of sample point clouds, which are the sample point cloud frames at time t ₀ , t ₁ , and t ₂ respectively. According to the encoding characteristics of the sample point cloud frames at time t ₀ and t ₁ After determining the prediction feature map of the sample point cloud frame at time t ₂ , the loss function value is determined based on the encoding feature map of the sample point cloud frame _at time t 2 and the prediction feature map of the sample point cloud frame at time t ₂ .

In some embodiments, the predicted feature map of the multi-frame sample point cloud is determined through step S120. In these embodiments, the loss function value is determined based on the predicted feature map of the multi-frame sample point cloud and its encoding feature map.

For example, the sample point cloud frame sequence consists of 4 consecutive sample point cloud frames, which are the sample point cloud frames at t ₀ , t ₁ , t ₂ , and t ₃ respectively. According to the sample point cloud frames at t ₀ and t ₁ After determining the prediction feature map of the sample point cloud frame at time t ₂ based on the coding feature map, and determining the prediction feature map of the sample point cloud frame at time t ₃ based on the coding feature maps of the sample point cloud frames at time t ₁ and t ₂ , The value of the loss function is determined based on the coding feature map and prediction feature map of the sample point cloud frame at time t ₂ and the coding feature map and prediction feature map of the sample point cloud frame at time t ₃ .

In some embodiments, the value of the loss function is calculated as follows: calculating the predicted feature map of the next frame sample point cloud located after the adjacent multi-frame sample point cloud and the feature point corresponding to the same position index in its encoded feature map. The loss function value is calculated based on the Euclidean distance between the feature points of all position indexes.

In some embodiments, a mean squared error (MSE) loss function is used to measure the consistency between the predicted feature map and the encoded feature map. The goal of model training is to minimize the mean square error loss function value.

For example, when determining the loss function value based on the predicted feature map K ₂ and its encoding feature map F ₂ of a frame sample point cloud, the loss function value is calculated according to the following formula:

Among them, MSE represents the loss function value, x(i,j)∈K ₂ , y(i,j)∈F ₂ , i, j are the position index on the feature map respectively, m×n represents the predicted feature map K ₂ and The size of the encoding feature map F ₂ , ‖x(i,j)-y(i,j)‖ ² , represents the square of the Euclidean distance between the predicted feature map K ₂ and its encoding feature map F ₂ .

In other embodiments of the present disclosure, when determining the loss function value based on the prediction feature map of the multi-frame sample point cloud and its coding feature map, the prediction feature map of each frame sample point cloud and its coding feature map can first be calculated based on the above calculation formula. The loss function value of the feature map is then determined based on the loss function value of the multi-frame sample point cloud to determine the total loss function value.

For example, when determining the loss function value based on the coding feature map and prediction feature map of the sample point cloud frame at time t ₂ and t ₃ , the first step is to determine the first step based on the coding feature map and prediction feature map of the sample point cloud frame at time t ₂ . The loss function value determines the second loss function value based on the encoding feature map and the prediction feature map of the sample point cloud frame at time t ₃ , and then determines the total loss function value based on the first loss function value and the second loss function value.

Step S140: Train the first feature extraction network model according to the loss function value.

In step S140, the feature extraction network model is updated according to the loss function value. Repeat steps S110 to S140 until the training end condition is reached, for example, the training step length reaches a preset value (for example, 2 million steps).

In the embodiment of the present disclosure, the self-supervised learning of the point cloud feature extraction network model is achieved through the above steps. Different from the method of using image optical flow for self-supervision in related technologies, the embodiments of the present disclosure only need to use point cloud data in one modality, and use local feature relationships in the sample point cloud frame sequence to perform self-supervised matching training, without the need for The calibration between radar and cameras not only reduces the cost of data annotation, but also improves the performance of the trained feature extraction model. Furthermore, after the feature extraction network model is trained through the above steps, the feature extraction network model can be used as a backbone network for specific visual tasks, such as three-dimensional target detection, point cloud semantic segmentation, etc., thereby improving the target detection results or point detection results. Accuracy of cloud semantic segmentation results.

Figure 2 is a schematic flowchart of a point cloud feature extraction network model training method according to other embodiments of the present disclosure. As shown in Figure 2, point cloud feature extraction network model training methods in other embodiments of the present disclosure include:

Step S210: Convert the original data of three consecutive frames of sample point clouds into BEV data.

In the embodiment of the present disclosure, the sample point cloud frame sequence is composed of three consecutive time-series point clouds as an example for explanation. In practical applications, the sample point cloud frame sequence can be expanded into a time series sequence of more frames.

In some embodiments, step S210 includes: subjecting the original data of three consecutive frames of sample point clouds to voxelization to obtain corresponding voxelized feature data, and then converting the voxelized feature data into a bird's-eye view. A two-dimensional image from the (BEV) perspective, that is, BEV data. In specific implementation, there are many methods to convert the original point cloud data into BEV data, such as using the method of generating pseudo images in the PointPillars method, or downsampling along the Z-axis direction. The PointPillars method is a voxel-based three-dimensional target detection algorithm. Its main idea is to convert the three-dimensional point cloud into a two-dimensional pseudo image so that target detection can be performed using two-dimensional target detection.

For example, first filter the original point cloud data according to the target area of interest. For example, in the radar coordinate system, the original point cloud data collected from the radar will be located at x∈[-30,30], y∈[-15,15], Take out the point cloud data within the range of z∈[-1.8,0.8]. Then, a voxel cell (Voxel cell) is established every 0.05 meters in the x-axis and y-axis directions and every 0.10 meters in the z-axis direction to obtain the voxel grid (Voxel Grid). After obtaining the voxel grid, the PointPillar method is used to determine the BEV feature map of the point cloud.

In the embodiment of the present disclosure, by converting the original point cloud data into BEV data, subsequent feature extraction can be performed on the two-dimensional BEV data, which improves the training efficiency of the point cloud feature extraction network model. Moreover, since the BEV data of point clouds can better maintain the spatial relationship of obstacles, it is easier to use the features of continuous frame point clouds for self-supervised learning.

Step S220: Based on the first feature extraction network model, perform a first encoding on the BEV data of three consecutive frames of sample point clouds to obtain a coded feature map of each frame of sample point clouds.

In some embodiments, the first feature extraction network model is a shared weight encoder. The shared weight encoder includes three encoding modules, each encoding module is used to encode the BEV data (also known as BEV feature map) is encoded. By using a shared weight encoder, the weight of the network can be shared when extracting features from 3-frame point clouds, so that the feature extraction network model can distinguish similarities and differences between different learned point clouds.

In some embodiments, the encoding module of the first feature extraction network model includes: a convolutional neural network and a self-attention network. For example, the convolutional neural network is a two-dimensional convolutional neural network such as ResNet and EfficientNet. By setting up a convolutional neural network, preliminary information of the BEV feature map of the point cloud can be extracted, such as low-level local information on the BEV feature map of the point cloud. Illustratively, the self-attention network is a network such as Transformer. Transformer is a neural network model that utilizes self-attention mechanism. By setting up a self-attention network, it is possible to encode the relationship between each position and other positions in the feature map of the 3-frame sample point cloud output by the convolutional neural network, and extract spatial information with large intervals within the same frame sample point cloud. .

Step S230: Determine the predicted feature map of the third frame sample point cloud based on the encoding feature maps of the sample point clouds of the first two frames.

For example, when the sample point cloud frame sequence includes three sample point cloud frames at t ₀ , t ₁ , and t ₂ , the data of these three frame sample point clouds are encoded through the feature extraction network model to obtain the feature maps F ₀ and F ₁ , _F2 . In step S230, the predicted feature map K ₂ of the sample point cloud at time t ₂ is determined based on the feature maps F ₀ and F ₁ .

In some embodiments, the predicted feature map of the sample point cloud of the third frame is determined according to the process shown in Figure 4.

Step S240: Determine the loss function value based on the predicted feature map of the sample point cloud in the third frame and its encoding feature map.

In some embodiments, the mean square error loss function is used to measure the consistency between the predicted feature map K ₂ and its encoded feature map F ₂ of the sample point cloud in frame 3. In these embodiments, the training objective is to minimize the value of the mean square error loss function.

For example, calculate the value of the mean square error loss function according to the following formula:

Among them, MSE represents the loss function value, x(i,j)∈K ₂ , y(i,j)∈F ₂ , i, j are the position index on the feature map respectively, m×n represents the sample point cloud of the third frame The size of the predicted feature map K ₂ and the encoded feature map F ₂ , ‖x(i,j)-y(i,j)‖ ² represents the Euclidean distance between the predicted feature map K ₂ and its encoded feature map F ₂ squared.

Step S250: Train the first feature extraction network model according to the loss function value.

In step S250, the first feature extraction network model is updated according to the loss function value. In addition, when other network models that need to be updated are involved in the training process, other network models that need to be updated are also updated based on the loss function value.

According to the above processing steps, the first feature extraction network model is updated iteratively multiple times until the training end condition is reached, for example, the training step length reaches a preset value (for example, 2 million steps).

In the embodiment of the present disclosure, the self-supervised learning of the point cloud feature extraction network model is achieved through the above steps. Different from the method of using image optical flow for self-supervision in related technologies, the embodiments of the present disclosure only need to use point cloud data in one modality, and use local feature relationships in the sample point cloud frame sequence to perform self-supervised matching training, without the need for The calibration between radar and cameras not only reduces the cost of data annotation, but also improves the performance of the trained point cloud feature extraction network model.

Figure 3 is a schematic structural diagram of a first feature extraction network model according to some embodiments of the present disclosure.

As shown in Figure 3, the first feature extraction network model 300 of some embodiments of the present disclosure includes three encoding modules, namely encoding module 310, encoding module 320 and encoding module 330.

In some embodiments, three encoding modules share network weights. In these embodiments, three frames of sample point clouds included in the sample point cloud frame sequence are simultaneously input into three encoding modules. For example, the sample point cloud at time t ₀ is input into the encoding module. The coding module 310 inputs the sample point cloud at time t ₁ into the coding module 320, and inputs the sample point cloud at time t ₂ into the coding module 330.

In some embodiments, each encoding module includes a convolutional neural network and a self-attention network. For example, the convolutional neural network uses the ResNet model, and the self-attention network uses the Transformer model. During specific implementation, the Transformer model can adopt standard structures in related technologies.

In the embodiments of the present disclosure, by adopting the first feature extraction network model with the above structure, more point cloud feature information can be extracted and the point cloud feature extraction capability can be improved.

Figure 4a is a schematic flowchart of the steps of determining a prediction feature map according to some embodiments of the present disclosure. Figure 4a is an exemplary illustration of step S230. As shown in Figure 4a, the steps of determining the prediction feature map in this embodiment of the present disclosure include:

Step S410: Perform a second encoding on the coded feature maps of the sample point clouds of the first two frames to obtain the intermediate feature maps of the sample point clouds of the first two frames.

In some embodiments, in step S410, the intermediate feature map is determined as follows: using the attention encoding module in the second feature extraction network model to encode the encoding feature maps of the sample point clouds of the first two frames to obtain the previous The intermediate feature map of the two frame sample point clouds. For example, the attention encoding module adopts the Transformer model.

By setting up the attention encoding module, the spatial position relationship on the encoding feature maps of different frames can be learned based on the attention mechanism, which in turn helps to more accurately determine the matching relationship between feature points and improve the predicted feature map and real features determined thereby. The consistency of the graph improves the training efficiency of the feature extraction network model.

In other embodiments of the present disclosure, when the sample point cloud frame sequence includes four or more sample point cloud frames, and the next frame of sample point cloud is determined based on the encoding feature maps of the sample point clouds of two adjacent frames. When encoding the feature map, the encoding feature maps of the sample point clouds of the two adjacent frames are encoded as follows: Based on the attention coding module, the encoding feature maps of the sample point clouds of the two adjacent frames are encoded to obtain the two adjacent frames. Intermediate feature map of sample point cloud.

For example, when the sample point cloud frame sequence includes sample point cloud frames at four times t ₀ , t ₁ , t ₂ , and t ₃ , and the next frame of sample point cloud is determined based on the encoding feature maps of the sample point clouds of the two adjacent frames. When encoding the feature map, the sample point clouds of the two adjacent frames include the sample point cloud frames at t ₀ and t ₁ , and the sample point cloud frames at t ₁ and t _2. Based on the attention coding module, t ₀ and t The coding feature maps of the sample point cloud frames at time ₁ and t ₂ are encoded to obtain the intermediate feature maps of the sample point cloud frames at time t ₀ , t ₁ and t ₂ .

Step S420: Determine the feature point matching relationship between the sample point clouds of the first two frames based on the intermediate feature map.

In some embodiments, in step S420, the feature point matching relationship between the sample point clouds of the first two frames is determined as follows: calculating each feature point on the intermediate feature map of the first frame and the intermediate feature map of the second frame. The correlation degree of the feature points within the specified range on the feature map; based on the correlation degree, the feature point matching between the sample point clouds of the first two frames is determined. relation.

Among them, the specified range is the neighborhood range of the feature points of the intermediate feature map of the first frame. For example, for any feature point P ₀ on the intermediate feature map of the first frame, a circular area with the position coordinate of the point as the center and a preset length as the radius will be the neighborhood range of the feature point. For another example, let the neighborhood range be a Gaussian neighborhood. In the embodiment of the present disclosure, by searching for matching feature points within a specified range, a global search for the intermediate feature map of the second frame can be avoided, thereby reducing the amount of calculation.

Among them, correlation can be measured in various ways. In some embodiments, the cosine distance between feature points is used as the correlation between the two.

The feature point matching relationship includes: the corresponding relationship between the feature points on the intermediate feature map of the first frame and their matching feature points on the intermediate feature map of the second frame. In some embodiments, for any feature point P ₀ on the intermediate feature map of the first frame, the feature point with the greatest correlation with P ₀ within the specified range in the intermediate feature map of the second frame is used as P ₀ matching feature points.

Step S430: Fusion of the intermediate feature maps of the sample point clouds of the first two frames according to the feature point matching relationship to obtain a fused feature map.

In some embodiments, according to the feature point matching relationship, the matching feature points between the intermediate feature maps of the sample point clouds of the first two frames are feature spliced, and the spliced feature map is used as the fusion feature map.

For example, for any feature point P ₀ on the intermediate feature map of the sample point cloud in the first frame, splice it with the features of the matching feature point P ₁ on the intermediate feature map of the sample point cloud in the second frame, And the position index of the feature point P ₀ is used as the position index of the spliced feature point, thereby obtaining a fused feature point. By analogy, a fused feature map composed of fused feature points can be obtained.

Step S440: Determine the predicted feature map of the sample point cloud in the third frame based on the fused feature map.

In some embodiments, the attention decoding module in the second feature extraction network model is used to decode the fused feature map, thereby obtaining the predicted feature map of the sample point cloud in the third frame.

In the embodiment of the present disclosure, through the above steps, the predicted feature map of the sample point cloud in the third frame can be determined efficiently and accurately based on the encoded feature maps of the sample point clouds in the first two frames, thereby helping to optimize the point cloud feature extraction network model. The training process improves the performance of the first feature extraction network model obtained by training.

Figure 4b is a schematic flowchart of a point cloud feature extraction network model training method according to some embodiments of the present disclosure. In the embodiment of the present disclosure, taking the sample point cloud frame sequence as containing three frames of continuous time series point cloud data (specifically, the point cloud data c ₀ , c ₁ , and c ₂ at t ₀ , t ₁ , and t ₂ ), for example, The point cloud feature extraction network model training method is explained.

As shown in Figure 4b, the point cloud feature extraction network model training method includes: Step 1 to Step 7.

Step 1: Convert the point cloud data c ₀ , c ₁ , and c 2 _{at t 0 , t 1} _, _and t ₂ into bird's-eye views at the corresponding moments respectively.

Step 2: Input the bird's-eye view at time t ₀ , t ₁ , and t ₂ into the shared weight encoder to obtain the feature maps F ₀ , F ₁ , and F ₂ at the corresponding time.

Among them, shared weight encoders include: 2D CNN (2-dimensional convolutional neural network), and Transformer encoder (or Transformer network). For example, 2D CNN can use network models such as ResNet and EfficientNet to extract preliminary information from a bird's-eye view. The Transformer network is a self-attention network that is used to extract the encoding of the feature point relationship between each position and other positions in the bird's-eye view, that is, the spatial position relationship of the point cloud in the same frame. For example, for the pixel point corresponding to the vehicle position on the bird's-eye view, the point is encoded by the shared weight encoder to obtain the feature point X.

When using the shared weight encoder to extract the features of the bird's-eye view at t ₀ , t ₁ , and t ₂ , the weights of the network model used are shared. This helps the network learn between different point clouds. similarities and differences.

Step 3: Based on the temporal attention conversion module, encode the feature maps F ₀ and F 1 at t ₀ and t ₁ and calculate the feature point correlation to obtain the intermediate feature map and feature point matching relationship at _{t 0} _and t ₁ .

Among them, the temporal attention conversion module includes: Transformer encoder and correlation calculation module. In step 3, first encode the feature maps F ₀ and F ₁ at time t ₀ and t ₁ based on the Transformer encoder to obtain the intermediate feature map at the corresponding time; then calculate the intermediate feature map at time t ₀ and t ₁ The feature point correlation between V ₀ and V ₁ ; according to the feature point correlation between the intermediate feature maps at t ₀ and t ₁ , determine the feature point matching relationship between the intermediate feature maps at t ₀ and t ₁ .

In some embodiments, for each feature point on the intermediate feature map V ₀ , the correlation between the point and the feature points within the neighborhood of its corresponding position on the intermediate feature map V ₁ is calculated, and the point with the largest correlation is calculated The feature point is used as the matching feature point of this point.

By calculating each feature point on the intermediate feature map V ₀ and its matching feature point on the intermediate feature map V ₁ , the matching relationship between the feature points between the intermediate feature maps at time t ₀ and t ₁ can be obtained.

Step 4: Fusion of the intermediate feature maps V ₀ and V ₁ based on the position transformation coding module, and decoding the fused feature map to obtain the predicted feature map at time t ₂ .

Among them, the position transformation coding module includes: fusion module and Transformer decoder. The fusion module fuses the intermediate feature maps at t ₀ and t ₁ based on the feature point matching relationship between the intermediate feature maps at t ₀ and t ₁ to obtain a fused feature map. The Transformer decoder decodes the fused feature map to obtain the predicted feature map at time t ₂ .

Step 5: Calculate the MSE (root mean square loss) based on the encoding feature map at time t ₂ and the predicted feature map at time t ₂ loss function).

Among them, MSE is used to measure the consistency of the predicted feature map at time t ₂ and the encoding feature map at time t _2. The goal of the entire training is to minimize MSE.

Step 6: Repeat steps 1 to 5 until the model training cutoff condition is reached, for example, the training step length reaches 2 million steps.

Step 7: Output the shared weight encoder.

In the embodiment of the present disclosure, the self-supervised learning of the point cloud feature extraction network model is achieved through the above steps. Different from the method of using image optical flow for self-supervision in the related art, the embodiment of the present disclosure only needs to use point cloud data of one modality, and uses local partial feature relationships in the sample point cloud frame sequence to perform self-supervised matching training, without the need for The calibration between radar and cameras not only reduces the cost of data annotation, but also improves the performance of the trained feature extraction model. Furthermore, after the point cloud feature extraction network model is trained through the above steps, the point cloud feature extraction network model can be used as a backbone network for specific visual tasks, such as three-dimensional target detection, point cloud semantic segmentation, etc., thereby improving the target The accuracy of detection results or point cloud semantic segmentation results.

Figure 5 is a schematic flowchart of a point cloud feature extraction method according to some embodiments of the present disclosure. As shown in Figure 5, the point cloud feature extraction method of some embodiments of the present disclosure includes:

Step S510: Obtain the point cloud frame sequence to be processed.

In some embodiments, the point cloud feature extraction method is executed by a point cloud feature extraction device.

In some embodiments, step S510 includes: obtaining the original feature data of the multi-frame point cloud to be processed; converting the original feature data of the multi-frame point cloud to be processed into BEV feature data to obtain a bird's-eye view of the multi-frame point cloud to be processed. A sequence of point cloud frames to be processed composed of graph feature data.

Step S520: Based on the first feature extraction network model obtained by training, encode the point cloud frame sequence to be processed to obtain the encoded feature map of the point cloud frame sequence to be processed.

In the embodiments of the present disclosure, richer point cloud features can be extracted through the above steps. Furthermore, when performing target detection or point cloud semantic segmentation tasks, the accuracy of target detection results or the accuracy of point cloud semantic segmentation results can be improved. sex.

Figure 6 is a schematic structural diagram of a point cloud feature extraction network model training device according to some embodiments of the present disclosure. As shown in Figure 6, the point cloud feature extraction network model training device 600 in some embodiments of the present disclosure includes: a feature extraction module 610, a prediction module 620, a determination module 630, and a training module 640.

The feature extraction module 610 is configured to use the first feature extraction network model to perform first encoding on the sample point cloud frame sequence to obtain a coded feature map of each frame of the sample point cloud in the sample point cloud frame sequence.

The prediction module 620 is configured to determine the predicted feature map of the next frame sample point cloud located after the adjacent multi-frame sample point cloud based on the encoding feature map of the adjacent multi-frame sample point cloud.

The determination module 630 is configured to determine the loss function value based on the prediction feature map of the next frame sample point cloud located after the adjacent multi-frame sample point cloud and its encoding feature map.

The training module 640 is configured to train the first feature extraction network model according to the loss function value.

In embodiments of the present disclosure, the above device can improve the point cloud feature extraction effect, thereby helping to improve the accuracy of target detection or point cloud semantic segmentation results.

Figure 7 is a schematic structural diagram of a point cloud feature extraction device according to some embodiments of the present disclosure. As shown in Figure 7, the point cloud feature extraction device 700 in some embodiments of the present disclosure includes: an acquisition module 710 and a feature extraction module 720.

The acquisition module 710 is configured to acquire a sequence of point cloud frames to be processed.

In some embodiments, the acquisition module 710 is configured to: acquire the original feature data of the multi-frame point cloud to be processed; convert the original feature data of the multi-frame point cloud to be processed into BEV feature data to obtain the multi-frame point cloud to be processed. A sequence of point cloud frames to be processed composed of bird's-eye view feature data of the cloud.

The feature extraction module 720 is configured to encode the point cloud frame sequence to be processed based on the first feature extraction network model obtained by training, so as to obtain the encoded feature map of the point cloud frame sequence to be processed.

According to some embodiments of the present disclosure, a target detection device is also proposed, configured to extract the feature map of the point cloud frame sequence to be processed according to the point cloud feature extraction method of any embodiment of the present disclosure; according to the point cloud frame sequence to be processed feature map for target detection.

According to some embodiments of the present disclosure, a point cloud semantic segmentation device is also proposed, configured to extract the feature map of the point cloud frame sequence to be processed according to the point cloud feature extraction method of any embodiment of the present disclosure; according to the point cloud to be processed Feature maps of frame sequences for point cloud semantic segmentation.

Figure 8 is a schematic structural diagram of a point cloud feature extraction network model training device, a point cloud feature extraction device, a target detection device, or a point cloud semantic segmentation device according to some embodiments of the present disclosure.

As shown in FIG. 8 , the point cloud feature extraction network model training device or point cloud feature extraction device or target detection device or point cloud semantic segmentation device 800 includes a memory 810; and a processor 820 coupled to the memory 810. The memory 810 is used to store instructions for executing corresponding embodiments of the point cloud feature extraction network model training method, the point cloud feature extraction method, the target detection method, or the point cloud semantic segmentation method. Processor 820 is configured to store the The instructions in the memory 810 execute the point cloud feature extraction network model training method, point cloud feature extraction method, target detection method, or point cloud semantic segmentation method in any embodiment of the present disclosure.

Figure 9 is a schematic structural diagram of a computer system according to some embodiments of the present disclosure.

As shown in Figure 9, computer system 900 may be embodied in the form of a general purpose computing device. Computer system 900 includes memory 910, a processor 920, and a bus 930 that connects various system components.

Memory 910 may include, for example, system memory, non-volatile storage media, and the like. System memory stores, for example, operating systems, applications, boot loaders, and other programs. System memory may include volatile storage media such as random access memory (RAM) and/or cache memory. The non-volatile storage medium stores, for example, instructions for executing corresponding embodiments of at least one point cloud feature extraction network model training method, point cloud feature extraction method, target detection method, or point cloud semantic segmentation method. Non-volatile storage media include but are not limited to disk storage, optical storage, flash memory, etc.

The processor 920 may be implemented as a general purpose processor, a digital signal processor (DSP), an application specific integrated circuit (ASIC), a field programmable gate array (FPGA) or other programmable logic device, discrete hardware components such as discrete gates or transistors. accomplish. Correspondingly, each module, such as the feature extraction module, prediction module, etc., can be implemented by a central processing unit (CPU) running instructions in the memory to perform the corresponding steps, or by a dedicated circuit that performs the corresponding steps.

Bus 930 may use any of a variety of bus structures. For example, bus structures include, but are not limited to, Industry Standard Architecture (ISA) bus, Micro Channel Architecture (MCA) bus, and Peripheral Component Interconnect (PCI) bus.

The interfaces 940, 950, 960, the memory 910 and the processor 920 of the computer system 900 may be connected through a bus 930. The input and output interface 940 can provide a connection interface for input and output devices such as a monitor, mouse, and keyboard. The network interface 950 provides a connection interface for various networked devices. The storage interface 960 provides a connection interface for external storage devices such as floppy disks, USB disks, and SD cards.

Figure 10 is a schematic structural diagram of an unmanned vehicle according to some embodiments of the present disclosure; Figure 11 is a perspective view of an unmanned vehicle according to some embodiments of the present disclosure. The unmanned vehicle provided by the embodiment of the present disclosure will be described below with reference to FIG. 10 and FIG. 11 .

As shown in Figure 10, the unmanned vehicle includes four parts: a chassis module 1010, an autonomous driving module 1020, a cargo box module 1030, and a remote monitoring flow module 1040.

In some embodiments, the chassis module 1010 mainly includes a battery, a power management device, a chassis controller, a motor driver, and a power motor. The battery provides power for the entire unmanned vehicle system, and the power management device converts the battery output into Change to different level voltages available for each functional module, and control power on and off. The chassis controller receives motion instructions from the autonomous driving module and controls the steering, forward, backward, braking, etc. of the unmanned vehicle.

In some embodiments, the autonomous driving module 1020 includes a core processing unit (Orin or Xavier module), traffic light recognition camera, front, rear, left and right surround cameras, multi-line lidar, positioning module (such as Beidou, GPS, etc.), and inertial navigation unit. The camera and the autonomous driving module can communicate. In order to increase the transmission speed and reduce the wiring harness, GMSL link communication can be used.

In some embodiments, the autonomous driving module 1020 includes the point cloud feature extraction network model training device or point cloud feature extraction device or target detection device or point cloud semantic segmentation device in the above embodiments.

In some embodiments, the remote monitoring streaming module 1030 is composed of a front surveillance camera, a rear surveillance camera, a left surveillance camera, a right surveillance camera, and a streaming module. This module transmits the video data collected by the surveillance cameras to the backend server for use by the backend. Operator checks. The wireless communication module communicates with the backend server through the antenna, allowing the backend operator to remotely control the unmanned vehicle.

The cargo box module 1040 is the cargo carrying device of the unmanned vehicle. In some embodiments, the cargo box module 1040 is also provided with a display interaction module. The display interaction module is used for the unmanned vehicle to interact with the user. The user can perform operations such as picking up, depositing, and purchasing goods through the display interaction module. The type of cargo box can be changed according to actual needs. For example, in a logistics scenario, a cargo box can include multiple sub-boxes of different sizes, and the sub-boxes can be used to load goods for distribution. In a retail scenario, the cargo box can be set up as a transparent box so that users can intuitively see the products for sale.

The unmanned vehicle in the embodiment of the present disclosure can improve the point cloud feature extraction capability, thereby helping to improve the accuracy of the point cloud semantic segmentation results or the accuracy of the target detection results, thereby improving the safety of unmanned driving.

Various aspects of the disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus, and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks, can be implemented by computer-readable program instructions.

These computer readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable device to produce a machine, such that execution of the instructions by the processor produces implementations in one or more blocks of the flowcharts and/or block diagrams. A device with specified functions.

Computer-readable program instructions, which may also be stored in computer-readable memory, cause the computer to operate in a specific manner to produce an article of manufacture, including implementing the functions specified in one or more blocks of the flowcharts and/or block diagrams. instructions.

The disclosure may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment that combines software and hardware aspects.

Through the point cloud feature extraction network model training, point cloud feature extraction method, device and unmanned vehicle in the above embodiments, only one modality of data is used, and no calibration between radar and camera is required. The self-supervised learning of the point cloud feature extraction network model not only reduces the cost of data annotation, but also improves the performance of the trained feature extraction model.

So far, the point cloud feature extraction network model training, point cloud feature extraction method, device and unmanned vehicle according to the present disclosure have been described in detail. To avoid obscuring the concepts of the present disclosure, some details that are well known in the art have not been described. Based on the above description, those skilled in the art can completely understand how to implement the technical solution disclosed here.

Claims

A training method for point cloud feature extraction network model, including:

Using a first feature extraction network model, perform a first encoding on the sample point cloud frame sequence to obtain a coded feature map of each frame of sample point cloud in the sample point cloud frame sequence;

According to the coding feature map of the adjacent multi-frame sample point cloud, determine the predicted feature map of the next frame sample point cloud located after the adjacent multi-frame sample point cloud;

Determine the loss function value according to the predicted feature map and its encoding feature map of the next frame sample point cloud located after the adjacent multi-frame sample point cloud;

The first feature extraction network model is trained according to the loss function value.
The point cloud feature extraction network model training method according to claim 1, wherein the next frame sample point cloud located after the adjacent multi-frame sample point cloud is determined according to the encoding feature map of the adjacent multi-frame sample point cloud. The predicted feature maps include:

Using a second feature extraction network model, perform second encoding on the coded feature maps of the adjacent multi-frame sample point clouds, respectively, to obtain intermediate feature maps of the adjacent multi-frame sample point clouds;

Fusing the intermediate feature maps of the adjacent multi-frame sample point clouds to obtain a fused feature map;

The fused feature map is decoded to obtain a predicted feature map of the next frame sample point cloud located after the adjacent multi-frame sample point cloud.
The point cloud feature extraction network model training method according to claim 2, wherein said fusing the intermediate feature maps of the adjacent multi-frame sample point clouds to obtain the fused feature map includes:

Determine the feature point matching relationship between the adjacent multi-frame sample point clouds according to the intermediate feature map of the adjacent multi-frame sample point cloud;

According to the feature point matching relationship, the intermediate feature maps of the adjacent multi-frame sample point clouds are fused to obtain a fused feature map.
The point cloud feature extraction network model training method according to any one of claims 1 to 3, wherein the next point cloud located after the adjacent multi-frame sample point cloud is determined according to the coding feature map of the adjacent multi-frame sample point cloud. The predicted feature maps of the frame sample point cloud include:

Determine the feature point matching relationship between the adjacent multi-frame sample point clouds according to the encoded feature maps of the adjacent multi-frame sample point clouds;

According to the matching relationship of the feature points, fuse the coded feature maps of the adjacent multi-frame sample point clouds to obtain a fused feature map;

According to the fused feature map, a predicted feature map of a next frame of sample point cloud following the adjacent multiple frames of sample point cloud is determined.
The point cloud feature extraction network model training method according to claim 3 or 4, wherein the distance between the adjacent multi-frame sample point clouds is determined based on the intermediate feature map of the adjacent multi-frame sample point clouds. Feature point matching relationships include:

According to the intermediate feature map of the adjacent multi-frame sample point cloud, calculate the correlation degree of the feature points between the adjacent multi-frame sample point cloud;

According to the correlation of the feature points between the adjacent multi-frame sample point clouds, the feature point matching relationship between the two adjacent frame sample point clouds is determined.
The point cloud feature extraction network model training method according to claim 5, wherein the adjacent multi-frame sample point clouds are two adjacent frame sample point clouds, and the intermediate feature map of the adjacent multi-frame sample point clouds includes : the first intermediate feature map corresponding to the first frame in the two adjacent frames of sample point clouds, and the second intermediate feature map corresponding to the second frame in the two adjacent frames of sample point clouds; and,

Calculating the correlation of feature points between adjacent multi-frame sample point clouds based on the intermediate feature maps of the adjacent multi-frame sample point clouds includes:

Calculate the correlation between each feature point on the first intermediate feature map and the feature points within a specified range on the second intermediate feature map. The specified range is the neighborhood of the feature points of the first intermediate feature map. scope;

According to the correlation degree, the feature point matching relationship between the sample point clouds of the two adjacent frames is determined.
The point cloud feature extraction network model training method according to any one of claims 3 to 6, wherein the adjacent multi-frame sample point clouds are two adjacent frame sample point clouds;

The step of fusing the intermediate feature maps of the sample point clouds of the two adjacent frames according to the feature point matching relationship to obtain the fused feature map includes:

According to the feature point matching relationship, the matching features between the intermediate feature maps of the sample point clouds of the two adjacent frames are Feature points are spliced together, and the spliced feature map is used as a fused feature map.
The point cloud feature extraction network model training method according to any one of claims 1 to 7, wherein the predicted feature map and its encoding based on the next frame sample point cloud located after the adjacent multi-frame sample point cloud Feature map, determining the loss function value includes:

Between the predicted feature map and the encoded feature map of the next frame sample point cloud, calculate the Euclidean distance between feature points with the same position index;

The loss function value is calculated based on the Euclidean distance between the feature points of all position indexes.
The point cloud feature extraction network model training method according to any one of claims 1 to 8, wherein the first feature extraction network model is a shared weight encoder, and the shared weight encoder includes a plurality of encoding modules, Each encoding module is used to encode one frame in the sample point cloud frame sequence.
The point cloud feature extraction network model training method according to claim 9, wherein the encoding module includes: a convolutional neural network and a self-attention network.
The point cloud feature extraction network model training method according to any one of claims 1 to 10, further comprising:

Convert the original feature data of the multi-frame sample point cloud into two-dimensional image feature data to obtain a sample point cloud frame sequence composed of the two-dimensional image feature data of the multi-frame sample point cloud.
The point cloud feature extraction network model training method according to claim 11, wherein converting the original feature data of the multi-frame sample point cloud into two-dimensional image feature data includes:

Convert the original feature data of multi-frame sample point clouds into bird's-eye view BEV feature data.
The point cloud feature extraction network model training method according to any one of claims 1 to 12, wherein:

The sample point cloud frame sequence consists of multiple frames of sample point clouds that are continuous in time series; and/or,

The number of sample point cloud frames included in the sample point cloud frame sequence is greater than or equal to 3 and less than or equal to 5.
The point cloud feature extraction network model training method according to any one of claims 2 to 13, wherein the second feature extraction network model includes:

An attention coding module, configured to perform second coding on the coding feature maps of the adjacent multi-frame sample point clouds respectively;

An attention decoding module is used to decode the fused feature map to obtain a predicted feature map of the next frame sample point cloud located after the adjacent multi-frame sample point cloud.
A point cloud feature extraction method, including:

Get the point cloud frame sequence to be processed;

Based on the first feature extraction network model trained by the point cloud feature extraction network model training method according to any one of claims 1 to 14, the point cloud frame sequence to be processed is encoded to obtain a feature map of the point cloud frame sequence to be processed.
The point cloud feature extraction method according to claim 15, wherein obtaining the point cloud frame sequence to be processed includes:

Obtain the original feature data of multi-frame point clouds to be processed;

The original feature data of the multi-frame point cloud to be processed is converted into bird's-eye view BEV feature data to obtain a sequence of point cloud frames to be processed consisting of the bird's-eye view feature data of the multi-frame point cloud to be processed.
A target detection method including:

Extract the feature map of the point cloud frame sequence to be processed according to the point cloud feature extraction method according to claim 15 or 16;

Target detection is performed based on the feature map of the point cloud frame sequence to be processed.
A point cloud semantic segmentation method, including:

Extract the feature map of the point cloud frame sequence to be processed according to the point cloud feature extraction method according to claim 15 or 16;

According to the feature map of the point cloud frame sequence to be processed, point cloud semantic segmentation is performed.
A device including:

A module for executing the point cloud feature extraction network model training method described in any one of claims 1 to 14, or a module for executing the point cloud feature extraction method described in claims 15 or 16, or a module for executing the right A module for the target detection method according to claim 17, or a module for executing the point cloud semantic segmentation method according to claim 18.
An electronic device including:

memory; and

A processor coupled to the memory, the processor being configured to execute the point cloud feature extraction network model training method of any one of claims 1 to 14 based on instructions stored in the memory, or claim 15 Or the point cloud feature extraction method described in claim 16, or the target detection method described in claim 17, or the point cloud semantic segmentation method described in claim 18.
A computer-readable storage medium with computer program instructions stored thereon, which when executed by a processor implements the point cloud feature extraction network model training method described in any one of claims 1 to 14, or claim 15 or 16 The point cloud feature extraction method, or the target detection method according to claim 17, or the point cloud semantic segmentation method according to claim 18.
An unmanned vehicle, including:

A device as claimed in claim 19, or an electronic device as claimed in claim 20.
A computer program consisting of:

Instructions, which when executed by the processor cause the processor to execute the point cloud feature extraction network model training method according to any one of claims 1 to 14, or the point cloud feature extraction according to claims 15 or 16 method, or the target detection method according to claim 17, or the point cloud semantic segmentation method according to claim 18.