WO2024055551A1 - Point cloud feature extraction network model training method, point cloud feature extraction method, apparatus, and driverless vehicle - Google Patents

Point cloud feature extraction network model training method, point cloud feature extraction method, apparatus, and driverless vehicle Download PDF

Info

Publication number
WO2024055551A1
WO2024055551A1 PCT/CN2023/082809 CN2023082809W WO2024055551A1 WO 2024055551 A1 WO2024055551 A1 WO 2024055551A1 CN 2023082809 W CN2023082809 W CN 2023082809W WO 2024055551 A1 WO2024055551 A1 WO 2024055551A1
Authority
WO
WIPO (PCT)
Prior art keywords
point cloud
feature
sample point
frame
feature map
Prior art date
Application number
PCT/CN2023/082809
Other languages
French (fr)
Chinese (zh)
Inventor
刘浩
Original Assignee
北京京东乾石科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京京东乾石科技有限公司 filed Critical 北京京东乾石科技有限公司
Publication of WO2024055551A1 publication Critical patent/WO2024055551A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/74Image or video pattern matching; Proximity measures in feature spaces
    • G06V10/75Organisation of the matching processes, e.g. simultaneous or sequential comparisons of image or video features; Coarse-fine approaches, e.g. multi-scale approaches; using context analysis; Selection of dictionaries
    • G06V10/757Matching configurations of points or features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection

Definitions

  • the present disclosure relates to the field of computer vision technology, especially to the field of unmanned driving, and in particular to a point cloud feature extraction network model training, point cloud feature extraction method, device and unmanned vehicle.
  • unmanned driving equipment is used to automatically transport people or objects from one location to another.
  • Unmanned driving equipment collects environmental information through sensors on the equipment and completes automatic transportation.
  • Logistics and transportation using unmanned delivery vehicles controlled by unmanned driving technology has greatly improved the convenience of production and life and saved labor costs.
  • a technical problem to be solved by this disclosure is to provide a point cloud feature extraction network model training, point cloud feature extraction method, device and unmanned vehicle.
  • a point cloud feature extraction network model training method including: using a first feature extraction network model to perform a first encoding on a sample point cloud frame sequence to obtain the sample point cloud frame The coding feature map of the sample point cloud of each frame in the sequence; according to the coding feature map of the adjacent multi-frame sample point cloud, determine the predicted feature map of the next frame sample point cloud located after the adjacent multi-frame sample point cloud; Determine the loss function value according to the prediction feature map and its encoding feature map of the next frame sample point cloud located after the adjacent multi-frame sample point cloud; The first feature extraction network model is trained according to the loss function value.
  • determining the predicted feature map of the next frame sample point cloud located after the adjacent multi-frame sample point cloud includes: using a second feature extraction network model, perform second encoding on the coding feature maps of the adjacent multi-frame sample point clouds respectively to obtain the intermediate feature map of the adjacent multi-frame sample point clouds; The feature maps are fused to obtain a fused feature map; the fused feature map is decoded to obtain a predicted feature map of the next frame sample point cloud located after the adjacent multi-frame sample point cloud.
  • fusing the intermediate feature maps of the adjacent multi-frame sample point clouds to obtain the fused feature map includes: determining the said The feature point matching relationship between adjacent multi-frame sample point clouds; according to the feature point matching relationship, the intermediate feature maps of the adjacent multi-frame sample point cloud are fused to obtain a fused feature map.
  • determining the predicted feature map of the next frame sample point cloud located after the adjacent multi-frame sample point cloud includes: based on the adjacent multi-frame sample point cloud The coding feature map of the point cloud determines the feature point matching relationship between the adjacent multi-frame sample point clouds; according to the feature point matching relationship, the coding feature map of the adjacent multi-frame sample point cloud is fused, To obtain a fusion feature map; according to the fusion feature map, determine the predicted feature map of the next frame sample point cloud located after the adjacent multi-frame sample point cloud.
  • determining the feature point matching relationship between the adjacent multi-frame sample point clouds based on the intermediate feature maps of the adjacent multi-frame sample point clouds includes: based on the adjacent multi-frame sample points The intermediate feature map of the point cloud is used to calculate the correlation of the feature points between the adjacent multi-frame sample point clouds; according to the correlation of the feature points between the adjacent multi-frame sample point clouds, the adjacent multi-frame sample point clouds are determined. Feature point matching relationship between two frame sample point clouds.
  • the adjacent multi-frame sample point clouds are two adjacent frame sample point clouds
  • the intermediate feature map of the adjacent multi-frame sample point cloud includes: The first intermediate feature map corresponding to the first frame, and the second intermediate feature map corresponding to the second frame in the two adjacent frame sample point clouds; and, the sample point cloud based on the adjacent multiple frames
  • the intermediate feature map, calculating the correlation of the feature points between the adjacent multi-frame sample point clouds includes: calculating each feature point on the first intermediate feature map, and the specified value on the second intermediate feature map.
  • the correlation degree of the feature points within the range, the specified range is the neighborhood range of the feature points of the first intermediate feature map; according to the correlation degree, the matching relationship of the feature points between the sample point clouds of the two adjacent frames is determined .
  • the adjacent multi-frame sample point clouds are two adjacent frame sample point clouds; the intermediate feature maps of the two adjacent frame sample point clouds are fused according to the feature point matching relationship.
  • to obtain the fused feature map includes: according to the matching relationship of the feature points, matching the intermediate feature maps of the sample point clouds of the two adjacent frames The feature points are spliced together, and the spliced feature map is used as the fused feature map.
  • determining the loss function value based on the prediction feature map of the next frame sample point cloud located after the adjacent multi-frame sample point cloud and its encoding feature map includes: in the next frame sample Between the predicted feature map and the encoded feature map of the point cloud, the Euclidean distance between feature points with the same position index is calculated; the loss function value is calculated based on the Euclidean distance between the feature points with all position indexes.
  • the first feature extraction network model is a shared weight encoder.
  • the shared weight encoder includes multiple encoding modules, each encoding module is used to encode the sample point cloud frame sequence. One frame is encoded.
  • the encoding module includes: a convolutional neural network and a self-attention network.
  • the method further includes: converting the original feature data of the multi-frame sample point cloud into two-dimensional image feature data to obtain a sample point cloud frame sequence composed of the two-dimensional image feature data of the multi-frame sample point cloud.
  • converting the original feature data of the multi-frame sample point cloud into two-dimensional image feature data includes: converting the original feature data of the multi-frame sample point cloud into bird's-eye view BEV feature data.
  • the sample point cloud frame sequence consists of multiple frames of sample point clouds that are continuous in time series; and/or the number of frames of the sample point cloud included in the sample point cloud frame sequence is greater than or equal to 3, and Less than or equal to 5.
  • the second feature extraction network model includes: an attention encoding module, used to perform a second encoding on the encoding feature maps of the adjacent multi-frame sample point clouds; an attention decoding module, used to perform a second encoding on the adjacent multi-frame sample point clouds; The fused feature map is decoded to obtain the predicted feature map of the next frame sample point cloud located after the adjacent multi-frame sample point cloud.
  • a point cloud feature extraction method including: obtaining a sequence of point cloud frames to be processed; based on the first feature extraction network model trained using the above feature extraction network model training method, The point cloud frame sequence to be processed is encoded to obtain the feature map of the point cloud frame sequence to be processed.
  • obtaining the point cloud frame sequence to be processed includes: obtaining the original data of the point cloud to be processed in multiple frames; converting the original feature data of the point cloud to be processed in the multiple frames into bird's-eye view BEV feature data to obtain the A sequence of point cloud frames to be processed consisting of bird's-eye view feature data of multiple frames of point clouds to be processed.
  • a target detection method is proposed.
  • the feature map of the point cloud frame sequence to be processed is extracted according to the aforementioned point cloud feature extraction method.
  • the target detection is performed based on the feature map of the point cloud frame sequence to be processed.
  • a point cloud semantic segmentation method including: extracting a feature map of the point cloud frame sequence to be processed according to the aforementioned point cloud feature extraction method; and based on the feature map of the point cloud frame sequence to be processed, Perform point cloud semantic segmentation.
  • a device including: a module for performing the point cloud feature extraction network model training method as described above, or a module for performing the point cloud feature extraction method as described above, Or, a module for performing the above-mentioned target detection method, or a module for performing the above-mentioned point cloud semantic segmentation method.
  • an electronic device including: a memory; and, a processor coupled to the memory, the processor being configured to perform the above points based on instructions stored in the memory
  • the cloud feature extraction network model training method is either the above-mentioned point cloud feature extraction method, or the above-mentioned target detection method, or the above-mentioned point cloud semantic segmentation method.
  • a computer-readable storage medium on which computer program instructions are stored.
  • the above-mentioned point cloud feature extraction network model training method is implemented, or the above-mentioned point cloud feature extraction network model training method is implemented.
  • Cloud feature extraction method, or the above-mentioned target detection method, or the above-mentioned point cloud semantic segmentation method is proposed.
  • an unmanned vehicle including the above device or electronic equipment.
  • Figure 1 is a schematic flowchart of a point cloud feature extraction network model training method according to some embodiments of the present disclosure
  • Figure 2 is a schematic flowchart of a point cloud feature extraction network model training method according to other embodiments of the present disclosure
  • Figure 3 is a schematic structural diagram of a first feature extraction network model according to some embodiments of the present disclosure.
  • Figure 4a is a schematic flowchart of the steps of determining a prediction feature map according to some embodiments of the present disclosure
  • Figure 4b is a schematic flowchart of a point cloud feature extraction network model training method according to some embodiments of the present disclosure
  • Figure 5 is a schematic flowchart of a point cloud feature extraction method according to some embodiments of the present disclosure
  • Figure 6 is a schematic structural diagram of a point cloud feature extraction network model training device according to some embodiments of the present disclosure.
  • Figure 7 is a schematic structural diagram of a point cloud feature extraction device according to some embodiments of the present disclosure.
  • Figure 8 is a schematic structural diagram of a point cloud feature extraction network model training device or a point cloud feature extraction device or a target detection device or a point cloud semantic segmentation device according to some embodiments of the present disclosure
  • Figure 9 is a schematic structural diagram of a computer system according to some embodiments of the present disclosure.
  • Figure 10 is a schematic structural diagram of an autonomous vehicle according to some embodiments of the present disclosure.
  • Figure 11 is a schematic three-dimensional structural diagram of an autonomous vehicle according to some embodiments of the present disclosure.
  • any specific values are to be construed as illustrative only and not as limiting. Accordingly, other examples of the exemplary embodiments may have different values.
  • a method for learning point cloud features based on self-supervision is proposed.
  • multiple frames of continuous point clouds are projected onto the corresponding RGB images respectively, and then the optical flow method is used to find moving objects on the RGB images, and the point clouds corresponding to the moving objects are obtained, and then the point cloud features are learned.
  • This method has the following shortcomings: 1.
  • the calibration requirements for lidar and cameras are very high; 2.
  • At the edge of the object there is a high probability that it cannot be correctly projected.
  • the projection of the point cloud to the RGB image is a cone, It is possible that some point clouds will overlap after being projected onto the RGB image, thus affecting model performance.
  • the present disclosure proposes a point cloud feature extraction network model training, point cloud feature extraction method, device and unmanned vehicle, which only need to use data of one modality and do not require calibration between radar and cameras.
  • This can realize self-supervised learning of the point cloud feature extraction network model, which not only reduces the cost of data annotation, but also improves the performance of the trained point cloud feature extraction network model.
  • Figure 1 is a schematic flowchart of a point cloud feature extraction network model training method according to some embodiments of the present disclosure. As shown in Figure 1, the point cloud feature extraction network model training method of some embodiments of the present disclosure includes:
  • Step S110 Use the first feature extraction network model to perform first encoding on the sample point cloud frame sequence to obtain the encoded feature map of each frame of the sample point cloud in the sample point cloud frame sequence.
  • the point cloud feature extraction network model training method is executed by a point cloud feature extraction network model training device.
  • the sample point cloud frame sequence consists of multiple frames of sample point clouds that are sequential in time series.
  • the sample point cloud frame sequence consists of 3, 4, 5, 6, or other frame number sample point clouds that are consecutive in time series.
  • the number of frames of the sample point cloud frame sequence is greater than or equal to 3 and less than or equal to 5.
  • the problem of difficulty in matching objects at the edge of the "Region of Interests" (RoI) caused by too long frame sequences can be alleviated.
  • the object here refers to a broad concept, which can be any target in the scene, such as trees, buildings, etc., regardless of the type that needs to be specifically identified in autonomous driving (such as cars, pedestrians, bicycles, etc.).
  • autonomous driving such as cars, pedestrians, bicycles, etc.
  • the network can learn the low-level information of various objects in the autonomous driving scene, such as shape, size, etc., so that the learned network has broader feature extraction capabilities.
  • each sample point cloud in the sample point cloud frame sequence is original point cloud feature data collected by lidar.
  • the original point cloud feature data includes the three-dimensional position coordinates of each point cloud point and the reflection intensity.
  • each sample point cloud in the sample point cloud frame sequence is two-dimensional image feature data obtained by processing the original point cloud feature data.
  • the point cloud feature extraction network model training method also includes: converting the original feature data of the multi-frame sample point cloud into two-dimensional image feature data to obtain the two-dimensional image feature data composed of the multi-frame sample point cloud.
  • a sequence of sample point cloud frames For example, convert the original features of multi-frame sample point clouds into The feature data is converted into Bird's Eye View (BEV) feature data.
  • BEV Bird's Eye View
  • the first feature extraction network model is a shared weight encoder.
  • the shared weight encoder includes a plurality of encoding modules, each encoding module is used to encode one frame in the sample point cloud frame sequence.
  • the sample point cloud frame sequence includes BEV data of 4 frames of sample point clouds from t 0 , t 1 , t 2 to t 3
  • the BEV data of these 4 frames of sample point clouds are simultaneously input into the first feature extraction network model.
  • the four coding modules (specifically coding modules 1 to 4), for example, input the BEV data of the sample point cloud at time t 0 into coding module 1, input the BEV data of the sample point cloud at time t 1 into coding module 2, and input the BEV data of the sample point cloud at time t 1 into coding module 2.
  • the BEV data of the sample point cloud at time t 2 is input into the encoding module 3
  • the BEV data of the sample point cloud at time t 3 is input into the encoding module 4.
  • Step S120 Determine the predicted feature map of the next frame sample point cloud located after the adjacent multi-frame sample point cloud based on the encoding feature map of the adjacent multi-frame sample point cloud.
  • the predicted feature map of the sample point cloud of the next frame located after the sample point clouds of the two adjacent frames is determined.
  • the sample point cloud frame sequence consists of three consecutive frames of sample point clouds, which are the sample point cloud frames at time t 0 , t 1 , and t 2 respectively.
  • the encoding feature map of the sample point cloud frame at time t2 determines the predicted feature map of the sample point cloud frame at time t2 .
  • the sample point cloud frame sequence consists of 4 consecutive sample point cloud frames, which are the sample point cloud frames at time t 0 , t 1 , t 2 , and t 3 respectively.
  • the encoding feature map of the sample point cloud frame at time t 0 and the coding feature map of the sample point cloud frame at time t 1 determine the prediction feature map of the sample point cloud frame at time t 2 ; based on the coding feature map of the sample point cloud frame at time t 1 and the sample point cloud frame at time t 2
  • the encoding feature map is used to determine the predicted feature map of the sample point cloud frame at time t 3 .
  • the predicted feature map of the sample point cloud of the next frame located after the sample point cloud of three adjacent frames or more is determined based on the encoding feature map of the sample point cloud of three adjacent frames or more.
  • the sample point cloud frame sequence consists of 4 consecutive sample point cloud frames, which are the sample point cloud frames at t 0 , t 1 , t 2 , and t 3 respectively.
  • the encoding feature map of the sample point cloud frame is determined to determine the predicted feature map of the sample point cloud frame at time t 3 .
  • step S120 includes: using a second feature extraction network model to perform second encoding on the coded feature maps of adjacent multi-frame sample point clouds, respectively, to obtain intermediate feature maps of adjacent multi-frame sample point clouds; Fusion of intermediate feature maps of adjacent multi-frame sample point clouds to obtain a fused feature map; decoding the fused feature map to obtain predictions of the next frame of sample point clouds located after the adjacent multi-frame sample point clouds Feature map.
  • the second feature extraction network model includes an attention encoding module and an attention decoding module.
  • the attention coding module is used to perform second coding on the coding feature maps of adjacent multi-frame sample point clouds;
  • the attention decoding module is used to decode the fused feature map to obtain the predicted feature map of the next frame sample point cloud located after the adjacent multi-frame sample point cloud.
  • the intermediate feature maps of adjacent multi-frame sample point clouds are fused according to the following method: based on the intermediate feature maps of adjacent multi-frame sample point clouds, the feature points between adjacent multi-frame sample point clouds are determined. Matching relationship; according to the matching relationship of feature points, the intermediate feature maps of adjacent multi-frame sample point clouds are fused to obtain a fused feature map.
  • step S120 includes: determining the feature point matching relationship between adjacent multiple frame sample point clouds based on the coding feature maps of adjacent multiple frame sample point clouds; based on the feature point matching relationship, determining the matching relationship between adjacent multiple frame sample point clouds.
  • the coding feature maps of the frame sample point clouds are fused to obtain a fusion feature map; based on the fusion feature map, the predicted feature map of the next frame sample point cloud located after the adjacent multi-frame sample point clouds is determined.
  • Step S130 Determine the loss function value based on the prediction feature map of the next frame sample point cloud located after the adjacent multi-frame sample point cloud and its encoding feature map.
  • the predicted feature map of a frame of sample point cloud is determined through step S120.
  • the loss function value is determined based on the predicted feature map of the sample point cloud of this frame and its encoding feature map.
  • the sample point cloud frame sequence consists of three consecutive frames of sample point clouds, which are the sample point cloud frames at time t 0 , t 1 , and t 2 respectively.
  • the loss function value is determined based on the encoding feature map of the sample point cloud frame at time t 2 and the prediction feature map of the sample point cloud frame at time t 2 .
  • the predicted feature map of the multi-frame sample point cloud is determined through step S120.
  • the loss function value is determined based on the predicted feature map of the multi-frame sample point cloud and its encoding feature map.
  • the sample point cloud frame sequence consists of 4 consecutive sample point cloud frames, which are the sample point cloud frames at t 0 , t 1 , t 2 , and t 3 respectively.
  • the sample point cloud frames at t 0 and t 1 After determining the prediction feature map of the sample point cloud frame at time t 2 based on the coding feature map, and determining the prediction feature map of the sample point cloud frame at time t 3 based on the coding feature maps of the sample point cloud frames at time t 1 and t 2 , The value of the loss function is determined based on the coding feature map and prediction feature map of the sample point cloud frame at time t 2 and the coding feature map and prediction feature map of the sample point cloud frame at time t 3 .
  • the value of the loss function is calculated as follows: calculating the predicted feature map of the next frame sample point cloud located after the adjacent multi-frame sample point cloud and the feature point corresponding to the same position index in its encoded feature map.
  • the loss function value is calculated based on the Euclidean distance between the feature points of all position indexes.
  • a mean squared error (MSE) loss function is used to measure the consistency between the predicted feature map and the encoded feature map.
  • the goal of model training is to minimize the mean square error loss function value.
  • the loss function value is calculated according to the following formula:
  • MSE represents the loss function value
  • y(i,j) ⁇ F 2 i, j are the position index on the feature map respectively
  • m ⁇ n represents the predicted feature map K 2
  • the size of the encoding feature map F 2 ⁇ x(i,j)-y(i,j) ⁇ 2 , represents the square of the Euclidean distance between the predicted feature map K 2 and its encoding feature map F 2 .
  • the prediction feature map of each frame sample point cloud and its coding feature map can first be calculated based on the above calculation formula.
  • the loss function value of the feature map is then determined based on the loss function value of the multi-frame sample point cloud to determine the total loss function value.
  • the first step is to determine the first step based on the coding feature map and prediction feature map of the sample point cloud frame at time t 2 .
  • the loss function value determines the second loss function value based on the encoding feature map and the prediction feature map of the sample point cloud frame at time t 3 , and then determines the total loss function value based on the first loss function value and the second loss function value.
  • Step S140 Train the first feature extraction network model according to the loss function value.
  • step S140 the feature extraction network model is updated according to the loss function value. Repeat steps S110 to S140 until the training end condition is reached, for example, the training step length reaches a preset value (for example, 2 million steps).
  • the self-supervised learning of the point cloud feature extraction network model is achieved through the above steps.
  • the embodiments of the present disclosure only need to use point cloud data in one modality, and use local feature relationships in the sample point cloud frame sequence to perform self-supervised matching training, without the need for
  • the calibration between radar and cameras not only reduces the cost of data annotation, but also improves the performance of the trained feature extraction model.
  • the feature extraction network model can be used as a backbone network for specific visual tasks, such as three-dimensional target detection, point cloud semantic segmentation, etc., thereby improving the target detection results or point detection results. Accuracy of cloud semantic segmentation results.
  • FIG. 2 is a schematic flowchart of a point cloud feature extraction network model training method according to other embodiments of the present disclosure.
  • point cloud feature extraction network model training methods in other embodiments of the present disclosure include:
  • Step S210 Convert the original data of three consecutive frames of sample point clouds into BEV data.
  • the sample point cloud frame sequence is composed of three consecutive time-series point clouds as an example for explanation.
  • the sample point cloud frame sequence can be expanded into a time series sequence of more frames.
  • step S210 includes: subjecting the original data of three consecutive frames of sample point clouds to voxelization to obtain corresponding voxelized feature data, and then converting the voxelized feature data into a bird's-eye view.
  • a two-dimensional image from the (BEV) perspective that is, BEV data.
  • there are many methods to convert the original point cloud data into BEV data such as using the method of generating pseudo images in the PointPillars method, or downsampling along the Z-axis direction.
  • the PointPillars method is a voxel-based three-dimensional target detection algorithm. Its main idea is to convert the three-dimensional point cloud into a two-dimensional pseudo image so that target detection can be performed using two-dimensional target detection.
  • the original point cloud data collected from the radar will be located at x ⁇ [-30,30], y ⁇ [-15,15], Take out the point cloud data within the range of z ⁇ [-1.8,0.8].
  • a voxel cell (Voxel cell) is established every 0.05 meters in the x-axis and y-axis directions and every 0.10 meters in the z-axis direction to obtain the voxel grid (Voxel Grid).
  • the PointPillar method is used to determine the BEV feature map of the point cloud.
  • Step S220 Based on the first feature extraction network model, perform a first encoding on the BEV data of three consecutive frames of sample point clouds to obtain a coded feature map of each frame of sample point clouds.
  • the first feature extraction network model is a shared weight encoder.
  • the shared weight encoder includes three encoding modules, each encoding module is used to encode the BEV data (also known as BEV feature map) is encoded.
  • BEV data also known as BEV feature map
  • the weight of the network can be shared when extracting features from 3-frame point clouds, so that the feature extraction network model can distinguish similarities and differences between different learned point clouds.
  • the encoding module of the first feature extraction network model includes: a convolutional neural network and a self-attention network.
  • the convolutional neural network is a two-dimensional convolutional neural network such as ResNet and EfficientNet.
  • the self-attention network is a network such as Transformer. Transformer is a neural network model that utilizes self-attention mechanism.
  • Step S230 Determine the predicted feature map of the third frame sample point cloud based on the encoding feature maps of the sample point clouds of the first two frames.
  • the sample point cloud frame sequence includes three sample point cloud frames at t 0 , t 1 , and t 2
  • the data of these three frame sample point clouds are encoded through the feature extraction network model to obtain the feature maps F 0 and F 1 , F2 .
  • the predicted feature map K 2 of the sample point cloud at time t 2 is determined based on the feature maps F 0 and F 1 .
  • the predicted feature map of the sample point cloud of the third frame is determined according to the process shown in Figure 4.
  • Step S240 Determine the loss function value based on the predicted feature map of the sample point cloud in the third frame and its encoding feature map.
  • the mean square error loss function is used to measure the consistency between the predicted feature map K 2 and its encoded feature map F 2 of the sample point cloud in frame 3.
  • the training objective is to minimize the value of the mean square error loss function.
  • MSE represents the loss function value
  • y(i,j) ⁇ F 2 i, j are the position index on the feature map respectively
  • m ⁇ n represents the sample point cloud of the third frame
  • ⁇ x(i,j)-y(i,j) ⁇ 2 represents the Euclidean distance between the predicted feature map K 2 and its encoded feature map F 2 squared.
  • Step S250 Train the first feature extraction network model according to the loss function value.
  • step S250 the first feature extraction network model is updated according to the loss function value.
  • other network models that need to be updated are also updated based on the loss function value.
  • the first feature extraction network model is updated iteratively multiple times until the training end condition is reached, for example, the training step length reaches a preset value (for example, 2 million steps).
  • the self-supervised learning of the point cloud feature extraction network model is achieved through the above steps.
  • the embodiments of the present disclosure only need to use point cloud data in one modality, and use local feature relationships in the sample point cloud frame sequence to perform self-supervised matching training, without the need for The calibration between radar and cameras not only reduces the cost of data annotation, but also improves the performance of the trained point cloud feature extraction network model.
  • Figure 3 is a schematic structural diagram of a first feature extraction network model according to some embodiments of the present disclosure.
  • the first feature extraction network model 300 of some embodiments of the present disclosure includes three encoding modules, namely encoding module 310, encoding module 320 and encoding module 330.
  • three encoding modules share network weights.
  • three frames of sample point clouds included in the sample point cloud frame sequence are simultaneously input into three encoding modules.
  • the sample point cloud at time t 0 is input into the encoding module.
  • the coding module 310 inputs the sample point cloud at time t 1 into the coding module 320, and inputs the sample point cloud at time t 2 into the coding module 330.
  • each encoding module includes a convolutional neural network and a self-attention network.
  • the convolutional neural network uses the ResNet model
  • the self-attention network uses the Transformer model.
  • the Transformer model can adopt standard structures in related technologies.
  • the first feature extraction network model by adopting the first feature extraction network model with the above structure, more point cloud feature information can be extracted and the point cloud feature extraction capability can be improved.
  • Figure 4a is a schematic flowchart of the steps of determining a prediction feature map according to some embodiments of the present disclosure.
  • Figure 4a is an exemplary illustration of step S230.
  • the steps of determining the prediction feature map in this embodiment of the present disclosure include:
  • Step S410 Perform a second encoding on the coded feature maps of the sample point clouds of the first two frames to obtain the intermediate feature maps of the sample point clouds of the first two frames.
  • the intermediate feature map is determined as follows: using the attention encoding module in the second feature extraction network model to encode the encoding feature maps of the sample point clouds of the first two frames to obtain the previous The intermediate feature map of the two frame sample point clouds.
  • the attention encoding module adopts the Transformer model.
  • the spatial position relationship on the encoding feature maps of different frames can be learned based on the attention mechanism, which in turn helps to more accurately determine the matching relationship between feature points and improve the predicted feature map and real features determined thereby.
  • the consistency of the graph improves the training efficiency of the feature extraction network model.
  • the sample point cloud frame sequence includes four or more sample point cloud frames
  • the next frame of sample point cloud is determined based on the encoding feature maps of the sample point clouds of two adjacent frames.
  • the encoding feature maps of the sample point clouds of the two adjacent frames are encoded as follows: Based on the attention coding module, the encoding feature maps of the sample point clouds of the two adjacent frames are encoded to obtain the two adjacent frames. Intermediate feature map of sample point cloud.
  • the sample point cloud frame sequence includes sample point cloud frames at four times t 0 , t 1 , t 2 , and t 3
  • the next frame of sample point cloud is determined based on the encoding feature maps of the sample point clouds of the two adjacent frames.
  • the sample point clouds of the two adjacent frames include the sample point cloud frames at t 0 and t 1 , and the sample point cloud frames at t 1 and t 2.
  • t 0 and t The coding feature maps of the sample point cloud frames at time 1 and t 2 are encoded to obtain the intermediate feature maps of the sample point cloud frames at time t 0 , t 1 and t 2 .
  • Step S420 Determine the feature point matching relationship between the sample point clouds of the first two frames based on the intermediate feature map.
  • the feature point matching relationship between the sample point clouds of the first two frames is determined as follows: calculating each feature point on the intermediate feature map of the first frame and the intermediate feature map of the second frame. The correlation degree of the feature points within the specified range on the feature map; based on the correlation degree, the feature point matching between the sample point clouds of the first two frames is determined. relation.
  • the specified range is the neighborhood range of the feature points of the intermediate feature map of the first frame.
  • a circular area with the position coordinate of the point as the center and a preset length as the radius will be the neighborhood range of the feature point.
  • the neighborhood range be a Gaussian neighborhood.
  • correlation can be measured in various ways.
  • the cosine distance between feature points is used as the correlation between the two.
  • the feature point matching relationship includes: the corresponding relationship between the feature points on the intermediate feature map of the first frame and their matching feature points on the intermediate feature map of the second frame.
  • the feature point with the greatest correlation with P 0 within the specified range in the intermediate feature map of the second frame is used as P 0 matching feature points.
  • Step S430 Fusion of the intermediate feature maps of the sample point clouds of the first two frames according to the feature point matching relationship to obtain a fused feature map.
  • the matching feature points between the intermediate feature maps of the sample point clouds of the first two frames are feature spliced, and the spliced feature map is used as the fusion feature map.
  • any feature point P 0 on the intermediate feature map of the sample point cloud in the first frame splice it with the features of the matching feature point P 1 on the intermediate feature map of the sample point cloud in the second frame, And the position index of the feature point P 0 is used as the position index of the spliced feature point, thereby obtaining a fused feature point.
  • a fused feature map composed of fused feature points can be obtained.
  • Step S440 Determine the predicted feature map of the sample point cloud in the third frame based on the fused feature map.
  • the attention decoding module in the second feature extraction network model is used to decode the fused feature map, thereby obtaining the predicted feature map of the sample point cloud in the third frame.
  • the predicted feature map of the sample point cloud in the third frame can be determined efficiently and accurately based on the encoded feature maps of the sample point clouds in the first two frames, thereby helping to optimize the point cloud feature extraction network model.
  • the training process improves the performance of the first feature extraction network model obtained by training.
  • Figure 4b is a schematic flowchart of a point cloud feature extraction network model training method according to some embodiments of the present disclosure.
  • taking the sample point cloud frame sequence as containing three frames of continuous time series point cloud data (specifically, the point cloud data c 0 , c 1 , and c 2 at t 0 , t 1 , and t 2 ), for example.
  • the point cloud feature extraction network model training method is explained.
  • the point cloud feature extraction network model training method includes: Step 1 to Step 7.
  • Step 1 Convert the point cloud data c 0 , c 1 , and c 2 at t 0 , t 1 , and t 2 into bird's-eye views at the corresponding moments respectively.
  • Step 2 Input the bird's-eye view at time t 0 , t 1 , and t 2 into the shared weight encoder to obtain the feature maps F 0 , F 1 , and F 2 at the corresponding time.
  • shared weight encoders include: 2D CNN (2-dimensional convolutional neural network), and Transformer encoder (or Transformer network).
  • 2D CNN can use network models such as ResNet and EfficientNet to extract preliminary information from a bird's-eye view.
  • the Transformer network is a self-attention network that is used to extract the encoding of the feature point relationship between each position and other positions in the bird's-eye view, that is, the spatial position relationship of the point cloud in the same frame. For example, for the pixel point corresponding to the vehicle position on the bird's-eye view, the point is encoded by the shared weight encoder to obtain the feature point X.
  • the weights of the network model used are shared. This helps the network learn between different point clouds. similarities and differences.
  • Step 3 Based on the temporal attention conversion module, encode the feature maps F 0 and F 1 at t 0 and t 1 and calculate the feature point correlation to obtain the intermediate feature map and feature point matching relationship at t 0 and t 1 .
  • the temporal attention conversion module includes: Transformer encoder and correlation calculation module.
  • step 3 first encode the feature maps F 0 and F 1 at time t 0 and t 1 based on the Transformer encoder to obtain the intermediate feature map at the corresponding time; then calculate the intermediate feature map at time t 0 and t 1 The feature point correlation between V 0 and V 1 ; according to the feature point correlation between the intermediate feature maps at t 0 and t 1 , determine the feature point matching relationship between the intermediate feature maps at t 0 and t 1 .
  • the correlation between the point and the feature points within the neighborhood of its corresponding position on the intermediate feature map V 1 is calculated, and the point with the largest correlation is calculated
  • the feature point is used as the matching feature point of this point.
  • Step 4 Fusion of the intermediate feature maps V 0 and V 1 based on the position transformation coding module, and decoding the fused feature map to obtain the predicted feature map at time t 2 .
  • the position transformation coding module includes: fusion module and Transformer decoder.
  • the fusion module fuses the intermediate feature maps at t 0 and t 1 based on the feature point matching relationship between the intermediate feature maps at t 0 and t 1 to obtain a fused feature map.
  • the Transformer decoder decodes the fused feature map to obtain the predicted feature map at time t 2 .
  • Step 5 Calculate the MSE (root mean square loss) based on the encoding feature map at time t 2 and the predicted feature map at time t 2 loss function).
  • MSE is used to measure the consistency of the predicted feature map at time t 2 and the encoding feature map at time t 2.
  • the goal of the entire training is to minimize MSE.
  • Step 6 Repeat steps 1 to 5 until the model training cutoff condition is reached, for example, the training step length reaches 2 million steps.
  • Step 7 Output the shared weight encoder.
  • the self-supervised learning of the point cloud feature extraction network model is achieved through the above steps.
  • the embodiment of the present disclosure only needs to use point cloud data of one modality, and uses local partial feature relationships in the sample point cloud frame sequence to perform self-supervised matching training, without the need for The calibration between radar and cameras not only reduces the cost of data annotation, but also improves the performance of the trained feature extraction model.
  • the point cloud feature extraction network model can be used as a backbone network for specific visual tasks, such as three-dimensional target detection, point cloud semantic segmentation, etc., thereby improving the target The accuracy of detection results or point cloud semantic segmentation results.
  • Figure 5 is a schematic flowchart of a point cloud feature extraction method according to some embodiments of the present disclosure. As shown in Figure 5, the point cloud feature extraction method of some embodiments of the present disclosure includes:
  • Step S510 Obtain the point cloud frame sequence to be processed.
  • the point cloud feature extraction method is executed by a point cloud feature extraction device.
  • step S510 includes: obtaining the original feature data of the multi-frame point cloud to be processed; converting the original feature data of the multi-frame point cloud to be processed into BEV feature data to obtain a bird's-eye view of the multi-frame point cloud to be processed.
  • a sequence of point cloud frames to be processed composed of graph feature data.
  • Step S520 Based on the first feature extraction network model obtained by training, encode the point cloud frame sequence to be processed to obtain the encoded feature map of the point cloud frame sequence to be processed.
  • Figure 6 is a schematic structural diagram of a point cloud feature extraction network model training device according to some embodiments of the present disclosure.
  • the point cloud feature extraction network model training device 600 in some embodiments of the present disclosure includes: a feature extraction module 610, a prediction module 620, a determination module 630, and a training module 640.
  • the feature extraction module 610 is configured to use the first feature extraction network model to perform first encoding on the sample point cloud frame sequence to obtain a coded feature map of each frame of the sample point cloud in the sample point cloud frame sequence.
  • the prediction module 620 is configured to determine the predicted feature map of the next frame sample point cloud located after the adjacent multi-frame sample point cloud based on the encoding feature map of the adjacent multi-frame sample point cloud.
  • the determination module 630 is configured to determine the loss function value based on the prediction feature map of the next frame sample point cloud located after the adjacent multi-frame sample point cloud and its encoding feature map.
  • the training module 640 is configured to train the first feature extraction network model according to the loss function value.
  • the above device can improve the point cloud feature extraction effect, thereby helping to improve the accuracy of target detection or point cloud semantic segmentation results.
  • Figure 7 is a schematic structural diagram of a point cloud feature extraction device according to some embodiments of the present disclosure.
  • the point cloud feature extraction device 700 in some embodiments of the present disclosure includes: an acquisition module 710 and a feature extraction module 720.
  • the acquisition module 710 is configured to acquire a sequence of point cloud frames to be processed.
  • the acquisition module 710 is configured to: acquire the original feature data of the multi-frame point cloud to be processed; convert the original feature data of the multi-frame point cloud to be processed into BEV feature data to obtain the multi-frame point cloud to be processed.
  • a sequence of point cloud frames to be processed composed of bird's-eye view feature data of the cloud.
  • the feature extraction module 720 is configured to encode the point cloud frame sequence to be processed based on the first feature extraction network model obtained by training, so as to obtain the encoded feature map of the point cloud frame sequence to be processed.
  • a target detection device configured to extract the feature map of the point cloud frame sequence to be processed according to the point cloud feature extraction method of any embodiment of the present disclosure; according to the point cloud frame sequence to be processed feature map for target detection.
  • a point cloud semantic segmentation device configured to extract the feature map of the point cloud frame sequence to be processed according to the point cloud feature extraction method of any embodiment of the present disclosure; according to the point cloud to be processed Feature maps of frame sequences for point cloud semantic segmentation.
  • Figure 8 is a schematic structural diagram of a point cloud feature extraction network model training device, a point cloud feature extraction device, a target detection device, or a point cloud semantic segmentation device according to some embodiments of the present disclosure.
  • the point cloud feature extraction network model training device or point cloud feature extraction device or target detection device or point cloud semantic segmentation device 800 includes a memory 810; and a processor 820 coupled to the memory 810.
  • the memory 810 is used to store instructions for executing corresponding embodiments of the point cloud feature extraction network model training method, the point cloud feature extraction method, the target detection method, or the point cloud semantic segmentation method.
  • Processor 820 is configured to store the The instructions in the memory 810 execute the point cloud feature extraction network model training method, point cloud feature extraction method, target detection method, or point cloud semantic segmentation method in any embodiment of the present disclosure.
  • Figure 9 is a schematic structural diagram of a computer system according to some embodiments of the present disclosure.
  • Computer system 900 may be embodied in the form of a general purpose computing device.
  • Computer system 900 includes memory 910, a processor 920, and a bus 930 that connects various system components.
  • Memory 910 may include, for example, system memory, non-volatile storage media, and the like.
  • System memory stores, for example, operating systems, applications, boot loaders, and other programs.
  • System memory may include volatile storage media such as random access memory (RAM) and/or cache memory.
  • RAM random access memory
  • the non-volatile storage medium stores, for example, instructions for executing corresponding embodiments of at least one point cloud feature extraction network model training method, point cloud feature extraction method, target detection method, or point cloud semantic segmentation method.
  • Non-volatile storage media include but are not limited to disk storage, optical storage, flash memory, etc.
  • the processor 920 may be implemented as a general purpose processor, a digital signal processor (DSP), an application specific integrated circuit (ASIC), a field programmable gate array (FPGA) or other programmable logic device, discrete hardware components such as discrete gates or transistors.
  • each module such as the feature extraction module, prediction module, etc., can be implemented by a central processing unit (CPU) running instructions in the memory to perform the corresponding steps, or by a dedicated circuit that performs the corresponding steps.
  • CPU central processing unit
  • Bus 930 may use any of a variety of bus structures.
  • bus structures include, but are not limited to, Industry Standard Architecture (ISA) bus, Micro Channel Architecture (MCA) bus, and Peripheral Component Interconnect (PCI) bus.
  • ISA Industry Standard Architecture
  • MCA Micro Channel Architecture
  • PCI Peripheral Component Interconnect
  • the interfaces 940, 950, 960, the memory 910 and the processor 920 of the computer system 900 may be connected through a bus 930.
  • the input and output interface 940 can provide a connection interface for input and output devices such as a monitor, mouse, and keyboard.
  • the network interface 950 provides a connection interface for various networked devices.
  • the storage interface 960 provides a connection interface for external storage devices such as floppy disks, USB disks, and SD cards.
  • Figure 10 is a schematic structural diagram of an unmanned vehicle according to some embodiments of the present disclosure
  • Figure 11 is a perspective view of an unmanned vehicle according to some embodiments of the present disclosure.
  • the unmanned vehicle provided by the embodiment of the present disclosure will be described below with reference to FIG. 10 and FIG. 11 .
  • the unmanned vehicle includes four parts: a chassis module 1010, an autonomous driving module 1020, a cargo box module 1030, and a remote monitoring flow module 1040.
  • the chassis module 1010 mainly includes a battery, a power management device, a chassis controller, a motor driver, and a power motor.
  • the battery provides power for the entire unmanned vehicle system
  • the power management device converts the battery output into Change to different level voltages available for each functional module, and control power on and off.
  • the chassis controller receives motion instructions from the autonomous driving module and controls the steering, forward, backward, braking, etc. of the unmanned vehicle.
  • the autonomous driving module 1020 includes a core processing unit (Orin or Xavier module), traffic light recognition camera, front, rear, left and right surround cameras, multi-line lidar, positioning module (such as Beidou, GPS, etc.), and inertial navigation unit.
  • the camera and the autonomous driving module can communicate.
  • GMSL link communication can be used.
  • the autonomous driving module 1020 includes the point cloud feature extraction network model training device or point cloud feature extraction device or target detection device or point cloud semantic segmentation device in the above embodiments.
  • the remote monitoring streaming module 1030 is composed of a front surveillance camera, a rear surveillance camera, a left surveillance camera, a right surveillance camera, and a streaming module. This module transmits the video data collected by the surveillance cameras to the backend server for use by the backend. Operator checks.
  • the wireless communication module communicates with the backend server through the antenna, allowing the backend operator to remotely control the unmanned vehicle.
  • the cargo box module 1040 is the cargo carrying device of the unmanned vehicle.
  • the cargo box module 1040 is also provided with a display interaction module.
  • the display interaction module is used for the unmanned vehicle to interact with the user.
  • the user can perform operations such as picking up, depositing, and purchasing goods through the display interaction module.
  • the type of cargo box can be changed according to actual needs.
  • a cargo box can include multiple sub-boxes of different sizes, and the sub-boxes can be used to load goods for distribution.
  • the cargo box can be set up as a transparent box so that users can intuitively see the products for sale.
  • the unmanned vehicle in the embodiment of the present disclosure can improve the point cloud feature extraction capability, thereby helping to improve the accuracy of the point cloud semantic segmentation results or the accuracy of the target detection results, thereby improving the safety of unmanned driving.
  • These computer readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable device to produce a machine, such that execution of the instructions by the processor produces implementations in one or more blocks of the flowcharts and/or block diagrams.
  • a device with specified functions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable device to produce a machine, such that execution of the instructions by the processor produces implementations in one or more blocks of the flowcharts and/or block diagrams.
  • Computer-readable program instructions which may also be stored in computer-readable memory, cause the computer to operate in a specific manner to produce an article of manufacture, including implementing the functions specified in one or more blocks of the flowcharts and/or block diagrams. instructions.
  • the disclosure may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment that combines software and hardware aspects.
  • point cloud feature extraction network model training point cloud feature extraction method, device and unmanned vehicle in the above embodiments, only one modality of data is used, and no calibration between radar and camera is required.
  • the self-supervised learning of the point cloud feature extraction network model not only reduces the cost of data annotation, but also improves the performance of the trained feature extraction model.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Image Analysis (AREA)

Abstract

The present disclosure relates to the technical field of driverless vehicles, and provides a point cloud feature extraction network model training method, a point cloud feature extraction method, an apparatus, and a driverless vehicle. The point cloud feature extraction network model training method comprises: performing first encoding on a sample point cloud frame sequence by using a first feature extraction network model to obtain an encoded feature map of each sample point cloud frame in the sample point cloud frame sequence; according to the encoded feature maps of a plurality of adjacent sample point cloud frames, determining a predicted feature map of the next sample point cloud frame following the plurality of adjacent sample point cloud frames; determining a loss function value according to the predicted feature map and the encoded feature map of the next sample point cloud frame following the plurality of adjacent sample point cloud frames; and training the first feature extraction network model according to the loss function value. By means of the steps above, self-supervised learning of a point cloud feature extraction network model is realized, so that the cost of data annotation is reduced, and the performance of a trained feature extraction model is improved.

Description

点云特征提取网络模型训练、点云特征提取方法、装置和无人车Point cloud feature extraction network model training, point cloud feature extraction method, device and unmanned vehicle
相关申请的交叉引用Cross-references to related applications
本申请是以CN申请号为202211115103.3,申请日为2022年9月14日的申请为基础,并主张其优先权,该CN申请的公开内容在此作为整体引入本申请中。This application is based on the application with CN application number 202211115103.3 and the filing date is September 14, 2022, and claims its priority. The disclosure content of the CN application is hereby incorporated into this application as a whole.
技术领域Technical field
本公开涉及计算机视觉技术领域,尤其涉及无人驾驶领域,特别涉及一种点云特征提取网络模型训练、点云特征提取方法、装置和无人车。The present disclosure relates to the field of computer vision technology, especially to the field of unmanned driving, and in particular to a point cloud feature extraction network model training, point cloud feature extraction method, device and unmanned vehicle.
背景技术Background technique
目前,无人驾驶设备用于将人或者物从一个位置自动运送到另一个位置,无人驾驶设备通过设备上的传感器采集环境信息并完成自动运送。基于无人驾驶技术控制的无人配送车进行物流运输极大地提高了生产生活的便捷性,节约了人力成本。Currently, unmanned driving equipment is used to automatically transport people or objects from one location to another. Unmanned driving equipment collects environmental information through sensors on the equipment and completes automatic transportation. Logistics and transportation using unmanned delivery vehicles controlled by unmanned driving technology has greatly improved the convenience of production and life and saved labor costs.
在自动驾驶任务中,为确保运行的安全性必须对可能阻碍行驶的障碍物进行检测识别,以便根据不同的障碍物类型和状态做出合理的回避动作。目前,自动驾驶中最为成熟的检测方案为点云检测。检测模型的训练通常使用有监督(supervised learning)的方式,在这个过程中,模型的性能受限于数据采集的数量和标注的质量。为了获得一个高性能的检测模型,往往需要使用大量的标注数据训练网络,而数据采集和标注的人工成本高、周期长,不利于模型的迭代。对比而言,使用自监督(self-supervised learning)的方式可以不需要对数据进行标注。In autonomous driving tasks, in order to ensure operational safety, obstacles that may hinder driving must be detected and identified, so that reasonable avoidance actions can be made based on different obstacle types and states. Currently, the most mature detection solution in autonomous driving is point cloud detection. The training of detection models usually uses supervised learning. In this process, the performance of the model is limited by the quantity of data collected and the quality of annotation. In order to obtain a high-performance detection model, it is often necessary to use a large amount of annotated data to train the network. However, the labor cost and long cycle of data collection and annotation are high, which is not conducive to model iteration. In contrast, using self-supervised learning does not require data labeling.
发明内容Contents of the invention
本公开要解决的一个技术问题是,提供一种点云特征提取网络模型训练、点云特征提取方法、装置和无人车。A technical problem to be solved by this disclosure is to provide a point cloud feature extraction network model training, point cloud feature extraction method, device and unmanned vehicle.
根据本公开的第一方面,提出了一种点云特征提取网络模型训练方法,包括:利用第一特征提取网络模型,对样本点云帧序列进行第一编码,以得到所述样本点云帧序列中每一帧样本点云的编码特征图;根据相邻多帧样本点云的编码特征图,确定位于所述相邻多帧样本点云之后的下一帧样本点云的预测特征图;根据位于所述相邻多帧样本点云之后的下一帧样本点云的预测特征图和其编码特征图,确定损失函数值; 根据所述损失函数值,对所述第一特征提取网络模型进行训练。According to a first aspect of the present disclosure, a point cloud feature extraction network model training method is proposed, including: using a first feature extraction network model to perform a first encoding on a sample point cloud frame sequence to obtain the sample point cloud frame The coding feature map of the sample point cloud of each frame in the sequence; according to the coding feature map of the adjacent multi-frame sample point cloud, determine the predicted feature map of the next frame sample point cloud located after the adjacent multi-frame sample point cloud; Determine the loss function value according to the prediction feature map and its encoding feature map of the next frame sample point cloud located after the adjacent multi-frame sample point cloud; The first feature extraction network model is trained according to the loss function value.
在一些实施例中,根据相邻多帧样本点云的编码特征图,确定位于所述相邻多帧样本点云之后的下一帧样本点云的预测特征图包括:利用第二特征提取网络模型,对所述相邻多帧样本点云的编码特征图分别进行第二编码,以得到所述相邻多帧样本点云的中间特征图;对所述相邻多帧样本点云的中间特征图进行融合,以得到融合特征图;对所述融合特征图进行解码,以得到位于所述相邻多帧样本点云之后的下一帧样本点云的预测特征图。In some embodiments, according to the encoding feature map of the adjacent multi-frame sample point cloud, determining the predicted feature map of the next frame sample point cloud located after the adjacent multi-frame sample point cloud includes: using a second feature extraction network model, perform second encoding on the coding feature maps of the adjacent multi-frame sample point clouds respectively to obtain the intermediate feature map of the adjacent multi-frame sample point clouds; The feature maps are fused to obtain a fused feature map; the fused feature map is decoded to obtain a predicted feature map of the next frame sample point cloud located after the adjacent multi-frame sample point cloud.
在一些实施例中,所述对所述相邻多帧样本点云的中间特征图进行融合,以得到融合特征图包括:根据所述相邻多帧样本点云的中间特征图,确定所述相邻多帧样本点云之间的特征点匹配关系;根据所述特征点匹配关系,对所述相邻多帧样本点云的中间特征图进行融合,以得到融合特征图。In some embodiments, fusing the intermediate feature maps of the adjacent multi-frame sample point clouds to obtain the fused feature map includes: determining the said The feature point matching relationship between adjacent multi-frame sample point clouds; according to the feature point matching relationship, the intermediate feature maps of the adjacent multi-frame sample point cloud are fused to obtain a fused feature map.
在一些实施例中,根据相邻多帧样本点云的编码特征图,确定位于所述相邻多帧样本点云之后的下一帧样本点云的预测特征图包括:根据相邻多帧样本点云的编码特征图,确定所述相邻多帧样本点云之间的特征点匹配关系;根据所述特征点匹配关系,对所述相邻多帧样本点云的编码特征图进行融合,以得到融合特征图;根据所述融合特征图,确定位于所述相邻多帧样本点云之后的下一帧样本点云的预测特征图。In some embodiments, according to the encoding feature map of the adjacent multi-frame sample point cloud, determining the predicted feature map of the next frame sample point cloud located after the adjacent multi-frame sample point cloud includes: based on the adjacent multi-frame sample point cloud The coding feature map of the point cloud determines the feature point matching relationship between the adjacent multi-frame sample point clouds; according to the feature point matching relationship, the coding feature map of the adjacent multi-frame sample point cloud is fused, To obtain a fusion feature map; according to the fusion feature map, determine the predicted feature map of the next frame sample point cloud located after the adjacent multi-frame sample point cloud.
在一些实施例中,所述根据所述相邻多帧样本点云的中间特征图,确定所述相邻多帧样本点云之间的特征点匹配关系包括:根据所述相邻多帧样本点云的中间特征图,计算所述相邻多帧样本点云之间的特征点的相关度;根据所述相邻多帧样本点云之间的特征点的相关度,确定所述相邻两帧样本点云之间的特征点匹配关系。In some embodiments, determining the feature point matching relationship between the adjacent multi-frame sample point clouds based on the intermediate feature maps of the adjacent multi-frame sample point clouds includes: based on the adjacent multi-frame sample points The intermediate feature map of the point cloud is used to calculate the correlation of the feature points between the adjacent multi-frame sample point clouds; according to the correlation of the feature points between the adjacent multi-frame sample point clouds, the adjacent multi-frame sample point clouds are determined. Feature point matching relationship between two frame sample point clouds.
在一些实施例中,所述相邻多帧样本点云为相邻两帧样本点云,所述相邻多帧样本点云的中间特征图包括:与所述相邻两帧样本点云中的第一帧对应的第一中间特征图、以及与所述相邻两帧样本点云中的第二帧对应的第二中间特征图;以及,所述根据所述相邻多帧样本点云的中间特征图,计算所述相邻多帧样本点云之间的特征点的相关度包括:计算所述第一中间特征图上的每个特征点,与所述第二中间特征图上指定范围内的特征点的相关度,所述指定范围为第一中间特征图的特征点的邻域范围;根据所述相关度,确定所述相邻两帧样本点云之间的特征点匹配关系。In some embodiments, the adjacent multi-frame sample point clouds are two adjacent frame sample point clouds, and the intermediate feature map of the adjacent multi-frame sample point cloud includes: The first intermediate feature map corresponding to the first frame, and the second intermediate feature map corresponding to the second frame in the two adjacent frame sample point clouds; and, the sample point cloud based on the adjacent multiple frames The intermediate feature map, calculating the correlation of the feature points between the adjacent multi-frame sample point clouds includes: calculating each feature point on the first intermediate feature map, and the specified value on the second intermediate feature map. The correlation degree of the feature points within the range, the specified range is the neighborhood range of the feature points of the first intermediate feature map; according to the correlation degree, the matching relationship of the feature points between the sample point clouds of the two adjacent frames is determined .
在一些实施例中,所述相邻多帧样本点云为相邻两帧样本点云;所述根据所述特征点匹配关系,对所述相邻两帧样本点云的中间特征图进行融合,以得到融合特征图包括:根据所述特征点匹配关系,将所述相邻两帧样本点云的中间特征图之间的匹配 特征点进行特征拼接,并将拼接得到的特征图作为融合特征图。In some embodiments, the adjacent multi-frame sample point clouds are two adjacent frame sample point clouds; the intermediate feature maps of the two adjacent frame sample point clouds are fused according to the feature point matching relationship. , to obtain the fused feature map includes: according to the matching relationship of the feature points, matching the intermediate feature maps of the sample point clouds of the two adjacent frames The feature points are spliced together, and the spliced feature map is used as the fused feature map.
在一些实施例中,所述根据位于所述相邻多帧样本点云之后的下一帧样本点云的预测特征图和其编码特征图,确定损失函数值包括:在所述下一帧样本点云的预测特征图和编码特征图之间,计算具有同一位置索引的特征点之间的欧式距离;根据所有位置索引的特征点之间的欧式距离,计算损失函数值。In some embodiments, determining the loss function value based on the prediction feature map of the next frame sample point cloud located after the adjacent multi-frame sample point cloud and its encoding feature map includes: in the next frame sample Between the predicted feature map and the encoded feature map of the point cloud, the Euclidean distance between feature points with the same position index is calculated; the loss function value is calculated based on the Euclidean distance between the feature points with all position indexes.
在一些实施例中,所述第一特征提取网络模型为共享权值编码器,所述共享权值编码器包括多个编码模块,每个编码模块用于对所述样本点云帧序列中的一帧进行编码。In some embodiments, the first feature extraction network model is a shared weight encoder. The shared weight encoder includes multiple encoding modules, each encoding module is used to encode the sample point cloud frame sequence. One frame is encoded.
在一些实施例中,所述编码模块包括:卷积神经网络以及自注意力网络。In some embodiments, the encoding module includes: a convolutional neural network and a self-attention network.
在一些实施例中,还包括:将多帧样本点云的原始特征数据转换成二维图像特征数据,以得到由多帧样本点云的二维图像特征数据构成的样本点云帧序列。In some embodiments, the method further includes: converting the original feature data of the multi-frame sample point cloud into two-dimensional image feature data to obtain a sample point cloud frame sequence composed of the two-dimensional image feature data of the multi-frame sample point cloud.
在一些实施例中,将多帧样本点云的原始特征数据转换成二维图像特征数据包括:将多帧样本点云的原始特征数据转换成鸟瞰图BEV特征数据。In some embodiments, converting the original feature data of the multi-frame sample point cloud into two-dimensional image feature data includes: converting the original feature data of the multi-frame sample point cloud into bird's-eye view BEV feature data.
在一些实施例中,所述样本点云帧序列由在时序上连续的多帧样本点云组成;和/或,所述样本点云帧序列包含的样本点云的帧数量大于等于3、且小于等于5。In some embodiments, the sample point cloud frame sequence consists of multiple frames of sample point clouds that are continuous in time series; and/or the number of frames of the sample point cloud included in the sample point cloud frame sequence is greater than or equal to 3, and Less than or equal to 5.
在一些实施例中,第二特征提取网络模型包括:注意力编码模块,用于对所述相邻多帧样本点云的编码特征图分别进行第二编码;注意力解码模块,用于对所述融合特征图进行解码,以得到位于所述相邻多帧样本点云之后的下一帧样本点云的预测特征图。In some embodiments, the second feature extraction network model includes: an attention encoding module, used to perform a second encoding on the encoding feature maps of the adjacent multi-frame sample point clouds; an attention decoding module, used to perform a second encoding on the adjacent multi-frame sample point clouds; The fused feature map is decoded to obtain the predicted feature map of the next frame sample point cloud located after the adjacent multi-frame sample point cloud.
根据本公开的第二方面,提供了一种点云特征提取方法,包括:获取待处理点云帧序列;基于利用如上述特征提取网络模型训练方法训练得到的第一特征提取网络模型,对所述待处理点云帧序列进行编码,以得到所述待处理点云帧序列的特征图。According to a second aspect of the present disclosure, a point cloud feature extraction method is provided, including: obtaining a sequence of point cloud frames to be processed; based on the first feature extraction network model trained using the above feature extraction network model training method, The point cloud frame sequence to be processed is encoded to obtain the feature map of the point cloud frame sequence to be processed.
在一些实施例中,获取待处理点云帧序列包括:获取多帧待处理点云的原始数据;将所述多帧待处理点云的原始特征数据转换为鸟瞰图BEV特征数据,以得到由多帧待处理点云的鸟瞰图特征数据组成的待处理点云帧序列。In some embodiments, obtaining the point cloud frame sequence to be processed includes: obtaining the original data of the point cloud to be processed in multiple frames; converting the original feature data of the point cloud to be processed in the multiple frames into bird's-eye view BEV feature data to obtain the A sequence of point cloud frames to be processed consisting of bird's-eye view feature data of multiple frames of point clouds to be processed.
根据本公开的第三方面,提出一种目标检测方法,根据前述点云特征提取方法提取待处理点云帧序列的特征图;根据所述待处理点云帧序列的特征图,进行目标检测。According to a third aspect of the present disclosure, a target detection method is proposed. The feature map of the point cloud frame sequence to be processed is extracted according to the aforementioned point cloud feature extraction method. The target detection is performed based on the feature map of the point cloud frame sequence to be processed.
根据本公开的第四方面,提出一种点云语义分割方法,包括:根据前述点云特征提取方法提取待处理点云帧序列的特征图;根据所述待处理点云帧序列的特征图,进行点云语义分割。 According to the fourth aspect of the present disclosure, a point cloud semantic segmentation method is proposed, including: extracting a feature map of the point cloud frame sequence to be processed according to the aforementioned point cloud feature extraction method; and based on the feature map of the point cloud frame sequence to be processed, Perform point cloud semantic segmentation.
根据本公开的第五方面,提出一种装置,包括:用于执行如上所述的点云特征提取网络模型训练方法的模块,或者,用于执行如上所述的点云特征提取方法的模块,或者,用于执行如上述的目标检测方法的模块,或者,用于执行如上述的点云语义分割方法的模块。According to a fifth aspect of the present disclosure, a device is proposed, including: a module for performing the point cloud feature extraction network model training method as described above, or a module for performing the point cloud feature extraction method as described above, Or, a module for performing the above-mentioned target detection method, or a module for performing the above-mentioned point cloud semantic segmentation method.
根据本公开的第六方面,提出一种电子设备,包括:存储器;以及,耦接至所述存储器的处理器,所述处理器被配置为基于存储在所述存储器的指令执行如上述的点云特征提取网络模型训练方法,或者如上述的点云特征提取方法,或者如上述的目标检测方法,或者如上述的点云语义分割方法。According to a sixth aspect of the present disclosure, an electronic device is proposed, including: a memory; and, a processor coupled to the memory, the processor being configured to perform the above points based on instructions stored in the memory The cloud feature extraction network model training method is either the above-mentioned point cloud feature extraction method, or the above-mentioned target detection method, or the above-mentioned point cloud semantic segmentation method.
根据本公开的第七方面,提出一种计算机可读存储介质,其上存储有计算机程序指令,该指令被处理器执行时实现如上述的点云特征提取网络模型训练方法,或者如上述的点云特征提取方法,或者如上述的目标检测方法,或者如上述的点云语义分割方法。According to a seventh aspect of the present disclosure, a computer-readable storage medium is proposed, on which computer program instructions are stored. When the instructions are executed by a processor, the above-mentioned point cloud feature extraction network model training method is implemented, or the above-mentioned point cloud feature extraction network model training method is implemented. Cloud feature extraction method, or the above-mentioned target detection method, or the above-mentioned point cloud semantic segmentation method.
根据本公开的第六方面,还提出一种无人车,包括如上述的装置或电子设备。According to a sixth aspect of the present disclosure, an unmanned vehicle is also proposed, including the above device or electronic equipment.
通过以下参照附图对本公开的示例性实施例的详细描述,本公开的其它特征及其优点将会变得清楚。Other features and advantages of the present disclosure will become apparent from the following detailed description of exemplary embodiments of the present disclosure with reference to the accompanying drawings.
附图说明Description of drawings
构成说明书的一部分的附图描述了本公开的实施例,并且连同说明书一起用于解释本公开的原理。The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and, together with the description, serve to explain principles of the disclosure.
参照附图,根据下面的详细描述,可以更加清楚地理解本公开,其中:The present disclosure may be more clearly understood from the following detailed description with reference to the accompanying drawings, in which:
图1为根据本公开一些实施例的点云特征提取网络模型训练方法的流程示意图;Figure 1 is a schematic flowchart of a point cloud feature extraction network model training method according to some embodiments of the present disclosure;
图2为根据本公开另一些实施例的点云特征提取网络模型训练方法的流程示意图;Figure 2 is a schematic flowchart of a point cloud feature extraction network model training method according to other embodiments of the present disclosure;
图3为根据本公开一些实施例的第一特征提取网络模型的结构示意图;Figure 3 is a schematic structural diagram of a first feature extraction network model according to some embodiments of the present disclosure;
图4a为根据本公开一些实施例的确定预测特征图步骤的流程示意图;Figure 4a is a schematic flowchart of the steps of determining a prediction feature map according to some embodiments of the present disclosure;
图4b为根据本公开一些实施例的点云特征提取网络模型训练方法的流程示意图;Figure 4b is a schematic flowchart of a point cloud feature extraction network model training method according to some embodiments of the present disclosure;
图5为根据本公开一些实施例的点云特征提取方法的流程示意图;Figure 5 is a schematic flowchart of a point cloud feature extraction method according to some embodiments of the present disclosure;
图6为根据本公开一些实施例的点云特征提取网络模型训练装置的结构示意图;Figure 6 is a schematic structural diagram of a point cloud feature extraction network model training device according to some embodiments of the present disclosure;
图7为根据本公开一些实施例的点云特征提取装置的结构示意图;Figure 7 is a schematic structural diagram of a point cloud feature extraction device according to some embodiments of the present disclosure;
图8为根据本公开一些实施例的点云特征提取网络模型训练装置或点云特征提取装置或目标检测装置或点云语义分割装置的结构示意图; Figure 8 is a schematic structural diagram of a point cloud feature extraction network model training device or a point cloud feature extraction device or a target detection device or a point cloud semantic segmentation device according to some embodiments of the present disclosure;
图9为根据本公开一些实施例的计算机系统的结构示意图;Figure 9 is a schematic structural diagram of a computer system according to some embodiments of the present disclosure;
图10为根据本公开一些实施例的无人车的结构示意图;Figure 10 is a schematic structural diagram of an autonomous vehicle according to some embodiments of the present disclosure;
图11为根据本公开一些实施例的无人车的立体结构示意图。Figure 11 is a schematic three-dimensional structural diagram of an autonomous vehicle according to some embodiments of the present disclosure.
具体实施方式Detailed ways
现在将参照附图来详细描述本公开的各种示例性实施例。应注意到:除非另外具体说明,否则在这些实施例中阐述的部件和步骤的相对布置、数字表达式和数值不限制本公开的范围。Various exemplary embodiments of the present disclosure will now be described in detail with reference to the accompanying drawings. It should be noted that the relative arrangement of components and steps, numerical expressions, and numerical values set forth in these examples do not limit the scope of the disclosure unless otherwise specifically stated.
同时,应当明白,为了便于描述,附图中所示出的各个部分的尺寸并不是按照实际的比例关系绘制的。At the same time, it should be understood that, for convenience of description, the dimensions of various parts shown in the drawings are not drawn according to actual proportional relationships.
以下对至少一个示例性实施例的描述实际上仅仅是说明性的,决不作为对本公开及其应用或使用的任何限制。The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the disclosure, its application or uses.
对于相关领域普通技术人员已知的技术、方法和设备可能不作详细讨论,但在适当情况下,所述技术、方法和设备应当被视为授权说明书的一部分。Techniques, methods and devices known to those of ordinary skill in the relevant art may not be discussed in detail, but where appropriate, such techniques, methods and devices should be considered part of the authorized specification.
在这里示出和讨论的所有示例中,任何具体值应被解释为仅仅是示例性的,而不是作为限制。因此,示例性实施例的其它示例可以具有不同的值。In all examples shown and discussed herein, any specific values are to be construed as illustrative only and not as limiting. Accordingly, other examples of the exemplary embodiments may have different values.
应注意到:相似的标号和字母在下面的附图中表示类似项,因此,一旦某一项在一个附图中被定义,则在随后的附图中不需要对其进行进一步讨论。It should be noted that similar reference numerals and letters refer to similar items in the following figures, so that once an item is defined in one figure, it does not need further discussion in subsequent figures.
为使本公开的目的、技术方案和优点更加清楚明白,以下结合具体实施例,并参照附图,对本公开进一步详细说明。In order to make the purpose, technical solutions and advantages of the present disclosure more clear, the present disclosure will be further described in detail below in conjunction with specific embodiments and with reference to the accompanying drawings.
近期,学界对自监督在图像分类领域有若干尝试,取得了较好的效果,甚至超过了传统的监督学习。但是在自动驾驶领域,特别是针对激光雷达点云检测领域的自监督学习方法,目前鲜有研究,整个学界和业界还处于非常初期的阶段。Recently, the academic community has made several attempts at self-supervision in the field of image classification, and has achieved good results, even surpassing traditional supervised learning. However, in the field of autonomous driving, especially the self-supervised learning method in the field of lidar point cloud detection, there is currently little research, and the entire academic community and industry are still in a very early stage.
相关技术中,提出了一种基于自监督来学习点云特征的方法。在该方法中,将多帧连续点云分别投影至相应的RGB图像上,然后在RGB图像上使用光流法找出运动的物体,获得对应运动物体的点云,进而学习点云特征。该方法存在以下缺陷:一、对激光雷达和相机的标定要求非常高;二、在物体的边缘,有较高的概率无法正确投影,此外,由于点云到RGB图像的投影是一个锥体,有可能有些点云投影至RGB图像后发生重叠,从而影响模型性能;三、由于需要使用连续帧点云和RGB图像,标注时间和经济成本都较高。 In related technology, a method for learning point cloud features based on self-supervision is proposed. In this method, multiple frames of continuous point clouds are projected onto the corresponding RGB images respectively, and then the optical flow method is used to find moving objects on the RGB images, and the point clouds corresponding to the moving objects are obtained, and then the point cloud features are learned. This method has the following shortcomings: 1. The calibration requirements for lidar and cameras are very high; 2. At the edge of the object, there is a high probability that it cannot be correctly projected. In addition, since the projection of the point cloud to the RGB image is a cone, It is possible that some point clouds will overlap after being projected onto the RGB image, thus affecting model performance. Third, since continuous frame point clouds and RGB images need to be used, the annotation time and economic cost are high.
鉴于此,本公开提出了一种点云特征提取网络模型训练、点云特征提取方法、装置和无人车,只需使用一种模态的数据,且无需进行雷达、相机之间的标定,即可实现点云特征提取网络模型的自监督学习,不仅减少了数据标注的成本,而且提升了训练得到的点云特征提取网络模型的性能。In view of this, the present disclosure proposes a point cloud feature extraction network model training, point cloud feature extraction method, device and unmanned vehicle, which only need to use data of one modality and do not require calibration between radar and cameras. This can realize self-supervised learning of the point cloud feature extraction network model, which not only reduces the cost of data annotation, but also improves the performance of the trained point cloud feature extraction network model.
图1为根据本公开一些实施例的点云特征提取网络模型训练方法的流程示意图。如图1所示,本公开一些实施例的点云特征提取网络模型训练方法包括:Figure 1 is a schematic flowchart of a point cloud feature extraction network model training method according to some embodiments of the present disclosure. As shown in Figure 1, the point cloud feature extraction network model training method of some embodiments of the present disclosure includes:
步骤S110:利用第一特征提取网络模型,对样本点云帧序列进行第一编码,以得到样本点云帧序列中每一帧样本点云的编码特征图。Step S110: Use the first feature extraction network model to perform first encoding on the sample point cloud frame sequence to obtain the encoded feature map of each frame of the sample point cloud in the sample point cloud frame sequence.
在一些实施例中,点云特征提取网络模型训练方法由点云特征提取网络模型训练装置执行。In some embodiments, the point cloud feature extraction network model training method is executed by a point cloud feature extraction network model training device.
在一些实施例中,样本点云帧序列由在时序上连续的多帧样本点云组成。例如,样本点云帧序列由在时序上连续的3帧、4帧、5帧、6帧、或其他帧数的样本点云组成。In some embodiments, the sample point cloud frame sequence consists of multiple frames of sample point clouds that are sequential in time series. For example, the sample point cloud frame sequence consists of 3, 4, 5, 6, or other frame number sample point clouds that are consecutive in time series.
由于实际自动驾驶场景中的点云采集方式多为连续帧采集,在本公开实施例中,通过令样本点云帧序列由连续点云帧组成,使得本方案与实际场景契合,可以进一步提升特征提取网络模型的特征提取能力。Since the point cloud collection method in actual autonomous driving scenarios is mostly continuous frame collection, in the embodiment of the present disclosure, by making the sample point cloud frame sequence consist of continuous point cloud frames, this solution is consistent with the actual scene, and the features can be further improved. Extract the feature extraction capabilities of the network model.
在一些实施例中,样本点云帧序列的帧数量大于等于3、且小于等于5。通过令样本点云帧序列的帧数量在3~5帧之间,能够缓解过长的帧序列导致对处在“感兴趣区”(Region of Interests,RoI)边缘的物体匹配困难的问题。其中,这里的物体指的是广义的概念,其可以是场景中任何目标,如树木,建筑等,与自动驾驶中需要具体识别的类型无关(比如车、行人,自行车等)。通过上述处理,能够让网络学习到自动驾驶场景中各种物体的低层的信息,如形状,大小等,使得学习后的网络具有更广泛的特征提取能力。In some embodiments, the number of frames of the sample point cloud frame sequence is greater than or equal to 3 and less than or equal to 5. By making the number of frames in the sample point cloud frame sequence between 3 and 5 frames, the problem of difficulty in matching objects at the edge of the "Region of Interests" (RoI) caused by too long frame sequences can be alleviated. Among them, the object here refers to a broad concept, which can be any target in the scene, such as trees, buildings, etc., regardless of the type that needs to be specifically identified in autonomous driving (such as cars, pedestrians, bicycles, etc.). Through the above processing, the network can learn the low-level information of various objects in the autonomous driving scene, such as shape, size, etc., so that the learned network has broader feature extraction capabilities.
在一些实施例中,样本点云帧序列中的每一帧样本点云为激光雷达采集的原始点云特征数据。例如,原始点云特征数据包括每个点云点的三维位置坐标、以及反射强度。In some embodiments, each sample point cloud in the sample point cloud frame sequence is original point cloud feature data collected by lidar. For example, the original point cloud feature data includes the three-dimensional position coordinates of each point cloud point and the reflection intensity.
在一些实施例中,样本点云帧序列中的每一帧样本点云为对原始点云特征数据进行处理得到的二维图像特征数据。在这些实施例中,点云特征提取网络模型训练方法还包括:将多帧样本点云的原始特征数据转换成二维图像特征数据,以得到由多帧样本点云的二维图像特征数据构成的样本点云帧序列。例如,将多帧样本点云的原始特 征数据转换成鸟瞰图(Bird’s Eye View,BEV)特征数据。In some embodiments, each sample point cloud in the sample point cloud frame sequence is two-dimensional image feature data obtained by processing the original point cloud feature data. In these embodiments, the point cloud feature extraction network model training method also includes: converting the original feature data of the multi-frame sample point cloud into two-dimensional image feature data to obtain the two-dimensional image feature data composed of the multi-frame sample point cloud. A sequence of sample point cloud frames. For example, convert the original features of multi-frame sample point clouds into The feature data is converted into Bird's Eye View (BEV) feature data.
在一些实施例中,第一特征提取网络模型为共享权值编码器,该共享权值编码器包括多个编码模块,每个编码模块用于对样本点云帧序列中的一帧进行编码。In some embodiments, the first feature extraction network model is a shared weight encoder. The shared weight encoder includes a plurality of encoding modules, each encoding module is used to encode one frame in the sample point cloud frame sequence.
例如,当样本点云帧序列包括t0、t1、t2至t3时刻的4帧样本点云的BEV数据时,将这4帧样本点云的BEV数据同时输入第一特征提取网络模型的四个编码模块(具体为编码模块1至4),比如将t0时刻的样本点云的BEV数据输入编码模块1,将t1时刻的样本点云的BEV数据输入编码模块2,将t2时刻的样本点云的BEV数据输入编码模块3,将t3时刻的样本点云的BEV数据输入编码模块4。For example, when the sample point cloud frame sequence includes BEV data of 4 frames of sample point clouds from t 0 , t 1 , t 2 to t 3 , the BEV data of these 4 frames of sample point clouds are simultaneously input into the first feature extraction network model. The four coding modules (specifically coding modules 1 to 4), for example, input the BEV data of the sample point cloud at time t 0 into coding module 1, input the BEV data of the sample point cloud at time t 1 into coding module 2, and input the BEV data of the sample point cloud at time t 1 into coding module 2. The BEV data of the sample point cloud at time t 2 is input into the encoding module 3, and the BEV data of the sample point cloud at time t 3 is input into the encoding module 4.
步骤S120:根据相邻多帧样本点云的编码特征图,确定位于相邻多帧样本点云之后的下一帧样本点云的预测特征图。Step S120: Determine the predicted feature map of the next frame sample point cloud located after the adjacent multi-frame sample point cloud based on the encoding feature map of the adjacent multi-frame sample point cloud.
在一些实施例中,根据相邻两帧样本点云的编码特征图,确定位于相邻两帧样本点云之后的下一帧样本点云的预测特征图。In some embodiments, based on the encoding feature maps of the sample point clouds of the two adjacent frames, the predicted feature map of the sample point cloud of the next frame located after the sample point clouds of the two adjacent frames is determined.
例如,样本点云帧序列由连续的3帧样本点云组成,分别为t0、t1、t2时刻的样本点云帧,根据t0时刻的样本点云帧的编码特征图和t1时刻的样本点云帧的编码特征图,确定t2时刻的样本点云帧的预测特征图。For example, the sample point cloud frame sequence consists of three consecutive frames of sample point clouds, which are the sample point cloud frames at time t 0 , t 1 , and t 2 respectively. According to the encoding feature map of the sample point cloud frame at time t 0 and t 1 The encoding feature map of the sample point cloud frame at time t2 determines the predicted feature map of the sample point cloud frame at time t2 .
例如,样本点云帧序列由连续的4帧样本点云组成,分别为t0、t1、t2、t3时刻的样本点云帧,根据t0时刻的样本点云帧的编码特征图和t1时刻的样本点云帧的编码特征图,确定t2时刻的样本点云帧的预测特征图;根据t1时刻的样本点云帧的编码特征图和t2时刻的样本点云帧的编码特征图,确定t3时刻的样本点云帧的预测特征图。For example, the sample point cloud frame sequence consists of 4 consecutive sample point cloud frames, which are the sample point cloud frames at time t 0 , t 1 , t 2 , and t 3 respectively. According to the encoding feature map of the sample point cloud frame at time t 0 and the coding feature map of the sample point cloud frame at time t 1 , determine the prediction feature map of the sample point cloud frame at time t 2 ; based on the coding feature map of the sample point cloud frame at time t 1 and the sample point cloud frame at time t 2 The encoding feature map is used to determine the predicted feature map of the sample point cloud frame at time t 3 .
在一些实施例中,根据相邻3帧或3帧以上样本点云的编码特征图,确定位于相邻3帧或3帧以上样本点云之后的下一帧样本点云的预测特征图。In some embodiments, the predicted feature map of the sample point cloud of the next frame located after the sample point cloud of three adjacent frames or more is determined based on the encoding feature map of the sample point cloud of three adjacent frames or more.
例如,样本点云帧序列由连续的4帧样本点云组成,分别为t0、t1、t2、t3时刻的样本点云帧,根据t0、t1、t2这三个时刻的样本点云帧的编码特征图,确定t3时刻的样本点云帧的预测特征图。For example, the sample point cloud frame sequence consists of 4 consecutive sample point cloud frames, which are the sample point cloud frames at t 0 , t 1 , t 2 , and t 3 respectively. According to the three times t 0 , t 1 , and t 2 The encoding feature map of the sample point cloud frame is determined to determine the predicted feature map of the sample point cloud frame at time t 3 .
在一些实施例中,步骤S120包括:利用第二特征提取网络模型,对相邻多帧样本点云的编码特征图分别进行第二编码,以得到相邻多帧样本点云的中间特征图;对相邻多帧样本点云的中间特征图进行融合,以得到融合特征图;对融合特征图进行解码,以得到位于所述相邻多帧样本点云之后的下一帧样本点云的预测特征图。In some embodiments, step S120 includes: using a second feature extraction network model to perform second encoding on the coded feature maps of adjacent multi-frame sample point clouds, respectively, to obtain intermediate feature maps of adjacent multi-frame sample point clouds; Fusion of intermediate feature maps of adjacent multi-frame sample point clouds to obtain a fused feature map; decoding the fused feature map to obtain predictions of the next frame of sample point clouds located after the adjacent multi-frame sample point clouds Feature map.
在一些实施例中,第二特征提取网络模型包括注意力编码模块和注意力解码模块。其中,注意力编码模块,用于对相邻多帧样本点云的编码特征图分别进行第二编码; 注意力解码模块,用于对融合特征图进行解码,以得到位于相邻多帧样本点云之后的下一帧样本点云的预测特征图。In some embodiments, the second feature extraction network model includes an attention encoding module and an attention decoding module. Among them, the attention coding module is used to perform second coding on the coding feature maps of adjacent multi-frame sample point clouds; The attention decoding module is used to decode the fused feature map to obtain the predicted feature map of the next frame sample point cloud located after the adjacent multi-frame sample point cloud.
在一些实施例中,根据如下方式对相邻多帧样本点云的中间特征图进行融合:根据相邻多帧样本点云的中间特征图,确定相邻多帧样本点云之间的特征点匹配关系;根据特征点匹配关系,对相邻多帧样本点云的中间特征图进行融合,以得到融合特征图。In some embodiments, the intermediate feature maps of adjacent multi-frame sample point clouds are fused according to the following method: based on the intermediate feature maps of adjacent multi-frame sample point clouds, the feature points between adjacent multi-frame sample point clouds are determined. Matching relationship; according to the matching relationship of feature points, the intermediate feature maps of adjacent multi-frame sample point clouds are fused to obtain a fused feature map.
在另一些实施例中,步骤S120包括:根据相邻多帧样本点云的编码特征图,确定相邻多帧样本点云之间的特征点匹配关系;根据特征点匹配关系,对相邻多帧样本点云的编码特征图进行融合,以得到融合特征图;根据融合特征图,确定位于相邻多帧样本点云之后的下一帧样本点云的预测特征图。In other embodiments, step S120 includes: determining the feature point matching relationship between adjacent multiple frame sample point clouds based on the coding feature maps of adjacent multiple frame sample point clouds; based on the feature point matching relationship, determining the matching relationship between adjacent multiple frame sample point clouds. The coding feature maps of the frame sample point clouds are fused to obtain a fusion feature map; based on the fusion feature map, the predicted feature map of the next frame sample point cloud located after the adjacent multi-frame sample point clouds is determined.
步骤S130:根据位于相邻多帧样本点云之后的下一帧样本点云的预测特征图和其编码特征图,确定损失函数值。Step S130: Determine the loss function value based on the prediction feature map of the next frame sample point cloud located after the adjacent multi-frame sample point cloud and its encoding feature map.
在一些实施例中,通过步骤S120确定了一帧样本点云的预测特征图。在这些实施例中,根据这一帧样本点云的预测特征图和其编码特征图,确定损失函数值。In some embodiments, the predicted feature map of a frame of sample point cloud is determined through step S120. In these embodiments, the loss function value is determined based on the predicted feature map of the sample point cloud of this frame and its encoding feature map.
例如,样本点云帧序列由连续的3帧样本点云组成,分别为t0、t1、t2时刻的样本点云帧,在根据t0和t1时刻的样本点云帧的编码特征图确定t2时刻的样本点云帧的预测特征图之后,根据t2时刻的样本点云帧的编码特征图和t2时刻的样本点云帧的预测特征图,确定损失函数值。For example, the sample point cloud frame sequence consists of three consecutive frames of sample point clouds, which are the sample point cloud frames at time t 0 , t 1 , and t 2 respectively. According to the encoding characteristics of the sample point cloud frames at time t 0 and t 1 After determining the prediction feature map of the sample point cloud frame at time t 2 , the loss function value is determined based on the encoding feature map of the sample point cloud frame at time t 2 and the prediction feature map of the sample point cloud frame at time t 2 .
在一些实施例中,通过步骤S120确定了多帧样本点云的预测特征图。在这些实施例中,根据这多帧样本点云的预测特征图和其编码特征图,确定损失函数值。In some embodiments, the predicted feature map of the multi-frame sample point cloud is determined through step S120. In these embodiments, the loss function value is determined based on the predicted feature map of the multi-frame sample point cloud and its encoding feature map.
例如,样本点云帧序列由连续的4帧样本点云组成,分别为t0、t1、t2、t3时刻的样本点云帧,在根据t0和t1时刻的样本点云帧的编码特征图确定t2时刻的样本点云帧的预测特征图,以及根据t1和t2时刻的样本点云帧的编码特征图确定t3时刻的样本点云帧的预测特征图之后,根据t2时刻的样本点云帧的编码特征图和预测特征图、以及t3时刻的样本点云帧的编码特征图和预测特征图,确定损失函数的值。For example, the sample point cloud frame sequence consists of 4 consecutive sample point cloud frames, which are the sample point cloud frames at t 0 , t 1 , t 2 , and t 3 respectively. According to the sample point cloud frames at t 0 and t 1 After determining the prediction feature map of the sample point cloud frame at time t 2 based on the coding feature map, and determining the prediction feature map of the sample point cloud frame at time t 3 based on the coding feature maps of the sample point cloud frames at time t 1 and t 2 , The value of the loss function is determined based on the coding feature map and prediction feature map of the sample point cloud frame at time t 2 and the coding feature map and prediction feature map of the sample point cloud frame at time t 3 .
在一些实施例中,根据如下方式计算损失函数的值:计算位于相邻多帧样本点云之后的下一帧样本点云的预测特征图和其编码特征图中对应同一位置索引的特征点之间的欧式距离;根据所有位置索引的特征点之间的欧式距离,计算损失函数值。In some embodiments, the value of the loss function is calculated as follows: calculating the predicted feature map of the next frame sample point cloud located after the adjacent multi-frame sample point cloud and the feature point corresponding to the same position index in its encoded feature map. The loss function value is calculated based on the Euclidean distance between the feature points of all position indexes.
在一些实施例中,采用均方误差损失函数(Mean Squared Error,MSE)度量预测特征图和编码特征图之间的一致性。模型训练的目标为最小化均方误差损失函数的 值。In some embodiments, a mean squared error (MSE) loss function is used to measure the consistency between the predicted feature map and the encoded feature map. The goal of model training is to minimize the mean square error loss function value.
例如,在根据一帧样本点云的预测特征图K2和其编码特征图F2确定损失函数值时,根据如下公式计算损失函数值:
For example, when determining the loss function value based on the predicted feature map K 2 and its encoding feature map F 2 of a frame sample point cloud, the loss function value is calculated according to the following formula:
其中,MSE表示损失函数值,x(i,j)∈K2,y(i,j)∈F2,i,j分别是特征图上的位置索引,m×n表示预测特征图K2和编码特征图F2的大小,‖x(i,j)-y(i,j)‖2表示预测特征图K2和其编码特征图F2之间的欧氏距离的平方。Among them, MSE represents the loss function value, x(i,j)∈K 2 , y(i,j)∈F 2 , i, j are the position index on the feature map respectively, m×n represents the predicted feature map K 2 and The size of the encoding feature map F 2 , ‖x(i,j)-y(i,j)‖ 2 , represents the square of the Euclidean distance between the predicted feature map K 2 and its encoding feature map F 2 .
在本公开另一些实施例中,当根据多帧样本点云的预测特征图和其编码特征图确定损失函数值时,可先基于上述计算公式计算每帧样本点云的预测特征图和其编码特征图的损失函数值,再根据多帧样本点云的损失函数值确定总的损失函数值。In other embodiments of the present disclosure, when determining the loss function value based on the prediction feature map of the multi-frame sample point cloud and its coding feature map, the prediction feature map of each frame sample point cloud and its coding feature map can first be calculated based on the above calculation formula. The loss function value of the feature map is then determined based on the loss function value of the multi-frame sample point cloud to determine the total loss function value.
例如,当根据t2、t3时刻的样本点云帧的编码特征图和预测特征图确定损失函数值时,先根据t2时刻的样本点云帧的编码特征图和预测特征图确定第一损失函数值,根据t3时刻的样本点云帧的编码特征图和预测特征图确定第二损失函数值,再根据第一损失函数值和第二损失函数值确定总的损失函数值。For example, when determining the loss function value based on the coding feature map and prediction feature map of the sample point cloud frame at time t 2 and t 3 , the first step is to determine the first step based on the coding feature map and prediction feature map of the sample point cloud frame at time t 2 . The loss function value determines the second loss function value based on the encoding feature map and the prediction feature map of the sample point cloud frame at time t 3 , and then determines the total loss function value based on the first loss function value and the second loss function value.
步骤S140:根据损失函数值,对第一特征提取网络模型进行训练。Step S140: Train the first feature extraction network model according to the loss function value.
在步骤S140中,根据损失函数值对特征提取网络模型进行更新。重复步骤S110至步骤S140,直至达到训练结束条件,比如训练步长达到预设值(例如200万步)。In step S140, the feature extraction network model is updated according to the loss function value. Repeat steps S110 to S140 until the training end condition is reached, for example, the training step length reaches a preset value (for example, 2 million steps).
在本公开实施例中,通过以上步骤实现了点云特征提取网络模型的自监督学习。与相关技术中利用图像光流进行自监督的方法不同,本公开实施例仅需使用点云这一个模态的数据,利用样本点云帧序列中局部的特征关系进行自监督匹配训练,无需进行雷达、相机之间的标定,不仅减少了数据标注的成本,而且提升了训练得到的特征提取模型的性能。进一步,在通过上述步骤训练得到特征提取网络模型之后,可将特征提取网络模型作为骨干网络应用在具体的视觉任务上,如三维目标检测、点云语义分割等,从而能够提高目标检测结果或点云语义分割结果的准确性。In the embodiment of the present disclosure, the self-supervised learning of the point cloud feature extraction network model is achieved through the above steps. Different from the method of using image optical flow for self-supervision in related technologies, the embodiments of the present disclosure only need to use point cloud data in one modality, and use local feature relationships in the sample point cloud frame sequence to perform self-supervised matching training, without the need for The calibration between radar and cameras not only reduces the cost of data annotation, but also improves the performance of the trained feature extraction model. Furthermore, after the feature extraction network model is trained through the above steps, the feature extraction network model can be used as a backbone network for specific visual tasks, such as three-dimensional target detection, point cloud semantic segmentation, etc., thereby improving the target detection results or point detection results. Accuracy of cloud semantic segmentation results.
图2为根据本公开另一些实施例的点云特征提取网络模型训练方法的流程示意图。如图2所示,本公开另一些实施例的点云特征提取网络模型训练方法包括:Figure 2 is a schematic flowchart of a point cloud feature extraction network model training method according to other embodiments of the present disclosure. As shown in Figure 2, point cloud feature extraction network model training methods in other embodiments of the present disclosure include:
步骤S210:将连续3帧样本点云的原始数据转换为BEV数据。Step S210: Convert the original data of three consecutive frames of sample point clouds into BEV data.
在本公开实施例中,以样本点云帧序列由3帧连续时序点云组成为例,进行说明。在实际应用时,样本点云帧序列可扩展为更多帧的时序序列。 In the embodiment of the present disclosure, the sample point cloud frame sequence is composed of three consecutive time-series point clouds as an example for explanation. In practical applications, the sample point cloud frame sequence can be expanded into a time series sequence of more frames.
在一些实施例中,步骤S210包括:将连续3帧样本点云的原始数据经过体素化(Voxelization)处理,以得到相应的体素化特征数据,再将体素化特征数据转换为鸟瞰图(BEV)视角下的二维图像,即BEV数据。具体实施时,将点云的原始数据转换为BEV数据的方法有很多,比如采用PointPillars方法中生成伪图像的方法,或沿Z轴方向下采样等方法。PointPillars方法是一种基于体素的三维目标检测算法,它的主要思想是把三维点云转换成二维伪图像以便用二维目标检测的方式进行目标检测。In some embodiments, step S210 includes: subjecting the original data of three consecutive frames of sample point clouds to voxelization to obtain corresponding voxelized feature data, and then converting the voxelized feature data into a bird's-eye view. A two-dimensional image from the (BEV) perspective, that is, BEV data. In specific implementation, there are many methods to convert the original point cloud data into BEV data, such as using the method of generating pseudo images in the PointPillars method, or downsampling along the Z-axis direction. The PointPillars method is a voxel-based three-dimensional target detection algorithm. Its main idea is to convert the three-dimensional point cloud into a two-dimensional pseudo image so that target detection can be performed using two-dimensional target detection.
例如,先根据目标关注区域过滤点云原始数据,比如,在雷达坐标系内,从雷达采集的点云原始数据中将位于x∈[-30,30]、y∈[-15,15]、z∈[-1.8,0.8]范围内的点云数据取出来。然后,在x轴、y轴方向上每隔0.05米,z轴方向上每隔0.10米,建立一个体素单元(Voxel cell),从而获得体素网格(Voxel Grid)。在得到体素网格之后,采用PointPillar方式确定点云的BEV特征图。For example, first filter the original point cloud data according to the target area of interest. For example, in the radar coordinate system, the original point cloud data collected from the radar will be located at x∈[-30,30], y∈[-15,15], Take out the point cloud data within the range of z∈[-1.8,0.8]. Then, a voxel cell (Voxel cell) is established every 0.05 meters in the x-axis and y-axis directions and every 0.10 meters in the z-axis direction to obtain the voxel grid (Voxel Grid). After obtaining the voxel grid, the PointPillar method is used to determine the BEV feature map of the point cloud.
在本公开实施例中,通过将点云的原始数据转换为BEV数据,使得后续针对二维的BEV数据进行特征提取即可,提高了点云特征提取网络模型的训练效率。而且,由于点云的BEV数据可以较好的保持障碍物的空间关系,更便于后续利用连续帧点云的特征进行自监督学习。In the embodiment of the present disclosure, by converting the original point cloud data into BEV data, subsequent feature extraction can be performed on the two-dimensional BEV data, which improves the training efficiency of the point cloud feature extraction network model. Moreover, since the BEV data of point clouds can better maintain the spatial relationship of obstacles, it is easier to use the features of continuous frame point clouds for self-supervised learning.
步骤S220:基于第一特征提取网络模型,对连续3帧样本点云的BEV数据进行第一编码,得到每一帧样本点云的编码特征图。Step S220: Based on the first feature extraction network model, perform a first encoding on the BEV data of three consecutive frames of sample point clouds to obtain a coded feature map of each frame of sample point clouds.
在一些实施例中,第一特征提取网络模型为共享权值编码器,该共享权值编码器包括三个编码模块,每个编码模块用于对每帧样本点云的BEV数据(或者称为BEV特征图)进行编码。通过采用共享权值编码器,能够在对3帧点云进行特征提取时共享网络的权值,使特征提取网络模型可以区分学习到的不同点云之间的异同点。In some embodiments, the first feature extraction network model is a shared weight encoder. The shared weight encoder includes three encoding modules, each encoding module is used to encode the BEV data (also known as BEV feature map) is encoded. By using a shared weight encoder, the weight of the network can be shared when extracting features from 3-frame point clouds, so that the feature extraction network model can distinguish similarities and differences between different learned point clouds.
在一些实施例中,第一特征提取网络模型的编码模块包括:卷积神经网络和自注意力网络。示例性地,卷积神经网络为ResNet、EfficientNet等二维卷积神经网络。通过设置卷积神经网络,能够提取点云的BEV特征图的初步信息,比如点云的BEV特征图上低层的局部的信息。示例性地,自注意力网络为诸如Transformer等网络。Transformer是一种利用自注意力机制的神经网络模型。通过设置自注意力网络,能够对卷积神经网络输出的3帧样本点云的特征图内每个位置与其他位置的关系编码,提取同一帧样本点云内的间隔较大的空间上的信息。In some embodiments, the encoding module of the first feature extraction network model includes: a convolutional neural network and a self-attention network. For example, the convolutional neural network is a two-dimensional convolutional neural network such as ResNet and EfficientNet. By setting up a convolutional neural network, preliminary information of the BEV feature map of the point cloud can be extracted, such as low-level local information on the BEV feature map of the point cloud. Illustratively, the self-attention network is a network such as Transformer. Transformer is a neural network model that utilizes self-attention mechanism. By setting up a self-attention network, it is possible to encode the relationship between each position and other positions in the feature map of the 3-frame sample point cloud output by the convolutional neural network, and extract spatial information with large intervals within the same frame sample point cloud. .
步骤S230:根据前两帧样本点云的编码特征图,确定第3帧样本点云的预测特征图。 Step S230: Determine the predicted feature map of the third frame sample point cloud based on the encoding feature maps of the sample point clouds of the first two frames.
例如,当样本点云帧序列包括t0、t1、t2时刻的3帧样本点云时,通过特征提取网络模型对这3帧样本点云的数据进行编码,获得特征图F0、F1、F2。在步骤S230中,根据特征图F0和F1,确定t2时刻的样本点云的预测特征图K2For example, when the sample point cloud frame sequence includes three sample point cloud frames at t 0 , t 1 , and t 2 , the data of these three frame sample point clouds are encoded through the feature extraction network model to obtain the feature maps F 0 and F 1 , F2 . In step S230, the predicted feature map K 2 of the sample point cloud at time t 2 is determined based on the feature maps F 0 and F 1 .
在一些实施例中,根据图4所示流程确定第3帧样本点云的预测特征图。In some embodiments, the predicted feature map of the sample point cloud of the third frame is determined according to the process shown in Figure 4.
步骤S240:根据第3帧样本点云的预测特征图和其编码特征图,确定损失函数值。Step S240: Determine the loss function value based on the predicted feature map of the sample point cloud in the third frame and its encoding feature map.
在一些实施例中,使用均方差损失函数来衡量第3帧样本点云的预测特征图K2和其编码特征图F2之间的一致性。在这些实施例中,训练的目标为最小化均方差损失函数的值。In some embodiments, the mean square error loss function is used to measure the consistency between the predicted feature map K 2 and its encoded feature map F 2 of the sample point cloud in frame 3. In these embodiments, the training objective is to minimize the value of the mean square error loss function.
例如,根据如下公式计算均方差损失函数的值:
For example, calculate the value of the mean square error loss function according to the following formula:
其中,MSE表示损失函数值,x(i,j)∈K2,y(i,j)∈F2,i,j分别是特征图上的位置索引,m×n表示第3帧样本点云的预测特征图K2和编码特征图F2的大小,‖x(i,j)-y(i,j)‖2表示预测特征图K2和其编码特征图F2之间的欧氏距离的平方。Among them, MSE represents the loss function value, x(i,j)∈K 2 , y(i,j)∈F 2 , i, j are the position index on the feature map respectively, m×n represents the sample point cloud of the third frame The size of the predicted feature map K 2 and the encoded feature map F 2 , ‖x(i,j)-y(i,j)‖ 2 represents the Euclidean distance between the predicted feature map K 2 and its encoded feature map F 2 squared.
步骤S250:根据损失函数值,对第一特征提取网络模型进行训练。Step S250: Train the first feature extraction network model according to the loss function value.
在步骤S250中,根据损失函数值对第一特征提取网络模型进行更新。此外,在训练过程中还涉及其他需要更新的网络模型时,还根据损失函数值对其他需要更新的网络模型进行更新。In step S250, the first feature extraction network model is updated according to the loss function value. In addition, when other network models that need to be updated are involved in the training process, other network models that need to be updated are also updated based on the loss function value.
按照上述处理步骤,对第一特征提取网络模型进行多次迭代更新,直至达到训练结束条件,比如训练步长达到预设值(例如200万步)。According to the above processing steps, the first feature extraction network model is updated iteratively multiple times until the training end condition is reached, for example, the training step length reaches a preset value (for example, 2 million steps).
在本公开实施例中,通过以上步骤实现了点云特征提取网络模型的自监督学习。与相关技术中利用图像光流进行自监督的方法不同,本公开实施例仅需使用点云这一个模态的数据,利用样本点云帧序列中局部的特征关系进行自监督匹配训练,无需进行雷达、相机之间的标定,不仅减少了数据标注的成本,而且提升了训练得到的点云特征提取网络模型的性能。In the embodiment of the present disclosure, the self-supervised learning of the point cloud feature extraction network model is achieved through the above steps. Different from the method of using image optical flow for self-supervision in related technologies, the embodiments of the present disclosure only need to use point cloud data in one modality, and use local feature relationships in the sample point cloud frame sequence to perform self-supervised matching training, without the need for The calibration between radar and cameras not only reduces the cost of data annotation, but also improves the performance of the trained point cloud feature extraction network model.
图3为根据本公开一些实施例的第一特征提取网络模型的结构示意图。Figure 3 is a schematic structural diagram of a first feature extraction network model according to some embodiments of the present disclosure.
如图3所示,本公开一些实施例的第一特征提取网络模型300包括三个编码模块,分别是编码模块310、编码模块320和编码模块330。As shown in Figure 3, the first feature extraction network model 300 of some embodiments of the present disclosure includes three encoding modules, namely encoding module 310, encoding module 320 and encoding module 330.
在一些实施例中,三个编码模块共享网络权值。在这些实施例中,将样本点云帧序列所包括的3帧样本点云同时输入三个编码模块,比如将t0时刻的样本点云输入编 码模块310,将t1时刻的样本点云输入编码模块320,将t2时刻的样本点云输入编码模块330。In some embodiments, three encoding modules share network weights. In these embodiments, three frames of sample point clouds included in the sample point cloud frame sequence are simultaneously input into three encoding modules. For example, the sample point cloud at time t 0 is input into the encoding module. The coding module 310 inputs the sample point cloud at time t 1 into the coding module 320, and inputs the sample point cloud at time t 2 into the coding module 330.
在一些实施例中,每个编码模块包括卷积神经网络和自注意力网络。例如,卷积神经网络采用ResNet模型,自注意力网络采用Transformer模型。具体实施时,Transformer模型可采用相关技术中的标准结构。In some embodiments, each encoding module includes a convolutional neural network and a self-attention network. For example, the convolutional neural network uses the ResNet model, and the self-attention network uses the Transformer model. During specific implementation, the Transformer model can adopt standard structures in related technologies.
在本公开实施例中,通过采用如上结构的第一特征提取网络模型,能够提取更多的点云特征信息,提高点云特征提取能力。In the embodiments of the present disclosure, by adopting the first feature extraction network model with the above structure, more point cloud feature information can be extracted and the point cloud feature extraction capability can be improved.
图4a为根据本公开一些实施例的确定预测特征图步骤的流程示意图。图4a是对步骤S230的示例性说明。如图4a所示,本公开实施例的确定预测特征图步骤包括:Figure 4a is a schematic flowchart of the steps of determining a prediction feature map according to some embodiments of the present disclosure. Figure 4a is an exemplary illustration of step S230. As shown in Figure 4a, the steps of determining the prediction feature map in this embodiment of the present disclosure include:
步骤S410:对前两帧样本点云的编码特征图进行第二编码,以得到前两帧样本点云的中间特征图。Step S410: Perform a second encoding on the coded feature maps of the sample point clouds of the first two frames to obtain the intermediate feature maps of the sample point clouds of the first two frames.
在一些实施例中,在步骤S410中,根据如下方式确定中间特征图:利用第二特征提取网络模型中的注意力编码模块,对前两帧样本点云的编码特征图进行编码,以得到前两帧样本点云的中间特征图。例如,注意力编码模块采用Transformer模型。In some embodiments, in step S410, the intermediate feature map is determined as follows: using the attention encoding module in the second feature extraction network model to encode the encoding feature maps of the sample point clouds of the first two frames to obtain the previous The intermediate feature map of the two frame sample point clouds. For example, the attention encoding module adopts the Transformer model.
通过设置注意力编码模块,能够基于注意力机制学习到不同帧的编码特征图上的空间位置关系,进而有助于更准确地确定特征点匹配关系,提高由此确定的预测特征图与真实特征图的一致性,进而提高特征提取网络模型的训练效率。By setting up the attention encoding module, the spatial position relationship on the encoding feature maps of different frames can be learned based on the attention mechanism, which in turn helps to more accurately determine the matching relationship between feature points and improve the predicted feature map and real features determined thereby. The consistency of the graph improves the training efficiency of the feature extraction network model.
在本公开的另一些实施例中,在样本点云帧序列包括四个或四个以上样本点云帧时,且根据相邻两帧样本点云的编码特征图确定下一帧样本点云的编码特征图时,根据如下方式对相邻两帧样本点云的编码特征图进行编码:基于注意力编码模块,对相邻两帧样本点云的编码特征图进行编码,以得到相邻两帧样本点云的中间特征图。In other embodiments of the present disclosure, when the sample point cloud frame sequence includes four or more sample point cloud frames, and the next frame of sample point cloud is determined based on the encoding feature maps of the sample point clouds of two adjacent frames. When encoding the feature map, the encoding feature maps of the sample point clouds of the two adjacent frames are encoded as follows: Based on the attention coding module, the encoding feature maps of the sample point clouds of the two adjacent frames are encoded to obtain the two adjacent frames. Intermediate feature map of sample point cloud.
例如,当样本点云帧序列包括t0、t1、t2、t3这四个时刻的样本点云帧、且根据相邻两帧样本点云的编码特征图确定下一帧样本点云的编码特征图时,相邻两帧样本点云包括t0和t1时刻的样本点云帧、以及t1和t2时刻的样本点云帧,则基于注意力编码模块对t0、t1、t2时刻的样本点云帧的编码特征图进行编码,以得到t0、t1、t2时刻的样本点云帧的中间特征图。For example, when the sample point cloud frame sequence includes sample point cloud frames at four times t 0 , t 1 , t 2 , and t 3 , and the next frame of sample point cloud is determined based on the encoding feature maps of the sample point clouds of the two adjacent frames. When encoding the feature map, the sample point clouds of the two adjacent frames include the sample point cloud frames at t 0 and t 1 , and the sample point cloud frames at t 1 and t 2. Based on the attention coding module, t 0 and t The coding feature maps of the sample point cloud frames at time 1 and t 2 are encoded to obtain the intermediate feature maps of the sample point cloud frames at time t 0 , t 1 and t 2 .
步骤S420:根据中间特征图,确定前两帧样本点云之间的特征点匹配关系。Step S420: Determine the feature point matching relationship between the sample point clouds of the first two frames based on the intermediate feature map.
在一些实施例中,在步骤S420中,根据如下方式确定前两帧样本点云之间的特征点匹配关系:计算第一帧的中间特征图上的每个特征点、与第二帧的中间特征图上指定范围内的特征点的相关度;根据相关度,确定前两帧样本点云之间的特征点匹配 关系。In some embodiments, in step S420, the feature point matching relationship between the sample point clouds of the first two frames is determined as follows: calculating each feature point on the intermediate feature map of the first frame and the intermediate feature map of the second frame. The correlation degree of the feature points within the specified range on the feature map; based on the correlation degree, the feature point matching between the sample point clouds of the first two frames is determined. relation.
其中,指定范围为第一帧的中间特征图的特征点的邻域范围。例如,对于第一帧的中间特征图上的任一特征点P0来说,将以该点的位置坐标为中心,以预设长度为半径的圆形区域作为该特征点的邻域范围。又例如,令邻域范围为高斯邻域。在本公开实施例中,通过在指定范围内搜索匹配特征点,能够避免对第二帧的中间特征图进行全局搜索,从而降低计算量。Among them, the specified range is the neighborhood range of the feature points of the intermediate feature map of the first frame. For example, for any feature point P 0 on the intermediate feature map of the first frame, a circular area with the position coordinate of the point as the center and a preset length as the radius will be the neighborhood range of the feature point. For another example, let the neighborhood range be a Gaussian neighborhood. In the embodiment of the present disclosure, by searching for matching feature points within a specified range, a global search for the intermediate feature map of the second frame can be avoided, thereby reducing the amount of calculation.
其中,相关度可采取多种度量方式。在一些实施例中,将特征点之间的余弦距离作为两者的相关度。Among them, correlation can be measured in various ways. In some embodiments, the cosine distance between feature points is used as the correlation between the two.
其中,特征点匹配关系包括:第一帧的中间特征图上的特征点、与其在第二帧的中间特征图上的匹配特征点的对应关系。在一些实施例中,对于第一帧的中间特征图上的任一特征点P0来说,将第二帧的中间特征图中指定范围内、与P0相关度最大的特征点,作为P0的匹配特征点。The feature point matching relationship includes: the corresponding relationship between the feature points on the intermediate feature map of the first frame and their matching feature points on the intermediate feature map of the second frame. In some embodiments, for any feature point P 0 on the intermediate feature map of the first frame, the feature point with the greatest correlation with P 0 within the specified range in the intermediate feature map of the second frame is used as P 0 matching feature points.
步骤S430:根据特征点匹配关系,对前两帧样本点云的中间特征图进行融合,以得到融合特征图。Step S430: Fusion of the intermediate feature maps of the sample point clouds of the first two frames according to the feature point matching relationship to obtain a fused feature map.
在一些实施例中,根据特征点匹配关系,将前两帧样本点云的中间特征图之间的匹配特征点进行特征拼接,并将拼接得到的特征图作为融合特征图。In some embodiments, according to the feature point matching relationship, the matching feature points between the intermediate feature maps of the sample point clouds of the first two frames are feature spliced, and the spliced feature map is used as the fusion feature map.
例如,对于第一帧样本点云的中间特征图上的任一特征点P0来说,将其与在第二帧样本点云的中间特征图上的匹配特征点P1的特征进行拼接,并将特征点P0的位置索引作为拼接后的特征点的位置索引,从而得到了一个融合特征点。依此类推,可得到由融合特征点构成的融合特征图。For example, for any feature point P 0 on the intermediate feature map of the sample point cloud in the first frame, splice it with the features of the matching feature point P 1 on the intermediate feature map of the sample point cloud in the second frame, And the position index of the feature point P 0 is used as the position index of the spliced feature point, thereby obtaining a fused feature point. By analogy, a fused feature map composed of fused feature points can be obtained.
步骤S440:根据融合特征图,确定第3帧样本点云的预测特征图。Step S440: Determine the predicted feature map of the sample point cloud in the third frame based on the fused feature map.
在一些实施例中,利用第二特征提取网络模型中的注意力解码模块,对融合特征图进行解码,从而得到第3帧样本点云的预测特征图。In some embodiments, the attention decoding module in the second feature extraction network model is used to decode the fused feature map, thereby obtaining the predicted feature map of the sample point cloud in the third frame.
在本公开实施例中,通过以上步骤能够基于前两帧样本点云的编码特征图,高效、精准地确定第3帧样本点云地预测特征图,进而有助于优化点云特征提取网络模型的训练流程,提高训练得到的第一特征提取网络模型的性能。In the embodiment of the present disclosure, through the above steps, the predicted feature map of the sample point cloud in the third frame can be determined efficiently and accurately based on the encoded feature maps of the sample point clouds in the first two frames, thereby helping to optimize the point cloud feature extraction network model. The training process improves the performance of the first feature extraction network model obtained by training.
图4b为根据本公开一些实施例的点云特征提取网络模型训练方法的流程示意图。在本公开实施例中,以样本点云帧序列包含三帧连续时序点云数据(具体为t0、t1、t2时刻的点云数据c0、c1、c2)为例,对点云特征提取网络模型训练方法进行说明。Figure 4b is a schematic flowchart of a point cloud feature extraction network model training method according to some embodiments of the present disclosure. In the embodiment of the present disclosure, taking the sample point cloud frame sequence as containing three frames of continuous time series point cloud data (specifically, the point cloud data c 0 , c 1 , and c 2 at t 0 , t 1 , and t 2 ), for example, The point cloud feature extraction network model training method is explained.
如图4b所示,点云特征提取网络模型训练方法包括:步骤1至步骤7。 As shown in Figure 4b, the point cloud feature extraction network model training method includes: Step 1 to Step 7.
步骤1:将t0、t1、t2时刻的点云数据c0、c1、c2分别转换为相应时刻的鸟瞰图。Step 1: Convert the point cloud data c 0 , c 1 , and c 2 at t 0 , t 1 , and t 2 into bird's-eye views at the corresponding moments respectively.
步骤2:将t0、t1、t2时刻的鸟瞰图输入共享权值编码器,以得到相应时刻的特征图F0、F1、F2Step 2: Input the bird's-eye view at time t 0 , t 1 , and t 2 into the shared weight encoder to obtain the feature maps F 0 , F 1 , and F 2 at the corresponding time.
其中,共享权值编码器包括:2D CNN(2维卷积神经网络)、以及Transformer编码器(或称为Transformer网络)。示例性地,2D CNN可采用ResNet、EfficientNet等网络模型,用于提取鸟瞰图中的初步信息。Transformer网络为一种自注意力网络,用于提取鸟瞰图中每个位置与其他位置的特征点关系编码,即同一帧点云的空间位置关系。例如,对于鸟瞰图上车辆位置对应的像素点,通过共享权值编码器对该点进行编码得到特征点X。Among them, shared weight encoders include: 2D CNN (2-dimensional convolutional neural network), and Transformer encoder (or Transformer network). For example, 2D CNN can use network models such as ResNet and EfficientNet to extract preliminary information from a bird's-eye view. The Transformer network is a self-attention network that is used to extract the encoding of the feature point relationship between each position and other positions in the bird's-eye view, that is, the spatial position relationship of the point cloud in the same frame. For example, for the pixel point corresponding to the vehicle position on the bird's-eye view, the point is encoded by the shared weight encoder to obtain the feature point X.
在利用共享权值编码器提取t0、t1、t2时刻的鸟瞰图的特征时,所使用的网络模型的权值是共享的,这样做有助于让网络学习到不同点云之间的异同点。When using the shared weight encoder to extract the features of the bird's-eye view at t 0 , t 1 , and t 2 , the weights of the network model used are shared. This helps the network learn between different point clouds. similarities and differences.
步骤3:基于时序注意力转换模块对t0、t1时刻的特征图F0、F1进行编码、特征点相关度计算,以得到t0、t1时刻的中间特征图和特征点匹配关系。Step 3: Based on the temporal attention conversion module, encode the feature maps F 0 and F 1 at t 0 and t 1 and calculate the feature point correlation to obtain the intermediate feature map and feature point matching relationship at t 0 and t 1 .
其中,时序注意力转换模块包括:Transformer编码器和相关性计算模块。在步骤3中,先基于Transformer编码器对t0、t1时刻的特征图F0、F1分别进行编码,以得到相应时刻的中间特征图;然后计算t0、t1时刻的中间特征图V0、V1之间的特征点相关度;根据t0、t1时刻的中间特征图之间的特征点相关度,确定t0、t1时刻的中间特征图之间的特征点匹配关系。Among them, the temporal attention conversion module includes: Transformer encoder and correlation calculation module. In step 3, first encode the feature maps F 0 and F 1 at time t 0 and t 1 based on the Transformer encoder to obtain the intermediate feature map at the corresponding time; then calculate the intermediate feature map at time t 0 and t 1 The feature point correlation between V 0 and V 1 ; according to the feature point correlation between the intermediate feature maps at t 0 and t 1 , determine the feature point matching relationship between the intermediate feature maps at t 0 and t 1 .
在一些实施例中,对于中间特征图V0上的每一个特征点,计算该点与其在中间特征图V1上对应位置的邻域范围内的特征点的相关度,并将相关度最大的特征点作为该点的匹配特征点。In some embodiments, for each feature point on the intermediate feature map V 0 , the correlation between the point and the feature points within the neighborhood of its corresponding position on the intermediate feature map V 1 is calculated, and the point with the largest correlation is calculated The feature point is used as the matching feature point of this point.
通过计算中间特征图V0上的每一个特征点,与其在中间特征图V1上的匹配特征点,从而可得到t0、t1时刻的中间特征图之间的特征点匹配关系。By calculating each feature point on the intermediate feature map V 0 and its matching feature point on the intermediate feature map V 1 , the matching relationship between the feature points between the intermediate feature maps at time t 0 and t 1 can be obtained.
步骤4:基于位置转换编码模块对中间特征图V0、V1进行融合,并对融合特征图进行解码,以得到t2时刻的预测特征图。Step 4: Fusion of the intermediate feature maps V 0 and V 1 based on the position transformation coding module, and decoding the fused feature map to obtain the predicted feature map at time t 2 .
其中,位置转换编码模块包括:融合模块和Transformer解码器。融合模块,根据t0、t1时刻的中间特征图之间的特征点匹配关系,对t0、t1时刻的中间特征图进行融合,以得到融合特征图。Transformer解码器对融合特征图进行解码,以得到t2时刻的预测特征图。Among them, the position transformation coding module includes: fusion module and Transformer decoder. The fusion module fuses the intermediate feature maps at t 0 and t 1 based on the feature point matching relationship between the intermediate feature maps at t 0 and t 1 to obtain a fused feature map. The Transformer decoder decodes the fused feature map to obtain the predicted feature map at time t 2 .
步骤5:根据t2时刻的编码特征图和t2时刻的预测特征图,计算MSE(均方根损 失函数)。Step 5: Calculate the MSE (root mean square loss) based on the encoding feature map at time t 2 and the predicted feature map at time t 2 loss function).
其中,MSE用来度量t2时刻的预测特征图和t2时刻的编码特征图的一致性,整个训练的目标就是最小化MSE。Among them, MSE is used to measure the consistency of the predicted feature map at time t 2 and the encoding feature map at time t 2. The goal of the entire training is to minimize MSE.
步骤6:重复步骤1至步骤5,直至达到模型训练截止条件,比如训练步长达到200万步。Step 6: Repeat steps 1 to 5 until the model training cutoff condition is reached, for example, the training step length reaches 2 million steps.
步骤7:输出共享权值编码器。Step 7: Output the shared weight encoder.
在本公开实施例中,通过以上步骤实现了点云特征提取网络模型的自监督学习。与相关技术中利用图像光流进行自监督的方法不同,本公开实施例仅需使用点云这一个模态的数据,利用样本点云帧序列中局部分特征关系进行自监督匹配训练,无需进行雷达、相机之间的标定,不仅减少了数据标注的成本,而且提升了训练得到的特征提取模型的性能。进一步,在通过上述步骤训练得到点云特征提取网络模型之后,可将点云特征提取网络模型作为骨干网络应用在具体的视觉任务上,如三维目标检测,点云语义分割等,从而能够提高目标检测结果或点云语义分割结果的准确性。In the embodiment of the present disclosure, the self-supervised learning of the point cloud feature extraction network model is achieved through the above steps. Different from the method of using image optical flow for self-supervision in the related art, the embodiment of the present disclosure only needs to use point cloud data of one modality, and uses local partial feature relationships in the sample point cloud frame sequence to perform self-supervised matching training, without the need for The calibration between radar and cameras not only reduces the cost of data annotation, but also improves the performance of the trained feature extraction model. Furthermore, after the point cloud feature extraction network model is trained through the above steps, the point cloud feature extraction network model can be used as a backbone network for specific visual tasks, such as three-dimensional target detection, point cloud semantic segmentation, etc., thereby improving the target The accuracy of detection results or point cloud semantic segmentation results.
图5为根据本公开一些实施例的点云特征提取方法的流程示意图。如图5所示,本公开一些实施例的点云特征提取方法包括:Figure 5 is a schematic flowchart of a point cloud feature extraction method according to some embodiments of the present disclosure. As shown in Figure 5, the point cloud feature extraction method of some embodiments of the present disclosure includes:
步骤S510:获取待处理点云帧序列。Step S510: Obtain the point cloud frame sequence to be processed.
在一些实施例中,点云特征提取方法由点云特征提取装置执行。In some embodiments, the point cloud feature extraction method is executed by a point cloud feature extraction device.
在一些实施例中,步骤S510包括:获取多帧待处理点云的原始特征数据;将多帧待处理点云的原始特征数据转换为BEV特征数据,以得到由多帧待处理点云的鸟瞰图特征数据组成的待处理点云帧序列。In some embodiments, step S510 includes: obtaining the original feature data of the multi-frame point cloud to be processed; converting the original feature data of the multi-frame point cloud to be processed into BEV feature data to obtain a bird's-eye view of the multi-frame point cloud to be processed. A sequence of point cloud frames to be processed composed of graph feature data.
步骤S520:基于训练得到的第一特征提取网络模型,对待处理点云帧序列进行编码,以得到待处理点云帧序列的编码特征图。Step S520: Based on the first feature extraction network model obtained by training, encode the point cloud frame sequence to be processed to obtain the encoded feature map of the point cloud frame sequence to be processed.
在本公开实施例中,通过以上步骤能够提取更为丰富的点云特征,进而,在进行目标检测或点云语义分割任务时,能够提高目标检测结果的准确性或者点云语义分割结果的准确性。In the embodiments of the present disclosure, richer point cloud features can be extracted through the above steps. Furthermore, when performing target detection or point cloud semantic segmentation tasks, the accuracy of target detection results or the accuracy of point cloud semantic segmentation results can be improved. sex.
图6为根据本公开一些实施例的点云特征提取网络模型训练装置的结构示意图。如图6所示,本公开一些实施例中的点云特征提取网络模型训练装置600包括:特征提取模块610、预测模块620、确定模块630、训练模块640。Figure 6 is a schematic structural diagram of a point cloud feature extraction network model training device according to some embodiments of the present disclosure. As shown in Figure 6, the point cloud feature extraction network model training device 600 in some embodiments of the present disclosure includes: a feature extraction module 610, a prediction module 620, a determination module 630, and a training module 640.
特征提取模块610,被配置为利用第一特征提取网络模型,对样本点云帧序列进行第一编码,以得到样本点云帧序列中每一帧样本点云的编码特征图。 The feature extraction module 610 is configured to use the first feature extraction network model to perform first encoding on the sample point cloud frame sequence to obtain a coded feature map of each frame of the sample point cloud in the sample point cloud frame sequence.
预测模块620,被配置为根据相邻多帧样本点云的编码特征图,确定位于相邻多帧样本点云之后的下一帧样本点云的预测特征图。The prediction module 620 is configured to determine the predicted feature map of the next frame sample point cloud located after the adjacent multi-frame sample point cloud based on the encoding feature map of the adjacent multi-frame sample point cloud.
确定模块630,被配置为根据位于所述相邻多帧样本点云之后的下一帧样本点云的预测特征图和其编码特征图,确定损失函数值。The determination module 630 is configured to determine the loss function value based on the prediction feature map of the next frame sample point cloud located after the adjacent multi-frame sample point cloud and its encoding feature map.
训练模块640,被配置为根据损失函数值,对第一特征提取网络模型进行训练。The training module 640 is configured to train the first feature extraction network model according to the loss function value.
在本公开实施例中,通过以上装置能够改善点云特征提取效果,进而有助于提高目标检测或点云语义分割结果的准确性。In embodiments of the present disclosure, the above device can improve the point cloud feature extraction effect, thereby helping to improve the accuracy of target detection or point cloud semantic segmentation results.
图7为根据本公开一些实施例的点云特征提取装置的结构示意图。如图7所示,本公开一些实施例中的点云特征提取装置700包括:获取模块710、特征提取模块720。Figure 7 is a schematic structural diagram of a point cloud feature extraction device according to some embodiments of the present disclosure. As shown in Figure 7, the point cloud feature extraction device 700 in some embodiments of the present disclosure includes: an acquisition module 710 and a feature extraction module 720.
获取模块710,被配置为获取待处理点云帧序列。The acquisition module 710 is configured to acquire a sequence of point cloud frames to be processed.
在一些实施例中,获取模块710被配置为:获取多帧待处理点云的原始特征数据;将多帧待处理点云的原始特征数据转换为BEV特征数据,以得到由多帧待处理点云的鸟瞰图特征数据组成的待处理点云帧序列。In some embodiments, the acquisition module 710 is configured to: acquire the original feature data of the multi-frame point cloud to be processed; convert the original feature data of the multi-frame point cloud to be processed into BEV feature data to obtain the multi-frame point cloud to be processed. A sequence of point cloud frames to be processed composed of bird's-eye view feature data of the cloud.
特征提取模块720,被配置为基于训练得到的第一特征提取网络模型,对待处理点云帧序列进行编码,以得到待处理点云帧序列的编码特征图。The feature extraction module 720 is configured to encode the point cloud frame sequence to be processed based on the first feature extraction network model obtained by training, so as to obtain the encoded feature map of the point cloud frame sequence to be processed.
在本公开实施例中,通过以上步骤能够提取更为丰富的点云特征,进而,在进行目标检测或点云语义分割任务时,能够提高目标检测结果的准确性或者点云语义分割结果的准确性。In the embodiments of the present disclosure, richer point cloud features can be extracted through the above steps. Furthermore, when performing target detection or point cloud semantic segmentation tasks, the accuracy of target detection results or the accuracy of point cloud semantic segmentation results can be improved. sex.
根据本公开的一些实施例,还提出一种目标检测装置,被配置为根据本公开任一实施例的点云特征提取方法提取待处理点云帧序列的特征图;根据待处理点云帧序列的特征图,进行目标检测。According to some embodiments of the present disclosure, a target detection device is also proposed, configured to extract the feature map of the point cloud frame sequence to be processed according to the point cloud feature extraction method of any embodiment of the present disclosure; according to the point cloud frame sequence to be processed feature map for target detection.
根据本公开的一些实施例,还提出一种点云语义分割装置,被配置为根据本公开任一实施例的点云特征提取方法提取待处理点云帧序列的特征图;根据待处理点云帧序列的特征图,进行点云语义分割。According to some embodiments of the present disclosure, a point cloud semantic segmentation device is also proposed, configured to extract the feature map of the point cloud frame sequence to be processed according to the point cloud feature extraction method of any embodiment of the present disclosure; according to the point cloud to be processed Feature maps of frame sequences for point cloud semantic segmentation.
图8为根据本公开一些实施例的点云特征提取网络模型训练装置或点云特征提取装置或目标检测装置或点云语义分割装置的结构示意图。Figure 8 is a schematic structural diagram of a point cloud feature extraction network model training device, a point cloud feature extraction device, a target detection device, or a point cloud semantic segmentation device according to some embodiments of the present disclosure.
如图8所示,点云特征提取网络模型训练装置或点云特征提取装置或目标检测装置或点云语义分割装置800包括存储器810;以及耦接至该存储器810的处理器820。存储器810用于存储执行点云特征提取网络模型训练方法或点云特征提取方法或目标检测方法或点云语义分割方法对应实施例的指令。处理器820被配置为基于存储在存 储器810中的指令,执行本公开中任意一些实施例中的点云特征提取网络模型训练方法或点云特征提取方法或目标检测方法或点云语义分割方法。As shown in FIG. 8 , the point cloud feature extraction network model training device or point cloud feature extraction device or target detection device or point cloud semantic segmentation device 800 includes a memory 810; and a processor 820 coupled to the memory 810. The memory 810 is used to store instructions for executing corresponding embodiments of the point cloud feature extraction network model training method, the point cloud feature extraction method, the target detection method, or the point cloud semantic segmentation method. Processor 820 is configured to store the The instructions in the memory 810 execute the point cloud feature extraction network model training method, point cloud feature extraction method, target detection method, or point cloud semantic segmentation method in any embodiment of the present disclosure.
图9为根据本公开一些实施例的计算机系统的结构示意图。Figure 9 is a schematic structural diagram of a computer system according to some embodiments of the present disclosure.
如图9所示,计算机系统900可以通用计算设备的形式表现。计算机系统900包括存储器910、处理器920和连接不同系统组件的总线930。As shown in Figure 9, computer system 900 may be embodied in the form of a general purpose computing device. Computer system 900 includes memory 910, a processor 920, and a bus 930 that connects various system components.
存储器910例如可以包括系统存储器、非易失性存储介质等。系统存储器例如存储有操作系统、应用程序、引导装载程序(Boot Loader)以及其他程序等。系统存储器可以包括易失性存储介质,例如随机存取存储器(RAM)和/或高速缓存存储器。非易失性存储介质例如存储有执行中的至少一种点云特征提取网络模型训练方法或点云特征提取方法或目标检测方法或点云语义分割方法的对应实施例的指令。非易失性存储介质包括但不限于磁盘存储器、光学存储器、闪存等。Memory 910 may include, for example, system memory, non-volatile storage media, and the like. System memory stores, for example, operating systems, applications, boot loaders, and other programs. System memory may include volatile storage media such as random access memory (RAM) and/or cache memory. The non-volatile storage medium stores, for example, instructions for executing corresponding embodiments of at least one point cloud feature extraction network model training method, point cloud feature extraction method, target detection method, or point cloud semantic segmentation method. Non-volatile storage media include but are not limited to disk storage, optical storage, flash memory, etc.
处理器920可以用通用处理器、数字信号处理器(DSP)、应用专用集成电路(ASIC)、现场可编程门阵列(FPGA)或其它可编程逻辑设备、分立门或晶体管等分立硬件组件方式来实现。相应地,诸如特征提取模块、预测模块等的每个模块,可以通过中央处理器(CPU)运行存储器中执行相应步骤的指令来实现,也可以通过执行相应步骤的专用电路来实现。The processor 920 may be implemented as a general purpose processor, a digital signal processor (DSP), an application specific integrated circuit (ASIC), a field programmable gate array (FPGA) or other programmable logic device, discrete hardware components such as discrete gates or transistors. accomplish. Correspondingly, each module, such as the feature extraction module, prediction module, etc., can be implemented by a central processing unit (CPU) running instructions in the memory to perform the corresponding steps, or by a dedicated circuit that performs the corresponding steps.
总线930可以使用多种总线结构中的任意总线结构。例如,总线结构包括但不限于工业标准体系结构(ISA)总线、微通道体系结构(MCA)总线、外围组件互连(PCI)总线。Bus 930 may use any of a variety of bus structures. For example, bus structures include, but are not limited to, Industry Standard Architecture (ISA) bus, Micro Channel Architecture (MCA) bus, and Peripheral Component Interconnect (PCI) bus.
计算机系统900这些接口940、950、960以及存储器910和处理器920之间可以通过总线930连接。输入输出接口940可以为显示器、鼠标、键盘等输入输出设备提供连接接口。网络接口950为各种联网设备提供连接接口。存储接口960为软盘、U盘、SD卡等外部存储设备提供连接接口。The interfaces 940, 950, 960, the memory 910 and the processor 920 of the computer system 900 may be connected through a bus 930. The input and output interface 940 can provide a connection interface for input and output devices such as a monitor, mouse, and keyboard. The network interface 950 provides a connection interface for various networked devices. The storage interface 960 provides a connection interface for external storage devices such as floppy disks, USB disks, and SD cards.
图10为根据本公开一些实施例的无人车的结构示意图;图11为根据本公开一些实施例的无人车的立体图。以下结合图10和图11对本公开实施例提供的无人车进行说明。Figure 10 is a schematic structural diagram of an unmanned vehicle according to some embodiments of the present disclosure; Figure 11 is a perspective view of an unmanned vehicle according to some embodiments of the present disclosure. The unmanned vehicle provided by the embodiment of the present disclosure will be described below with reference to FIG. 10 and FIG. 11 .
如附图10所示,无人车包括底盘模块1010、自动驾驶模块1020、货箱模块1030和远程监控推流模块1040四部分。As shown in Figure 10, the unmanned vehicle includes four parts: a chassis module 1010, an autonomous driving module 1020, a cargo box module 1030, and a remote monitoring flow module 1040.
在一些实施例中,底盘模块1010主要包括电池、电源管理装置、底盘控制器、电机驱动器、动力电机。电池为整个无人车系统提供电源,电源管理装置将电池输出转 换为可供各功能模块使用的不同电平电压,并控制上下电。底盘控制器接受自动驾驶模块下发的运动指令,控制无人车转向、前进、后退、刹车等。In some embodiments, the chassis module 1010 mainly includes a battery, a power management device, a chassis controller, a motor driver, and a power motor. The battery provides power for the entire unmanned vehicle system, and the power management device converts the battery output into Change to different level voltages available for each functional module, and control power on and off. The chassis controller receives motion instructions from the autonomous driving module and controls the steering, forward, backward, braking, etc. of the unmanned vehicle.
在一些实施例中,自动驾驶模块1020包括核心处理单元(Orin或Xavier模组)、红绿灯识别相机、前后左右环视相机、多线激光雷达、定位模块(如北斗、GPS等)、惯性导航单元。相机与自动驾驶模块之间可进行通信,为了提高传输速度、减少线束,可采用GMSL链路通信。In some embodiments, the autonomous driving module 1020 includes a core processing unit (Orin or Xavier module), traffic light recognition camera, front, rear, left and right surround cameras, multi-line lidar, positioning module (such as Beidou, GPS, etc.), and inertial navigation unit. The camera and the autonomous driving module can communicate. In order to increase the transmission speed and reduce the wiring harness, GMSL link communication can be used.
在一些实施例中,自动驾驶模块1020包括上述实施例中的点云特征提取网络模型训练装置或点云特征提取装置或目标检测装置或点云语义分割装置。In some embodiments, the autonomous driving module 1020 includes the point cloud feature extraction network model training device or point cloud feature extraction device or target detection device or point cloud semantic segmentation device in the above embodiments.
在一些实施例中,远程监控推流模块1030由前监控相机、后监控相机、左监控相机、右监控相机和推流模块构成,该模块将监控相机采集的视频数据传输到后台服务器,供后台操作人员查看。无线通讯模块通过天线与后台服务器进行通信,可实现后台操作人员对无人车的远程控制。In some embodiments, the remote monitoring streaming module 1030 is composed of a front surveillance camera, a rear surveillance camera, a left surveillance camera, a right surveillance camera, and a streaming module. This module transmits the video data collected by the surveillance cameras to the backend server for use by the backend. Operator checks. The wireless communication module communicates with the backend server through the antenna, allowing the backend operator to remotely control the unmanned vehicle.
货箱模块1040为无人车的货物承载装置。在一些实施例中,货箱模块1040上还设置有显示交互模块,显示交互模块用于无人车与用户交互,用户可通过显示交互模块进行如取件、寄存、购买货物等操作。货箱的类型可根据实际需求进行更换,如在物流场景中,货箱可以包括多个不同大小的子箱体,子箱体可用于装载货物进行配送。在零售场景中,货箱可以设置成透明箱体,以便于用户直观看到待售产品。The cargo box module 1040 is the cargo carrying device of the unmanned vehicle. In some embodiments, the cargo box module 1040 is also provided with a display interaction module. The display interaction module is used for the unmanned vehicle to interact with the user. The user can perform operations such as picking up, depositing, and purchasing goods through the display interaction module. The type of cargo box can be changed according to actual needs. For example, in a logistics scenario, a cargo box can include multiple sub-boxes of different sizes, and the sub-boxes can be used to load goods for distribution. In a retail scenario, the cargo box can be set up as a transparent box so that users can intuitively see the products for sale.
本公开实施例的无人车,能够提高点云特征提取能力,进而有助于提高点云语义分割结果的准确性或目标检测结果的准确性,进而提高无人驾驶的安全性。The unmanned vehicle in the embodiment of the present disclosure can improve the point cloud feature extraction capability, thereby helping to improve the accuracy of the point cloud semantic segmentation results or the accuracy of the target detection results, thereby improving the safety of unmanned driving.
这里,参照根据本公开实施例的方法、装置和计算机程序产品的流程图和/或框图描述了本公开的各个方面。应当理解,流程图和/或框图的每个框以及各框的组合,都可以由计算机可读程序指令实现。Various aspects of the disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus, and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks, can be implemented by computer-readable program instructions.
这些计算机可读程序指令可提供到通用计算机、专用计算机或其他可编程装置的处理器,以产生一个机器,使得通过处理器执行指令产生实现在流程图和/或框图中一个或多个框中指定的功能的装置。These computer readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable device to produce a machine, such that execution of the instructions by the processor produces implementations in one or more blocks of the flowcharts and/or block diagrams. A device with specified functions.
这些计算机可读程序指令也可存储在计算机可读存储器中,这些指令使得计算机以特定方式工作,从而产生一个制造品,包括实现在流程图和/或框图中一个或多个框中指定的功能的指令。Computer-readable program instructions, which may also be stored in computer-readable memory, cause the computer to operate in a specific manner to produce an article of manufacture, including implementing the functions specified in one or more blocks of the flowcharts and/or block diagrams. instructions.
本公开可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。 The disclosure may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment that combines software and hardware aspects.
通过上述实施例中的点云特征提取网络模型训练、点云特征提取方法、装置和无人车,只需使用一种模态的数据,且无需进行雷达、相机之间的标定,即可实现点云特征提取网络模型的自监督学习,不仅减少了数据标注的成本,而且提升了训练得到的特征提取模型的性能。Through the point cloud feature extraction network model training, point cloud feature extraction method, device and unmanned vehicle in the above embodiments, only one modality of data is used, and no calibration between radar and camera is required. The self-supervised learning of the point cloud feature extraction network model not only reduces the cost of data annotation, but also improves the performance of the trained feature extraction model.
至此,已经详细描述了根据本公开的点云特征提取网络模型训练、点云特征提取方法、装置和无人车。为了避免遮蔽本公开的构思,没有描述本领域所公知的一些细节。本领域技术人员根据上面的描述,完全可以明白如何实施这里公开的技术方案。 So far, the point cloud feature extraction network model training, point cloud feature extraction method, device and unmanned vehicle according to the present disclosure have been described in detail. To avoid obscuring the concepts of the present disclosure, some details that are well known in the art have not been described. Based on the above description, those skilled in the art can completely understand how to implement the technical solution disclosed here.

Claims (23)

  1. 一种点云特征提取网络模型的训练方法,包括:A training method for point cloud feature extraction network model, including:
    利用第一特征提取网络模型,对样本点云帧序列进行第一编码,以得到所述样本点云帧序列中每一帧样本点云的编码特征图;Using a first feature extraction network model, perform a first encoding on the sample point cloud frame sequence to obtain a coded feature map of each frame of sample point cloud in the sample point cloud frame sequence;
    根据相邻多帧样本点云的编码特征图,确定位于所述相邻多帧样本点云之后的下一帧样本点云的预测特征图;According to the coding feature map of the adjacent multi-frame sample point cloud, determine the predicted feature map of the next frame sample point cloud located after the adjacent multi-frame sample point cloud;
    根据位于所述相邻多帧样本点云之后的下一帧样本点云的预测特征图和其编码特征图,确定损失函数值;Determine the loss function value according to the predicted feature map and its encoding feature map of the next frame sample point cloud located after the adjacent multi-frame sample point cloud;
    根据所述损失函数值,对所述第一特征提取网络模型进行训练。The first feature extraction network model is trained according to the loss function value.
  2. 根据权利要求1所述的点云特征提取网络模型训练方法,其中,根据相邻多帧样本点云的编码特征图,确定位于所述相邻多帧样本点云之后的下一帧样本点云的预测特征图包括:The point cloud feature extraction network model training method according to claim 1, wherein the next frame sample point cloud located after the adjacent multi-frame sample point cloud is determined according to the encoding feature map of the adjacent multi-frame sample point cloud. The predicted feature maps include:
    利用第二特征提取网络模型,对所述相邻多帧样本点云的编码特征图分别进行第二编码,以得到所述相邻多帧样本点云的中间特征图;Using a second feature extraction network model, perform second encoding on the coded feature maps of the adjacent multi-frame sample point clouds, respectively, to obtain intermediate feature maps of the adjacent multi-frame sample point clouds;
    对所述相邻多帧样本点云的中间特征图进行融合,以得到融合特征图;Fusing the intermediate feature maps of the adjacent multi-frame sample point clouds to obtain a fused feature map;
    对所述融合特征图进行解码,以得到位于所述相邻多帧样本点云之后的下一帧样本点云的预测特征图。The fused feature map is decoded to obtain a predicted feature map of the next frame sample point cloud located after the adjacent multi-frame sample point cloud.
  3. 根据权利要求2所述的点云特征提取网络模型训练方法,其中,所述对所述相邻多帧样本点云的中间特征图进行融合,以得到融合特征图包括:The point cloud feature extraction network model training method according to claim 2, wherein said fusing the intermediate feature maps of the adjacent multi-frame sample point clouds to obtain the fused feature map includes:
    根据所述相邻多帧样本点云的中间特征图,确定所述相邻多帧样本点云之间的特征点匹配关系;Determine the feature point matching relationship between the adjacent multi-frame sample point clouds according to the intermediate feature map of the adjacent multi-frame sample point cloud;
    根据所述特征点匹配关系,对所述相邻多帧样本点云的中间特征图进行融合,以得到融合特征图。According to the feature point matching relationship, the intermediate feature maps of the adjacent multi-frame sample point clouds are fused to obtain a fused feature map.
  4. 根据权利要求1至3任一所述的点云特征提取网络模型训练方法,其中,根据相邻多帧样本点云的编码特征图,确定位于所述相邻多帧样本点云之后的下一帧样本点云的预测特征图包括: The point cloud feature extraction network model training method according to any one of claims 1 to 3, wherein the next point cloud located after the adjacent multi-frame sample point cloud is determined according to the coding feature map of the adjacent multi-frame sample point cloud. The predicted feature maps of the frame sample point cloud include:
    根据相邻多帧样本点云的编码特征图,确定所述相邻多帧样本点云之间的特征点匹配关系;Determine the feature point matching relationship between the adjacent multi-frame sample point clouds according to the encoded feature maps of the adjacent multi-frame sample point clouds;
    根据所述特征点匹配关系,对所述相邻多帧样本点云的编码特征图进行融合,以得到融合特征图;According to the matching relationship of the feature points, fuse the coded feature maps of the adjacent multi-frame sample point clouds to obtain a fused feature map;
    根据所述融合特征图,确定位于所述相邻多帧样本点云之后的下一帧样本点云的预测特征图。According to the fused feature map, a predicted feature map of a next frame of sample point cloud following the adjacent multiple frames of sample point cloud is determined.
  5. 根据权利要求3或4所述的点云特征提取网络模型训练方法,其中,所述根据所述相邻多帧样本点云的中间特征图,确定所述相邻多帧样本点云之间的特征点匹配关系包括:The point cloud feature extraction network model training method according to claim 3 or 4, wherein the distance between the adjacent multi-frame sample point clouds is determined based on the intermediate feature map of the adjacent multi-frame sample point clouds. Feature point matching relationships include:
    根据所述相邻多帧样本点云的中间特征图,计算所述相邻多帧样本点云之间的特征点的相关度;According to the intermediate feature map of the adjacent multi-frame sample point cloud, calculate the correlation degree of the feature points between the adjacent multi-frame sample point cloud;
    根据所述相邻多帧样本点云之间的特征点的相关度,确定所述相邻两帧样本点云之间的特征点匹配关系。According to the correlation of the feature points between the adjacent multi-frame sample point clouds, the feature point matching relationship between the two adjacent frame sample point clouds is determined.
  6. 根据权利要求5所述的点云特征提取网络模型训练方法,其中,所述相邻多帧样本点云为相邻两帧样本点云,所述相邻多帧样本点云的中间特征图包括:与所述相邻两帧样本点云中的第一帧对应的第一中间特征图、以及与所述相邻两帧样本点云中的第二帧对应的第二中间特征图;以及,The point cloud feature extraction network model training method according to claim 5, wherein the adjacent multi-frame sample point clouds are two adjacent frame sample point clouds, and the intermediate feature map of the adjacent multi-frame sample point clouds includes : the first intermediate feature map corresponding to the first frame in the two adjacent frames of sample point clouds, and the second intermediate feature map corresponding to the second frame in the two adjacent frames of sample point clouds; and,
    所述根据所述相邻多帧样本点云的中间特征图,计算所述相邻多帧样本点云之间的特征点的相关度包括:Calculating the correlation of feature points between adjacent multi-frame sample point clouds based on the intermediate feature maps of the adjacent multi-frame sample point clouds includes:
    计算所述第一中间特征图上的每个特征点,与所述第二中间特征图上指定范围内的特征点的相关度,所述指定范围为第一中间特征图的特征点的邻域范围;Calculate the correlation between each feature point on the first intermediate feature map and the feature points within a specified range on the second intermediate feature map. The specified range is the neighborhood of the feature points of the first intermediate feature map. scope;
    根据所述相关度,确定所述相邻两帧样本点云之间的特征点匹配关系。According to the correlation degree, the feature point matching relationship between the sample point clouds of the two adjacent frames is determined.
  7. 根据权利要求3至6任一所述的点云特征提取网络模型训练方法,其中,所述相邻多帧样本点云为相邻两帧样本点云;The point cloud feature extraction network model training method according to any one of claims 3 to 6, wherein the adjacent multi-frame sample point clouds are two adjacent frame sample point clouds;
    所述根据所述特征点匹配关系,对所述相邻两帧样本点云的中间特征图进行融合,以得到融合特征图包括:The step of fusing the intermediate feature maps of the sample point clouds of the two adjacent frames according to the feature point matching relationship to obtain the fused feature map includes:
    根据所述特征点匹配关系,将所述相邻两帧样本点云的中间特征图之间的匹配特 征点进行特征拼接,并将拼接得到的特征图作为融合特征图。According to the feature point matching relationship, the matching features between the intermediate feature maps of the sample point clouds of the two adjacent frames are Feature points are spliced together, and the spliced feature map is used as a fused feature map.
  8. 根据权利要求1至7任一所述的点云特征提取网络模型训练方法,其中,所述根据位于所述相邻多帧样本点云之后的下一帧样本点云的预测特征图和其编码特征图,确定损失函数值包括:The point cloud feature extraction network model training method according to any one of claims 1 to 7, wherein the predicted feature map and its encoding based on the next frame sample point cloud located after the adjacent multi-frame sample point cloud Feature map, determining the loss function value includes:
    在所述下一帧样本点云的预测特征图和编码特征图之间,计算具有同一位置索引的特征点之间的欧式距离;Between the predicted feature map and the encoded feature map of the next frame sample point cloud, calculate the Euclidean distance between feature points with the same position index;
    根据所有位置索引的特征点之间的欧式距离,计算损失函数值。The loss function value is calculated based on the Euclidean distance between the feature points of all position indexes.
  9. 根据权利要求1至8任一所述的点云特征提取网络模型训练方法,其中,所述第一特征提取网络模型为共享权值编码器,所述共享权值编码器包括多个编码模块,每个编码模块用于对所述样本点云帧序列中的一帧进行编码。The point cloud feature extraction network model training method according to any one of claims 1 to 8, wherein the first feature extraction network model is a shared weight encoder, and the shared weight encoder includes a plurality of encoding modules, Each encoding module is used to encode one frame in the sample point cloud frame sequence.
  10. 根据权利要求9所述的点云特征提取网络模型训练方法,其中,所述编码模块包括:卷积神经网络以及自注意力网络。The point cloud feature extraction network model training method according to claim 9, wherein the encoding module includes: a convolutional neural network and a self-attention network.
  11. 根据权利要求1至10任一所述的点云特征提取网络模型训练方法,还包括:The point cloud feature extraction network model training method according to any one of claims 1 to 10, further comprising:
    将多帧样本点云的原始特征数据转换成二维图像特征数据,以得到由多帧样本点云的二维图像特征数据构成的样本点云帧序列。Convert the original feature data of the multi-frame sample point cloud into two-dimensional image feature data to obtain a sample point cloud frame sequence composed of the two-dimensional image feature data of the multi-frame sample point cloud.
  12. 根据权利要求11所述的点云特征提取网络模型训练方法,其中,将多帧样本点云的原始特征数据转换成二维图像特征数据包括:The point cloud feature extraction network model training method according to claim 11, wherein converting the original feature data of the multi-frame sample point cloud into two-dimensional image feature data includes:
    将多帧样本点云的原始特征数据转换成鸟瞰图BEV特征数据。Convert the original feature data of multi-frame sample point clouds into bird's-eye view BEV feature data.
  13. 根据权利要求1至12任一所述的点云特征提取网络模型训练方法,其中:The point cloud feature extraction network model training method according to any one of claims 1 to 12, wherein:
    所述样本点云帧序列由在时序上连续的多帧样本点云组成;和/或,The sample point cloud frame sequence consists of multiple frames of sample point clouds that are continuous in time series; and/or,
    所述样本点云帧序列包含的样本点云的帧数量大于等于3、且小于等于5。The number of sample point cloud frames included in the sample point cloud frame sequence is greater than or equal to 3 and less than or equal to 5.
  14. 根据权利要求2至13任一所述的点云特征提取网络模型训练方法,其中,第二特征提取网络模型包括: The point cloud feature extraction network model training method according to any one of claims 2 to 13, wherein the second feature extraction network model includes:
    注意力编码模块,用于对所述相邻多帧样本点云的编码特征图分别进行第二编码;An attention coding module, configured to perform second coding on the coding feature maps of the adjacent multi-frame sample point clouds respectively;
    注意力解码模块,用于对所述融合特征图进行解码,以得到位于所述相邻多帧样本点云之后的下一帧样本点云的预测特征图。An attention decoding module is used to decode the fused feature map to obtain a predicted feature map of the next frame sample point cloud located after the adjacent multi-frame sample point cloud.
  15. 一种点云特征提取方法,包括:A point cloud feature extraction method, including:
    获取待处理点云帧序列;Get the point cloud frame sequence to be processed;
    基于利用权利要求1-14任一项所述的点云特征提取网络模型训练方法训练得到的第一特征提取网络模型,对所述待处理点云帧序列进行编码,以得到所述待处理点云帧序列的特征图。Based on the first feature extraction network model trained by the point cloud feature extraction network model training method according to any one of claims 1 to 14, the point cloud frame sequence to be processed is encoded to obtain a feature map of the point cloud frame sequence to be processed.
  16. 根据权利要求15所述的点云特征提取方法,其中,获取待处理点云帧序列包括:The point cloud feature extraction method according to claim 15, wherein obtaining the point cloud frame sequence to be processed includes:
    获取多帧待处理点云的原始特征数据;Obtain the original feature data of multi-frame point clouds to be processed;
    将所述多帧待处理点云的原始特征数据转换为鸟瞰图BEV特征数据,以得到由多帧待处理点云的鸟瞰图特征数据组成的待处理点云帧序列。The original feature data of the multi-frame point cloud to be processed is converted into bird's-eye view BEV feature data to obtain a sequence of point cloud frames to be processed consisting of the bird's-eye view feature data of the multi-frame point cloud to be processed.
  17. 一种目标检测方法,包括:A target detection method including:
    根据权利要求15或16所述的点云特征提取方法提取待处理点云帧序列的特征图;Extract the feature map of the point cloud frame sequence to be processed according to the point cloud feature extraction method according to claim 15 or 16;
    根据所述待处理点云帧序列的特征图,进行目标检测。Target detection is performed based on the feature map of the point cloud frame sequence to be processed.
  18. 一种点云语义分割方法,包括:A point cloud semantic segmentation method, including:
    根据权利要求15或16所述的点云特征提取方法提取待处理点云帧序列的特征图;Extract the feature map of the point cloud frame sequence to be processed according to the point cloud feature extraction method according to claim 15 or 16;
    根据所述待处理点云帧序列的特征图,进行点云语义分割。According to the feature map of the point cloud frame sequence to be processed, point cloud semantic segmentation is performed.
  19. 一种装置,包括:A device including:
    用于执行权利要求1-14任一所述的点云特征提取网络模型训练方法的模块,或者,用于执行权利要求15或16所述的点云特征提取方法的模块,或者用于执行权利要求17所述的目标检测方法的模块,或者,用于执行权利要求18所述的点云语义分割方法的模块。 A module for executing the point cloud feature extraction network model training method described in any one of claims 1 to 14, or a module for executing the point cloud feature extraction method described in claims 15 or 16, or a module for executing the right A module for the target detection method according to claim 17, or a module for executing the point cloud semantic segmentation method according to claim 18.
  20. 一种电子设备,包括:An electronic device including:
    存储器;以及memory; and
    耦接至所述存储器的处理器,所述处理器被配置为基于存储在所述存储器的指令执行权利要求1至14任一项所述的点云特征提取网络模型训练方法,或权利要求15或16所述的点云特征提取方法,或权利要求17所述的目标检测方法,或权利要求18所述的点云语义分割方法。A processor coupled to the memory, the processor being configured to execute the point cloud feature extraction network model training method of any one of claims 1 to 14 based on instructions stored in the memory, or claim 15 Or the point cloud feature extraction method described in claim 16, or the target detection method described in claim 17, or the point cloud semantic segmentation method described in claim 18.
  21. 一种计算机可读存储介质,其上存储有计算机程序指令,该指令被处理器执行时实现权利要求1至14任一项所述的点云特征提取网络模型训练方法,或权利要求15或16所述的点云特征提取方法,或权利要求17所述的目标检测方法,或权利要求18所述的点云语义分割方法。A computer-readable storage medium with computer program instructions stored thereon, which when executed by a processor implements the point cloud feature extraction network model training method described in any one of claims 1 to 14, or claim 15 or 16 The point cloud feature extraction method, or the target detection method according to claim 17, or the point cloud semantic segmentation method according to claim 18.
  22. 一种无人车,包括:An unmanned vehicle, including:
    如权利要求19所述的装置,或者权利要求20所述的电子设备。A device as claimed in claim 19, or an electronic device as claimed in claim 20.
  23. 一种计算机程序,包括:A computer program consisting of:
    指令,所述指令当由处理器执行时使所述处理器执行根据权利要求1至14任一所述的点云特征提取网络模型训练方法,或权利要求15或16所述的点云特征提取方法,或权利要求17所述的目标检测方法,或权利要求18所述的点云语义分割方法。 Instructions, which when executed by the processor cause the processor to execute the point cloud feature extraction network model training method according to any one of claims 1 to 14, or the point cloud feature extraction according to claims 15 or 16 method, or the target detection method according to claim 17, or the point cloud semantic segmentation method according to claim 18.
PCT/CN2023/082809 2022-09-14 2023-03-21 Point cloud feature extraction network model training method, point cloud feature extraction method, apparatus, and driverless vehicle WO2024055551A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202211115103.3 2022-09-14
CN202211115103.3A CN115482391A (en) 2022-09-14 2022-09-14 Point cloud feature extraction network model training method, point cloud feature extraction device and unmanned vehicle

Publications (1)

Publication Number Publication Date
WO2024055551A1 true WO2024055551A1 (en) 2024-03-21

Family

ID=84423833

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/082809 WO2024055551A1 (en) 2022-09-14 2023-03-21 Point cloud feature extraction network model training method, point cloud feature extraction method, apparatus, and driverless vehicle

Country Status (2)

Country Link
CN (1) CN115482391A (en)
WO (1) WO2024055551A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115482391A (en) * 2022-09-14 2022-12-16 北京京东乾石科技有限公司 Point cloud feature extraction network model training method, point cloud feature extraction device and unmanned vehicle

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022105502A1 (en) * 2020-11-23 2022-05-27 歌尔股份有限公司 Point cloud data processing method and apparatus
CN114898355A (en) * 2021-04-15 2022-08-12 北京轻舟智航智能技术有限公司 Method and system for self-supervised learning of body-to-body movements for autonomous driving
CN115482391A (en) * 2022-09-14 2022-12-16 北京京东乾石科技有限公司 Point cloud feature extraction network model training method, point cloud feature extraction device and unmanned vehicle

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022105502A1 (en) * 2020-11-23 2022-05-27 歌尔股份有限公司 Point cloud data processing method and apparatus
CN114898355A (en) * 2021-04-15 2022-08-12 北京轻舟智航智能技术有限公司 Method and system for self-supervised learning of body-to-body movements for autonomous driving
CN115482391A (en) * 2022-09-14 2022-12-16 北京京东乾石科技有限公司 Point cloud feature extraction network model training method, point cloud feature extraction device and unmanned vehicle

Also Published As

Publication number Publication date
CN115482391A (en) 2022-12-16

Similar Documents

Publication Publication Date Title
Vu et al. Hybridnets: End-to-end perception network
Sirohi et al. Efficientlps: Efficient lidar panoptic segmentation
CN113128348B (en) Laser radar target detection method and system integrating semantic information
Zhang et al. Instance segmentation of lidar point clouds
CN110765922A (en) AGV is with two mesh vision object detection barrier systems
JP2021089724A (en) 3d auto-labeling with structural and physical constraints
Ghasemieh et al. 3D object detection for autonomous driving: Methods, models, sensors, data, and challenges
Ruf et al. Real-time on-board obstacle avoidance for UAVs based on embedded stereo vision
WO2024055551A1 (en) Point cloud feature extraction network model training method, point cloud feature extraction method, apparatus, and driverless vehicle
Deng et al. Simultaneous vehicle and lane detection via MobileNetV3 in car following scene
CN115879060B (en) Multi-mode-based automatic driving perception method, device, equipment and medium
US20230252796A1 (en) Self-supervised compositional feature representation for video understanding
WO2024001093A1 (en) Semantic segmentation method, environment perception method, apparatus, and unmanned vehicle
Shi et al. Grid-centric traffic scenario perception for autonomous driving: A comprehensive review
CN115860102B (en) Pre-training method, device, equipment and medium for automatic driving perception model
CN114882457A (en) Model training method, lane line detection method and equipment
Petrovai et al. Semantic cameras for 360-degree environment perception in automated urban driving
Carranza-García et al. Object detection using depth completion and camera-LiDAR fusion for autonomous driving
Xu et al. Exploiting high-fidelity kinematic information from port surveillance videos via a YOLO-based framework
Luo et al. Dynamic multitarget detection algorithm of voxel point cloud fusion based on pointrcnn
Zhou et al. MotionBEV: Attention-Aware Online LiDAR Moving Object Segmentation With Bird's Eye View Based Appearance and Motion Features
US20220343096A1 (en) Learning monocular 3d object detection from 2d semantic keypoint detection
Zhang et al. Sst: Real-time end-to-end monocular 3d reconstruction via sparse spatial-temporal guidance
CN114332845A (en) 3D target detection method and device
CN117037141A (en) 3D target detection method and device and electronic equipment

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23864294

Country of ref document: EP

Kind code of ref document: A1