CN115690708A

CN115690708A - Method and device for training three-dimensional target detection model based on cross-modal knowledge distillation

Info

Publication number: CN115690708A
Application number: CN202211296868.1A
Authority: CN
Inventors: 杨晓东; 王泽宇; 罗晨旭
Original assignee: Suzhou Qingyu Technology Co Ltd
Current assignee: Suzhou Qingyu Technology Co Ltd
Priority date: 2022-10-21
Filing date: 2022-10-21
Publication date: 2023-02-03

Abstract

The embodiment of the invention relates to a method and a device for training a three-dimensional target detection model based on cross-modal knowledge distillation, wherein the method comprises the following steps: obtaining a teacher model, a student model and a student model loss function; acquiring training data from a training data set; determining characteristic loss, attention loss and similarity loss functions of knowledge distillation from a teacher model to a student model; and the student model loss function, the characteristic loss function, the attention loss function and the similarity loss function form an integral loss function; carrying out model self-training on the student model; performing teacher and student feature simulation training; carrying out attention simulation training of teachers and students; carrying out teacher-student similarity training; carrying out integral training; and after the training is finished, selecting a new group of training data from the training data set to perform the next round of training on the student model until the specified times are reached. By the method and the device, the detection accuracy of the image 3D target detection model can be improved.

Description

Method and device for training three-dimensional target detection model based on cross-modal knowledge distillation

Technical Field

The invention relates to the technical field of data processing, in particular to a method and a device for training a three-dimensional target detection model based on cross-modal knowledge distillation.

Background

And the sensing module of the automatic driving system is used for detecting the target according to the sensing data of the sensor. At present, mainstream sensor includes camera, laser radar and radar etc. and common 3D target detection model mainly has two types: the method comprises a neural network model taking point cloud as input and a neural network model taking an image as input. Both models have their own advantages and disadvantages: 1) The point cloud carries quasi distance (depth) information, and a 3D target detection model based on the point cloud can output higher detection precision to a target at a closer distance; however, the point cloud has the characteristic of sparsity, so that the detection accuracy of the 3D target detection model based on the point cloud on a remote target is poor; 2) Visual field information on the image is uniformly distributed, the information density is high, and the 3D target detection model based on the image can output higher identification precision to a near target or a far target; however, the image does not have depth information, and depth estimation needs to be performed on the image through a model, and the depth estimation causes a large detection error, so that the positioning accuracy of a target identification box (bbox) output by a 3D target detection model based on the image is always insufficient.

Disclosure of Invention

The invention aims to provide a method, a device, electronic equipment and a computer readable storage medium for training a three-dimensional target detection model based on cross-modal knowledge distillation, aiming at the defects of the prior art; taking a well-trained 3D target detection Model based on point cloud as a Teacher Model (Teacher Model), taking a 3D target detection Model based on an image as a Student Model (Student Model), distilling the point cloud BEV characteristics of the Teacher Model to the Student Model by using a Knowledge Distillation (Knowledge Distillation) mechanism, training the Student Model, and helping the Student Model to learn depth characteristics similar to the point cloud data from image data; the method and the device can help the image 3D target detection model to reduce depth estimation errors, and achieve the purpose of improving the detection accuracy of the image 3D target detection model.

In order to achieve the above object, a first aspect of the embodiments of the present invention provides a method for training a three-dimensional target detection model based on cross-modal knowledge distillation, the method including:

acquiring a well-trained 3D target detection model based on point cloud as a corresponding teacher model, and acquiring a 3D target detection model to be trained based on images as a corresponding student model; and obtaining the model loss function of the student model as a corresponding student model loss function L _det ；

Acquiring an original point cloud and an original image of the same scene from a training data set as a corresponding first point cloud and a first image; acquiring identification frame marking information and false positive area marking information of the original point cloud as a corresponding first identification frame set and a first false positive area set;

characteristic loss function L for knowledge distillation from the teacher model to the student model _fea Attention loss function L _att And a similarity loss function L _aff Determining; and loss function L is formed by the student model _det The characteristic loss function L _fea The attention loss function L _att And said similarity loss function L _aff Are added to form a corresponding overall loss function L _all ,L _all ＝L _det +L _fea +L _att +L _aff ；

According to the first image, the first identification frame set and the student model loss function L _det Performing model self-training on the student model; after the model self-training is finished, the model self-training is finished according to the first point cloud, the first image, the first recognition frame set, the first false positive area set and the characteristic loss function L _fea Performing teacher and student characteristic imitation training on the student model; after the teacher-student feature simulation training is finished, the teacher-student feature simulation training is finished according to the first point cloud and the first pictureImage and said attention loss function L _att Performing teacher and student attention imitation training on the student model; after the teacher and student attention imitation training is finished, the teacher and student attention imitation training is finished according to the first point cloud, the first image and the similarity loss function L _aff Carrying out teacher and student similarity training on the student model; the teacher-student similarity training is completed according to the first point cloud, the first image, the first recognition frame set, the first false positive area set and the overall loss function L _all Performing overall training on the student model;

and after the whole training is finished, selecting a new group of original point clouds, original images, identification frame marking information of the original point clouds and false positive area marking information from the training data set to perform next training on the student model until the total number of training rounds reaches the specified number.

Preferably, the teacher model comprises a point cloud pillar feature network, a BEV pooling network, a first BEV encoder and a first target detection head network; the output of the point cloud column feature network is connected with the input of the BEV pooling network; an output of the BEV pooling network is connected to an input of the first BEV encoder; the input of the first BEV encoder is connected with the input of the first target detection head network;

the student model comprises an image encoder, an LSS view converter, a second BEV encoder and a second target detection head network; the output of the image encoder is connected to the input of the LSS view converter; an output of the LSS view converter is connected to an input of the second BEV encoder; the input of the second BEV encoder is connected with the input of the second target detection head network;

the first identification frame set comprises a plurality of three-dimensional first identification frames bbox ₁ (ii) a The first identification frame bbox ₁ Is H in shape _bbox1 *W _bbox1 *Z _bbox1 ，H _bbox1 、W _bbox1 、Z _bbox1 Is the first identification frame bbox ₁ Depth, width and height of;

the first set of false positive regions comprises a plurality of first false positive regions FP.

Preferably, the characteristic loss function L of knowledge distillation from the teacher model to the student model _fea Attention loss function L _att And similarity loss function L _aff The determination specifically comprises the following steps:

step 31, determining said characteristic loss function L _fea Comprises the following steps:

(ii) a Wherein the content of the first and second substances,

alpha, beta and gamma are respectively preset loss coefficients;

F ^T is the teacher feature map output by the first BEV encoder of the teacher model, and H, W, C is the teacher feature map F ^T Height, width and channel dimensions; the teacher characteristic diagram F ^T Teacher feature channel vector with 1 × C shape and capable of being decomposed into H × W one-dimensional channels

The teacher characteristic diagram F ^T Can also be decomposed into C teacher sub-feature graphs with two-dimensional shape of H multiplied by W

The teacher characteristic diagram F ^T Can also be decomposed into C, H, W teacher characteristic data

1≤k≤C、1≤i≤H、1≤j≤W；

F ^S A student profile output for the second BEV encoder of the student model; f. of _proj () Is a projection function from the student model BEV space to the teacher model BEV space;

is the student characteristic diagram F ^S The corresponding projection feature map is displayed on the display screen,

the projection feature map

And the teacher characteristic diagram F ^T Are consistent with the channel dimensions; the projection feature map

Projection eigen channel vector with shape of 1 × C capable of being decomposed into H × W one-dimensional

The projection characteristic diagram

Can also be decomposed into C projection sub-feature maps with two-dimensional shapes of H multiplied by W

The projection characteristic diagram

Can also be decomposed into C, H, W projection characteristic data

M () is a foreground-background binary mask function;

n () is a false positive-background binary mask function;

s () is a size mask function;

A ^s () As a channel vector attention function;

A ^c () Attention function for sub-feature map;

is the teacher feature diagram F ^T And the student characteristic diagram F ^S The foreground feature of (a) loses the branch,

is the teacher characteristic diagram F ^T And the student characteristic diagram F ^S The false positive feature of (2) loses a branch,

is the teacher characteristic diagram F ^T And the student characteristic diagram F ^S The background feature loss branch of (1);

step 32, determining the attention loss function L _att Comprises the following steps:

wherein the content of the first and second substances,

eta is a preset attention loss hyper-parameter;

L ₁ is L1_ Loss function;

step 33, from the teacher feature map F ^T And the projection feature map

Optionally a predetermined quantity Q ² The pixel points form corresponding matching point pairs

1≤i′≤Q，1≤j′≤Q；

Wherein the content of the first and second substances,

each pixel point

In the teacher characteristic diagram F ^T Corresponding to a line feature tensor of 1 xWxC

And also a column feature tensor of H × 1 × C shape

Each pixel point

In the projection feature map

Corresponding to a line feature tensor of 1 xWxC shape

And also a column feature tensor of H × 1 × C shape

Step 34, according to the preset quantity Q ² The matching point pair of

Determining the similarity loss function L _aff Comprises the following steps:

wherein the content of the first and second substances,

zeta is a preset similarity loss hyper-parameter;

‖‖ _smoothl1 is a Smooth _ L1_ loss function;

A _ff () As a function of similarity.

Further, the foreground-background binary mask function M () is:

the false positive-background binary mask function N () is:

the size mask function S () is:

wherein the content of the first and second substances,

H _bbox1 、W _bbox1 for the corresponding first identification frame bbox ₁ Depth and width of;

N _gb is the teacher sub-feature graph

The number of background points on the top of the tile,

the channel vector attention function A ^s () Comprises the following steps:

wherein the content of the first and second substances,

t is a preset distillation hyper-parameter;

softmax _s []an activation function that is a channel vector attention function;

F ¹ as input features, the input features F ¹ Comprising C feature components

In the input feature F ¹ Is the teacher feature channel vector

Then, the teacher feature channel vector is used

As the corresponding said bitCharacteristic component

In the input feature F ¹ For the projected eigen-channel vector

Then, the projected feature channel vector is processed

As corresponding said feature components

The sub-feature map attention function A ^c () Comprises the following steps:

wherein the content of the first and second substances,

softmax _c []an activation function that is a sub-feature graph attention function;

F ² as input features, the input features F ² Comprises H × W characteristic components

At the input feature F ² As the teacher sub-feature graph

Then, the teacher sub-feature graph is drawn

As corresponding feature components

At the input feature F ² For the projection sub-feature map

Then, the projection sub-feature map is processed

As corresponding feature components

Further, the similarity function A _ff () Comprises the following steps:

wherein D is _i′ 、D _j′ Inputting a feature vector;

in inputting feature vector D _i′ 、D _j′ Is the teacher characteristic diagram F ^T Last one of the pixel points

The line feature tensor of

And said column feature tensor

Time, similarity function

The method comprises the following specific steps:

in inputting feature vector D _i′ 、D _j′ For the projection feature map

Last one of the pixel points

The line feature tensor of

And said column feature tensor

Function of time similarity

The method specifically comprises the following steps:

preferably, the method further comprises the step of calculating a loss function L according to the first image, the first recognition frame set and the student model _det Carrying out model self-training on the student model, and specifically comprising:

step 61, inputting the first image into the student model for gradual operation, and taking a target identification frame set output by the second target detection head network of the student model as a corresponding second identification frame set in the operation process; the second recognition frame set comprises a plurality of three-dimensional second recognition frames bbox ₂ (ii) a The second identification frame bbox ₂ Is H in shape _bbox2 *W _bbox2 *Z _bbox2 ，H _bbox2 、W _bbox2 、Z _bbox2 Is the second identification frame bbox ₂ Depth, width and height of;

step 62, substituting the first and second recognition box sets into the student model loss function L _det Estimating a loss value to generate a corresponding first loss value;

step 63, identifying whether the first loss value meets a preset first loss convergence range; if yes, go to step 65; if not, go to step 64;

step 64, substituting the model parameters of the student model into the student model loss function L _det Forming a corresponding first objective function; and for minimizing said first objective functionSolving the model parameters; updating the model parameters of the student model according to the solving result; and returns to step 61 after updating;

and step 65, confirming that the model self-training is completed.

Preferably, the first point cloud, the first image, the first recognition box set, the first false positive area set and the feature loss function L are obtained according to the first point cloud _fea Right the student model carries out teacher and student's feature simulation training, specifically includes:

step 71, inputting the first point cloud into the teacher model for gradual operation, and extracting the output characteristic diagram of the first BEV encoder of the teacher model as the corresponding teacher characteristic diagram F in the operation process ^T (ii) a Inputting the first image into the student model for gradual operation, and extracting the output characteristic diagram of the second BEV encoder of the student model as the corresponding student characteristic diagram F in the operation process ^S (ii) a And based on said projection function f _proj () For the student characteristic diagram F ^S Projecting from the student model BEV space to the teacher model BEV space to generate corresponding projection feature map

Step 72, the teacher feature map F is processed ^T The projection feature map

Substituting the first set of recognition boxes and the first set of false positive regions into the feature loss function L _fea Estimating the loss value to generate a corresponding second loss value;

step 73, identifying whether the second loss value meets a preset second loss convergence range; if yes, go to step 75; if not, go to step 74;

step 74, substituting the model parameters of the student model into the characteristic loss function L _fea Forming a corresponding second objective function; and toSolving the model parameter which enables the second objective function to reach the minimum value; updating the model parameters of the student model according to the solving result; and returns to step 71 after the update;

and step 75, confirming that the teacher-student characteristic imitation training is completed.

Preferably, the method further comprises the step of calculating a first point cloud, a first image and an attention loss function L _att It is right the student model carries out teacher and student's attention imitation training, specifically includes:

step 81, inputting the first point cloud into the teacher model for gradual operation, and extracting the output characteristic diagram of the first BEV encoder of the teacher model as the corresponding teacher characteristic diagram F in the operation process ^T (ii) a Inputting the first image into the student model for gradual operation, and extracting the output characteristic diagram of the second BEV encoder of the student model as the corresponding student characteristic diagram F in the operation process ^S (ii) a And based on said projection function f _proj () For the student characteristic diagram F ^S Projecting from the student model BEV space to the teacher model BEV space to generate corresponding projection feature map

Step 82, the teacher characteristic graph F is processed ^T And the projection feature map

Substituting the attention loss function L _att Estimating the loss value to generate a corresponding third loss value;

step 83, identifying whether the third loss value meets a preset third loss convergence range; if yes, go to step 85; if not, go to step 84;

step 84, substituting the model parameters of the student model into the attention loss function L _att Forming a corresponding third objective function; and solving the model parameter which makes the third objective function reach the minimum value(ii) a Updating the model parameters of the student model according to the solving result; and returns to step 81 after the update;

and step 85, confirming that the teacher-student attention imitation training is finished.

Preferably, the method further comprises the step of calculating a similarity loss function L from the first point cloud, the first image and the similarity loss function _aff It is right the student model carries out teachers and students similarity training, specifically includes:

step 91, inputting the first point cloud into the teacher model for gradual operation, and extracting the output characteristic diagram of the first BEV encoder of the teacher model as the corresponding teacher characteristic diagram F in the operation process ^T (ii) a Inputting the first image into the student model for gradual operation, and extracting the output characteristic diagram of the second BEV encoder of the student model as the corresponding student characteristic diagram F in the operation process ^S (ii) a And based on said projection function f _proj () For the student characteristic diagram F ^S Projecting from the student model BEV space to the teacher model BEV space to generate corresponding projection feature map

Step 92, the teacher characteristic graph F is processed ^T And the projection feature map

Substituting the similarity loss function L _aff Estimating the loss value to generate a corresponding fourth loss value;

step 93, identifying whether the fourth loss value meets a preset fourth loss convergence range; if yes, go to step 95; if not, go to step 94;

step 94, substituting the model parameters of the student model into the similarity loss function L _aff Forming a corresponding fourth objective function; solving the model parameter which enables the fourth objective function to reach the minimum value; and updating the model parameters of the student model according to the solving result(ii) a And returns to step 91 after the update;

and step 95, confirming that the teacher-student similarity training is completed.

Preferably, the method further comprises the steps of calculating a first point cloud, calculating a first image, calculating a first recognition frame set, calculating a first false positive area set, calculating a second false positive area set, and calculating a global loss function L according to the first point cloud, the first image, the first recognition frame set, the first false positive area set, and the global loss function L _all Carrying out integral training on the student model, and specifically comprising:

step 101, inputting the first point cloud into the teacher model for gradual operation, and extracting an output characteristic diagram of the first BEV encoder of the teacher model as a corresponding teacher characteristic diagram F in the operation process ^T (ii) a Inputting the first image into the student model for gradual operation, and extracting the output characteristic diagram of the second BEV encoder of the student model as the corresponding student characteristic diagram F in the operation process ^S And taking a target recognition frame set output by the second target detection head network of the student model as a corresponding third recognition frame set; and based on said projection function f _proj () For the student characteristic diagram F ^S Projecting from the student model BEV space to the teacher model BEV space to generate corresponding projection feature map

The third identification frame set comprises a plurality of three-dimensional third identification frames bbox ₃ (ii) a The third identification frame bbox ₃ Is H in shape _bbox3 *W _bbox3 *Z _bbox3 ，H _bbox3 、W _bbox3 、Z _bbox3 Identifying the third frame bbox ₃ Depth, width and height of;

102, collecting the first and third identification frames and the teacher characteristic graph F ^T The projection feature map

And substituting said first set of false positive regions into said global loss function L _all Estimating the loss value to generate a corresponding integral loss value;

step 103, identifying whether the overall loss value meets a preset overall loss convergence range; if yes, go to step 105; if not, go to step 104;

104, substituting the model parameters of the student model into the overall loss function L _all Forming a corresponding overall objective function; solving the model parameter which enables the overall objective function to reach the minimum value; updating the model parameters of the student model according to the solving result; and returns to step 101 after updating;

and 105, confirming that the whole training is finished.

In a second aspect, the present invention provides an apparatus for implementing the method for training a three-dimensional target detection model based on cross-modal knowledge distillation according to the first aspect, where the apparatus includes: the device comprises an acquisition module, a training data processing module, a loss function processing module and a training processing module;

the acquisition module is used for acquiring a mature point cloud-based 3D target detection model to be trained as a corresponding teacher model and acquiring an image-based 3D target detection model to be trained as a corresponding student model; and obtaining the model loss function of the student model as a corresponding student model loss function L _det ；

The training data processing module is used for acquiring an original point cloud and an original image of the same scene from a training data set as a corresponding first point cloud and a first image; acquiring identification frame marking information and false positive area marking information of the original point cloud as a corresponding first identification frame set and a first false positive area set;

the loss function processing module is used for carrying out characteristic loss function L on knowledge distillation from the teacher model to the student model _fea Attention loss function L _att And similarity loss function L _aff Determining; and loss function L is formed by the student model _det Said characteristic loss function L _fea The attention loss function L _att And said similarity loss function L _aff Are added to form correspondencesOverall loss function L _all ,L _all ＝L _det +L _fea +L _att +L _aff ；

The training processing module is used for obtaining the first image, the first recognition frame set and the student model loss function L according to the first image, the first recognition frame set and the student model loss function L _det Performing model self-training on the student model; after the model self-training is finished, the model self-training is finished according to the first point cloud, the first image, the first recognition frame set, the first false positive area set and the characteristic loss function L _fea Performing teacher-student characteristic simulation training on the student model; after the teacher-student feature simulation training is finished, the teacher-student feature simulation training is finished according to the first point cloud, the first image and the attention loss function L _att Carrying out teacher and student attention imitation training on the student model; after the teacher and student attention imitation training is finished, the teacher and student attention imitation training is finished according to the first point cloud, the first image and the similarity loss function L _aff Carrying out teacher-student similarity training on the student model; the teacher-student similarity training is completed according to the first point cloud, the first image, the first recognition frame set, the first false positive area set and the overall loss function L _all Performing overall training on the student model; and after the integral training is finished, selecting a new group of original point clouds, original images, identification frame marking information of the original point clouds and false positive area marking information from the training data set to perform next training on the student model until the total number of training rounds reaches the specified number.

A third aspect of an embodiment of the present invention provides an electronic device, including: a memory, a processor, and a transceiver;

the processor is configured to be coupled to the memory, read and execute instructions in the memory, so as to implement the method steps of the first aspect;

the transceiver is coupled to the processor, and the processor controls the transceiver to transmit and receive messages.

A fourth aspect of embodiments of the present invention provides a computer-readable storage medium storing computer instructions that, when executed by a computer, cause the computer to perform the method of the first aspect.

The embodiment of the invention provides a method, a device, electronic equipment and a computer readable storage medium for training a three-dimensional target detection model based on cross-modal knowledge distillation; taking a well-trained point cloud-based 3D target detection model as a teacher model, taking an image-based 3D target detection model as a student model, distilling a point cloud BEV (belief-oriented vector) feature of the teacher model to the student model by using a knowledge distillation mechanism, and training the student model to help the student model to learn a depth feature similar to the point cloud data from image data; the method and the device can help the image 3D target detection model to reduce depth estimation errors and improve the detection accuracy of the image 3D target detection model.

Drawings

Fig. 1 is a schematic diagram of a method for training a three-dimensional target detection model based on cross-modal knowledge distillation according to an embodiment of the present invention;

FIG. 2 is a block diagram of an apparatus for training a three-dimensional object detection model based on cross-modal knowledge distillation according to a second embodiment of the present invention;

fig. 3 is a schematic structural diagram of an electronic device according to a third embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the present invention will be described in further detail with reference to the accompanying drawings, and it is apparent that the described embodiments are only a part of the embodiments of the present invention, not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

An embodiment of the present invention provides a method for training a three-dimensional target detection model based on cross-modal knowledge distillation, as shown in fig. 1, which is a schematic diagram of the method for training the three-dimensional target detection model based on cross-modal knowledge distillation provided by the embodiment of the present invention, the method mainly includes the following steps:

step 1, acquiring a mature point cloud-based 3D target detection model to be trained as a corresponding teacher model, and acquiring an image-based 3D target detection model to be trained as a corresponding student model; and obtaining a model loss function of the student model as a corresponding student model loss function L _det ；

The teacher model comprises a point cloud pillar feature network, a BEV pooling network, a first BEV encoder and a first target detection head network; the output of the point cloud column characteristic network is connected with the input of the BEV pooling network; the output of the BEV pooling network is connected to the input of the first BEV encoder; the input of the first BEV encoder is connected with the input of the first target detection head network;

the student model comprises an image encoder, an LSS view converter, a second BEV encoder and a second target detection head network; the output of the image encoder is connected to the input of the LSS view converter; the output of the LSS view converter is connected to the input of the second BEV encoder; the input of the second BEV encoder is connected to the input of the second target detection head network.

Here, the teacher model of the embodiment of the invention is a well-trained point cloud-based 3D target detection model; the point cloud column Feature network of the teacher model is similar to the Pillar Feature Net of the PointPillars model; the BEV pooling network of the teacher model performs height pooling (posing) on the output of the point cloud column feature network to obtain a BEV feature map under a Bird Eye View (BEV); the first BEV encoder of the teacher model further encodes the BEV characteristic diagram to output a corresponding BEV thermodynamic characteristic diagram; the first target detection head network of the teacher model performs BEV target detection according to a BEV thermodynamic characteristic diagram to obtain a plurality of two-dimensional target identification frames, and performs 3D shape regression calculation on each two-dimensional target identification frame through an internal full-connection network so as to output a plurality of three-dimensional target identification frames;

the student model of the embodiment of the invention is a 3D target detection model to be trained based on images; the image encoder of the student model extracts the characteristics of an input image; theThe LSS (LSS) view converter of the student model is similar to a view converter (view converter) in the technical paper 'Lift, splat, shot: encoding images from imaging cameras by imaging not projecting to 3 d', and can extract BEV characteristics of image data and output a corresponding BEV characteristic diagram; a second BEV encoder of the student model carries out further information encoding on the BEV characteristic diagram and outputs a corresponding BEV thermodynamic characteristic diagram; the second target detection head network of the student model performs BEV target detection according to the BEV thermodynamic characteristic diagram to obtain a plurality of two-dimensional target identification frames, and performs 3D shape regression calculation on each two-dimensional target identification frame through an internal full-connection network so as to output a plurality of three-dimensional target identification frames. It should be noted that, other neural network structures capable of outputting the BEV characteristic diagram and the BEV thermodynamic characteristic diagram may also be used before the second target detection head network of the student model in the embodiment of the present invention; the model loss function of the student model of the embodiment of the invention is a known loss function and is recorded as a student model loss function L _det 。

Step 2, acquiring an original point cloud and an original image of the same scene from a training data set as a corresponding first point cloud and a corresponding first image; acquiring identification frame marking information and false positive area marking information of the original point cloud as a corresponding first identification frame set and a first false positive area set;

wherein the first identification frame set comprises a plurality of three-dimensional first identification frames bbox ₁ (ii) a First identification frame bbox ₁ Is H in shape _bbox1 *W _bbox1 *Z _bbox1 ，H _bbox1 、W _bbox1 、Z _bbox1 As the first recognition frame bbox ₁ Depth, width and height of (a);

Here, the training data set of the present invention is used to hold a plurality of training data records; each training data record corresponds to a group of original point clouds, original images, identification frame marking information of the original point clouds and false positive area marking information under the same scene; the original point cloud is basically consistent with the visual field range of the original image and generated at the same time; the identification frame marking information and the false positive area marking information of the original point cloud are manual marking information, and the identification frame marking information and the false positive area marking information of the original point cloud and the original point cloud are also one of training data of a teacher model. The False Positive (FP) area mentioned here is an area of the original point cloud space that is not occupied by the target recognition box but has the solid object point cloud, and correspondingly, the area occupied by the recognition box in the original point cloud space is a foreground area, and the area that is neither a foreground area nor a False Positive area in the original point cloud space is called a background area.

Step 3, carrying out characteristic loss function L of knowledge distillation from the teacher model to the student model _fea Attention loss function L _att And similarity loss function L _aff Determining; and loss function L is formed by student model _det Characteristic loss function L _fea Attention loss function L _att And a similarity loss function L _aff Are added to form a corresponding overall loss function L _all ；

Wherein L is _all ＝L _det +L _fea +L _att +L _aff ；

The method specifically comprises the following steps: step 31, carrying out characteristic loss function L of knowledge distillation from teacher model to student model _fea Attention loss function L _att And similarity loss function L _aff Determining;

the method specifically comprises the following steps: step 311, determine the characteristic loss function L _fea Comprises the following steps:

wherein, the first and the second end of the pipe are connected with each other,

1) Alpha, beta and gamma are respectively preset loss coefficients;

2)F ^T is the teacher characteristic diagram output by the first BEV encoder of the teacher model, H, W, C is the teacher characteristic diagram F ^T Height, width, and channel dimensions; teacher feature graph F ^T The shape which can be decomposed into H, W and one dimension is 1 xCTeacher feature channel vector

Teacher feature graph F ^T Can also be decomposed into C teacher sub-feature graphs with two-dimensional shape of H multiplied by W

Teacher feature graph F ^T Can also be decomposed into C, H, W teacher characteristic data

1≤k≤C、1≤i≤H、1≤j≤W；

3)F ^S A student feature map output by a second BEV encoder of the student model; f. of _proj () Is a projection function from the student model BEV space to the teacher model BEV space;

characteristic diagram F for students ^S The corresponding projection feature map is displayed on the display screen,

projection feature map

And teacher feature graph F ^T Are consistent with the channel dimensions; projection feature map

Projection feature map

Projection feature map

Can also be decomposed into C, H, W projection characteristic data

4) M () is the foreground-background binary mask function:

here, (i, j) is the teacher sub-feature diagram

The coordinates of the upper pixel points and the specification of the foreground-background binary mask function are only required to be in any first identification frame bbox ₁ If the corresponding mask output is 1, otherwise, if the coordinate is not in any foreground area, the corresponding mask output is 0;

5) N () is a false positive-background binary mask function:

here, (i, j) is the teacher sub-feature graph

The false positive-background binary mask function specifies that the corresponding mask output is 1 as long as the coordinate is in any first false positive region FP, and otherwise, the corresponding mask output is 0 if the coordinate is not in any false positive region FP;

6) S () is a size mask function:

wherein the content of the first and second substances,

H _bbox1 、W _bbox1 for the corresponding first identification box bbox ₁ Depth and width of;

N _gb for teachers to use sub-feature maps

The number of background points on the top of the tile,

here, (i, j) is the teacher sub-feature graph

The coordinates of the pixels on the screen and the size mask function are specified as long as the coordinates are in any first identification frame bbox ₁ The corresponding size mask output of the occupied foreground region is the corresponding first identification box bbox ₁ Depth H of _bbox1 And width W _bbox1 The reciprocal of the product of (a), whereas if the coordinate is not in any foreground region, the corresponding size mask output is the number of background points N _gb The reciprocal of (a);

7)A ^s () For the channel vector attention function:

wherein the content of the first and second substances,

t is a preset distillation hyper-parameter;

F ¹ for input features, input feature F ¹ Comprising C feature components

In the input feature F ¹ Feature channel vector for teacher

Then, the teacher feature channel vector is used

As corresponding feature components

The channel vector attention function at this time is:

in the input feature F ¹ For projecting feature channel vectors

Then, the eigen-channel vectors will be projected

As corresponding feature components

The channel vector attention function at this time is:

8)A ^c () For the sub-feature graph attention function:

wherein the content of the first and second substances,

F ² for input features, input feature F ² Comprises H × W characteristic components

In the input feature F ² For teachers to use sub-feature maps

The teacher is presented with a sub-feature map

As corresponding feature components

The sub-feature graph attention function at this time is:

in the input feature F ² For projecting sub-feature maps

Then, projecting the sub-feature map

As corresponding feature components

The sub-feature graph attention function at this time is:

here, as is apparent from the above description, the characteristic loss function L _fea In (1)

Is a teacher characteristic diagram F ^T Student characteristic diagram F ^S The foreground feature of (a) loses the branch,

is a teacher characteristic diagram F ^T Student characteristic diagram F ^S The false positive feature of (a) loses branches,

is a teacher feature diagram F ^T Student characteristic diagram F ^S The background feature loss branch of (1); by a characteristic loss function L _fea The positioning accuracy of the student model can be improved;

step 312, determine the attention loss function L _att Comprises the following steps:

wherein eta is a preset attention loss hyper-parameter; l is a radical of an alcohol ₁ Is L1_ Loss function;

here, the attention loss function L _att Practical channel vector attention function and sub-feature graph attention function based on teacher feature graph F ^T Student characteristic diagram F ^S Comparing the attention characteristics of (1); by the attention loss function L _att The student model can be helped to improve the target recognition precision;

step 313, from teacher feature map F ^T And projection feature map

1≤i′≤Q，1≤j′≤Q；

Wherein each pixel point

In teacher feature graph F ^T Corresponding to a line feature tensor of 1 xWxC

And also a column feature tensor of H × 1 × C shape

Each pixel point

In the projection of feature maps

Corresponding to a line feature tensor of 1 xWxC

And also a column feature tensor of H × 1 × C shape

Here, from teacher profile F ^T And projection feature maps

Respectively select a preset number Q ² Each pixel point forms a matching point pair

And each matching point pair

Characteristic graph F of two pixel points in teacher ^T And projection feature maps

The pixel coordinates of (a) are all the same;

e.g., Q =2, then from teacher profile F ^T Four points with coordinates of (1,2), (1,3), (1,4) and (1,5) are selected as corresponding points

Then the feature map should be projected from the same

Four points with coordinates of (1,2), (1,3), (1,4) and (1,5) are selected as corresponding points

And obtaining 4 matching point pairs according to the corresponding relation of (i ', j'):

step 314, according to the preset quantity Q ² Is paired with

Determining a similarity loss function L _aff Comprises the following steps:

1) Zeta is a preset similarity loss hyper-parameter;

2)‖‖ _smoothl1 is a Smooth _ L1_ loss function;

3)A _ff () As a function of similarity:

wherein the content of the first and second substances,

D _i′ 、D _j′ inputting a feature vector;

in inputting feature vector D _i′ 、D _j′ Is a teacher characteristic diagram F ^T Last pixel point

Of the line feature tensor

Sum-column feature tensor

Time, similarity function

The method comprises the following specific steps:

in inputting feature vector D _i′ 、D _j′ For projecting feature maps

Last pixel point

Of the line feature tensor

Sum-column feature tensor

Function of time similarity

The method specifically comprises the following steps:

here, the similarity function used in the embodiment of the present invention is a cosine similarity function; by a similarity loss function L _aff The student model can be helped to improve the model performance;

step 32, loss function L is calculated by student model _det Characteristic loss function L _fea Attention loss function L _att And similarity loss function L _aff Are added to form a corresponding overall loss function L _all ；

Wherein L is _all ＝L _det +L _fea +L _att +L _aff ；

Step 4, according to the first image, the first identification frame set and the student model loss function L _det Carrying out model self-training on the student model; after the model self-training is finished, according to the first point cloud, the first image, the first recognition frame set, the first false positive area set and the characteristic loss function L _fea Carrying out teacher-student characteristic simulation training on the student model; after the teacher-student characteristic simulation training is finished, the teacher-student characteristic simulation training is finished according to the first point cloud, the first image and the attention loss function L _att Carrying out teacher and student attention imitation training on the student model; after the teacher and student completes the attention simulation training, the teacher and student finish the attention simulation training according to the first point cloud, the first image and the similarity loss function L _aff Carrying out teacher-student similarity training on the student model; after the teacher-student similarity training is finished, according to the first point cloud, the first image, the first recognition frame set, the first false positive area set and the overall loss function L _all Carrying out integral training on the student model;

the method comprises the steps of distilling point cloud BEV characteristics of a teacher model to a student model by using a knowledge distillation mechanism, and training the student model; in the training process, the student model is subjected to self-training; and then based on three loss functions (characteristic loss function L) for the knowledge distillation _fea Attention loss function L _att And similarity loss function L _aff ) Gradually training the student model; finally based on the overall loss function L _all Carrying out integral training on the student model;

the method specifically comprises the following steps: step 41, according to the first image, the first identification frame set and the student model loss function L _det Performing model self-training on the student model;

the method specifically comprises the following steps: step 411, inputting the first image into the student model for gradual operation, and taking a target recognition frame set output by a second target detection head network of the student model as a corresponding second recognition frame set in the operation process;

wherein the second identification frame set comprises a plurality of three-dimensional second identification frames bbox ₂ (ii) a Second identification frame bbox ₂ Is in the shape ofH _bbox2 *W _bbox2 *Z _bbox2 ，H _bbox2 、W _bbox2 、Z _bbox2 As a second identification frame bbox ₂ Depth, width and height of;

step 412, substituting the first and second recognition frame sets into the student model loss function L _det Estimating a loss value to generate a corresponding first loss value;

step 413, identifying whether the first loss value meets a preset first loss convergence range; if yes, go to step 415; if not, go to step 414;

step 414, substituting the model parameters of the student model into the student model loss function L _det Forming a corresponding first objective function; solving the model parameter which enables the first objective function to reach the minimum value; updating the model parameters of the student model according to the solving result; and returns to step 411 after updating;

step 415, confirming that the model self-training is completed;

step 42, after the model self-training is completed, according to the first point cloud, the first image, the first recognition frame set, the first false positive area set and the characteristic loss function L _fea Carrying out teacher-student characteristic simulation training on the student model;

the method specifically comprises the following steps: step 421, inputting the first point cloud into the teacher model for gradual operation, and extracting the output characteristic diagram of the first BEV encoder of the teacher model as the corresponding teacher characteristic diagram F in the operation process ^T (ii) a Inputting the first image into the student model for gradual operation, and extracting the output characteristic diagram of the second BEV encoder of the student model as the corresponding student characteristic diagram F in the operation process ^S (ii) a And based on a projection function f _proj () To student characteristic diagram F ^S Projection from student model BEV space to teacher model BEV space to generate corresponding projection characteristic diagram

Step 422, the teacher characteristic graph F ^T Projection feature map

Substituting the first recognition frame set and the first false positive region set into the characteristic loss function L _fea Estimating the loss value to generate a corresponding second loss value;

step 423, identifying whether the second loss value meets a preset second loss convergence range; if yes, go to step 425; if not, go to step 424;

step 424, substituting the model parameters of the student model into the characteristic loss function L _fea Forming a corresponding second objective function; solving the model parameter which enables the second objective function to reach the minimum value; updating the model parameters of the student model according to the solving result; and returns to step 421 after updating;

step 425, confirming that the teacher and student characteristic imitation training is finished;

step 43, after the teacher-student feature simulation training is completed, according to the first point cloud, the first image and the attention loss function L _att Carrying out teacher and student attention imitation training on the student model;

the method specifically comprises the following steps: step 431, inputting the first point cloud into the teacher model for gradual operation, and extracting an output characteristic diagram of a first BEV encoder of the teacher model as a corresponding teacher characteristic diagram FT in the operation process; inputting the first image into the student model for gradual operation, and extracting the output characteristic diagram of the second BEV encoder of the student model as the corresponding student characteristic diagram F in the operation process ^S (ii) a And based on a projection function f _proj () To student characteristic diagram F ^S Projection from student model BEV space to teacher model BEV space to generate corresponding projection characteristic diagram

Step 432, the teacher characteristic graph F is processed ^T And projection feature maps

Substituting attention loss function L _att Estimating the loss value to generate a corresponding third loss value;

step 433, identifying whether the third loss value meets a preset third loss convergence range; if yes, go to step 435; if not, go to step 434;

step 434, substituting the model parameters of the student model into the attention loss function L _att Forming a corresponding third objective function; solving the model parameter which enables the third objective function to reach the minimum value; updating the model parameters of the student model according to the solving result; and returns to step 431 after the update;

step 435, confirming completion of the teacher and student attention imitation training;

step 44, after the teacher and the student finish the attention imitation training, the first point cloud, the first image and the similarity loss function L are used for simulating the first point cloud and the second image _aff Carrying out teacher-student similarity training on the student model;

the method specifically comprises the following steps: step 441, inputting the first point cloud into the teacher model for gradual operation, and extracting the output characteristic diagram of the first BEV encoder of the teacher model as a corresponding teacher characteristic diagram F in the operation process ^T (ii) a Inputting the first image into the student model for gradual operation, and extracting the output characteristic diagram of the second BEV encoder of the student model as the corresponding student characteristic diagram F in the operation process ^S (ii) a And based on a projection function f _proj () To student characteristic diagram F ^S Projection from student model BEV space to teacher model BEV space to generate corresponding projection characteristic diagram

Step 442, the teacher feature map F ^T And projection feature maps

Substituting similarity loss function L _aff Estimating the loss value to generate a corresponding fourth loss value;

step 443, identifying whether the fourth loss value meets a preset fourth loss convergence range; if yes, go to step 445; if not, go to step 444;

step 444, substituting the model parameters of the student model into the similarity loss function L _aff Forming a corresponding fourth objective function; solving the model parameter which enables the fourth objective function to reach the minimum value; updating the model parameters of the student model according to the solving result; and returns to step 441 after the update;

step 445, confirming that the teacher-student similarity training is completed;

step 45, after the teacher-student similarity training is finished, according to the first point cloud, the first image, the first recognition frame set, the first false positive area set and the overall loss function L _all Carrying out integral training on the student model;

the method specifically comprises the following steps: step 451, inputting the first point cloud into the teacher model to perform step-by-step operation, and extracting the output characteristic diagram of the first BEV encoder of the teacher model as the corresponding teacher characteristic diagram F in the operation process ^T (ii) a Inputting the first image into the student model for gradual operation, and extracting the output characteristic diagram of the second BEV encoder of the student model as the corresponding student characteristic diagram F in the operation process ^S And a target recognition frame set output by a second target detection head network of the student model is used as a corresponding third recognition frame set; and based on a projection function f _proj () To student characteristic diagram F ^S Projection from student model BEV space to teacher model BEV space to generate corresponding projection characteristic diagram

Wherein the third identification frame set comprises a plurality of three-dimensional third identification frames bbox ₃ (ii) a Third identification frame bbox ₃ Is H in shape _bbox3 *W _bbox3 *Z _bbox3 ，H _bbox3 、W _bbox3 、Z _bbox3 As a third identification frame bbox ₃ Depth, width and height of;

step 452, collecting the first recognition frame, the third recognition frame and the teacher feature map F ^T Projection feature map

And substituting the first set of false positive regions into the global loss function L _all Estimating the loss value to generate a corresponding overall loss value;

step 453, identifying whether the overall loss value satisfies a preset overall loss convergence range; if yes, go to step 455; if not, go to step 454;

step 454, substituting the model parameters of the student model into the overall loss function L _all Forming a corresponding overall objective function; solving the model parameter which enables the overall objective function to reach the minimum value; updating the model parameters of the student model according to the solving result; and returns to step 451 after the update;

at step 455, the overall training is confirmed to be complete.

And 5, after the whole training is finished, selecting a new group of original point clouds, original images, identification frame marking information of the original point clouds and false positive area marking information from the training data set to perform next training on the student model until the total number of training rounds reaches the specified number.

Fig. 2 is a block diagram of an apparatus for performing distillation training on a three-dimensional object detection model based on cross-modal knowledge according to a second embodiment of the present invention, where the apparatus is a terminal device or a server for implementing the foregoing method embodiment, and may also be an apparatus capable of enabling the foregoing terminal device or server to implement the foregoing method embodiment, and for example, the apparatus may be an apparatus or a chip system of the foregoing terminal device or server. As shown in fig. 2, the apparatus includes: an acquisition module 201, a training data processing module 202, a loss function processing module 203, and a training processing module 204.

The acquisition module 201 is configured to acquire a mature point cloud-based 3D target detection model for training as a corresponding teacher model, and acquire an image-based 3D target detection model to be trained as a corresponding student model; and obtaining a model loss function of the student model as a corresponding student model loss function L _det 。

The training data processing module 202 is configured to obtain an original point cloud and an original image of the same scene from a training data set as a corresponding first point cloud and a corresponding first image; and acquiring identification frame marking information and false positive area marking information of the original point cloud as a corresponding first identification frame set and a first false positive area set.

Loss function processing module 203 is used for performing characteristic loss function L on knowledge distillation from teacher model to student model _fea Attention loss function L _att And similarity loss function L _aff Determining; and loss function L is formed by student model _det Characteristic loss function L _fea Attention loss function L _att And similarity loss function L _aff Are added to form a corresponding overall loss function L _all ,L _all ＝L _det +L _fea +L _att +L _aff 。

The training processing module 204 is configured to perform a loss function L according to the first image, the first recognition box set, and the student model _det Performing model self-training on the student model; after the model self-training is finished, according to the first point cloud, the first image, the first recognition frame set, the first false positive area set and the characteristic loss function L _fea Performing teacher and student characteristic imitation training on the student model; after the teacher-student characteristic simulation training is finished, the teacher-student characteristic simulation training is finished according to the first point cloud, the first image and the attention loss function L _att Carrying out teacher and student attention imitation training on the student model; after the teacher and student completes the attention simulation training, the teacher and student finish the attention simulation training according to the first point cloud, the first image and the similarity loss function L _aff Carrying out teacher-student similarity training on the student model; after the teacher-student similarity training is finished, according to the first point cloud, the first image, the first recognition frame set, the first false positive area set and the overall loss function L _all Carrying out integral training on the student model; and after the whole training is finished, selecting a new group of original point clouds, original images, identification frame marking information of the original point clouds and false positive area marking information from the training data set to perform next training on the student model until the total number of training rounds reaches the specified number.

The device for training the three-dimensional target detection model based on cross-modal knowledge distillation provided by the embodiment of the invention can execute the method steps in the method embodiment, and the implementation principle and the technical effect are similar, so that the detailed description is omitted.

It should be noted that the division of the modules of the above apparatus is only a logical division, and the actual implementation may be wholly or partially integrated into one physical entity, or may be physically separated. And these modules can be realized in the form of software called by processing element; or can be implemented in the form of hardware; and part of the modules can be realized in the form of calling software by the processing element, and part of the modules can be realized in the form of hardware. For example, the obtaining module may be a processing element separately set up, or may be implemented by being integrated in a chip of the apparatus, or may be stored in a memory of the apparatus in the form of program code, and a processing element of the apparatus calls and executes the functions of the determining module. Other modules are implemented similarly. In addition, all or part of the modules can be integrated together or can be independently realized. The processing element described herein may be an integrated circuit having signal processing capabilities. In implementation, each step of the above method or each module above may be implemented by an integrated logic circuit of hardware in a processor element or an instruction in the form of software.

For example, the above modules may be one or more integrated circuits configured to implement the above methods, such as: one or more Application Specific Integrated Circuits (ASICs), or one or more Digital Signal Processors (DSPs), or one or more Field Programmable Gate Arrays (FPGAs), etc. For another example, when some of the above modules are implemented in the form of a Processing element scheduler code, the Processing element may be a general-purpose processor, such as a Central Processing Unit (CPU) or other processor that can invoke the program code. As another example, these modules may be integrated together and implemented in the form of a System-on-a-chip (SOC).

In the above embodiments, all or part of the implementation may be realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. The procedures or functions described in accordance with the foregoing method embodiments are generated in whole or in part when the computer program instructions are loaded and executed on a computer. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another computer readable storage medium, for example, from one website, computer, server, or data center to another website, computer, server, or data center via wire (e.g., coaxial cable, fiber optic, digital Subscriber Line (DSL)), or wireless (e.g., infrared, wireless, bluetooth, microwave, etc.).

Fig. 3 is a schematic structural diagram of an electronic device according to a third embodiment of the present invention. The electronic device may be the terminal device or the server, or may be a terminal device or a server connected to the terminal device or the server and implementing the method according to the embodiment of the present invention. As shown in fig. 3, the electronic device may include: a processor 301 (e.g., a CPU), a memory 302, a transceiver 303; the transceiver 303 is coupled to the processor 301, and the processor 301 controls the transceiving operation of the transceiver 303. Various instructions may be stored in memory 302 for performing various processing functions and implementing the processing steps described in the foregoing method embodiments. Preferably, the electronic device according to an embodiment of the present invention further includes: a power supply 304, a system bus 305, and a communication port 306. The system bus 305 is used to implement communication connections between the elements. The communication port 306 is used for connection communication between the electronic device and other peripherals.

The system bus 305 mentioned in fig. 3 may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The system bus may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown, but this does not mean that there is only one bus or one type of bus. The communication interface is used for realizing communication between the database access device and other equipment (such as a client, a read-write library and a read-only library). The Memory may include a Random Access Memory (RAM) and may also include a Non-Volatile Memory (Non-Volatile Memory), such as at least one disk Memory.

The Processor may be a general-purpose Processor, and includes a central Processing Unit CPU, a Network Processor (NP), a Graphics Processing Unit (GPU), and the like; but also a digital signal processor DSP, an application specific integrated circuit ASIC, a field programmable gate array FPGA or other programmable logic device, discrete gate or transistor logic, discrete hardware components.

It should be noted that the embodiment of the present invention also provides a computer-readable storage medium, which stores instructions that, when executed on a computer, cause the computer to execute the method and the processing procedure provided in the above-mentioned embodiment.

The embodiment of the present invention further provides a chip for executing the instructions, where the chip is configured to execute the processing steps described in the foregoing method embodiment.

Those of skill would further appreciate that the various illustrative components and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the components and steps of the various examples have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied in hardware, a software module executed by a processor, or a combination of the two. A software module may reside in Random Access Memory (RAM), memory, read-only memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.

The above-mentioned embodiments, objects, technical solutions and advantages of the present invention are further described in detail, it should be understood that the above-mentioned embodiments are only examples of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A method for training a three-dimensional target detection model based on cross-modal knowledge distillation, the method comprising:

obtainingTraining a mature point cloud-based 3D target detection model as a corresponding teacher model, and acquiring an image-based 3D target detection model to be trained as a corresponding student model; and obtaining the model loss function of the student model as a corresponding student model loss function L _det ；

characteristic loss function L for knowledge distillation from the teacher model to the student model _fea Attention loss function L _att And similarity loss function L _aff Determining; and loss function L is formed by the student model _det The characteristic loss function L _fea The attention loss function L _att And said similarity loss function L _aff Are added to form a corresponding overall loss function L _all ,L _all ＝L _det +L _fea +L _att +L _aff ；

According to the first image, the first identification frame set and the student model loss function L _det Performing model self-training on the student model; after the model self-training is finished, the model self-training is finished according to the first point cloud, the first image, the first recognition frame set, the first false positive area set and the characteristic loss function L _fea Performing teacher-student characteristic simulation training on the student model; after the teacher-student feature simulation training is finished, the teacher-student feature simulation training is finished according to the first point cloud, the first image and the attention loss function L _att Performing teacher and student attention imitation training on the student model; after the teacher and student attention imitation training is finished, the teacher and student attention imitation training is finished according to the first point cloud, the first image and the similarity loss function L _aff Carrying out teacher and student similarity training on the student model; after the teacher-student similarity training is finished, the teacher-student similarity training is finished according to the first point cloud, the first image, the first recognition frame set and the first false positive area setAnd said global loss function L _all Performing overall training on the student model;

2. The method for distillation training of a three-dimensional object detection model based on cross-modal knowledge according to claim 1,

the teacher model comprises a point cloud pillar feature network, a BEV pooling network, a first BEV encoder and a first target detection head network; the output of the point cloud column feature network is connected with the input of the BEV pooling network; an output of the BEV pooling network is connected to an input of the first BEV encoder; the input of the first BEV encoder is connected with the input of the first target detection head network;

3. The method for distillation training of three-dimensional target detection model based on cross-modal knowledge as claimed in claim 2, wherein the teacher model is a model of the teacherCharacteristic loss function L of knowledge distillation to the student model _fea Attention loss function L _att And similarity loss function L _aff The determination specifically comprises the following steps:

；

wherein the content of the first and second substances,

alpha, beta and gamma are respectively preset loss coefficients;

F ^T is the teacher characteristic diagram output by the first BEV encoder of the teacher model, and H, W, C is the teacher characteristic diagram F ^T Height, width and channel dimensions; the teacher characteristic diagram F ^T Teacher feature channel vector with 1 × C shape capable of being decomposed into H × W one-dimensional channels

the projection feature map

The projection feature map

The projection feature map

Can also be decomposed into C, H, W projection characteristic data

M () is a foreground-background binary mask function;

n () is a false positive-background binary mask function;

s () is a size mask function;

A ^s () As a channel vector attention function;

A ^c () Attention function for sub-feature map;

is the teacher characteristic diagram F ^T And the student characteristic diagram F ^S The foreground feature of (a) loses the branch,

is the teacher characteristic diagram F ^T And the student characteristic diagram F ^S The false positive feature of (a) loses branches,

wherein the content of the first and second substances,

eta is a preset attention loss hyper-parameter;

L ₁ is L1_ Loss function;

step 33, from the teacher feature map F ^T And the projection feature map

Wherein the content of the first and second substances,

each pixel point

In the teacher characteristic diagram F ^T Corresponding to a line feature tensor of 1 xWxC shape

And also a column feature tensor of H × 1 × C shape

Each pixel point

In the projection feature map

Corresponding to a line feature tensor of 1 xWxC shape

And also a column feature tensor of H × 1 × C shape

Step 34, according to the preset quantity Q ² The matching point pair of

Determining the similarity loss function L _aff Comprises the following steps:

zeta is a preset similarity loss hyper-parameter;

‖ ‖ _smoothl1 is a Smooth _ L1_ loss function;

A _ff () As a function of similarity.

4. The method for distillation training of the three-dimensional object detection model based on the cross-modal knowledge as recited in claim 3,

the foreground-background binary mask function M () is:

the false positive-background binary mask function N () is:

the size mask function S () is:

wherein the content of the first and second substances,

N _gb is the teacher sub-feature graph

The number of background points on the top of the tile,

the channel vector attention function A ^s () Comprises the following steps:

wherein the content of the first and second substances,

t is a preset distillation hyper-parameter;

F ¹ as input features, the input features F ¹ Comprising C feature components

In the input feature F ¹ Is the teacher feature channel vector

Then, the teacher feature channel vector is used

As corresponding said feature components

In the input feature F ¹ For the projected eigen-channel vector

Then, the projected feature channel vector is processed

As corresponding said feature components

The sub-feature map attention function A ^c () Comprises the following steps:

wherein the content of the first and second substances,

In the input feature F ² Is the teacher sub-feature graph

Then, the teacher sub-feature graph is drawn

As corresponding feature components

In the input feature F ² For the projection sub-feature map

Then, the projection sub-feature map is processed

As corresponding feature components

5. The method for distillation training of a three-dimensional object detection model based on cross-modal knowledge according to claim 3,

the similarity function A _ff () Comprises the following steps:

wherein D is _i′ 、D _j′ Inputting a feature vector;

The line feature tensor of

And said column feature tensor

Time, similarity function

The method specifically comprises the following steps:

in inputting feature vector D _i′ 、D _j′ For the projection feature map

Last one of the pixel points

The line feature tensor of

And said column feature tensor

Function of time similarity

The method specifically comprises the following steps:

6. the method for distillation training of a three-dimensional object detection model based on cross-modal knowledge according to claim 3, wherein the method comprises performing a loss function L according to the first image, the first set of recognition boxes and the student model _det Carrying out model self-training on the student model, and specifically comprising the following steps:

step 61, the first stepInputting the image into the student model to perform gradual operation, and taking a target identification frame set output by the second target detection head network of the student model as a corresponding second identification frame set in the operation process; the second recognition frame set comprises a plurality of three-dimensional second recognition frames bbox ₂ (ii) a The second identification frame bbox ₂ Is H in shape _bbox2 *W _bbox2 *Z _bbox2 ，H _bbox2 、W _bbox2 、Z _bbox2 Is the second identification frame bbox ₂ Depth, width and height of;

step 64, substituting the model parameters of the student model into the student model loss function L _det Forming a corresponding first objective function; solving the model parameter which enables the first objective function to reach the minimum value; updating the model parameters of the student model according to the solving result; and returns to step 61 after updating;

and step 65, confirming that the model self-training is completed.

7. The method for distillation training of a three-dimensional object detection model based on cross-modal knowledge according to claim 3, wherein the method comprises the steps of calculating the first point cloud, the first image, the first recognition box set, the first false positive region set, and the feature loss function L according to the first point cloud, the first image, the first recognition box set, the first false positive region set, and the feature loss function L _fea Right the student model carries out teacher and student's feature simulation training, specifically includes:

step 71, inputting the first point cloud into the teacher model for gradual operation, and extracting the output characteristic diagram of the first BEV encoder of the teacher model as the corresponding teacher characteristic diagram F in the operation process ^T (ii) a And inputting the first image into the image processing deviceThe student model is gradually operated, and the output characteristic diagram of the second BEV encoder of the student model is extracted as the corresponding student characteristic diagram F in the operation process ^S (ii) a And based on said projection function f _proj () For the student characteristic diagram F ^S Projecting from the student model BEV space to the teacher model BEV space to generate corresponding projection feature map

Step 72, the teacher feature map F is processed ^T The projection feature map

step 74, substituting the model parameters of the student model into the characteristic loss function L _fea Forming a corresponding second objective function; solving the model parameter which enables the second objective function to reach the minimum value; updating the model parameters of the student model according to the solving result; and returns to step 71 after the update;

and step 75, confirming that the teacher-student feature simulation training is completed.

8. The method for distillation training of a three-dimensional object detection model based on cross-modal knowledge according to claim 3, wherein the method comprises the step of performing the distillation training of the three-dimensional object detection model based on the first point cloud, the first image and the attention loss function L _att It is right the student model carries out teacher and student's attention imitation training, specifically includes:

step 81, inputting the first point cloud into the teacher model for gradual operation, and performing gradual operation on the first point cloud in the operation processThe output characteristic diagram of the first BEV encoder of the teacher model is extracted as the corresponding teacher characteristic diagram F ^T (ii) a Inputting the first image into the student model for gradual operation, and extracting the output characteristic diagram of the second BEV encoder of the student model as the corresponding student characteristic diagram F in the operation process ^S (ii) a And based on said projection function f _proj () For the student characteristic diagram F ^S Projecting from the student model BEV space to the teacher model BEV space to generate corresponding projection feature map

step 84, substituting the model parameters of the student model into the attention loss function L _att Forming a corresponding third objective function; solving the model parameter which enables the third objective function to reach the minimum value; updating the model parameters of the student model according to the solving result; and returns to step 81 after the update;

9. The method for distillation training of a three-dimensional object detection model based on cross-modal knowledge according to claim 3, wherein the method comprises the step of performing the distillation training of the three-dimensional object detection model based on the first point cloud, the first image and the similarity loss function L _aff It is right the student model carries out teachers and students similarity training, specifically includes:

step 94, substituting the model parameters of the student model into the similarity loss function L _aff Forming a corresponding fourth objective function; solving the model parameter which enables the fourth objective function to reach the minimum value; updating the model parameters of the student model according to the solving result; and returns to step 91 after the update;

10. The method for distillation training of a three-dimensional object detection model based on cross-modal knowledge according to claim 3, wherein the method further comprises the step of performing the distillation training on the three-dimensional object detection model according to the first point cloud, the first image, the first recognition box set and the first false positiveSet of zones and the global loss function L _all Carrying out integral training on the student model, and specifically comprising:

And substituting said first set of false positive regions into said global loss function L _all Estimating the loss value to generate a corresponding overall loss value;

104, the student model is analyzedModel parameters are substituted into the overall loss function L _all Forming a corresponding overall objective function; solving the model parameter which enables the overall objective function to reach the minimum value; updating the model parameters of the student model according to the solving result; and returns to step 101 after updating;

and 105, confirming that the whole training is finished.

11. An apparatus for performing the method for distillation training of three-dimensional object detection model based on cross-modal knowledge according to any one of claims 1-10, the apparatus comprising: the device comprises an acquisition module, a training data processing module, a loss function processing module and a training processing module;

The training data processing module is used for acquiring an original point cloud and an original image of the same scene from a training data set as a corresponding first point cloud and a corresponding first image; acquiring identification frame marking information and false positive area marking information of the original point cloud as a corresponding first identification frame set and a first false positive area set;

the loss function processing module is used for carrying out characteristic loss function L on knowledge distillation from the teacher model to the student model _fea Attention loss function L _att And a similarity loss function L _aff Determining; and loss function L is formed by the student model _det The characteristic loss function L _fea The attention loss function L _att And said similarity loss function L _aff Are added to form a corresponding overall loss function L _all ,L _all ＝L _det +L _fea +L _att +L _aff ；

The training processing module is used for recognizing the first image according to the first imageFrame set and the student model loss function L _det Performing model self-training on the student model; after the model self-training is finished, the model self-training is finished according to the first point cloud, the first image, the first recognition frame set, the first false positive area set and the characteristic loss function L _fea Performing teacher-student characteristic simulation training on the student model; after the teacher-student characteristic simulation training is finished, according to the first point cloud, the first image and the attention loss function L _att Performing teacher and student attention imitation training on the student model; after the teacher and student attention imitation training is finished, the teacher and student attention imitation training is finished according to the first point cloud, the first image and the similarity loss function L _aff Carrying out teacher-student similarity training on the student model; the teacher-student similarity training is completed according to the first point cloud, the first image, the first recognition frame set, the first false positive area set and the overall loss function L _all Performing overall training on the student model; and after the integral training is finished, selecting a new group of original point clouds, original images, identification frame marking information of the original point clouds and false positive area marking information from the training data set to perform next training on the student model until the total number of training rounds reaches the specified number.

12. An electronic device, comprising: a memory, a processor, and a transceiver;

the processor is used for being coupled with the memory, reading and executing the instructions in the memory to realize the method steps of any one of the claims 1-10;

13. A computer-readable storage medium having stored thereon computer instructions which, when executed by a computer, cause the computer to perform the method of any of claims 1-10.