CN115690708A - Method and device for training three-dimensional target detection model based on cross-modal knowledge distillation - Google Patents
Method and device for training three-dimensional target detection model based on cross-modal knowledge distillation Download PDFInfo
- Publication number
- CN115690708A CN115690708A CN202211296868.1A CN202211296868A CN115690708A CN 115690708 A CN115690708 A CN 115690708A CN 202211296868 A CN202211296868 A CN 202211296868A CN 115690708 A CN115690708 A CN 115690708A
- Authority
- CN
- China
- Prior art keywords
- model
- student
- teacher
- training
- loss function
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Abstract
The embodiment of the invention relates to a method and a device for training a three-dimensional target detection model based on cross-modal knowledge distillation, wherein the method comprises the following steps: obtaining a teacher model, a student model and a student model loss function; acquiring training data from a training data set; determining characteristic loss, attention loss and similarity loss functions of knowledge distillation from a teacher model to a student model; and the student model loss function, the characteristic loss function, the attention loss function and the similarity loss function form an integral loss function; carrying out model self-training on the student model; performing teacher and student feature simulation training; carrying out attention simulation training of teachers and students; carrying out teacher-student similarity training; carrying out integral training; and after the training is finished, selecting a new group of training data from the training data set to perform the next round of training on the student model until the specified times are reached. By the method and the device, the detection accuracy of the image 3D target detection model can be improved.
Description
Technical Field
The invention relates to the technical field of data processing, in particular to a method and a device for training a three-dimensional target detection model based on cross-modal knowledge distillation.
Background
And the sensing module of the automatic driving system is used for detecting the target according to the sensing data of the sensor. At present, mainstream sensor includes camera, laser radar and radar etc. and common 3D target detection model mainly has two types: the method comprises a neural network model taking point cloud as input and a neural network model taking an image as input. Both models have their own advantages and disadvantages: 1) The point cloud carries quasi distance (depth) information, and a 3D target detection model based on the point cloud can output higher detection precision to a target at a closer distance; however, the point cloud has the characteristic of sparsity, so that the detection accuracy of the 3D target detection model based on the point cloud on a remote target is poor; 2) Visual field information on the image is uniformly distributed, the information density is high, and the 3D target detection model based on the image can output higher identification precision to a near target or a far target; however, the image does not have depth information, and depth estimation needs to be performed on the image through a model, and the depth estimation causes a large detection error, so that the positioning accuracy of a target identification box (bbox) output by a 3D target detection model based on the image is always insufficient.
Disclosure of Invention
The invention aims to provide a method, a device, electronic equipment and a computer readable storage medium for training a three-dimensional target detection model based on cross-modal knowledge distillation, aiming at the defects of the prior art; taking a well-trained 3D target detection Model based on point cloud as a Teacher Model (Teacher Model), taking a 3D target detection Model based on an image as a Student Model (Student Model), distilling the point cloud BEV characteristics of the Teacher Model to the Student Model by using a Knowledge Distillation (Knowledge Distillation) mechanism, training the Student Model, and helping the Student Model to learn depth characteristics similar to the point cloud data from image data; the method and the device can help the image 3D target detection model to reduce depth estimation errors, and achieve the purpose of improving the detection accuracy of the image 3D target detection model.
In order to achieve the above object, a first aspect of the embodiments of the present invention provides a method for training a three-dimensional target detection model based on cross-modal knowledge distillation, the method including:
acquiring a well-trained 3D target detection model based on point cloud as a corresponding teacher model, and acquiring a 3D target detection model to be trained based on images as a corresponding student model; and obtaining the model loss function of the student model as a corresponding student model loss function L det ;
Acquiring an original point cloud and an original image of the same scene from a training data set as a corresponding first point cloud and a first image; acquiring identification frame marking information and false positive area marking information of the original point cloud as a corresponding first identification frame set and a first false positive area set;
characteristic loss function L for knowledge distillation from the teacher model to the student model fea Attention loss function L att And a similarity loss function L aff Determining; and loss function L is formed by the student model det The characteristic loss function L fea The attention loss function L att And said similarity loss function L aff Are added to form a corresponding overall loss function L all ,L all =L det +L fea +L att +L aff ;
According to the first image, the first identification frame set and the student model loss function L det Performing model self-training on the student model; after the model self-training is finished, the model self-training is finished according to the first point cloud, the first image, the first recognition frame set, the first false positive area set and the characteristic loss function L fea Performing teacher and student characteristic imitation training on the student model; after the teacher-student feature simulation training is finished, the teacher-student feature simulation training is finished according to the first point cloud and the first pictureImage and said attention loss function L att Performing teacher and student attention imitation training on the student model; after the teacher and student attention imitation training is finished, the teacher and student attention imitation training is finished according to the first point cloud, the first image and the similarity loss function L aff Carrying out teacher and student similarity training on the student model; the teacher-student similarity training is completed according to the first point cloud, the first image, the first recognition frame set, the first false positive area set and the overall loss function L all Performing overall training on the student model;
and after the whole training is finished, selecting a new group of original point clouds, original images, identification frame marking information of the original point clouds and false positive area marking information from the training data set to perform next training on the student model until the total number of training rounds reaches the specified number.
Preferably, the teacher model comprises a point cloud pillar feature network, a BEV pooling network, a first BEV encoder and a first target detection head network; the output of the point cloud column feature network is connected with the input of the BEV pooling network; an output of the BEV pooling network is connected to an input of the first BEV encoder; the input of the first BEV encoder is connected with the input of the first target detection head network;
the student model comprises an image encoder, an LSS view converter, a second BEV encoder and a second target detection head network; the output of the image encoder is connected to the input of the LSS view converter; an output of the LSS view converter is connected to an input of the second BEV encoder; the input of the second BEV encoder is connected with the input of the second target detection head network;
the first identification frame set comprises a plurality of three-dimensional first identification frames bbox 1 (ii) a The first identification frame bbox 1 Is H in shape bbox1 *W bbox1 *Z bbox1 ,H bbox1 、W bbox1 、Z bbox1 Is the first identification frame bbox 1 Depth, width and height of;
the first set of false positive regions comprises a plurality of first false positive regions FP.
Preferably, the characteristic loss function L of knowledge distillation from the teacher model to the student model fea Attention loss function L att And similarity loss function L aff The determination specifically comprises the following steps:
step 31, determining said characteristic loss function L fea Comprises the following steps:
(ii) a Wherein the content of the first and second substances,
alpha, beta and gamma are respectively preset loss coefficients;
F T is the teacher feature map output by the first BEV encoder of the teacher model, and H, W, C is the teacher feature map F T Height, width and channel dimensions; the teacher characteristic diagram F T Teacher feature channel vector with 1 × C shape and capable of being decomposed into H × W one-dimensional channelsThe teacher characteristic diagram F T Can also be decomposed into C teacher sub-feature graphs with two-dimensional shape of H multiplied by WThe teacher characteristic diagram F T Can also be decomposed into C, H, W teacher characteristic data1≤k≤C、1≤i≤H、1≤j≤W;
F S A student profile output for the second BEV encoder of the student model; f. of proj () Is a projection function from the student model BEV space to the teacher model BEV space;is the student characteristic diagram F S The corresponding projection feature map is displayed on the display screen,the projection feature mapAnd the teacher characteristic diagram F T Are consistent with the channel dimensions; the projection feature mapProjection eigen channel vector with shape of 1 × C capable of being decomposed into H × W one-dimensionalThe projection characteristic diagramCan also be decomposed into C projection sub-feature maps with two-dimensional shapes of H multiplied by WThe projection characteristic diagramCan also be decomposed into C, H, W projection characteristic data
M () is a foreground-background binary mask function;
n () is a false positive-background binary mask function;
s () is a size mask function;
A s () As a channel vector attention function;
A c () Attention function for sub-feature map;
is the teacher feature diagram F T And the student characteristic diagram F S The foreground feature of (a) loses the branch,is the teacher characteristic diagram F T And the student characteristic diagram F S The false positive feature of (2) loses a branch,is the teacher characteristic diagram F T And the student characteristic diagram F S The background feature loss branch of (1);
step 32, determining the attention loss function L att Comprises the following steps:
wherein the content of the first and second substances,
eta is a preset attention loss hyper-parameter;
L 1 is L1_ Loss function;
step 33, from the teacher feature map F T And the projection feature mapOptionally a predetermined quantity Q 2 The pixel points form corresponding matching point pairs1≤i′≤Q,1≤j′≤Q;
Wherein the content of the first and second substances,
each pixel pointIn the teacher characteristic diagram F T Corresponding to a line feature tensor of 1 xWxCAnd also a column feature tensor of H × 1 × C shapeEach pixel pointIn the projection feature mapCorresponding to a line feature tensor of 1 xWxC shapeAnd also a column feature tensor of H × 1 × C shape
Step 34, according to the preset quantity Q 2 The matching point pair ofDetermining the similarity loss function L aff Comprises the following steps:
wherein the content of the first and second substances,
zeta is a preset similarity loss hyper-parameter;
‖‖ smoothl1 is a Smooth _ L1_ loss function;
A ff () As a function of similarity.
Further, the foreground-background binary mask function M () is:
the false positive-background binary mask function N () is:
the size mask function S () is:
wherein the content of the first and second substances,
H bbox1 、W bbox1 for the corresponding first identification frame bbox 1 Depth and width of;
the channel vector attention function A s () Comprises the following steps:
wherein the content of the first and second substances,
t is a preset distillation hyper-parameter;
softmax s []an activation function that is a channel vector attention function;
F 1 as input features, the input features F 1 Comprising C feature componentsIn the input feature F 1 Is the teacher feature channel vectorThen, the teacher feature channel vector is usedAs the corresponding said bitCharacteristic componentIn the input feature F 1 For the projected eigen-channel vectorThen, the projected feature channel vector is processedAs corresponding said feature components
The sub-feature map attention function A c () Comprises the following steps:
wherein the content of the first and second substances,
softmax c []an activation function that is a sub-feature graph attention function;
F 2 as input features, the input features F 2 Comprises H × W characteristic componentsAt the input feature F 2 As the teacher sub-feature graphThen, the teacher sub-feature graph is drawnAs corresponding feature componentsAt the input feature F 2 For the projection sub-feature mapThen, the projection sub-feature map is processedAs corresponding feature components
Further, the similarity function A ff () Comprises the following steps:
wherein D is i′ 、D j′ Inputting a feature vector;
in inputting feature vector D i′ 、D j′ Is the teacher characteristic diagram F T Last one of the pixel pointsThe line feature tensor ofAnd said column feature tensorTime, similarity functionThe method comprises the following specific steps:
in inputting feature vector D i′ 、D j′ For the projection feature mapLast one of the pixel pointsThe line feature tensor ofAnd said column feature tensorFunction of time similarityThe method specifically comprises the following steps:
preferably, the method further comprises the step of calculating a loss function L according to the first image, the first recognition frame set and the student model det Carrying out model self-training on the student model, and specifically comprising:
step 61, inputting the first image into the student model for gradual operation, and taking a target identification frame set output by the second target detection head network of the student model as a corresponding second identification frame set in the operation process; the second recognition frame set comprises a plurality of three-dimensional second recognition frames bbox 2 (ii) a The second identification frame bbox 2 Is H in shape bbox2 *W bbox2 *Z bbox2 ,H bbox2 、W bbox2 、Z bbox2 Is the second identification frame bbox 2 Depth, width and height of;
step 62, substituting the first and second recognition box sets into the student model loss function L det Estimating a loss value to generate a corresponding first loss value;
step 63, identifying whether the first loss value meets a preset first loss convergence range; if yes, go to step 65; if not, go to step 64;
step 64, substituting the model parameters of the student model into the student model loss function L det Forming a corresponding first objective function; and for minimizing said first objective functionSolving the model parameters; updating the model parameters of the student model according to the solving result; and returns to step 61 after updating;
and step 65, confirming that the model self-training is completed.
Preferably, the first point cloud, the first image, the first recognition box set, the first false positive area set and the feature loss function L are obtained according to the first point cloud fea Right the student model carries out teacher and student's feature simulation training, specifically includes:
step 71, inputting the first point cloud into the teacher model for gradual operation, and extracting the output characteristic diagram of the first BEV encoder of the teacher model as the corresponding teacher characteristic diagram F in the operation process T (ii) a Inputting the first image into the student model for gradual operation, and extracting the output characteristic diagram of the second BEV encoder of the student model as the corresponding student characteristic diagram F in the operation process S (ii) a And based on said projection function f proj () For the student characteristic diagram F S Projecting from the student model BEV space to the teacher model BEV space to generate corresponding projection feature map
Step 72, the teacher feature map F is processed T The projection feature mapSubstituting the first set of recognition boxes and the first set of false positive regions into the feature loss function L fea Estimating the loss value to generate a corresponding second loss value;
step 73, identifying whether the second loss value meets a preset second loss convergence range; if yes, go to step 75; if not, go to step 74;
step 74, substituting the model parameters of the student model into the characteristic loss function L fea Forming a corresponding second objective function; and toSolving the model parameter which enables the second objective function to reach the minimum value; updating the model parameters of the student model according to the solving result; and returns to step 71 after the update;
and step 75, confirming that the teacher-student characteristic imitation training is completed.
Preferably, the method further comprises the step of calculating a first point cloud, a first image and an attention loss function L att It is right the student model carries out teacher and student's attention imitation training, specifically includes:
step 81, inputting the first point cloud into the teacher model for gradual operation, and extracting the output characteristic diagram of the first BEV encoder of the teacher model as the corresponding teacher characteristic diagram F in the operation process T (ii) a Inputting the first image into the student model for gradual operation, and extracting the output characteristic diagram of the second BEV encoder of the student model as the corresponding student characteristic diagram F in the operation process S (ii) a And based on said projection function f proj () For the student characteristic diagram F S Projecting from the student model BEV space to the teacher model BEV space to generate corresponding projection feature map
Step 82, the teacher characteristic graph F is processed T And the projection feature mapSubstituting the attention loss function L att Estimating the loss value to generate a corresponding third loss value;
step 83, identifying whether the third loss value meets a preset third loss convergence range; if yes, go to step 85; if not, go to step 84;
step 84, substituting the model parameters of the student model into the attention loss function L att Forming a corresponding third objective function; and solving the model parameter which makes the third objective function reach the minimum value(ii) a Updating the model parameters of the student model according to the solving result; and returns to step 81 after the update;
and step 85, confirming that the teacher-student attention imitation training is finished.
Preferably, the method further comprises the step of calculating a similarity loss function L from the first point cloud, the first image and the similarity loss function aff It is right the student model carries out teachers and students similarity training, specifically includes:
step 91, inputting the first point cloud into the teacher model for gradual operation, and extracting the output characteristic diagram of the first BEV encoder of the teacher model as the corresponding teacher characteristic diagram F in the operation process T (ii) a Inputting the first image into the student model for gradual operation, and extracting the output characteristic diagram of the second BEV encoder of the student model as the corresponding student characteristic diagram F in the operation process S (ii) a And based on said projection function f proj () For the student characteristic diagram F S Projecting from the student model BEV space to the teacher model BEV space to generate corresponding projection feature map
Step 92, the teacher characteristic graph F is processed T And the projection feature mapSubstituting the similarity loss function L aff Estimating the loss value to generate a corresponding fourth loss value;
step 93, identifying whether the fourth loss value meets a preset fourth loss convergence range; if yes, go to step 95; if not, go to step 94;
step 94, substituting the model parameters of the student model into the similarity loss function L aff Forming a corresponding fourth objective function; solving the model parameter which enables the fourth objective function to reach the minimum value; and updating the model parameters of the student model according to the solving result(ii) a And returns to step 91 after the update;
and step 95, confirming that the teacher-student similarity training is completed.
Preferably, the method further comprises the steps of calculating a first point cloud, calculating a first image, calculating a first recognition frame set, calculating a first false positive area set, calculating a second false positive area set, and calculating a global loss function L according to the first point cloud, the first image, the first recognition frame set, the first false positive area set, and the global loss function L all Carrying out integral training on the student model, and specifically comprising:
step 101, inputting the first point cloud into the teacher model for gradual operation, and extracting an output characteristic diagram of the first BEV encoder of the teacher model as a corresponding teacher characteristic diagram F in the operation process T (ii) a Inputting the first image into the student model for gradual operation, and extracting the output characteristic diagram of the second BEV encoder of the student model as the corresponding student characteristic diagram F in the operation process S And taking a target recognition frame set output by the second target detection head network of the student model as a corresponding third recognition frame set; and based on said projection function f proj () For the student characteristic diagram F S Projecting from the student model BEV space to the teacher model BEV space to generate corresponding projection feature mapThe third identification frame set comprises a plurality of three-dimensional third identification frames bbox 3 (ii) a The third identification frame bbox 3 Is H in shape bbox3 *W bbox3 *Z bbox3 ,H bbox3 、W bbox3 、Z bbox3 Identifying the third frame bbox 3 Depth, width and height of;
102, collecting the first and third identification frames and the teacher characteristic graph F T The projection feature mapAnd substituting said first set of false positive regions into said global loss function L all Estimating the loss value to generate a corresponding integral loss value;
step 103, identifying whether the overall loss value meets a preset overall loss convergence range; if yes, go to step 105; if not, go to step 104;
104, substituting the model parameters of the student model into the overall loss function L all Forming a corresponding overall objective function; solving the model parameter which enables the overall objective function to reach the minimum value; updating the model parameters of the student model according to the solving result; and returns to step 101 after updating;
and 105, confirming that the whole training is finished.
In a second aspect, the present invention provides an apparatus for implementing the method for training a three-dimensional target detection model based on cross-modal knowledge distillation according to the first aspect, where the apparatus includes: the device comprises an acquisition module, a training data processing module, a loss function processing module and a training processing module;
the acquisition module is used for acquiring a mature point cloud-based 3D target detection model to be trained as a corresponding teacher model and acquiring an image-based 3D target detection model to be trained as a corresponding student model; and obtaining the model loss function of the student model as a corresponding student model loss function L det ;
The training data processing module is used for acquiring an original point cloud and an original image of the same scene from a training data set as a corresponding first point cloud and a first image; acquiring identification frame marking information and false positive area marking information of the original point cloud as a corresponding first identification frame set and a first false positive area set;
the loss function processing module is used for carrying out characteristic loss function L on knowledge distillation from the teacher model to the student model fea Attention loss function L att And similarity loss function L aff Determining; and loss function L is formed by the student model det Said characteristic loss function L fea The attention loss function L att And said similarity loss function L aff Are added to form correspondencesOverall loss function L all ,L all =L det +L fea +L att +L aff ;
The training processing module is used for obtaining the first image, the first recognition frame set and the student model loss function L according to the first image, the first recognition frame set and the student model loss function L det Performing model self-training on the student model; after the model self-training is finished, the model self-training is finished according to the first point cloud, the first image, the first recognition frame set, the first false positive area set and the characteristic loss function L fea Performing teacher-student characteristic simulation training on the student model; after the teacher-student feature simulation training is finished, the teacher-student feature simulation training is finished according to the first point cloud, the first image and the attention loss function L att Carrying out teacher and student attention imitation training on the student model; after the teacher and student attention imitation training is finished, the teacher and student attention imitation training is finished according to the first point cloud, the first image and the similarity loss function L aff Carrying out teacher-student similarity training on the student model; the teacher-student similarity training is completed according to the first point cloud, the first image, the first recognition frame set, the first false positive area set and the overall loss function L all Performing overall training on the student model; and after the integral training is finished, selecting a new group of original point clouds, original images, identification frame marking information of the original point clouds and false positive area marking information from the training data set to perform next training on the student model until the total number of training rounds reaches the specified number.
A third aspect of an embodiment of the present invention provides an electronic device, including: a memory, a processor, and a transceiver;
the processor is configured to be coupled to the memory, read and execute instructions in the memory, so as to implement the method steps of the first aspect;
the transceiver is coupled to the processor, and the processor controls the transceiver to transmit and receive messages.
A fourth aspect of embodiments of the present invention provides a computer-readable storage medium storing computer instructions that, when executed by a computer, cause the computer to perform the method of the first aspect.
The embodiment of the invention provides a method, a device, electronic equipment and a computer readable storage medium for training a three-dimensional target detection model based on cross-modal knowledge distillation; taking a well-trained point cloud-based 3D target detection model as a teacher model, taking an image-based 3D target detection model as a student model, distilling a point cloud BEV (belief-oriented vector) feature of the teacher model to the student model by using a knowledge distillation mechanism, and training the student model to help the student model to learn a depth feature similar to the point cloud data from image data; the method and the device can help the image 3D target detection model to reduce depth estimation errors and improve the detection accuracy of the image 3D target detection model.
Drawings
Fig. 1 is a schematic diagram of a method for training a three-dimensional target detection model based on cross-modal knowledge distillation according to an embodiment of the present invention;
FIG. 2 is a block diagram of an apparatus for training a three-dimensional object detection model based on cross-modal knowledge distillation according to a second embodiment of the present invention;
fig. 3 is a schematic structural diagram of an electronic device according to a third embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer, the present invention will be described in further detail with reference to the accompanying drawings, and it is apparent that the described embodiments are only a part of the embodiments of the present invention, not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
An embodiment of the present invention provides a method for training a three-dimensional target detection model based on cross-modal knowledge distillation, as shown in fig. 1, which is a schematic diagram of the method for training the three-dimensional target detection model based on cross-modal knowledge distillation provided by the embodiment of the present invention, the method mainly includes the following steps:
step 1, acquiring a mature point cloud-based 3D target detection model to be trained as a corresponding teacher model, and acquiring an image-based 3D target detection model to be trained as a corresponding student model; and obtaining a model loss function of the student model as a corresponding student model loss function L det ;
The teacher model comprises a point cloud pillar feature network, a BEV pooling network, a first BEV encoder and a first target detection head network; the output of the point cloud column characteristic network is connected with the input of the BEV pooling network; the output of the BEV pooling network is connected to the input of the first BEV encoder; the input of the first BEV encoder is connected with the input of the first target detection head network;
the student model comprises an image encoder, an LSS view converter, a second BEV encoder and a second target detection head network; the output of the image encoder is connected to the input of the LSS view converter; the output of the LSS view converter is connected to the input of the second BEV encoder; the input of the second BEV encoder is connected to the input of the second target detection head network.
Here, the teacher model of the embodiment of the invention is a well-trained point cloud-based 3D target detection model; the point cloud column Feature network of the teacher model is similar to the Pillar Feature Net of the PointPillars model; the BEV pooling network of the teacher model performs height pooling (posing) on the output of the point cloud column feature network to obtain a BEV feature map under a Bird Eye View (BEV); the first BEV encoder of the teacher model further encodes the BEV characteristic diagram to output a corresponding BEV thermodynamic characteristic diagram; the first target detection head network of the teacher model performs BEV target detection according to a BEV thermodynamic characteristic diagram to obtain a plurality of two-dimensional target identification frames, and performs 3D shape regression calculation on each two-dimensional target identification frame through an internal full-connection network so as to output a plurality of three-dimensional target identification frames;
the student model of the embodiment of the invention is a 3D target detection model to be trained based on images; the image encoder of the student model extracts the characteristics of an input image; theThe LSS (LSS) view converter of the student model is similar to a view converter (view converter) in the technical paper 'Lift, splat, shot: encoding images from imaging cameras by imaging not projecting to 3 d', and can extract BEV characteristics of image data and output a corresponding BEV characteristic diagram; a second BEV encoder of the student model carries out further information encoding on the BEV characteristic diagram and outputs a corresponding BEV thermodynamic characteristic diagram; the second target detection head network of the student model performs BEV target detection according to the BEV thermodynamic characteristic diagram to obtain a plurality of two-dimensional target identification frames, and performs 3D shape regression calculation on each two-dimensional target identification frame through an internal full-connection network so as to output a plurality of three-dimensional target identification frames. It should be noted that, other neural network structures capable of outputting the BEV characteristic diagram and the BEV thermodynamic characteristic diagram may also be used before the second target detection head network of the student model in the embodiment of the present invention; the model loss function of the student model of the embodiment of the invention is a known loss function and is recorded as a student model loss function L det 。
Step 2, acquiring an original point cloud and an original image of the same scene from a training data set as a corresponding first point cloud and a corresponding first image; acquiring identification frame marking information and false positive area marking information of the original point cloud as a corresponding first identification frame set and a first false positive area set;
wherein the first identification frame set comprises a plurality of three-dimensional first identification frames bbox 1 (ii) a First identification frame bbox 1 Is H in shape bbox1 *W bbox1 *Z bbox1 ,H bbox1 、W bbox1 、Z bbox1 As the first recognition frame bbox 1 Depth, width and height of (a);
the first set of false positive regions comprises a plurality of first false positive regions FP.
Here, the training data set of the present invention is used to hold a plurality of training data records; each training data record corresponds to a group of original point clouds, original images, identification frame marking information of the original point clouds and false positive area marking information under the same scene; the original point cloud is basically consistent with the visual field range of the original image and generated at the same time; the identification frame marking information and the false positive area marking information of the original point cloud are manual marking information, and the identification frame marking information and the false positive area marking information of the original point cloud and the original point cloud are also one of training data of a teacher model. The False Positive (FP) area mentioned here is an area of the original point cloud space that is not occupied by the target recognition box but has the solid object point cloud, and correspondingly, the area occupied by the recognition box in the original point cloud space is a foreground area, and the area that is neither a foreground area nor a False Positive area in the original point cloud space is called a background area.
Step 3, carrying out characteristic loss function L of knowledge distillation from the teacher model to the student model fea Attention loss function L att And similarity loss function L aff Determining; and loss function L is formed by student model det Characteristic loss function L fea Attention loss function L att And a similarity loss function L aff Are added to form a corresponding overall loss function L all ;
Wherein L is all =L det +L fea +L att +L aff ;
The method specifically comprises the following steps: step 31, carrying out characteristic loss function L of knowledge distillation from teacher model to student model fea Attention loss function L att And similarity loss function L aff Determining;
the method specifically comprises the following steps: step 311, determine the characteristic loss function L fea Comprises the following steps:
wherein, the first and the second end of the pipe are connected with each other,
1) Alpha, beta and gamma are respectively preset loss coefficients;
2)F T is the teacher characteristic diagram output by the first BEV encoder of the teacher model, H, W, C is the teacher characteristic diagram F T Height, width, and channel dimensions; teacher feature graph F T The shape which can be decomposed into H, W and one dimension is 1 xCTeacher feature channel vectorTeacher feature graph F T Can also be decomposed into C teacher sub-feature graphs with two-dimensional shape of H multiplied by WTeacher feature graph F T Can also be decomposed into C, H, W teacher characteristic data1≤k≤C、1≤i≤H、1≤j≤W;
3)F S A student feature map output by a second BEV encoder of the student model; f. of proj () Is a projection function from the student model BEV space to the teacher model BEV space;characteristic diagram F for students S The corresponding projection feature map is displayed on the display screen,projection feature mapAnd teacher feature graph F T Are consistent with the channel dimensions; projection feature mapProjection eigen channel vector with shape of 1 × C capable of being decomposed into H × W one-dimensionalProjection feature mapCan also be decomposed into C projection sub-feature maps with two-dimensional shapes of H multiplied by WProjection feature mapCan also be decomposed into C, H, W projection characteristic data
4) M () is the foreground-background binary mask function:
here, (i, j) is the teacher sub-feature diagramThe coordinates of the upper pixel points and the specification of the foreground-background binary mask function are only required to be in any first identification frame bbox 1 If the corresponding mask output is 1, otherwise, if the coordinate is not in any foreground area, the corresponding mask output is 0;
5) N () is a false positive-background binary mask function:
here, (i, j) is the teacher sub-feature graphThe false positive-background binary mask function specifies that the corresponding mask output is 1 as long as the coordinate is in any first false positive region FP, and otherwise, the corresponding mask output is 0 if the coordinate is not in any false positive region FP;
6) S () is a size mask function:
wherein the content of the first and second substances,
H bbox1 、W bbox1 for the corresponding first identification box bbox 1 Depth and width of;
here, (i, j) is the teacher sub-feature graphThe coordinates of the pixels on the screen and the size mask function are specified as long as the coordinates are in any first identification frame bbox 1 The corresponding size mask output of the occupied foreground region is the corresponding first identification box bbox 1 Depth H of bbox1 And width W bbox1 The reciprocal of the product of (a), whereas if the coordinate is not in any foreground region, the corresponding size mask output is the number of background points N gb The reciprocal of (a);
7)A s () For the channel vector attention function:
wherein the content of the first and second substances,
t is a preset distillation hyper-parameter;
softmax s []an activation function that is a channel vector attention function;
In the input feature F 1 Feature channel vector for teacherThen, the teacher feature channel vector is usedAs corresponding feature componentsThe channel vector attention function at this time is:
in the input feature F 1 For projecting feature channel vectorsThen, the eigen-channel vectors will be projectedAs corresponding feature componentsThe channel vector attention function at this time is:
8)A c () For the sub-feature graph attention function:
wherein the content of the first and second substances,
softmax c []an activation function that is a sub-feature graph attention function;
In the input feature F 2 For teachers to use sub-feature mapsThe teacher is presented with a sub-feature mapAs corresponding feature componentsThe sub-feature graph attention function at this time is:
in the input feature F 2 For projecting sub-feature mapsThen, projecting the sub-feature mapAs corresponding feature componentsThe sub-feature graph attention function at this time is:
here, as is apparent from the above description, the characteristic loss function L fea In (1)Is a teacher characteristic diagram F T Student characteristic diagram F S The foreground feature of (a) loses the branch,is a teacher characteristic diagram F T Student characteristic diagram F S The false positive feature of (a) loses branches,is a teacher feature diagram F T Student characteristic diagram F S The background feature loss branch of (1); by a characteristic loss function L fea The positioning accuracy of the student model can be improved;
step 312, determine the attention loss function L att Comprises the following steps:
wherein eta is a preset attention loss hyper-parameter; l is a radical of an alcohol 1 Is L1_ Loss function;
here, the attention loss function L att Practical channel vector attention function and sub-feature graph attention function based on teacher feature graph F T Student characteristic diagram F S Comparing the attention characteristics of (1); by the attention loss function L att The student model can be helped to improve the target recognition precision;
step 313, from teacher feature map F T And projection feature mapOptionally a predetermined quantity Q 2 The pixel points form corresponding matching point pairs1≤i′≤Q,1≤j′≤Q;
Wherein each pixel pointIn teacher feature graph F T Corresponding to a line feature tensor of 1 xWxCAnd also a column feature tensor of H × 1 × C shapeEach pixel pointIn the projection of feature mapsCorresponding to a line feature tensor of 1 xWxCAnd also a column feature tensor of H × 1 × C shape
Here, from teacher profile F T And projection feature mapsRespectively select a preset number Q 2 Each pixel point forms a matching point pairAnd each matching point pairCharacteristic graph F of two pixel points in teacher T And projection feature mapsThe pixel coordinates of (a) are all the same;
e.g., Q =2, then from teacher profile F T Four points with coordinates of (1,2), (1,3), (1,4) and (1,5) are selected as corresponding pointsThen the feature map should be projected from the sameFour points with coordinates of (1,2), (1,3), (1,4) and (1,5) are selected as corresponding pointsAnd obtaining 4 matching point pairs according to the corresponding relation of (i ', j'):
step 314, according to the preset quantity Q 2 Is paired withDetermining a similarity loss function L aff Comprises the following steps:
wherein, the first and the second end of the pipe are connected with each other,
1) Zeta is a preset similarity loss hyper-parameter;
2)‖‖ smoothl1 is a Smooth _ L1_ loss function;
3)A ff () As a function of similarity:
wherein the content of the first and second substances,
D i′ 、D j′ inputting a feature vector;
in inputting feature vector D i′ 、D j′ Is a teacher characteristic diagram F T Last pixel pointOf the line feature tensorSum-column feature tensorTime, similarity functionThe method comprises the following specific steps:
in inputting feature vector D i′ 、D j′ For projecting feature mapsLast pixel pointOf the line feature tensorSum-column feature tensorFunction of time similarityThe method specifically comprises the following steps:
here, the similarity function used in the embodiment of the present invention is a cosine similarity function; by a similarity loss function L aff The student model can be helped to improve the model performance;
step 32, loss function L is calculated by student model det Characteristic loss function L fea Attention loss function L att And similarity loss function L aff Are added to form a corresponding overall loss function L all ;
Wherein L is all =L det +L fea +L att +L aff ;
Step 4, according to the first image, the first identification frame set and the student model loss function L det Carrying out model self-training on the student model; after the model self-training is finished, according to the first point cloud, the first image, the first recognition frame set, the first false positive area set and the characteristic loss function L fea Carrying out teacher-student characteristic simulation training on the student model; after the teacher-student characteristic simulation training is finished, the teacher-student characteristic simulation training is finished according to the first point cloud, the first image and the attention loss function L att Carrying out teacher and student attention imitation training on the student model; after the teacher and student completes the attention simulation training, the teacher and student finish the attention simulation training according to the first point cloud, the first image and the similarity loss function L aff Carrying out teacher-student similarity training on the student model; after the teacher-student similarity training is finished, according to the first point cloud, the first image, the first recognition frame set, the first false positive area set and the overall loss function L all Carrying out integral training on the student model;
the method comprises the steps of distilling point cloud BEV characteristics of a teacher model to a student model by using a knowledge distillation mechanism, and training the student model; in the training process, the student model is subjected to self-training; and then based on three loss functions (characteristic loss function L) for the knowledge distillation fea Attention loss function L att And similarity loss function L aff ) Gradually training the student model; finally based on the overall loss function L all Carrying out integral training on the student model;
the method specifically comprises the following steps: step 41, according to the first image, the first identification frame set and the student model loss function L det Performing model self-training on the student model;
the method specifically comprises the following steps: step 411, inputting the first image into the student model for gradual operation, and taking a target recognition frame set output by a second target detection head network of the student model as a corresponding second recognition frame set in the operation process;
wherein the second identification frame set comprises a plurality of three-dimensional second identification frames bbox 2 (ii) a Second identification frame bbox 2 Is in the shape ofH bbox2 *W bbox2 *Z bbox2 ,H bbox2 、W bbox2 、Z bbox2 As a second identification frame bbox 2 Depth, width and height of;
step 412, substituting the first and second recognition frame sets into the student model loss function L det Estimating a loss value to generate a corresponding first loss value;
step 413, identifying whether the first loss value meets a preset first loss convergence range; if yes, go to step 415; if not, go to step 414;
step 414, substituting the model parameters of the student model into the student model loss function L det Forming a corresponding first objective function; solving the model parameter which enables the first objective function to reach the minimum value; updating the model parameters of the student model according to the solving result; and returns to step 411 after updating;
step 415, confirming that the model self-training is completed;
step 42, after the model self-training is completed, according to the first point cloud, the first image, the first recognition frame set, the first false positive area set and the characteristic loss function L fea Carrying out teacher-student characteristic simulation training on the student model;
the method specifically comprises the following steps: step 421, inputting the first point cloud into the teacher model for gradual operation, and extracting the output characteristic diagram of the first BEV encoder of the teacher model as the corresponding teacher characteristic diagram F in the operation process T (ii) a Inputting the first image into the student model for gradual operation, and extracting the output characteristic diagram of the second BEV encoder of the student model as the corresponding student characteristic diagram F in the operation process S (ii) a And based on a projection function f proj () To student characteristic diagram F S Projection from student model BEV space to teacher model BEV space to generate corresponding projection characteristic diagram
Step 422, the teacher characteristic graph F T Projection feature mapSubstituting the first recognition frame set and the first false positive region set into the characteristic loss function L fea Estimating the loss value to generate a corresponding second loss value;
step 423, identifying whether the second loss value meets a preset second loss convergence range; if yes, go to step 425; if not, go to step 424;
step 424, substituting the model parameters of the student model into the characteristic loss function L fea Forming a corresponding second objective function; solving the model parameter which enables the second objective function to reach the minimum value; updating the model parameters of the student model according to the solving result; and returns to step 421 after updating;
step 425, confirming that the teacher and student characteristic imitation training is finished;
step 43, after the teacher-student feature simulation training is completed, according to the first point cloud, the first image and the attention loss function L att Carrying out teacher and student attention imitation training on the student model;
the method specifically comprises the following steps: step 431, inputting the first point cloud into the teacher model for gradual operation, and extracting an output characteristic diagram of a first BEV encoder of the teacher model as a corresponding teacher characteristic diagram FT in the operation process; inputting the first image into the student model for gradual operation, and extracting the output characteristic diagram of the second BEV encoder of the student model as the corresponding student characteristic diagram F in the operation process S (ii) a And based on a projection function f proj () To student characteristic diagram F S Projection from student model BEV space to teacher model BEV space to generate corresponding projection characteristic diagram
Step 432, the teacher characteristic graph F is processed T And projection feature mapsSubstituting attention loss function L att Estimating the loss value to generate a corresponding third loss value;
step 433, identifying whether the third loss value meets a preset third loss convergence range; if yes, go to step 435; if not, go to step 434;
step 434, substituting the model parameters of the student model into the attention loss function L att Forming a corresponding third objective function; solving the model parameter which enables the third objective function to reach the minimum value; updating the model parameters of the student model according to the solving result; and returns to step 431 after the update;
step 435, confirming completion of the teacher and student attention imitation training;
step 44, after the teacher and the student finish the attention imitation training, the first point cloud, the first image and the similarity loss function L are used for simulating the first point cloud and the second image aff Carrying out teacher-student similarity training on the student model;
the method specifically comprises the following steps: step 441, inputting the first point cloud into the teacher model for gradual operation, and extracting the output characteristic diagram of the first BEV encoder of the teacher model as a corresponding teacher characteristic diagram F in the operation process T (ii) a Inputting the first image into the student model for gradual operation, and extracting the output characteristic diagram of the second BEV encoder of the student model as the corresponding student characteristic diagram F in the operation process S (ii) a And based on a projection function f proj () To student characteristic diagram F S Projection from student model BEV space to teacher model BEV space to generate corresponding projection characteristic diagram
Step 442, the teacher feature map F T And projection feature mapsSubstituting similarity loss function L aff Estimating the loss value to generate a corresponding fourth loss value;
step 443, identifying whether the fourth loss value meets a preset fourth loss convergence range; if yes, go to step 445; if not, go to step 444;
step 444, substituting the model parameters of the student model into the similarity loss function L aff Forming a corresponding fourth objective function; solving the model parameter which enables the fourth objective function to reach the minimum value; updating the model parameters of the student model according to the solving result; and returns to step 441 after the update;
step 445, confirming that the teacher-student similarity training is completed;
step 45, after the teacher-student similarity training is finished, according to the first point cloud, the first image, the first recognition frame set, the first false positive area set and the overall loss function L all Carrying out integral training on the student model;
the method specifically comprises the following steps: step 451, inputting the first point cloud into the teacher model to perform step-by-step operation, and extracting the output characteristic diagram of the first BEV encoder of the teacher model as the corresponding teacher characteristic diagram F in the operation process T (ii) a Inputting the first image into the student model for gradual operation, and extracting the output characteristic diagram of the second BEV encoder of the student model as the corresponding student characteristic diagram F in the operation process S And a target recognition frame set output by a second target detection head network of the student model is used as a corresponding third recognition frame set; and based on a projection function f proj () To student characteristic diagram F S Projection from student model BEV space to teacher model BEV space to generate corresponding projection characteristic diagramWherein the third identification frame set comprises a plurality of three-dimensional third identification frames bbox 3 (ii) a Third identification frame bbox 3 Is H in shape bbox3 *W bbox3 *Z bbox3 ,H bbox3 、W bbox3 、Z bbox3 As a third identification frame bbox 3 Depth, width and height of;
step 452, collecting the first recognition frame, the third recognition frame and the teacher feature map F T Projection feature mapAnd substituting the first set of false positive regions into the global loss function L all Estimating the loss value to generate a corresponding overall loss value;
step 453, identifying whether the overall loss value satisfies a preset overall loss convergence range; if yes, go to step 455; if not, go to step 454;
step 454, substituting the model parameters of the student model into the overall loss function L all Forming a corresponding overall objective function; solving the model parameter which enables the overall objective function to reach the minimum value; updating the model parameters of the student model according to the solving result; and returns to step 451 after the update;
at step 455, the overall training is confirmed to be complete.
And 5, after the whole training is finished, selecting a new group of original point clouds, original images, identification frame marking information of the original point clouds and false positive area marking information from the training data set to perform next training on the student model until the total number of training rounds reaches the specified number.
Fig. 2 is a block diagram of an apparatus for performing distillation training on a three-dimensional object detection model based on cross-modal knowledge according to a second embodiment of the present invention, where the apparatus is a terminal device or a server for implementing the foregoing method embodiment, and may also be an apparatus capable of enabling the foregoing terminal device or server to implement the foregoing method embodiment, and for example, the apparatus may be an apparatus or a chip system of the foregoing terminal device or server. As shown in fig. 2, the apparatus includes: an acquisition module 201, a training data processing module 202, a loss function processing module 203, and a training processing module 204.
The acquisition module 201 is configured to acquire a mature point cloud-based 3D target detection model for training as a corresponding teacher model, and acquire an image-based 3D target detection model to be trained as a corresponding student model; and obtaining a model loss function of the student model as a corresponding student model loss function L det 。
The training data processing module 202 is configured to obtain an original point cloud and an original image of the same scene from a training data set as a corresponding first point cloud and a corresponding first image; and acquiring identification frame marking information and false positive area marking information of the original point cloud as a corresponding first identification frame set and a first false positive area set.
Loss function processing module 203 is used for performing characteristic loss function L on knowledge distillation from teacher model to student model fea Attention loss function L att And similarity loss function L aff Determining; and loss function L is formed by student model det Characteristic loss function L fea Attention loss function L att And similarity loss function L aff Are added to form a corresponding overall loss function L all ,L all =L det +L fea +L att +L aff 。
The training processing module 204 is configured to perform a loss function L according to the first image, the first recognition box set, and the student model det Performing model self-training on the student model; after the model self-training is finished, according to the first point cloud, the first image, the first recognition frame set, the first false positive area set and the characteristic loss function L fea Performing teacher and student characteristic imitation training on the student model; after the teacher-student characteristic simulation training is finished, the teacher-student characteristic simulation training is finished according to the first point cloud, the first image and the attention loss function L att Carrying out teacher and student attention imitation training on the student model; after the teacher and student completes the attention simulation training, the teacher and student finish the attention simulation training according to the first point cloud, the first image and the similarity loss function L aff Carrying out teacher-student similarity training on the student model; after the teacher-student similarity training is finished, according to the first point cloud, the first image, the first recognition frame set, the first false positive area set and the overall loss function L all Carrying out integral training on the student model; and after the whole training is finished, selecting a new group of original point clouds, original images, identification frame marking information of the original point clouds and false positive area marking information from the training data set to perform next training on the student model until the total number of training rounds reaches the specified number.
The device for training the three-dimensional target detection model based on cross-modal knowledge distillation provided by the embodiment of the invention can execute the method steps in the method embodiment, and the implementation principle and the technical effect are similar, so that the detailed description is omitted.
It should be noted that the division of the modules of the above apparatus is only a logical division, and the actual implementation may be wholly or partially integrated into one physical entity, or may be physically separated. And these modules can be realized in the form of software called by processing element; or can be implemented in the form of hardware; and part of the modules can be realized in the form of calling software by the processing element, and part of the modules can be realized in the form of hardware. For example, the obtaining module may be a processing element separately set up, or may be implemented by being integrated in a chip of the apparatus, or may be stored in a memory of the apparatus in the form of program code, and a processing element of the apparatus calls and executes the functions of the determining module. Other modules are implemented similarly. In addition, all or part of the modules can be integrated together or can be independently realized. The processing element described herein may be an integrated circuit having signal processing capabilities. In implementation, each step of the above method or each module above may be implemented by an integrated logic circuit of hardware in a processor element or an instruction in the form of software.
For example, the above modules may be one or more integrated circuits configured to implement the above methods, such as: one or more Application Specific Integrated Circuits (ASICs), or one or more Digital Signal Processors (DSPs), or one or more Field Programmable Gate Arrays (FPGAs), etc. For another example, when some of the above modules are implemented in the form of a Processing element scheduler code, the Processing element may be a general-purpose processor, such as a Central Processing Unit (CPU) or other processor that can invoke the program code. As another example, these modules may be integrated together and implemented in the form of a System-on-a-chip (SOC).
In the above embodiments, all or part of the implementation may be realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. The procedures or functions described in accordance with the foregoing method embodiments are generated in whole or in part when the computer program instructions are loaded and executed on a computer. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another computer readable storage medium, for example, from one website, computer, server, or data center to another website, computer, server, or data center via wire (e.g., coaxial cable, fiber optic, digital Subscriber Line (DSL)), or wireless (e.g., infrared, wireless, bluetooth, microwave, etc.).
Fig. 3 is a schematic structural diagram of an electronic device according to a third embodiment of the present invention. The electronic device may be the terminal device or the server, or may be a terminal device or a server connected to the terminal device or the server and implementing the method according to the embodiment of the present invention. As shown in fig. 3, the electronic device may include: a processor 301 (e.g., a CPU), a memory 302, a transceiver 303; the transceiver 303 is coupled to the processor 301, and the processor 301 controls the transceiving operation of the transceiver 303. Various instructions may be stored in memory 302 for performing various processing functions and implementing the processing steps described in the foregoing method embodiments. Preferably, the electronic device according to an embodiment of the present invention further includes: a power supply 304, a system bus 305, and a communication port 306. The system bus 305 is used to implement communication connections between the elements. The communication port 306 is used for connection communication between the electronic device and other peripherals.
The system bus 305 mentioned in fig. 3 may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The system bus may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown, but this does not mean that there is only one bus or one type of bus. The communication interface is used for realizing communication between the database access device and other equipment (such as a client, a read-write library and a read-only library). The Memory may include a Random Access Memory (RAM) and may also include a Non-Volatile Memory (Non-Volatile Memory), such as at least one disk Memory.
The Processor may be a general-purpose Processor, and includes a central Processing Unit CPU, a Network Processor (NP), a Graphics Processing Unit (GPU), and the like; but also a digital signal processor DSP, an application specific integrated circuit ASIC, a field programmable gate array FPGA or other programmable logic device, discrete gate or transistor logic, discrete hardware components.
It should be noted that the embodiment of the present invention also provides a computer-readable storage medium, which stores instructions that, when executed on a computer, cause the computer to execute the method and the processing procedure provided in the above-mentioned embodiment.
The embodiment of the present invention further provides a chip for executing the instructions, where the chip is configured to execute the processing steps described in the foregoing method embodiment.
The embodiment of the invention provides a method, a device, electronic equipment and a computer readable storage medium for training a three-dimensional target detection model based on cross-modal knowledge distillation; taking a well-trained point cloud-based 3D target detection model as a teacher model, taking an image-based 3D target detection model as a student model, distilling a point cloud BEV (belief-oriented vector) feature of the teacher model to the student model by using a knowledge distillation mechanism, and training the student model to help the student model to learn a depth feature similar to the point cloud data from image data; the method and the device can help the image 3D target detection model to reduce depth estimation errors and improve the detection accuracy of the image 3D target detection model.
Those of skill would further appreciate that the various illustrative components and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the components and steps of the various examples have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied in hardware, a software module executed by a processor, or a combination of the two. A software module may reside in Random Access Memory (RAM), memory, read-only memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.
The above-mentioned embodiments, objects, technical solutions and advantages of the present invention are further described in detail, it should be understood that the above-mentioned embodiments are only examples of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.
Claims (13)
1. A method for training a three-dimensional target detection model based on cross-modal knowledge distillation, the method comprising:
obtainingTraining a mature point cloud-based 3D target detection model as a corresponding teacher model, and acquiring an image-based 3D target detection model to be trained as a corresponding student model; and obtaining the model loss function of the student model as a corresponding student model loss function L det ;
Acquiring an original point cloud and an original image of the same scene from a training data set as a corresponding first point cloud and a first image; acquiring identification frame marking information and false positive area marking information of the original point cloud as a corresponding first identification frame set and a first false positive area set;
characteristic loss function L for knowledge distillation from the teacher model to the student model fea Attention loss function L att And similarity loss function L aff Determining; and loss function L is formed by the student model det The characteristic loss function L fea The attention loss function L att And said similarity loss function L aff Are added to form a corresponding overall loss function L all ,L all =L det +L fea +L att +L aff ;
According to the first image, the first identification frame set and the student model loss function L det Performing model self-training on the student model; after the model self-training is finished, the model self-training is finished according to the first point cloud, the first image, the first recognition frame set, the first false positive area set and the characteristic loss function L fea Performing teacher-student characteristic simulation training on the student model; after the teacher-student feature simulation training is finished, the teacher-student feature simulation training is finished according to the first point cloud, the first image and the attention loss function L att Performing teacher and student attention imitation training on the student model; after the teacher and student attention imitation training is finished, the teacher and student attention imitation training is finished according to the first point cloud, the first image and the similarity loss function L aff Carrying out teacher and student similarity training on the student model; after the teacher-student similarity training is finished, the teacher-student similarity training is finished according to the first point cloud, the first image, the first recognition frame set and the first false positive area setAnd said global loss function L all Performing overall training on the student model;
and after the whole training is finished, selecting a new group of original point clouds, original images, identification frame marking information of the original point clouds and false positive area marking information from the training data set to perform next training on the student model until the total number of training rounds reaches the specified number.
2. The method for distillation training of a three-dimensional object detection model based on cross-modal knowledge according to claim 1,
the teacher model comprises a point cloud pillar feature network, a BEV pooling network, a first BEV encoder and a first target detection head network; the output of the point cloud column feature network is connected with the input of the BEV pooling network; an output of the BEV pooling network is connected to an input of the first BEV encoder; the input of the first BEV encoder is connected with the input of the first target detection head network;
the student model comprises an image encoder, an LSS view converter, a second BEV encoder and a second target detection head network; the output of the image encoder is connected to the input of the LSS view converter; an output of the LSS view converter is connected to an input of the second BEV encoder; the input of the second BEV encoder is connected with the input of the second target detection head network;
the first identification frame set comprises a plurality of three-dimensional first identification frames bbox 1 (ii) a The first identification frame bbox 1 Is H in shape bbox1 *W bbox1 *Z bbox1 ,H bbox1 、W bbox1 、Z bbox1 Is the first identification frame bbox 1 Depth, width and height of;
the first set of false positive regions comprises a plurality of first false positive regions FP.
3. The method for distillation training of three-dimensional target detection model based on cross-modal knowledge as claimed in claim 2, wherein the teacher model is a model of the teacherCharacteristic loss function L of knowledge distillation to the student model fea Attention loss function L att And similarity loss function L aff The determination specifically comprises the following steps:
step 31, determining said characteristic loss function L fea Comprises the following steps:
wherein the content of the first and second substances,
alpha, beta and gamma are respectively preset loss coefficients;
F T is the teacher characteristic diagram output by the first BEV encoder of the teacher model, and H, W, C is the teacher characteristic diagram F T Height, width and channel dimensions; the teacher characteristic diagram F T Teacher feature channel vector with 1 × C shape capable of being decomposed into H × W one-dimensional channelsThe teacher characteristic diagram F T Can also be decomposed into C teacher sub-feature graphs with two-dimensional shape of H multiplied by WThe teacher characteristic diagram F T Can also be decomposed into C, H, W teacher characteristic data
F S A student profile output for the second BEV encoder of the student model; f. of proj () Is a projection function from the student model BEV space to the teacher model BEV space;is the student characteristic diagram F S The corresponding projection feature map is displayed on the display screen,the projection feature mapAnd the teacher characteristic diagram F T Are consistent with the channel dimensions; the projection feature mapProjection eigen channel vector with shape of 1 × C capable of being decomposed into H × W one-dimensionalThe projection feature mapCan also be decomposed into C projection sub-feature maps with two-dimensional shapes of H multiplied by WThe projection feature mapCan also be decomposed into C, H, W projection characteristic data
M () is a foreground-background binary mask function;
n () is a false positive-background binary mask function;
s () is a size mask function;
A s () As a channel vector attention function;
A c () Attention function for sub-feature map;
is the teacher characteristic diagram F T And the student characteristic diagram F S The foreground feature of (a) loses the branch,is the teacher characteristic diagram F T And the student characteristic diagram F S The false positive feature of (a) loses branches,is the teacher characteristic diagram F T And the student characteristic diagram F S The background feature loss branch of (1);
step 32, determining the attention loss function L att Comprises the following steps:
wherein the content of the first and second substances,
eta is a preset attention loss hyper-parameter;
L 1 is L1_ Loss function;
step 33, from the teacher feature map F T And the projection feature mapOptionally a predetermined quantity Q 2 The pixel points form corresponding matching point pairs
Wherein the content of the first and second substances,
each pixel pointIn the teacher characteristic diagram F T Corresponding to a line feature tensor of 1 xWxC shapeAnd also a column feature tensor of H × 1 × C shapeEach pixel pointIn the projection feature mapCorresponding to a line feature tensor of 1 xWxC shapeAnd also a column feature tensor of H × 1 × C shape
Step 34, according to the preset quantity Q 2 The matching point pair ofDetermining the similarity loss function L aff Comprises the following steps:
wherein, the first and the second end of the pipe are connected with each other,
zeta is a preset similarity loss hyper-parameter;
‖ ‖ smoothl1 is a Smooth _ L1_ loss function;
A ff () As a function of similarity.
4. The method for distillation training of the three-dimensional object detection model based on the cross-modal knowledge as recited in claim 3,
the foreground-background binary mask function M () is:
the false positive-background binary mask function N () is:
the size mask function S () is:
wherein the content of the first and second substances,
H bbox1 、W bbox1 for the corresponding first identification frame bbox 1 Depth and width of;
the channel vector attention function A s () Comprises the following steps:
wherein the content of the first and second substances,
t is a preset distillation hyper-parameter;
softmax s []an activation function that is a channel vector attention function;
F 1 as input features, the input features F 1 Comprising C feature componentsIn the input feature F 1 Is the teacher feature channel vectorThen, the teacher feature channel vector is usedAs corresponding said feature componentsIn the input feature F 1 For the projected eigen-channel vectorThen, the projected feature channel vector is processedAs corresponding said feature components
The sub-feature map attention function A c () Comprises the following steps:
wherein the content of the first and second substances,
softmax c []an activation function that is a sub-feature graph attention function;
F 2 as input features, the input features F 2 Comprises H × W characteristic componentsIn the input feature F 2 Is the teacher sub-feature graphThen, the teacher sub-feature graph is drawnAs corresponding feature componentsIn the input feature F 2 For the projection sub-feature mapThen, the projection sub-feature map is processedAs corresponding feature components
5. The method for distillation training of a three-dimensional object detection model based on cross-modal knowledge according to claim 3,
the similarity function A ff () Comprises the following steps:
wherein D is i′ 、D j′ Inputting a feature vector;
in inputting feature vector D i′ 、D j′ Is the teacher characteristic diagram F T Last one of the pixel pointsThe line feature tensor ofAnd said column feature tensorTime, similarity functionThe method specifically comprises the following steps:
6. the method for distillation training of a three-dimensional object detection model based on cross-modal knowledge according to claim 3, wherein the method comprises performing a loss function L according to the first image, the first set of recognition boxes and the student model det Carrying out model self-training on the student model, and specifically comprising the following steps:
step 61, the first stepInputting the image into the student model to perform gradual operation, and taking a target identification frame set output by the second target detection head network of the student model as a corresponding second identification frame set in the operation process; the second recognition frame set comprises a plurality of three-dimensional second recognition frames bbox 2 (ii) a The second identification frame bbox 2 Is H in shape bbox2 *W bbox2 *Z bbox2 ,H bbox2 、W bbox2 、Z bbox2 Is the second identification frame bbox 2 Depth, width and height of;
step 62, substituting the first and second recognition box sets into the student model loss function L det Estimating a loss value to generate a corresponding first loss value;
step 63, identifying whether the first loss value meets a preset first loss convergence range; if yes, go to step 65; if not, go to step 64;
step 64, substituting the model parameters of the student model into the student model loss function L det Forming a corresponding first objective function; solving the model parameter which enables the first objective function to reach the minimum value; updating the model parameters of the student model according to the solving result; and returns to step 61 after updating;
and step 65, confirming that the model self-training is completed.
7. The method for distillation training of a three-dimensional object detection model based on cross-modal knowledge according to claim 3, wherein the method comprises the steps of calculating the first point cloud, the first image, the first recognition box set, the first false positive region set, and the feature loss function L according to the first point cloud, the first image, the first recognition box set, the first false positive region set, and the feature loss function L fea Right the student model carries out teacher and student's feature simulation training, specifically includes:
step 71, inputting the first point cloud into the teacher model for gradual operation, and extracting the output characteristic diagram of the first BEV encoder of the teacher model as the corresponding teacher characteristic diagram F in the operation process T (ii) a And inputting the first image into the image processing deviceThe student model is gradually operated, and the output characteristic diagram of the second BEV encoder of the student model is extracted as the corresponding student characteristic diagram F in the operation process S (ii) a And based on said projection function f proj () For the student characteristic diagram F S Projecting from the student model BEV space to the teacher model BEV space to generate corresponding projection feature map
Step 72, the teacher feature map F is processed T The projection feature mapSubstituting the first set of recognition boxes and the first set of false positive regions into the feature loss function L fea Estimating the loss value to generate a corresponding second loss value;
step 73, identifying whether the second loss value meets a preset second loss convergence range; if yes, go to step 75; if not, go to step 74;
step 74, substituting the model parameters of the student model into the characteristic loss function L fea Forming a corresponding second objective function; solving the model parameter which enables the second objective function to reach the minimum value; updating the model parameters of the student model according to the solving result; and returns to step 71 after the update;
and step 75, confirming that the teacher-student feature simulation training is completed.
8. The method for distillation training of a three-dimensional object detection model based on cross-modal knowledge according to claim 3, wherein the method comprises the step of performing the distillation training of the three-dimensional object detection model based on the first point cloud, the first image and the attention loss function L att It is right the student model carries out teacher and student's attention imitation training, specifically includes:
step 81, inputting the first point cloud into the teacher model for gradual operation, and performing gradual operation on the first point cloud in the operation processThe output characteristic diagram of the first BEV encoder of the teacher model is extracted as the corresponding teacher characteristic diagram F T (ii) a Inputting the first image into the student model for gradual operation, and extracting the output characteristic diagram of the second BEV encoder of the student model as the corresponding student characteristic diagram F in the operation process S (ii) a And based on said projection function f proj () For the student characteristic diagram F S Projecting from the student model BEV space to the teacher model BEV space to generate corresponding projection feature map
Step 82, the teacher characteristic graph F is processed T And the projection feature mapSubstituting the attention loss function L att Estimating the loss value to generate a corresponding third loss value;
step 83, identifying whether the third loss value meets a preset third loss convergence range; if yes, go to step 85; if not, go to step 84;
step 84, substituting the model parameters of the student model into the attention loss function L att Forming a corresponding third objective function; solving the model parameter which enables the third objective function to reach the minimum value; updating the model parameters of the student model according to the solving result; and returns to step 81 after the update;
and step 85, confirming that the teacher-student attention imitation training is finished.
9. The method for distillation training of a three-dimensional object detection model based on cross-modal knowledge according to claim 3, wherein the method comprises the step of performing the distillation training of the three-dimensional object detection model based on the first point cloud, the first image and the similarity loss function L aff It is right the student model carries out teachers and students similarity training, specifically includes:
step 91, inputting the first point cloud into the teacher model for gradual operation, and extracting the output characteristic diagram of the first BEV encoder of the teacher model as the corresponding teacher characteristic diagram F in the operation process T (ii) a Inputting the first image into the student model for gradual operation, and extracting the output characteristic diagram of the second BEV encoder of the student model as the corresponding student characteristic diagram F in the operation process S (ii) a And based on said projection function f proj () For the student characteristic diagram F S Projecting from the student model BEV space to the teacher model BEV space to generate corresponding projection feature map
Step 92, the teacher characteristic graph F is processed T And the projection feature mapSubstituting the similarity loss function L aff Estimating the loss value to generate a corresponding fourth loss value;
step 93, identifying whether the fourth loss value meets a preset fourth loss convergence range; if yes, go to step 95; if not, go to step 94;
step 94, substituting the model parameters of the student model into the similarity loss function L aff Forming a corresponding fourth objective function; solving the model parameter which enables the fourth objective function to reach the minimum value; updating the model parameters of the student model according to the solving result; and returns to step 91 after the update;
and step 95, confirming that the teacher-student similarity training is completed.
10. The method for distillation training of a three-dimensional object detection model based on cross-modal knowledge according to claim 3, wherein the method further comprises the step of performing the distillation training on the three-dimensional object detection model according to the first point cloud, the first image, the first recognition box set and the first false positiveSet of zones and the global loss function L all Carrying out integral training on the student model, and specifically comprising:
step 101, inputting the first point cloud into the teacher model for gradual operation, and extracting an output characteristic diagram of the first BEV encoder of the teacher model as a corresponding teacher characteristic diagram F in the operation process T (ii) a Inputting the first image into the student model for gradual operation, and extracting the output characteristic diagram of the second BEV encoder of the student model as the corresponding student characteristic diagram F in the operation process S And taking a target recognition frame set output by the second target detection head network of the student model as a corresponding third recognition frame set; and based on said projection function f proj () For the student characteristic diagram F S Projecting from the student model BEV space to the teacher model BEV space to generate corresponding projection feature mapThe third identification frame set comprises a plurality of three-dimensional third identification frames bbox 3 (ii) a The third identification frame bbox 3 Is H in shape bbox3 *W bbox3 *Z bbox3 ,H bbox3 、W bbox3 、Z bbox3 Identifying the third frame bbox 3 Depth, width and height of;
102, collecting the first and third identification frames and the teacher characteristic graph F T The projection feature mapAnd substituting said first set of false positive regions into said global loss function L all Estimating the loss value to generate a corresponding overall loss value;
step 103, identifying whether the overall loss value meets a preset overall loss convergence range; if yes, go to step 105; if not, go to step 104;
104, the student model is analyzedModel parameters are substituted into the overall loss function L all Forming a corresponding overall objective function; solving the model parameter which enables the overall objective function to reach the minimum value; updating the model parameters of the student model according to the solving result; and returns to step 101 after updating;
and 105, confirming that the whole training is finished.
11. An apparatus for performing the method for distillation training of three-dimensional object detection model based on cross-modal knowledge according to any one of claims 1-10, the apparatus comprising: the device comprises an acquisition module, a training data processing module, a loss function processing module and a training processing module;
the acquisition module is used for acquiring a mature point cloud-based 3D target detection model to be trained as a corresponding teacher model and acquiring an image-based 3D target detection model to be trained as a corresponding student model; and obtaining the model loss function of the student model as a corresponding student model loss function L det ;
The training data processing module is used for acquiring an original point cloud and an original image of the same scene from a training data set as a corresponding first point cloud and a corresponding first image; acquiring identification frame marking information and false positive area marking information of the original point cloud as a corresponding first identification frame set and a first false positive area set;
the loss function processing module is used for carrying out characteristic loss function L on knowledge distillation from the teacher model to the student model fea Attention loss function L att And a similarity loss function L aff Determining; and loss function L is formed by the student model det The characteristic loss function L fea The attention loss function L att And said similarity loss function L aff Are added to form a corresponding overall loss function L all ,L all =L det +L fea +L att +L aff ;
The training processing module is used for recognizing the first image according to the first imageFrame set and the student model loss function L det Performing model self-training on the student model; after the model self-training is finished, the model self-training is finished according to the first point cloud, the first image, the first recognition frame set, the first false positive area set and the characteristic loss function L fea Performing teacher-student characteristic simulation training on the student model; after the teacher-student characteristic simulation training is finished, according to the first point cloud, the first image and the attention loss function L att Performing teacher and student attention imitation training on the student model; after the teacher and student attention imitation training is finished, the teacher and student attention imitation training is finished according to the first point cloud, the first image and the similarity loss function L aff Carrying out teacher-student similarity training on the student model; the teacher-student similarity training is completed according to the first point cloud, the first image, the first recognition frame set, the first false positive area set and the overall loss function L all Performing overall training on the student model; and after the integral training is finished, selecting a new group of original point clouds, original images, identification frame marking information of the original point clouds and false positive area marking information from the training data set to perform next training on the student model until the total number of training rounds reaches the specified number.
12. An electronic device, comprising: a memory, a processor, and a transceiver;
the processor is used for being coupled with the memory, reading and executing the instructions in the memory to realize the method steps of any one of the claims 1-10;
the transceiver is coupled to the processor, and the processor controls the transceiver to transmit and receive messages.
13. A computer-readable storage medium having stored thereon computer instructions which, when executed by a computer, cause the computer to perform the method of any of claims 1-10.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211296868.1A CN115690708A (en) | 2022-10-21 | 2022-10-21 | Method and device for training three-dimensional target detection model based on cross-modal knowledge distillation |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211296868.1A CN115690708A (en) | 2022-10-21 | 2022-10-21 | Method and device for training three-dimensional target detection model based on cross-modal knowledge distillation |
Publications (1)
Publication Number | Publication Date |
---|---|
CN115690708A true CN115690708A (en) | 2023-02-03 |
Family
ID=85066272
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211296868.1A Pending CN115690708A (en) | 2022-10-21 | 2022-10-21 | Method and device for training three-dimensional target detection model based on cross-modal knowledge distillation |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115690708A (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116028891A (en) * | 2023-02-16 | 2023-04-28 | 之江实验室 | Industrial anomaly detection model training method and device based on multi-model fusion |
CN116229210A (en) * | 2023-02-23 | 2023-06-06 | 南通探维光电科技有限公司 | Target detection model training method, device, equipment and medium |
CN116341650A (en) * | 2023-03-23 | 2023-06-27 | 哈尔滨市科佳通用机电股份有限公司 | Noise self-training-based railway wagon bolt loss detection method |
CN117097797A (en) * | 2023-10-19 | 2023-11-21 | 浪潮电子信息产业股份有限公司 | Cloud edge end cooperation method, device and system, electronic equipment and readable storage medium |
CN117351450A (en) * | 2023-12-06 | 2024-01-05 | 吉咖智能机器人有限公司 | Monocular 3D detection method and device, electronic equipment and storage medium |
CN117523549A (en) * | 2024-01-04 | 2024-02-06 | 南京邮电大学 | Three-dimensional point cloud object identification method based on deep and wide knowledge distillation |
-
2022
- 2022-10-21 CN CN202211296868.1A patent/CN115690708A/en active Pending
Cited By (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116028891A (en) * | 2023-02-16 | 2023-04-28 | 之江实验室 | Industrial anomaly detection model training method and device based on multi-model fusion |
CN116229210A (en) * | 2023-02-23 | 2023-06-06 | 南通探维光电科技有限公司 | Target detection model training method, device, equipment and medium |
CN116229210B (en) * | 2023-02-23 | 2023-10-24 | 南通探维光电科技有限公司 | Target detection model training method, device, equipment and medium |
CN116341650A (en) * | 2023-03-23 | 2023-06-27 | 哈尔滨市科佳通用机电股份有限公司 | Noise self-training-based railway wagon bolt loss detection method |
CN116341650B (en) * | 2023-03-23 | 2023-12-26 | 哈尔滨市科佳通用机电股份有限公司 | Noise self-training-based railway wagon bolt loss detection method |
CN117097797A (en) * | 2023-10-19 | 2023-11-21 | 浪潮电子信息产业股份有限公司 | Cloud edge end cooperation method, device and system, electronic equipment and readable storage medium |
CN117097797B (en) * | 2023-10-19 | 2024-02-09 | 浪潮电子信息产业股份有限公司 | Cloud edge end cooperation method, device and system, electronic equipment and readable storage medium |
CN117351450A (en) * | 2023-12-06 | 2024-01-05 | 吉咖智能机器人有限公司 | Monocular 3D detection method and device, electronic equipment and storage medium |
CN117351450B (en) * | 2023-12-06 | 2024-02-27 | 吉咖智能机器人有限公司 | Monocular 3D detection method and device, electronic equipment and storage medium |
CN117523549A (en) * | 2024-01-04 | 2024-02-06 | 南京邮电大学 | Three-dimensional point cloud object identification method based on deep and wide knowledge distillation |
CN117523549B (en) * | 2024-01-04 | 2024-03-29 | 南京邮电大学 | Three-dimensional point cloud object identification method based on deep and wide knowledge distillation |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN115690708A (en) | Method and device for training three-dimensional target detection model based on cross-modal knowledge distillation | |
CN109816725B (en) | Monocular camera object pose estimation method and device based on deep learning | |
CN109960742B (en) | Local information searching method and device | |
CN110866871A (en) | Text image correction method and device, computer equipment and storage medium | |
CN111723691B (en) | Three-dimensional face recognition method and device, electronic equipment and storage medium | |
CN113888689A (en) | Image rendering model training method, image rendering method and image rendering device | |
CN109948441B (en) | Model training method, image processing method, device, electronic equipment and computer readable storage medium | |
CN107767358B (en) | Method and device for determining ambiguity of object in image | |
CN107766864B (en) | Method and device for extracting features and method and device for object recognition | |
CN112200057A (en) | Face living body detection method and device, electronic equipment and storage medium | |
CN112102424A (en) | License plate image generation model construction method, generation method and device | |
CN114219855A (en) | Point cloud normal vector estimation method and device, computer equipment and storage medium | |
CN113095333A (en) | Unsupervised feature point detection method and unsupervised feature point detection device | |
CN111260794B (en) | Outdoor augmented reality application method based on cross-source image matching | |
CN113129425A (en) | Face image three-dimensional reconstruction method, storage medium and terminal device | |
CN112929626A (en) | Three-dimensional information extraction method based on smartphone image | |
CN112990183A (en) | Method, system and device for extracting homonymous strokes of offline handwritten Chinese characters | |
CN114663880A (en) | Three-dimensional target detection method based on multi-level cross-modal self-attention mechanism | |
CN112070181B (en) | Image stream-based cooperative detection method and device and storage medium | |
CN113706472A (en) | Method, device and equipment for detecting road surface diseases and storage medium | |
CN110533663B (en) | Image parallax determining method, device, equipment and system | |
CN111723688A (en) | Human body action recognition result evaluation method and device and electronic equipment | |
CN116597246A (en) | Model training method, target detection method, electronic device and storage medium | |
CN116206302A (en) | Three-dimensional object detection method, three-dimensional object detection device, computer equipment and storage medium | |
CN115035193A (en) | Bulk grain random sampling method based on binocular vision and image segmentation technology |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |