CN115690708A - Method and device for training three-dimensional target detection model based on cross-modal knowledge distillation - Google Patents

Method and device for training three-dimensional target detection model based on cross-modal knowledge distillation Download PDF

Info

Publication number
CN115690708A
CN115690708A CN202211296868.1A CN202211296868A CN115690708A CN 115690708 A CN115690708 A CN 115690708A CN 202211296868 A CN202211296868 A CN 202211296868A CN 115690708 A CN115690708 A CN 115690708A
Authority
CN
China
Prior art keywords
model
student
teacher
training
loss function
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211296868.1A
Other languages
Chinese (zh)
Inventor
杨晓东
王泽宇
罗晨旭
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Qingyu Technology Co Ltd
Original Assignee
Suzhou Qingyu Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Qingyu Technology Co Ltd filed Critical Suzhou Qingyu Technology Co Ltd
Priority to CN202211296868.1A priority Critical patent/CN115690708A/en
Publication of CN115690708A publication Critical patent/CN115690708A/en
Pending legal-status Critical Current

Links

Images

Abstract

The embodiment of the invention relates to a method and a device for training a three-dimensional target detection model based on cross-modal knowledge distillation, wherein the method comprises the following steps: obtaining a teacher model, a student model and a student model loss function; acquiring training data from a training data set; determining characteristic loss, attention loss and similarity loss functions of knowledge distillation from a teacher model to a student model; and the student model loss function, the characteristic loss function, the attention loss function and the similarity loss function form an integral loss function; carrying out model self-training on the student model; performing teacher and student feature simulation training; carrying out attention simulation training of teachers and students; carrying out teacher-student similarity training; carrying out integral training; and after the training is finished, selecting a new group of training data from the training data set to perform the next round of training on the student model until the specified times are reached. By the method and the device, the detection accuracy of the image 3D target detection model can be improved.

Description

Method and device for training three-dimensional target detection model based on cross-modal knowledge distillation
Technical Field
The invention relates to the technical field of data processing, in particular to a method and a device for training a three-dimensional target detection model based on cross-modal knowledge distillation.
Background
And the sensing module of the automatic driving system is used for detecting the target according to the sensing data of the sensor. At present, mainstream sensor includes camera, laser radar and radar etc. and common 3D target detection model mainly has two types: the method comprises a neural network model taking point cloud as input and a neural network model taking an image as input. Both models have their own advantages and disadvantages: 1) The point cloud carries quasi distance (depth) information, and a 3D target detection model based on the point cloud can output higher detection precision to a target at a closer distance; however, the point cloud has the characteristic of sparsity, so that the detection accuracy of the 3D target detection model based on the point cloud on a remote target is poor; 2) Visual field information on the image is uniformly distributed, the information density is high, and the 3D target detection model based on the image can output higher identification precision to a near target or a far target; however, the image does not have depth information, and depth estimation needs to be performed on the image through a model, and the depth estimation causes a large detection error, so that the positioning accuracy of a target identification box (bbox) output by a 3D target detection model based on the image is always insufficient.
Disclosure of Invention
The invention aims to provide a method, a device, electronic equipment and a computer readable storage medium for training a three-dimensional target detection model based on cross-modal knowledge distillation, aiming at the defects of the prior art; taking a well-trained 3D target detection Model based on point cloud as a Teacher Model (Teacher Model), taking a 3D target detection Model based on an image as a Student Model (Student Model), distilling the point cloud BEV characteristics of the Teacher Model to the Student Model by using a Knowledge Distillation (Knowledge Distillation) mechanism, training the Student Model, and helping the Student Model to learn depth characteristics similar to the point cloud data from image data; the method and the device can help the image 3D target detection model to reduce depth estimation errors, and achieve the purpose of improving the detection accuracy of the image 3D target detection model.
In order to achieve the above object, a first aspect of the embodiments of the present invention provides a method for training a three-dimensional target detection model based on cross-modal knowledge distillation, the method including:
acquiring a well-trained 3D target detection model based on point cloud as a corresponding teacher model, and acquiring a 3D target detection model to be trained based on images as a corresponding student model; and obtaining the model loss function of the student model as a corresponding student model loss function L det
Acquiring an original point cloud and an original image of the same scene from a training data set as a corresponding first point cloud and a first image; acquiring identification frame marking information and false positive area marking information of the original point cloud as a corresponding first identification frame set and a first false positive area set;
characteristic loss function L for knowledge distillation from the teacher model to the student model fea Attention loss function L att And a similarity loss function L aff Determining; and loss function L is formed by the student model det The characteristic loss function L fea The attention loss function L att And said similarity loss function L aff Are added to form a corresponding overall loss function L all ,L all =L det +L fea +L att +L aff
According to the first image, the first identification frame set and the student model loss function L det Performing model self-training on the student model; after the model self-training is finished, the model self-training is finished according to the first point cloud, the first image, the first recognition frame set, the first false positive area set and the characteristic loss function L fea Performing teacher and student characteristic imitation training on the student model; after the teacher-student feature simulation training is finished, the teacher-student feature simulation training is finished according to the first point cloud and the first pictureImage and said attention loss function L att Performing teacher and student attention imitation training on the student model; after the teacher and student attention imitation training is finished, the teacher and student attention imitation training is finished according to the first point cloud, the first image and the similarity loss function L aff Carrying out teacher and student similarity training on the student model; the teacher-student similarity training is completed according to the first point cloud, the first image, the first recognition frame set, the first false positive area set and the overall loss function L all Performing overall training on the student model;
and after the whole training is finished, selecting a new group of original point clouds, original images, identification frame marking information of the original point clouds and false positive area marking information from the training data set to perform next training on the student model until the total number of training rounds reaches the specified number.
Preferably, the teacher model comprises a point cloud pillar feature network, a BEV pooling network, a first BEV encoder and a first target detection head network; the output of the point cloud column feature network is connected with the input of the BEV pooling network; an output of the BEV pooling network is connected to an input of the first BEV encoder; the input of the first BEV encoder is connected with the input of the first target detection head network;
the student model comprises an image encoder, an LSS view converter, a second BEV encoder and a second target detection head network; the output of the image encoder is connected to the input of the LSS view converter; an output of the LSS view converter is connected to an input of the second BEV encoder; the input of the second BEV encoder is connected with the input of the second target detection head network;
the first identification frame set comprises a plurality of three-dimensional first identification frames bbox 1 (ii) a The first identification frame bbox 1 Is H in shape bbox1 *W bbox1 *Z bbox1 ,H bbox1 、W bbox1 、Z bbox1 Is the first identification frame bbox 1 Depth, width and height of;
the first set of false positive regions comprises a plurality of first false positive regions FP.
Preferably, the characteristic loss function L of knowledge distillation from the teacher model to the student model fea Attention loss function L att And similarity loss function L aff The determination specifically comprises the following steps:
step 31, determining said characteristic loss function L fea Comprises the following steps:
Figure BDA0003903078940000031
(ii) a Wherein the content of the first and second substances,
alpha, beta and gamma are respectively preset loss coefficients;
F T is the teacher feature map output by the first BEV encoder of the teacher model, and H, W, C is the teacher feature map F T Height, width and channel dimensions; the teacher characteristic diagram F T Teacher feature channel vector with 1 × C shape and capable of being decomposed into H × W one-dimensional channels
Figure BDA0003903078940000041
The teacher characteristic diagram F T Can also be decomposed into C teacher sub-feature graphs with two-dimensional shape of H multiplied by W
Figure BDA0003903078940000042
The teacher characteristic diagram F T Can also be decomposed into C, H, W teacher characteristic data
Figure BDA0003903078940000043
1≤k≤C、1≤i≤H、1≤j≤W;
F S A student profile output for the second BEV encoder of the student model; f. of proj () Is a projection function from the student model BEV space to the teacher model BEV space;
Figure BDA00039030789400000411
is the student characteristic diagram F S The corresponding projection feature map is displayed on the display screen,
Figure BDA00039030789400000412
the projection feature map
Figure BDA00039030789400000413
And the teacher characteristic diagram F T Are consistent with the channel dimensions; the projection feature map
Figure BDA00039030789400000414
Projection eigen channel vector with shape of 1 × C capable of being decomposed into H × W one-dimensional
Figure BDA0003903078940000044
The projection characteristic diagram
Figure BDA00039030789400000415
Can also be decomposed into C projection sub-feature maps with two-dimensional shapes of H multiplied by W
Figure BDA0003903078940000045
The projection characteristic diagram
Figure BDA00039030789400000416
Can also be decomposed into C, H, W projection characteristic data
Figure BDA0003903078940000046
M () is a foreground-background binary mask function;
n () is a false positive-background binary mask function;
s () is a size mask function;
A s () As a channel vector attention function;
A c () Attention function for sub-feature map;
Figure BDA0003903078940000047
is the teacher feature diagram F T And the student characteristic diagram F S The foreground feature of (a) loses the branch,
Figure BDA0003903078940000048
is the teacher characteristic diagram F T And the student characteristic diagram F S The false positive feature of (2) loses a branch,
Figure BDA0003903078940000049
is the teacher characteristic diagram F T And the student characteristic diagram F S The background feature loss branch of (1);
step 32, determining the attention loss function L att Comprises the following steps:
Figure BDA00039030789400000410
wherein the content of the first and second substances,
eta is a preset attention loss hyper-parameter;
L 1 is L1_ Loss function;
step 33, from the teacher feature map F T And the projection feature map
Figure BDA00039030789400000515
Optionally a predetermined quantity Q 2 The pixel points form corresponding matching point pairs
Figure BDA0003903078940000051
1≤i′≤Q,1≤j′≤Q;
Wherein the content of the first and second substances,
each pixel point
Figure BDA0003903078940000052
In the teacher characteristic diagram F T Corresponding to a line feature tensor of 1 xWxC
Figure BDA0003903078940000053
And also a column feature tensor of H × 1 × C shape
Figure BDA0003903078940000054
Each pixel point
Figure BDA0003903078940000055
In the projection feature map
Figure BDA00039030789400000516
Corresponding to a line feature tensor of 1 xWxC shape
Figure BDA0003903078940000056
And also a column feature tensor of H × 1 × C shape
Figure BDA0003903078940000057
Step 34, according to the preset quantity Q 2 The matching point pair of
Figure BDA0003903078940000058
Determining the similarity loss function L aff Comprises the following steps:
Figure BDA0003903078940000059
wherein the content of the first and second substances,
zeta is a preset similarity loss hyper-parameter;
‖‖ smoothl1 is a Smooth _ L1_ loss function;
A ff () As a function of similarity.
Further, the foreground-background binary mask function M () is:
Figure BDA00039030789400000510
the false positive-background binary mask function N () is:
Figure BDA00039030789400000511
the size mask function S () is:
Figure BDA00039030789400000512
wherein the content of the first and second substances,
H bbox1 、W bbox1 for the corresponding first identification frame bbox 1 Depth and width of;
N gb is the teacher sub-feature graph
Figure BDA00039030789400000513
The number of background points on the top of the tile,
Figure BDA00039030789400000514
Figure BDA0003903078940000061
the channel vector attention function A s () Comprises the following steps:
Figure BDA0003903078940000062
wherein the content of the first and second substances,
t is a preset distillation hyper-parameter;
softmax s []an activation function that is a channel vector attention function;
F 1 as input features, the input features F 1 Comprising C feature components
Figure BDA00039030789400000621
In the input feature F 1 Is the teacher feature channel vector
Figure BDA0003903078940000063
Then, the teacher feature channel vector is used
Figure BDA0003903078940000064
As the corresponding said bitCharacteristic component
Figure BDA00039030789400000622
In the input feature F 1 For the projected eigen-channel vector
Figure BDA0003903078940000065
Then, the projected feature channel vector is processed
Figure BDA0003903078940000066
As corresponding said feature components
Figure BDA0003903078940000067
The sub-feature map attention function A c () Comprises the following steps:
Figure BDA0003903078940000068
wherein the content of the first and second substances,
softmax c []an activation function that is a sub-feature graph attention function;
F 2 as input features, the input features F 2 Comprises H × W characteristic components
Figure BDA0003903078940000069
At the input feature F 2 As the teacher sub-feature graph
Figure BDA00039030789400000610
Then, the teacher sub-feature graph is drawn
Figure BDA00039030789400000611
As corresponding feature components
Figure BDA00039030789400000612
At the input feature F 2 For the projection sub-feature map
Figure BDA00039030789400000613
Then, the projection sub-feature map is processed
Figure BDA00039030789400000614
As corresponding feature components
Figure BDA00039030789400000615
Further, the similarity function A ff () Comprises the following steps:
Figure BDA00039030789400000616
wherein D is i′ 、D j′ Inputting a feature vector;
in inputting feature vector D i′ 、D j′ Is the teacher characteristic diagram F T Last one of the pixel points
Figure BDA00039030789400000617
The line feature tensor of
Figure BDA00039030789400000618
And said column feature tensor
Figure BDA00039030789400000619
Time, similarity function
Figure BDA00039030789400000620
The method comprises the following specific steps:
Figure BDA0003903078940000071
in inputting feature vector D i′ 、D j′ For the projection feature map
Figure BDA0003903078940000077
Last one of the pixel points
Figure BDA0003903078940000072
The line feature tensor of
Figure BDA0003903078940000073
And said column feature tensor
Figure BDA0003903078940000074
Function of time similarity
Figure BDA0003903078940000075
The method specifically comprises the following steps:
Figure BDA0003903078940000076
preferably, the method further comprises the step of calculating a loss function L according to the first image, the first recognition frame set and the student model det Carrying out model self-training on the student model, and specifically comprising:
step 61, inputting the first image into the student model for gradual operation, and taking a target identification frame set output by the second target detection head network of the student model as a corresponding second identification frame set in the operation process; the second recognition frame set comprises a plurality of three-dimensional second recognition frames bbox 2 (ii) a The second identification frame bbox 2 Is H in shape bbox2 *W bbox2 *Z bbox2 ,H bbox2 、W bbox2 、Z bbox2 Is the second identification frame bbox 2 Depth, width and height of;
step 62, substituting the first and second recognition box sets into the student model loss function L det Estimating a loss value to generate a corresponding first loss value;
step 63, identifying whether the first loss value meets a preset first loss convergence range; if yes, go to step 65; if not, go to step 64;
step 64, substituting the model parameters of the student model into the student model loss function L det Forming a corresponding first objective function; and for minimizing said first objective functionSolving the model parameters; updating the model parameters of the student model according to the solving result; and returns to step 61 after updating;
and step 65, confirming that the model self-training is completed.
Preferably, the first point cloud, the first image, the first recognition box set, the first false positive area set and the feature loss function L are obtained according to the first point cloud fea Right the student model carries out teacher and student's feature simulation training, specifically includes:
step 71, inputting the first point cloud into the teacher model for gradual operation, and extracting the output characteristic diagram of the first BEV encoder of the teacher model as the corresponding teacher characteristic diagram F in the operation process T (ii) a Inputting the first image into the student model for gradual operation, and extracting the output characteristic diagram of the second BEV encoder of the student model as the corresponding student characteristic diagram F in the operation process S (ii) a And based on said projection function f proj () For the student characteristic diagram F S Projecting from the student model BEV space to the teacher model BEV space to generate corresponding projection feature map
Figure BDA0003903078940000081
Step 72, the teacher feature map F is processed T The projection feature map
Figure BDA0003903078940000082
Substituting the first set of recognition boxes and the first set of false positive regions into the feature loss function L fea Estimating the loss value to generate a corresponding second loss value;
step 73, identifying whether the second loss value meets a preset second loss convergence range; if yes, go to step 75; if not, go to step 74;
step 74, substituting the model parameters of the student model into the characteristic loss function L fea Forming a corresponding second objective function; and toSolving the model parameter which enables the second objective function to reach the minimum value; updating the model parameters of the student model according to the solving result; and returns to step 71 after the update;
and step 75, confirming that the teacher-student characteristic imitation training is completed.
Preferably, the method further comprises the step of calculating a first point cloud, a first image and an attention loss function L att It is right the student model carries out teacher and student's attention imitation training, specifically includes:
step 81, inputting the first point cloud into the teacher model for gradual operation, and extracting the output characteristic diagram of the first BEV encoder of the teacher model as the corresponding teacher characteristic diagram F in the operation process T (ii) a Inputting the first image into the student model for gradual operation, and extracting the output characteristic diagram of the second BEV encoder of the student model as the corresponding student characteristic diagram F in the operation process S (ii) a And based on said projection function f proj () For the student characteristic diagram F S Projecting from the student model BEV space to the teacher model BEV space to generate corresponding projection feature map
Figure BDA0003903078940000083
Step 82, the teacher characteristic graph F is processed T And the projection feature map
Figure BDA0003903078940000084
Substituting the attention loss function L att Estimating the loss value to generate a corresponding third loss value;
step 83, identifying whether the third loss value meets a preset third loss convergence range; if yes, go to step 85; if not, go to step 84;
step 84, substituting the model parameters of the student model into the attention loss function L att Forming a corresponding third objective function; and solving the model parameter which makes the third objective function reach the minimum value(ii) a Updating the model parameters of the student model according to the solving result; and returns to step 81 after the update;
and step 85, confirming that the teacher-student attention imitation training is finished.
Preferably, the method further comprises the step of calculating a similarity loss function L from the first point cloud, the first image and the similarity loss function aff It is right the student model carries out teachers and students similarity training, specifically includes:
step 91, inputting the first point cloud into the teacher model for gradual operation, and extracting the output characteristic diagram of the first BEV encoder of the teacher model as the corresponding teacher characteristic diagram F in the operation process T (ii) a Inputting the first image into the student model for gradual operation, and extracting the output characteristic diagram of the second BEV encoder of the student model as the corresponding student characteristic diagram F in the operation process S (ii) a And based on said projection function f proj () For the student characteristic diagram F S Projecting from the student model BEV space to the teacher model BEV space to generate corresponding projection feature map
Figure BDA0003903078940000091
Step 92, the teacher characteristic graph F is processed T And the projection feature map
Figure BDA0003903078940000092
Substituting the similarity loss function L aff Estimating the loss value to generate a corresponding fourth loss value;
step 93, identifying whether the fourth loss value meets a preset fourth loss convergence range; if yes, go to step 95; if not, go to step 94;
step 94, substituting the model parameters of the student model into the similarity loss function L aff Forming a corresponding fourth objective function; solving the model parameter which enables the fourth objective function to reach the minimum value; and updating the model parameters of the student model according to the solving result(ii) a And returns to step 91 after the update;
and step 95, confirming that the teacher-student similarity training is completed.
Preferably, the method further comprises the steps of calculating a first point cloud, calculating a first image, calculating a first recognition frame set, calculating a first false positive area set, calculating a second false positive area set, and calculating a global loss function L according to the first point cloud, the first image, the first recognition frame set, the first false positive area set, and the global loss function L all Carrying out integral training on the student model, and specifically comprising:
step 101, inputting the first point cloud into the teacher model for gradual operation, and extracting an output characteristic diagram of the first BEV encoder of the teacher model as a corresponding teacher characteristic diagram F in the operation process T (ii) a Inputting the first image into the student model for gradual operation, and extracting the output characteristic diagram of the second BEV encoder of the student model as the corresponding student characteristic diagram F in the operation process S And taking a target recognition frame set output by the second target detection head network of the student model as a corresponding third recognition frame set; and based on said projection function f proj () For the student characteristic diagram F S Projecting from the student model BEV space to the teacher model BEV space to generate corresponding projection feature map
Figure BDA0003903078940000101
The third identification frame set comprises a plurality of three-dimensional third identification frames bbox 3 (ii) a The third identification frame bbox 3 Is H in shape bbox3 *W bbox3 *Z bbox3 ,H bbox3 、W bbox3 、Z bbox3 Identifying the third frame bbox 3 Depth, width and height of;
102, collecting the first and third identification frames and the teacher characteristic graph F T The projection feature map
Figure BDA0003903078940000102
And substituting said first set of false positive regions into said global loss function L all Estimating the loss value to generate a corresponding integral loss value;
step 103, identifying whether the overall loss value meets a preset overall loss convergence range; if yes, go to step 105; if not, go to step 104;
104, substituting the model parameters of the student model into the overall loss function L all Forming a corresponding overall objective function; solving the model parameter which enables the overall objective function to reach the minimum value; updating the model parameters of the student model according to the solving result; and returns to step 101 after updating;
and 105, confirming that the whole training is finished.
In a second aspect, the present invention provides an apparatus for implementing the method for training a three-dimensional target detection model based on cross-modal knowledge distillation according to the first aspect, where the apparatus includes: the device comprises an acquisition module, a training data processing module, a loss function processing module and a training processing module;
the acquisition module is used for acquiring a mature point cloud-based 3D target detection model to be trained as a corresponding teacher model and acquiring an image-based 3D target detection model to be trained as a corresponding student model; and obtaining the model loss function of the student model as a corresponding student model loss function L det
The training data processing module is used for acquiring an original point cloud and an original image of the same scene from a training data set as a corresponding first point cloud and a first image; acquiring identification frame marking information and false positive area marking information of the original point cloud as a corresponding first identification frame set and a first false positive area set;
the loss function processing module is used for carrying out characteristic loss function L on knowledge distillation from the teacher model to the student model fea Attention loss function L att And similarity loss function L aff Determining; and loss function L is formed by the student model det Said characteristic loss function L fea The attention loss function L att And said similarity loss function L aff Are added to form correspondencesOverall loss function L all ,L all =L det +L fea +L att +L aff
The training processing module is used for obtaining the first image, the first recognition frame set and the student model loss function L according to the first image, the first recognition frame set and the student model loss function L det Performing model self-training on the student model; after the model self-training is finished, the model self-training is finished according to the first point cloud, the first image, the first recognition frame set, the first false positive area set and the characteristic loss function L fea Performing teacher-student characteristic simulation training on the student model; after the teacher-student feature simulation training is finished, the teacher-student feature simulation training is finished according to the first point cloud, the first image and the attention loss function L att Carrying out teacher and student attention imitation training on the student model; after the teacher and student attention imitation training is finished, the teacher and student attention imitation training is finished according to the first point cloud, the first image and the similarity loss function L aff Carrying out teacher-student similarity training on the student model; the teacher-student similarity training is completed according to the first point cloud, the first image, the first recognition frame set, the first false positive area set and the overall loss function L all Performing overall training on the student model; and after the integral training is finished, selecting a new group of original point clouds, original images, identification frame marking information of the original point clouds and false positive area marking information from the training data set to perform next training on the student model until the total number of training rounds reaches the specified number.
A third aspect of an embodiment of the present invention provides an electronic device, including: a memory, a processor, and a transceiver;
the processor is configured to be coupled to the memory, read and execute instructions in the memory, so as to implement the method steps of the first aspect;
the transceiver is coupled to the processor, and the processor controls the transceiver to transmit and receive messages.
A fourth aspect of embodiments of the present invention provides a computer-readable storage medium storing computer instructions that, when executed by a computer, cause the computer to perform the method of the first aspect.
The embodiment of the invention provides a method, a device, electronic equipment and a computer readable storage medium for training a three-dimensional target detection model based on cross-modal knowledge distillation; taking a well-trained point cloud-based 3D target detection model as a teacher model, taking an image-based 3D target detection model as a student model, distilling a point cloud BEV (belief-oriented vector) feature of the teacher model to the student model by using a knowledge distillation mechanism, and training the student model to help the student model to learn a depth feature similar to the point cloud data from image data; the method and the device can help the image 3D target detection model to reduce depth estimation errors and improve the detection accuracy of the image 3D target detection model.
Drawings
Fig. 1 is a schematic diagram of a method for training a three-dimensional target detection model based on cross-modal knowledge distillation according to an embodiment of the present invention;
FIG. 2 is a block diagram of an apparatus for training a three-dimensional object detection model based on cross-modal knowledge distillation according to a second embodiment of the present invention;
fig. 3 is a schematic structural diagram of an electronic device according to a third embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer, the present invention will be described in further detail with reference to the accompanying drawings, and it is apparent that the described embodiments are only a part of the embodiments of the present invention, not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
An embodiment of the present invention provides a method for training a three-dimensional target detection model based on cross-modal knowledge distillation, as shown in fig. 1, which is a schematic diagram of the method for training the three-dimensional target detection model based on cross-modal knowledge distillation provided by the embodiment of the present invention, the method mainly includes the following steps:
step 1, acquiring a mature point cloud-based 3D target detection model to be trained as a corresponding teacher model, and acquiring an image-based 3D target detection model to be trained as a corresponding student model; and obtaining a model loss function of the student model as a corresponding student model loss function L det
The teacher model comprises a point cloud pillar feature network, a BEV pooling network, a first BEV encoder and a first target detection head network; the output of the point cloud column characteristic network is connected with the input of the BEV pooling network; the output of the BEV pooling network is connected to the input of the first BEV encoder; the input of the first BEV encoder is connected with the input of the first target detection head network;
the student model comprises an image encoder, an LSS view converter, a second BEV encoder and a second target detection head network; the output of the image encoder is connected to the input of the LSS view converter; the output of the LSS view converter is connected to the input of the second BEV encoder; the input of the second BEV encoder is connected to the input of the second target detection head network.
Here, the teacher model of the embodiment of the invention is a well-trained point cloud-based 3D target detection model; the point cloud column Feature network of the teacher model is similar to the Pillar Feature Net of the PointPillars model; the BEV pooling network of the teacher model performs height pooling (posing) on the output of the point cloud column feature network to obtain a BEV feature map under a Bird Eye View (BEV); the first BEV encoder of the teacher model further encodes the BEV characteristic diagram to output a corresponding BEV thermodynamic characteristic diagram; the first target detection head network of the teacher model performs BEV target detection according to a BEV thermodynamic characteristic diagram to obtain a plurality of two-dimensional target identification frames, and performs 3D shape regression calculation on each two-dimensional target identification frame through an internal full-connection network so as to output a plurality of three-dimensional target identification frames;
the student model of the embodiment of the invention is a 3D target detection model to be trained based on images; the image encoder of the student model extracts the characteristics of an input image; theThe LSS (LSS) view converter of the student model is similar to a view converter (view converter) in the technical paper 'Lift, splat, shot: encoding images from imaging cameras by imaging not projecting to 3 d', and can extract BEV characteristics of image data and output a corresponding BEV characteristic diagram; a second BEV encoder of the student model carries out further information encoding on the BEV characteristic diagram and outputs a corresponding BEV thermodynamic characteristic diagram; the second target detection head network of the student model performs BEV target detection according to the BEV thermodynamic characteristic diagram to obtain a plurality of two-dimensional target identification frames, and performs 3D shape regression calculation on each two-dimensional target identification frame through an internal full-connection network so as to output a plurality of three-dimensional target identification frames. It should be noted that, other neural network structures capable of outputting the BEV characteristic diagram and the BEV thermodynamic characteristic diagram may also be used before the second target detection head network of the student model in the embodiment of the present invention; the model loss function of the student model of the embodiment of the invention is a known loss function and is recorded as a student model loss function L det
Step 2, acquiring an original point cloud and an original image of the same scene from a training data set as a corresponding first point cloud and a corresponding first image; acquiring identification frame marking information and false positive area marking information of the original point cloud as a corresponding first identification frame set and a first false positive area set;
wherein the first identification frame set comprises a plurality of three-dimensional first identification frames bbox 1 (ii) a First identification frame bbox 1 Is H in shape bbox1 *W bbox1 *Z bbox1 ,H bbox1 、W bbox1 、Z bbox1 As the first recognition frame bbox 1 Depth, width and height of (a);
the first set of false positive regions comprises a plurality of first false positive regions FP.
Here, the training data set of the present invention is used to hold a plurality of training data records; each training data record corresponds to a group of original point clouds, original images, identification frame marking information of the original point clouds and false positive area marking information under the same scene; the original point cloud is basically consistent with the visual field range of the original image and generated at the same time; the identification frame marking information and the false positive area marking information of the original point cloud are manual marking information, and the identification frame marking information and the false positive area marking information of the original point cloud and the original point cloud are also one of training data of a teacher model. The False Positive (FP) area mentioned here is an area of the original point cloud space that is not occupied by the target recognition box but has the solid object point cloud, and correspondingly, the area occupied by the recognition box in the original point cloud space is a foreground area, and the area that is neither a foreground area nor a False Positive area in the original point cloud space is called a background area.
Step 3, carrying out characteristic loss function L of knowledge distillation from the teacher model to the student model fea Attention loss function L att And similarity loss function L aff Determining; and loss function L is formed by student model det Characteristic loss function L fea Attention loss function L att And a similarity loss function L aff Are added to form a corresponding overall loss function L all
Wherein L is all =L det +L fea +L att +L aff
The method specifically comprises the following steps: step 31, carrying out characteristic loss function L of knowledge distillation from teacher model to student model fea Attention loss function L att And similarity loss function L aff Determining;
the method specifically comprises the following steps: step 311, determine the characteristic loss function L fea Comprises the following steps:
Figure BDA0003903078940000151
wherein, the first and the second end of the pipe are connected with each other,
1) Alpha, beta and gamma are respectively preset loss coefficients;
2)F T is the teacher characteristic diagram output by the first BEV encoder of the teacher model, H, W, C is the teacher characteristic diagram F T Height, width, and channel dimensions; teacher feature graph F T The shape which can be decomposed into H, W and one dimension is 1 xCTeacher feature channel vector
Figure BDA0003903078940000152
Teacher feature graph F T Can also be decomposed into C teacher sub-feature graphs with two-dimensional shape of H multiplied by W
Figure BDA0003903078940000153
Teacher feature graph F T Can also be decomposed into C, H, W teacher characteristic data
Figure BDA0003903078940000154
1≤k≤C、1≤i≤H、1≤j≤W;
3)F S A student feature map output by a second BEV encoder of the student model; f. of proj () Is a projection function from the student model BEV space to the teacher model BEV space;
Figure BDA0003903078940000168
characteristic diagram F for students S The corresponding projection feature map is displayed on the display screen,
Figure BDA0003903078940000169
projection feature map
Figure BDA00039030789400001610
And teacher feature graph F T Are consistent with the channel dimensions; projection feature map
Figure BDA00039030789400001611
Projection eigen channel vector with shape of 1 × C capable of being decomposed into H × W one-dimensional
Figure BDA0003903078940000161
Projection feature map
Figure BDA00039030789400001612
Can also be decomposed into C projection sub-feature maps with two-dimensional shapes of H multiplied by W
Figure BDA0003903078940000162
Projection feature map
Figure BDA00039030789400001613
Can also be decomposed into C, H, W projection characteristic data
Figure BDA0003903078940000163
4) M () is the foreground-background binary mask function:
Figure BDA0003903078940000164
here, (i, j) is the teacher sub-feature diagram
Figure BDA0003903078940000165
The coordinates of the upper pixel points and the specification of the foreground-background binary mask function are only required to be in any first identification frame bbox 1 If the corresponding mask output is 1, otherwise, if the coordinate is not in any foreground area, the corresponding mask output is 0;
5) N () is a false positive-background binary mask function:
Figure BDA0003903078940000166
here, (i, j) is the teacher sub-feature graph
Figure BDA0003903078940000167
The false positive-background binary mask function specifies that the corresponding mask output is 1 as long as the coordinate is in any first false positive region FP, and otherwise, the corresponding mask output is 0 if the coordinate is not in any false positive region FP;
6) S () is a size mask function:
Figure BDA0003903078940000171
wherein the content of the first and second substances,
H bbox1 、W bbox1 for the corresponding first identification box bbox 1 Depth and width of;
N gb for teachers to use sub-feature maps
Figure BDA0003903078940000172
The number of background points on the top of the tile,
Figure BDA0003903078940000173
here, (i, j) is the teacher sub-feature graph
Figure BDA0003903078940000174
The coordinates of the pixels on the screen and the size mask function are specified as long as the coordinates are in any first identification frame bbox 1 The corresponding size mask output of the occupied foreground region is the corresponding first identification box bbox 1 Depth H of bbox1 And width W bbox1 The reciprocal of the product of (a), whereas if the coordinate is not in any foreground region, the corresponding size mask output is the number of background points N gb The reciprocal of (a);
7)A s () For the channel vector attention function:
Figure BDA0003903078940000175
wherein the content of the first and second substances,
t is a preset distillation hyper-parameter;
softmax s []an activation function that is a channel vector attention function;
F 1 for input features, input feature F 1 Comprising C feature components
Figure BDA0003903078940000176
In the input feature F 1 Feature channel vector for teacher
Figure BDA0003903078940000177
Then, the teacher feature channel vector is used
Figure BDA0003903078940000178
As corresponding feature components
Figure BDA0003903078940000179
The channel vector attention function at this time is:
Figure BDA0003903078940000181
in the input feature F 1 For projecting feature channel vectors
Figure BDA0003903078940000182
Then, the eigen-channel vectors will be projected
Figure BDA0003903078940000183
As corresponding feature components
Figure BDA0003903078940000184
The channel vector attention function at this time is:
Figure BDA0003903078940000185
8)A c () For the sub-feature graph attention function:
Figure BDA0003903078940000186
wherein the content of the first and second substances,
softmax c []an activation function that is a sub-feature graph attention function;
F 2 for input features, input feature F 2 Comprises H × W characteristic components
Figure BDA0003903078940000187
In the input feature F 2 For teachers to use sub-feature maps
Figure BDA0003903078940000188
The teacher is presented with a sub-feature map
Figure BDA0003903078940000189
As corresponding feature components
Figure BDA00039030789400001810
The sub-feature graph attention function at this time is:
Figure BDA00039030789400001811
in the input feature F 2 For projecting sub-feature maps
Figure BDA00039030789400001812
Then, projecting the sub-feature map
Figure BDA00039030789400001813
As corresponding feature components
Figure BDA00039030789400001814
The sub-feature graph attention function at this time is:
Figure BDA00039030789400001815
here, as is apparent from the above description, the characteristic loss function L fea In (1)
Figure BDA00039030789400001816
Is a teacher characteristic diagram F T Student characteristic diagram F S The foreground feature of (a) loses the branch,
Figure BDA00039030789400001817
is a teacher characteristic diagram F T Student characteristic diagram F S The false positive feature of (a) loses branches,
Figure BDA00039030789400001818
is a teacher feature diagram F T Student characteristic diagram F S The background feature loss branch of (1); by a characteristic loss function L fea The positioning accuracy of the student model can be improved;
step 312, determine the attention loss function L att Comprises the following steps:
Figure BDA0003903078940000191
wherein eta is a preset attention loss hyper-parameter; l is a radical of an alcohol 1 Is L1_ Loss function;
here, the attention loss function L att Practical channel vector attention function and sub-feature graph attention function based on teacher feature graph F T Student characteristic diagram F S Comparing the attention characteristics of (1); by the attention loss function L att The student model can be helped to improve the target recognition precision;
step 313, from teacher feature map F T And projection feature map
Figure BDA00039030789400001916
Optionally a predetermined quantity Q 2 The pixel points form corresponding matching point pairs
Figure BDA0003903078940000192
1≤i′≤Q,1≤j′≤Q;
Wherein each pixel point
Figure BDA0003903078940000193
In teacher feature graph F T Corresponding to a line feature tensor of 1 xWxC
Figure BDA0003903078940000194
And also a column feature tensor of H × 1 × C shape
Figure BDA0003903078940000195
Each pixel point
Figure BDA0003903078940000196
In the projection of feature maps
Figure BDA00039030789400001917
Corresponding to a line feature tensor of 1 xWxC
Figure BDA0003903078940000197
And also a column feature tensor of H × 1 × C shape
Figure BDA0003903078940000198
Here, from teacher profile F T And projection feature maps
Figure BDA00039030789400001918
Respectively select a preset number Q 2 Each pixel point forms a matching point pair
Figure BDA0003903078940000199
And each matching point pair
Figure BDA00039030789400001910
Characteristic graph F of two pixel points in teacher T And projection feature maps
Figure BDA00039030789400001919
The pixel coordinates of (a) are all the same;
e.g., Q =2, then from teacher profile F T Four points with coordinates of (1,2), (1,3), (1,4) and (1,5) are selected as corresponding points
Figure BDA00039030789400001911
Then the feature map should be projected from the same
Figure BDA00039030789400001920
Four points with coordinates of (1,2), (1,3), (1,4) and (1,5) are selected as corresponding points
Figure BDA00039030789400001912
And obtaining 4 matching point pairs according to the corresponding relation of (i ', j'):
Figure BDA00039030789400001913
Figure BDA00039030789400001914
step 314, according to the preset quantity Q 2 Is paired with
Figure BDA00039030789400001915
Determining a similarity loss function L aff Comprises the following steps:
Figure BDA0003903078940000201
wherein, the first and the second end of the pipe are connected with each other,
1) Zeta is a preset similarity loss hyper-parameter;
2)‖‖ smoothl1 is a Smooth _ L1_ loss function;
3)A ff () As a function of similarity:
Figure BDA0003903078940000202
wherein the content of the first and second substances,
D i′ 、D j′ inputting a feature vector;
in inputting feature vector D i′ 、D j′ Is a teacher characteristic diagram F T Last pixel point
Figure BDA0003903078940000203
Of the line feature tensor
Figure BDA0003903078940000204
Sum-column feature tensor
Figure BDA0003903078940000205
Time, similarity function
Figure BDA0003903078940000206
The method comprises the following specific steps:
Figure BDA0003903078940000207
in inputting feature vector D i′ 、D j′ For projecting feature maps
Figure BDA00039030789400002013
Last pixel point
Figure BDA0003903078940000208
Of the line feature tensor
Figure BDA0003903078940000209
Sum-column feature tensor
Figure BDA00039030789400002010
Function of time similarity
Figure BDA00039030789400002011
The method specifically comprises the following steps:
Figure BDA00039030789400002012
here, the similarity function used in the embodiment of the present invention is a cosine similarity function; by a similarity loss function L aff The student model can be helped to improve the model performance;
step 32, loss function L is calculated by student model det Characteristic loss function L fea Attention loss function L att And similarity loss function L aff Are added to form a corresponding overall loss function L all
Wherein L is all =L det +L fea +L att +L aff
Step 4, according to the first image, the first identification frame set and the student model loss function L det Carrying out model self-training on the student model; after the model self-training is finished, according to the first point cloud, the first image, the first recognition frame set, the first false positive area set and the characteristic loss function L fea Carrying out teacher-student characteristic simulation training on the student model; after the teacher-student characteristic simulation training is finished, the teacher-student characteristic simulation training is finished according to the first point cloud, the first image and the attention loss function L att Carrying out teacher and student attention imitation training on the student model; after the teacher and student completes the attention simulation training, the teacher and student finish the attention simulation training according to the first point cloud, the first image and the similarity loss function L aff Carrying out teacher-student similarity training on the student model; after the teacher-student similarity training is finished, according to the first point cloud, the first image, the first recognition frame set, the first false positive area set and the overall loss function L all Carrying out integral training on the student model;
the method comprises the steps of distilling point cloud BEV characteristics of a teacher model to a student model by using a knowledge distillation mechanism, and training the student model; in the training process, the student model is subjected to self-training; and then based on three loss functions (characteristic loss function L) for the knowledge distillation fea Attention loss function L att And similarity loss function L aff ) Gradually training the student model; finally based on the overall loss function L all Carrying out integral training on the student model;
the method specifically comprises the following steps: step 41, according to the first image, the first identification frame set and the student model loss function L det Performing model self-training on the student model;
the method specifically comprises the following steps: step 411, inputting the first image into the student model for gradual operation, and taking a target recognition frame set output by a second target detection head network of the student model as a corresponding second recognition frame set in the operation process;
wherein the second identification frame set comprises a plurality of three-dimensional second identification frames bbox 2 (ii) a Second identification frame bbox 2 Is in the shape ofH bbox2 *W bbox2 *Z bbox2 ,H bbox2 、W bbox2 、Z bbox2 As a second identification frame bbox 2 Depth, width and height of;
step 412, substituting the first and second recognition frame sets into the student model loss function L det Estimating a loss value to generate a corresponding first loss value;
step 413, identifying whether the first loss value meets a preset first loss convergence range; if yes, go to step 415; if not, go to step 414;
step 414, substituting the model parameters of the student model into the student model loss function L det Forming a corresponding first objective function; solving the model parameter which enables the first objective function to reach the minimum value; updating the model parameters of the student model according to the solving result; and returns to step 411 after updating;
step 415, confirming that the model self-training is completed;
step 42, after the model self-training is completed, according to the first point cloud, the first image, the first recognition frame set, the first false positive area set and the characteristic loss function L fea Carrying out teacher-student characteristic simulation training on the student model;
the method specifically comprises the following steps: step 421, inputting the first point cloud into the teacher model for gradual operation, and extracting the output characteristic diagram of the first BEV encoder of the teacher model as the corresponding teacher characteristic diagram F in the operation process T (ii) a Inputting the first image into the student model for gradual operation, and extracting the output characteristic diagram of the second BEV encoder of the student model as the corresponding student characteristic diagram F in the operation process S (ii) a And based on a projection function f proj () To student characteristic diagram F S Projection from student model BEV space to teacher model BEV space to generate corresponding projection characteristic diagram
Figure BDA0003903078940000221
Step 422, the teacher characteristic graph F T Projection feature map
Figure BDA0003903078940000222
Substituting the first recognition frame set and the first false positive region set into the characteristic loss function L fea Estimating the loss value to generate a corresponding second loss value;
step 423, identifying whether the second loss value meets a preset second loss convergence range; if yes, go to step 425; if not, go to step 424;
step 424, substituting the model parameters of the student model into the characteristic loss function L fea Forming a corresponding second objective function; solving the model parameter which enables the second objective function to reach the minimum value; updating the model parameters of the student model according to the solving result; and returns to step 421 after updating;
step 425, confirming that the teacher and student characteristic imitation training is finished;
step 43, after the teacher-student feature simulation training is completed, according to the first point cloud, the first image and the attention loss function L att Carrying out teacher and student attention imitation training on the student model;
the method specifically comprises the following steps: step 431, inputting the first point cloud into the teacher model for gradual operation, and extracting an output characteristic diagram of a first BEV encoder of the teacher model as a corresponding teacher characteristic diagram FT in the operation process; inputting the first image into the student model for gradual operation, and extracting the output characteristic diagram of the second BEV encoder of the student model as the corresponding student characteristic diagram F in the operation process S (ii) a And based on a projection function f proj () To student characteristic diagram F S Projection from student model BEV space to teacher model BEV space to generate corresponding projection characteristic diagram
Figure BDA0003903078940000231
Step 432, the teacher characteristic graph F is processed T And projection feature maps
Figure BDA0003903078940000232
Substituting attention loss function L att Estimating the loss value to generate a corresponding third loss value;
step 433, identifying whether the third loss value meets a preset third loss convergence range; if yes, go to step 435; if not, go to step 434;
step 434, substituting the model parameters of the student model into the attention loss function L att Forming a corresponding third objective function; solving the model parameter which enables the third objective function to reach the minimum value; updating the model parameters of the student model according to the solving result; and returns to step 431 after the update;
step 435, confirming completion of the teacher and student attention imitation training;
step 44, after the teacher and the student finish the attention imitation training, the first point cloud, the first image and the similarity loss function L are used for simulating the first point cloud and the second image aff Carrying out teacher-student similarity training on the student model;
the method specifically comprises the following steps: step 441, inputting the first point cloud into the teacher model for gradual operation, and extracting the output characteristic diagram of the first BEV encoder of the teacher model as a corresponding teacher characteristic diagram F in the operation process T (ii) a Inputting the first image into the student model for gradual operation, and extracting the output characteristic diagram of the second BEV encoder of the student model as the corresponding student characteristic diagram F in the operation process S (ii) a And based on a projection function f proj () To student characteristic diagram F S Projection from student model BEV space to teacher model BEV space to generate corresponding projection characteristic diagram
Figure BDA0003903078940000233
Step 442, the teacher feature map F T And projection feature maps
Figure BDA0003903078940000234
Substituting similarity loss function L aff Estimating the loss value to generate a corresponding fourth loss value;
step 443, identifying whether the fourth loss value meets a preset fourth loss convergence range; if yes, go to step 445; if not, go to step 444;
step 444, substituting the model parameters of the student model into the similarity loss function L aff Forming a corresponding fourth objective function; solving the model parameter which enables the fourth objective function to reach the minimum value; updating the model parameters of the student model according to the solving result; and returns to step 441 after the update;
step 445, confirming that the teacher-student similarity training is completed;
step 45, after the teacher-student similarity training is finished, according to the first point cloud, the first image, the first recognition frame set, the first false positive area set and the overall loss function L all Carrying out integral training on the student model;
the method specifically comprises the following steps: step 451, inputting the first point cloud into the teacher model to perform step-by-step operation, and extracting the output characteristic diagram of the first BEV encoder of the teacher model as the corresponding teacher characteristic diagram F in the operation process T (ii) a Inputting the first image into the student model for gradual operation, and extracting the output characteristic diagram of the second BEV encoder of the student model as the corresponding student characteristic diagram F in the operation process S And a target recognition frame set output by a second target detection head network of the student model is used as a corresponding third recognition frame set; and based on a projection function f proj () To student characteristic diagram F S Projection from student model BEV space to teacher model BEV space to generate corresponding projection characteristic diagram
Figure BDA0003903078940000241
Wherein the third identification frame set comprises a plurality of three-dimensional third identification frames bbox 3 (ii) a Third identification frame bbox 3 Is H in shape bbox3 *W bbox3 *Z bbox3 ,H bbox3 、W bbox3 、Z bbox3 As a third identification frame bbox 3 Depth, width and height of;
step 452, collecting the first recognition frame, the third recognition frame and the teacher feature map F T Projection feature map
Figure BDA0003903078940000242
And substituting the first set of false positive regions into the global loss function L all Estimating the loss value to generate a corresponding overall loss value;
step 453, identifying whether the overall loss value satisfies a preset overall loss convergence range; if yes, go to step 455; if not, go to step 454;
step 454, substituting the model parameters of the student model into the overall loss function L all Forming a corresponding overall objective function; solving the model parameter which enables the overall objective function to reach the minimum value; updating the model parameters of the student model according to the solving result; and returns to step 451 after the update;
at step 455, the overall training is confirmed to be complete.
And 5, after the whole training is finished, selecting a new group of original point clouds, original images, identification frame marking information of the original point clouds and false positive area marking information from the training data set to perform next training on the student model until the total number of training rounds reaches the specified number.
Fig. 2 is a block diagram of an apparatus for performing distillation training on a three-dimensional object detection model based on cross-modal knowledge according to a second embodiment of the present invention, where the apparatus is a terminal device or a server for implementing the foregoing method embodiment, and may also be an apparatus capable of enabling the foregoing terminal device or server to implement the foregoing method embodiment, and for example, the apparatus may be an apparatus or a chip system of the foregoing terminal device or server. As shown in fig. 2, the apparatus includes: an acquisition module 201, a training data processing module 202, a loss function processing module 203, and a training processing module 204.
The acquisition module 201 is configured to acquire a mature point cloud-based 3D target detection model for training as a corresponding teacher model, and acquire an image-based 3D target detection model to be trained as a corresponding student model; and obtaining a model loss function of the student model as a corresponding student model loss function L det
The training data processing module 202 is configured to obtain an original point cloud and an original image of the same scene from a training data set as a corresponding first point cloud and a corresponding first image; and acquiring identification frame marking information and false positive area marking information of the original point cloud as a corresponding first identification frame set and a first false positive area set.
Loss function processing module 203 is used for performing characteristic loss function L on knowledge distillation from teacher model to student model fea Attention loss function L att And similarity loss function L aff Determining; and loss function L is formed by student model det Characteristic loss function L fea Attention loss function L att And similarity loss function L aff Are added to form a corresponding overall loss function L all ,L all =L det +L fea +L att +L aff
The training processing module 204 is configured to perform a loss function L according to the first image, the first recognition box set, and the student model det Performing model self-training on the student model; after the model self-training is finished, according to the first point cloud, the first image, the first recognition frame set, the first false positive area set and the characteristic loss function L fea Performing teacher and student characteristic imitation training on the student model; after the teacher-student characteristic simulation training is finished, the teacher-student characteristic simulation training is finished according to the first point cloud, the first image and the attention loss function L att Carrying out teacher and student attention imitation training on the student model; after the teacher and student completes the attention simulation training, the teacher and student finish the attention simulation training according to the first point cloud, the first image and the similarity loss function L aff Carrying out teacher-student similarity training on the student model; after the teacher-student similarity training is finished, according to the first point cloud, the first image, the first recognition frame set, the first false positive area set and the overall loss function L all Carrying out integral training on the student model; and after the whole training is finished, selecting a new group of original point clouds, original images, identification frame marking information of the original point clouds and false positive area marking information from the training data set to perform next training on the student model until the total number of training rounds reaches the specified number.
The device for training the three-dimensional target detection model based on cross-modal knowledge distillation provided by the embodiment of the invention can execute the method steps in the method embodiment, and the implementation principle and the technical effect are similar, so that the detailed description is omitted.
It should be noted that the division of the modules of the above apparatus is only a logical division, and the actual implementation may be wholly or partially integrated into one physical entity, or may be physically separated. And these modules can be realized in the form of software called by processing element; or can be implemented in the form of hardware; and part of the modules can be realized in the form of calling software by the processing element, and part of the modules can be realized in the form of hardware. For example, the obtaining module may be a processing element separately set up, or may be implemented by being integrated in a chip of the apparatus, or may be stored in a memory of the apparatus in the form of program code, and a processing element of the apparatus calls and executes the functions of the determining module. Other modules are implemented similarly. In addition, all or part of the modules can be integrated together or can be independently realized. The processing element described herein may be an integrated circuit having signal processing capabilities. In implementation, each step of the above method or each module above may be implemented by an integrated logic circuit of hardware in a processor element or an instruction in the form of software.
For example, the above modules may be one or more integrated circuits configured to implement the above methods, such as: one or more Application Specific Integrated Circuits (ASICs), or one or more Digital Signal Processors (DSPs), or one or more Field Programmable Gate Arrays (FPGAs), etc. For another example, when some of the above modules are implemented in the form of a Processing element scheduler code, the Processing element may be a general-purpose processor, such as a Central Processing Unit (CPU) or other processor that can invoke the program code. As another example, these modules may be integrated together and implemented in the form of a System-on-a-chip (SOC).
In the above embodiments, all or part of the implementation may be realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. The procedures or functions described in accordance with the foregoing method embodiments are generated in whole or in part when the computer program instructions are loaded and executed on a computer. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another computer readable storage medium, for example, from one website, computer, server, or data center to another website, computer, server, or data center via wire (e.g., coaxial cable, fiber optic, digital Subscriber Line (DSL)), or wireless (e.g., infrared, wireless, bluetooth, microwave, etc.).
Fig. 3 is a schematic structural diagram of an electronic device according to a third embodiment of the present invention. The electronic device may be the terminal device or the server, or may be a terminal device or a server connected to the terminal device or the server and implementing the method according to the embodiment of the present invention. As shown in fig. 3, the electronic device may include: a processor 301 (e.g., a CPU), a memory 302, a transceiver 303; the transceiver 303 is coupled to the processor 301, and the processor 301 controls the transceiving operation of the transceiver 303. Various instructions may be stored in memory 302 for performing various processing functions and implementing the processing steps described in the foregoing method embodiments. Preferably, the electronic device according to an embodiment of the present invention further includes: a power supply 304, a system bus 305, and a communication port 306. The system bus 305 is used to implement communication connections between the elements. The communication port 306 is used for connection communication between the electronic device and other peripherals.
The system bus 305 mentioned in fig. 3 may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The system bus may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown, but this does not mean that there is only one bus or one type of bus. The communication interface is used for realizing communication between the database access device and other equipment (such as a client, a read-write library and a read-only library). The Memory may include a Random Access Memory (RAM) and may also include a Non-Volatile Memory (Non-Volatile Memory), such as at least one disk Memory.
The Processor may be a general-purpose Processor, and includes a central Processing Unit CPU, a Network Processor (NP), a Graphics Processing Unit (GPU), and the like; but also a digital signal processor DSP, an application specific integrated circuit ASIC, a field programmable gate array FPGA or other programmable logic device, discrete gate or transistor logic, discrete hardware components.
It should be noted that the embodiment of the present invention also provides a computer-readable storage medium, which stores instructions that, when executed on a computer, cause the computer to execute the method and the processing procedure provided in the above-mentioned embodiment.
The embodiment of the present invention further provides a chip for executing the instructions, where the chip is configured to execute the processing steps described in the foregoing method embodiment.
The embodiment of the invention provides a method, a device, electronic equipment and a computer readable storage medium for training a three-dimensional target detection model based on cross-modal knowledge distillation; taking a well-trained point cloud-based 3D target detection model as a teacher model, taking an image-based 3D target detection model as a student model, distilling a point cloud BEV (belief-oriented vector) feature of the teacher model to the student model by using a knowledge distillation mechanism, and training the student model to help the student model to learn a depth feature similar to the point cloud data from image data; the method and the device can help the image 3D target detection model to reduce depth estimation errors and improve the detection accuracy of the image 3D target detection model.
Those of skill would further appreciate that the various illustrative components and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the components and steps of the various examples have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied in hardware, a software module executed by a processor, or a combination of the two. A software module may reside in Random Access Memory (RAM), memory, read-only memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.
The above-mentioned embodiments, objects, technical solutions and advantages of the present invention are further described in detail, it should be understood that the above-mentioned embodiments are only examples of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims (13)

1. A method for training a three-dimensional target detection model based on cross-modal knowledge distillation, the method comprising:
obtainingTraining a mature point cloud-based 3D target detection model as a corresponding teacher model, and acquiring an image-based 3D target detection model to be trained as a corresponding student model; and obtaining the model loss function of the student model as a corresponding student model loss function L det
Acquiring an original point cloud and an original image of the same scene from a training data set as a corresponding first point cloud and a first image; acquiring identification frame marking information and false positive area marking information of the original point cloud as a corresponding first identification frame set and a first false positive area set;
characteristic loss function L for knowledge distillation from the teacher model to the student model fea Attention loss function L att And similarity loss function L aff Determining; and loss function L is formed by the student model det The characteristic loss function L fea The attention loss function L att And said similarity loss function L aff Are added to form a corresponding overall loss function L all ,L all =L det +L fea +L att +L aff
According to the first image, the first identification frame set and the student model loss function L det Performing model self-training on the student model; after the model self-training is finished, the model self-training is finished according to the first point cloud, the first image, the first recognition frame set, the first false positive area set and the characteristic loss function L fea Performing teacher-student characteristic simulation training on the student model; after the teacher-student feature simulation training is finished, the teacher-student feature simulation training is finished according to the first point cloud, the first image and the attention loss function L att Performing teacher and student attention imitation training on the student model; after the teacher and student attention imitation training is finished, the teacher and student attention imitation training is finished according to the first point cloud, the first image and the similarity loss function L aff Carrying out teacher and student similarity training on the student model; after the teacher-student similarity training is finished, the teacher-student similarity training is finished according to the first point cloud, the first image, the first recognition frame set and the first false positive area setAnd said global loss function L all Performing overall training on the student model;
and after the whole training is finished, selecting a new group of original point clouds, original images, identification frame marking information of the original point clouds and false positive area marking information from the training data set to perform next training on the student model until the total number of training rounds reaches the specified number.
2. The method for distillation training of a three-dimensional object detection model based on cross-modal knowledge according to claim 1,
the teacher model comprises a point cloud pillar feature network, a BEV pooling network, a first BEV encoder and a first target detection head network; the output of the point cloud column feature network is connected with the input of the BEV pooling network; an output of the BEV pooling network is connected to an input of the first BEV encoder; the input of the first BEV encoder is connected with the input of the first target detection head network;
the student model comprises an image encoder, an LSS view converter, a second BEV encoder and a second target detection head network; the output of the image encoder is connected to the input of the LSS view converter; an output of the LSS view converter is connected to an input of the second BEV encoder; the input of the second BEV encoder is connected with the input of the second target detection head network;
the first identification frame set comprises a plurality of three-dimensional first identification frames bbox 1 (ii) a The first identification frame bbox 1 Is H in shape bbox1 *W bbox1 *Z bbox1 ,H bbox1 、W bbox1 、Z bbox1 Is the first identification frame bbox 1 Depth, width and height of;
the first set of false positive regions comprises a plurality of first false positive regions FP.
3. The method for distillation training of three-dimensional target detection model based on cross-modal knowledge as claimed in claim 2, wherein the teacher model is a model of the teacherCharacteristic loss function L of knowledge distillation to the student model fea Attention loss function L att And similarity loss function L aff The determination specifically comprises the following steps:
step 31, determining said characteristic loss function L fea Comprises the following steps:
Figure FDA0003903078930000021
wherein the content of the first and second substances,
alpha, beta and gamma are respectively preset loss coefficients;
F T is the teacher characteristic diagram output by the first BEV encoder of the teacher model, and H, W, C is the teacher characteristic diagram F T Height, width and channel dimensions; the teacher characteristic diagram F T Teacher feature channel vector with 1 × C shape capable of being decomposed into H × W one-dimensional channels
Figure FDA0003903078930000031
The teacher characteristic diagram F T Can also be decomposed into C teacher sub-feature graphs with two-dimensional shape of H multiplied by W
Figure FDA0003903078930000032
The teacher characteristic diagram F T Can also be decomposed into C, H, W teacher characteristic data
Figure FDA0003903078930000033
F S A student profile output for the second BEV encoder of the student model; f. of proj () Is a projection function from the student model BEV space to the teacher model BEV space;
Figure FDA0003903078930000034
is the student characteristic diagram F S The corresponding projection feature map is displayed on the display screen,
Figure FDA0003903078930000035
the projection feature map
Figure FDA0003903078930000036
And the teacher characteristic diagram F T Are consistent with the channel dimensions; the projection feature map
Figure FDA0003903078930000037
Projection eigen channel vector with shape of 1 × C capable of being decomposed into H × W one-dimensional
Figure FDA0003903078930000038
The projection feature map
Figure FDA0003903078930000039
Can also be decomposed into C projection sub-feature maps with two-dimensional shapes of H multiplied by W
Figure FDA00039030789300000310
The projection feature map
Figure FDA00039030789300000311
Can also be decomposed into C, H, W projection characteristic data
Figure FDA00039030789300000312
M () is a foreground-background binary mask function;
n () is a false positive-background binary mask function;
s () is a size mask function;
A s () As a channel vector attention function;
A c () Attention function for sub-feature map;
Figure FDA00039030789300000313
is the teacher characteristic diagram F T And the student characteristic diagram F S The foreground feature of (a) loses the branch,
Figure FDA00039030789300000314
is the teacher characteristic diagram F T And the student characteristic diagram F S The false positive feature of (a) loses branches,
Figure FDA00039030789300000315
is the teacher characteristic diagram F T And the student characteristic diagram F S The background feature loss branch of (1);
step 32, determining the attention loss function L att Comprises the following steps:
Figure FDA00039030789300000316
wherein the content of the first and second substances,
eta is a preset attention loss hyper-parameter;
L 1 is L1_ Loss function;
step 33, from the teacher feature map F T And the projection feature map
Figure FDA0003903078930000041
Optionally a predetermined quantity Q 2 The pixel points form corresponding matching point pairs
Figure FDA0003903078930000042
Wherein the content of the first and second substances,
each pixel point
Figure FDA0003903078930000043
In the teacher characteristic diagram F T Corresponding to a line feature tensor of 1 xWxC shape
Figure FDA0003903078930000044
And also a column feature tensor of H × 1 × C shape
Figure FDA0003903078930000045
Each pixel point
Figure FDA0003903078930000046
In the projection feature map
Figure FDA0003903078930000047
Corresponding to a line feature tensor of 1 xWxC shape
Figure FDA0003903078930000048
And also a column feature tensor of H × 1 × C shape
Figure FDA0003903078930000049
Step 34, according to the preset quantity Q 2 The matching point pair of
Figure FDA00039030789300000410
Determining the similarity loss function L aff Comprises the following steps:
Figure FDA00039030789300000411
wherein, the first and the second end of the pipe are connected with each other,
zeta is a preset similarity loss hyper-parameter;
‖ ‖ smoothl1 is a Smooth _ L1_ loss function;
A ff () As a function of similarity.
4. The method for distillation training of the three-dimensional object detection model based on the cross-modal knowledge as recited in claim 3,
the foreground-background binary mask function M () is:
Figure FDA00039030789300000412
the false positive-background binary mask function N () is:
Figure FDA00039030789300000413
the size mask function S () is:
Figure FDA00039030789300000414
wherein the content of the first and second substances,
H bbox1 、W bbox1 for the corresponding first identification frame bbox 1 Depth and width of;
N gb is the teacher sub-feature graph
Figure FDA0003903078930000051
The number of background points on the top of the tile,
Figure FDA0003903078930000052
Figure FDA0003903078930000053
the channel vector attention function A s () Comprises the following steps:
Figure FDA0003903078930000054
wherein the content of the first and second substances,
t is a preset distillation hyper-parameter;
softmax s []an activation function that is a channel vector attention function;
F 1 as input features, the input features F 1 Comprising C feature components
Figure FDA0003903078930000055
In the input feature F 1 Is the teacher feature channel vector
Figure FDA0003903078930000056
Then, the teacher feature channel vector is used
Figure FDA0003903078930000057
As corresponding said feature components
Figure FDA0003903078930000058
In the input feature F 1 For the projected eigen-channel vector
Figure FDA0003903078930000059
Then, the projected feature channel vector is processed
Figure FDA00039030789300000510
As corresponding said feature components
Figure FDA00039030789300000511
The sub-feature map attention function A c () Comprises the following steps:
Figure FDA00039030789300000512
wherein the content of the first and second substances,
softmax c []an activation function that is a sub-feature graph attention function;
F 2 as input features, the input features F 2 Comprises H × W characteristic components
Figure FDA00039030789300000513
In the input feature F 2 Is the teacher sub-feature graph
Figure FDA00039030789300000514
Then, the teacher sub-feature graph is drawn
Figure FDA00039030789300000515
As corresponding feature components
Figure FDA00039030789300000516
In the input feature F 2 For the projection sub-feature map
Figure FDA00039030789300000517
Then, the projection sub-feature map is processed
Figure FDA00039030789300000518
As corresponding feature components
Figure FDA00039030789300000519
5. The method for distillation training of a three-dimensional object detection model based on cross-modal knowledge according to claim 3,
the similarity function A ff () Comprises the following steps:
Figure FDA0003903078930000061
wherein D is i′ 、D j′ Inputting a feature vector;
in inputting feature vector D i′ 、D j′ Is the teacher characteristic diagram F T Last one of the pixel points
Figure FDA0003903078930000062
The line feature tensor of
Figure FDA0003903078930000063
And said column feature tensor
Figure FDA0003903078930000064
Time, similarity function
Figure FDA0003903078930000065
The method specifically comprises the following steps:
Figure FDA0003903078930000066
in inputting feature vector D i′ 、D j′ For the projection feature map
Figure FDA0003903078930000067
Last one of the pixel points
Figure FDA0003903078930000068
The line feature tensor of
Figure FDA0003903078930000069
And said column feature tensor
Figure FDA00039030789300000610
Function of time similarity
Figure FDA00039030789300000611
The method specifically comprises the following steps:
Figure FDA00039030789300000612
6. the method for distillation training of a three-dimensional object detection model based on cross-modal knowledge according to claim 3, wherein the method comprises performing a loss function L according to the first image, the first set of recognition boxes and the student model det Carrying out model self-training on the student model, and specifically comprising the following steps:
step 61, the first stepInputting the image into the student model to perform gradual operation, and taking a target identification frame set output by the second target detection head network of the student model as a corresponding second identification frame set in the operation process; the second recognition frame set comprises a plurality of three-dimensional second recognition frames bbox 2 (ii) a The second identification frame bbox 2 Is H in shape bbox2 *W bbox2 *Z bbox2 ,H bbox2 、W bbox2 、Z bbox2 Is the second identification frame bbox 2 Depth, width and height of;
step 62, substituting the first and second recognition box sets into the student model loss function L det Estimating a loss value to generate a corresponding first loss value;
step 63, identifying whether the first loss value meets a preset first loss convergence range; if yes, go to step 65; if not, go to step 64;
step 64, substituting the model parameters of the student model into the student model loss function L det Forming a corresponding first objective function; solving the model parameter which enables the first objective function to reach the minimum value; updating the model parameters of the student model according to the solving result; and returns to step 61 after updating;
and step 65, confirming that the model self-training is completed.
7. The method for distillation training of a three-dimensional object detection model based on cross-modal knowledge according to claim 3, wherein the method comprises the steps of calculating the first point cloud, the first image, the first recognition box set, the first false positive region set, and the feature loss function L according to the first point cloud, the first image, the first recognition box set, the first false positive region set, and the feature loss function L fea Right the student model carries out teacher and student's feature simulation training, specifically includes:
step 71, inputting the first point cloud into the teacher model for gradual operation, and extracting the output characteristic diagram of the first BEV encoder of the teacher model as the corresponding teacher characteristic diagram F in the operation process T (ii) a And inputting the first image into the image processing deviceThe student model is gradually operated, and the output characteristic diagram of the second BEV encoder of the student model is extracted as the corresponding student characteristic diagram F in the operation process S (ii) a And based on said projection function f proj () For the student characteristic diagram F S Projecting from the student model BEV space to the teacher model BEV space to generate corresponding projection feature map
Figure FDA0003903078930000071
Step 72, the teacher feature map F is processed T The projection feature map
Figure FDA0003903078930000072
Substituting the first set of recognition boxes and the first set of false positive regions into the feature loss function L fea Estimating the loss value to generate a corresponding second loss value;
step 73, identifying whether the second loss value meets a preset second loss convergence range; if yes, go to step 75; if not, go to step 74;
step 74, substituting the model parameters of the student model into the characteristic loss function L fea Forming a corresponding second objective function; solving the model parameter which enables the second objective function to reach the minimum value; updating the model parameters of the student model according to the solving result; and returns to step 71 after the update;
and step 75, confirming that the teacher-student feature simulation training is completed.
8. The method for distillation training of a three-dimensional object detection model based on cross-modal knowledge according to claim 3, wherein the method comprises the step of performing the distillation training of the three-dimensional object detection model based on the first point cloud, the first image and the attention loss function L att It is right the student model carries out teacher and student's attention imitation training, specifically includes:
step 81, inputting the first point cloud into the teacher model for gradual operation, and performing gradual operation on the first point cloud in the operation processThe output characteristic diagram of the first BEV encoder of the teacher model is extracted as the corresponding teacher characteristic diagram F T (ii) a Inputting the first image into the student model for gradual operation, and extracting the output characteristic diagram of the second BEV encoder of the student model as the corresponding student characteristic diagram F in the operation process S (ii) a And based on said projection function f proj () For the student characteristic diagram F S Projecting from the student model BEV space to the teacher model BEV space to generate corresponding projection feature map
Figure FDA0003903078930000081
Step 82, the teacher characteristic graph F is processed T And the projection feature map
Figure FDA0003903078930000082
Substituting the attention loss function L att Estimating the loss value to generate a corresponding third loss value;
step 83, identifying whether the third loss value meets a preset third loss convergence range; if yes, go to step 85; if not, go to step 84;
step 84, substituting the model parameters of the student model into the attention loss function L att Forming a corresponding third objective function; solving the model parameter which enables the third objective function to reach the minimum value; updating the model parameters of the student model according to the solving result; and returns to step 81 after the update;
and step 85, confirming that the teacher-student attention imitation training is finished.
9. The method for distillation training of a three-dimensional object detection model based on cross-modal knowledge according to claim 3, wherein the method comprises the step of performing the distillation training of the three-dimensional object detection model based on the first point cloud, the first image and the similarity loss function L aff It is right the student model carries out teachers and students similarity training, specifically includes:
step 91, inputting the first point cloud into the teacher model for gradual operation, and extracting the output characteristic diagram of the first BEV encoder of the teacher model as the corresponding teacher characteristic diagram F in the operation process T (ii) a Inputting the first image into the student model for gradual operation, and extracting the output characteristic diagram of the second BEV encoder of the student model as the corresponding student characteristic diagram F in the operation process S (ii) a And based on said projection function f proj () For the student characteristic diagram F S Projecting from the student model BEV space to the teacher model BEV space to generate corresponding projection feature map
Figure FDA0003903078930000083
Step 92, the teacher characteristic graph F is processed T And the projection feature map
Figure FDA0003903078930000091
Substituting the similarity loss function L aff Estimating the loss value to generate a corresponding fourth loss value;
step 93, identifying whether the fourth loss value meets a preset fourth loss convergence range; if yes, go to step 95; if not, go to step 94;
step 94, substituting the model parameters of the student model into the similarity loss function L aff Forming a corresponding fourth objective function; solving the model parameter which enables the fourth objective function to reach the minimum value; updating the model parameters of the student model according to the solving result; and returns to step 91 after the update;
and step 95, confirming that the teacher-student similarity training is completed.
10. The method for distillation training of a three-dimensional object detection model based on cross-modal knowledge according to claim 3, wherein the method further comprises the step of performing the distillation training on the three-dimensional object detection model according to the first point cloud, the first image, the first recognition box set and the first false positiveSet of zones and the global loss function L all Carrying out integral training on the student model, and specifically comprising:
step 101, inputting the first point cloud into the teacher model for gradual operation, and extracting an output characteristic diagram of the first BEV encoder of the teacher model as a corresponding teacher characteristic diagram F in the operation process T (ii) a Inputting the first image into the student model for gradual operation, and extracting the output characteristic diagram of the second BEV encoder of the student model as the corresponding student characteristic diagram F in the operation process S And taking a target recognition frame set output by the second target detection head network of the student model as a corresponding third recognition frame set; and based on said projection function f proj () For the student characteristic diagram F S Projecting from the student model BEV space to the teacher model BEV space to generate corresponding projection feature map
Figure FDA0003903078930000092
The third identification frame set comprises a plurality of three-dimensional third identification frames bbox 3 (ii) a The third identification frame bbox 3 Is H in shape bbox3 *W bbox3 *Z bbox3 ,H bbox3 、W bbox3 、Z bbox3 Identifying the third frame bbox 3 Depth, width and height of;
102, collecting the first and third identification frames and the teacher characteristic graph F T The projection feature map
Figure FDA0003903078930000093
And substituting said first set of false positive regions into said global loss function L all Estimating the loss value to generate a corresponding overall loss value;
step 103, identifying whether the overall loss value meets a preset overall loss convergence range; if yes, go to step 105; if not, go to step 104;
104, the student model is analyzedModel parameters are substituted into the overall loss function L all Forming a corresponding overall objective function; solving the model parameter which enables the overall objective function to reach the minimum value; updating the model parameters of the student model according to the solving result; and returns to step 101 after updating;
and 105, confirming that the whole training is finished.
11. An apparatus for performing the method for distillation training of three-dimensional object detection model based on cross-modal knowledge according to any one of claims 1-10, the apparatus comprising: the device comprises an acquisition module, a training data processing module, a loss function processing module and a training processing module;
the acquisition module is used for acquiring a mature point cloud-based 3D target detection model to be trained as a corresponding teacher model and acquiring an image-based 3D target detection model to be trained as a corresponding student model; and obtaining the model loss function of the student model as a corresponding student model loss function L det
The training data processing module is used for acquiring an original point cloud and an original image of the same scene from a training data set as a corresponding first point cloud and a corresponding first image; acquiring identification frame marking information and false positive area marking information of the original point cloud as a corresponding first identification frame set and a first false positive area set;
the loss function processing module is used for carrying out characteristic loss function L on knowledge distillation from the teacher model to the student model fea Attention loss function L att And a similarity loss function L aff Determining; and loss function L is formed by the student model det The characteristic loss function L fea The attention loss function L att And said similarity loss function L aff Are added to form a corresponding overall loss function L all ,L all =L det +L fea +L att +L aff
The training processing module is used for recognizing the first image according to the first imageFrame set and the student model loss function L det Performing model self-training on the student model; after the model self-training is finished, the model self-training is finished according to the first point cloud, the first image, the first recognition frame set, the first false positive area set and the characteristic loss function L fea Performing teacher-student characteristic simulation training on the student model; after the teacher-student characteristic simulation training is finished, according to the first point cloud, the first image and the attention loss function L att Performing teacher and student attention imitation training on the student model; after the teacher and student attention imitation training is finished, the teacher and student attention imitation training is finished according to the first point cloud, the first image and the similarity loss function L aff Carrying out teacher-student similarity training on the student model; the teacher-student similarity training is completed according to the first point cloud, the first image, the first recognition frame set, the first false positive area set and the overall loss function L all Performing overall training on the student model; and after the integral training is finished, selecting a new group of original point clouds, original images, identification frame marking information of the original point clouds and false positive area marking information from the training data set to perform next training on the student model until the total number of training rounds reaches the specified number.
12. An electronic device, comprising: a memory, a processor, and a transceiver;
the processor is used for being coupled with the memory, reading and executing the instructions in the memory to realize the method steps of any one of the claims 1-10;
the transceiver is coupled to the processor, and the processor controls the transceiver to transmit and receive messages.
13. A computer-readable storage medium having stored thereon computer instructions which, when executed by a computer, cause the computer to perform the method of any of claims 1-10.
CN202211296868.1A 2022-10-21 2022-10-21 Method and device for training three-dimensional target detection model based on cross-modal knowledge distillation Pending CN115690708A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211296868.1A CN115690708A (en) 2022-10-21 2022-10-21 Method and device for training three-dimensional target detection model based on cross-modal knowledge distillation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211296868.1A CN115690708A (en) 2022-10-21 2022-10-21 Method and device for training three-dimensional target detection model based on cross-modal knowledge distillation

Publications (1)

Publication Number Publication Date
CN115690708A true CN115690708A (en) 2023-02-03

Family

ID=85066272

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211296868.1A Pending CN115690708A (en) 2022-10-21 2022-10-21 Method and device for training three-dimensional target detection model based on cross-modal knowledge distillation

Country Status (1)

Country Link
CN (1) CN115690708A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116028891A (en) * 2023-02-16 2023-04-28 之江实验室 Industrial anomaly detection model training method and device based on multi-model fusion
CN116229210A (en) * 2023-02-23 2023-06-06 南通探维光电科技有限公司 Target detection model training method, device, equipment and medium
CN116341650A (en) * 2023-03-23 2023-06-27 哈尔滨市科佳通用机电股份有限公司 Noise self-training-based railway wagon bolt loss detection method
CN117097797A (en) * 2023-10-19 2023-11-21 浪潮电子信息产业股份有限公司 Cloud edge end cooperation method, device and system, electronic equipment and readable storage medium
CN117351450A (en) * 2023-12-06 2024-01-05 吉咖智能机器人有限公司 Monocular 3D detection method and device, electronic equipment and storage medium
CN117523549A (en) * 2024-01-04 2024-02-06 南京邮电大学 Three-dimensional point cloud object identification method based on deep and wide knowledge distillation

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116028891A (en) * 2023-02-16 2023-04-28 之江实验室 Industrial anomaly detection model training method and device based on multi-model fusion
CN116229210A (en) * 2023-02-23 2023-06-06 南通探维光电科技有限公司 Target detection model training method, device, equipment and medium
CN116229210B (en) * 2023-02-23 2023-10-24 南通探维光电科技有限公司 Target detection model training method, device, equipment and medium
CN116341650A (en) * 2023-03-23 2023-06-27 哈尔滨市科佳通用机电股份有限公司 Noise self-training-based railway wagon bolt loss detection method
CN116341650B (en) * 2023-03-23 2023-12-26 哈尔滨市科佳通用机电股份有限公司 Noise self-training-based railway wagon bolt loss detection method
CN117097797A (en) * 2023-10-19 2023-11-21 浪潮电子信息产业股份有限公司 Cloud edge end cooperation method, device and system, electronic equipment and readable storage medium
CN117097797B (en) * 2023-10-19 2024-02-09 浪潮电子信息产业股份有限公司 Cloud edge end cooperation method, device and system, electronic equipment and readable storage medium
CN117351450A (en) * 2023-12-06 2024-01-05 吉咖智能机器人有限公司 Monocular 3D detection method and device, electronic equipment and storage medium
CN117351450B (en) * 2023-12-06 2024-02-27 吉咖智能机器人有限公司 Monocular 3D detection method and device, electronic equipment and storage medium
CN117523549A (en) * 2024-01-04 2024-02-06 南京邮电大学 Three-dimensional point cloud object identification method based on deep and wide knowledge distillation
CN117523549B (en) * 2024-01-04 2024-03-29 南京邮电大学 Three-dimensional point cloud object identification method based on deep and wide knowledge distillation

Similar Documents

Publication Publication Date Title
CN115690708A (en) Method and device for training three-dimensional target detection model based on cross-modal knowledge distillation
CN109816725B (en) Monocular camera object pose estimation method and device based on deep learning
CN109960742B (en) Local information searching method and device
CN110866871A (en) Text image correction method and device, computer equipment and storage medium
CN111723691B (en) Three-dimensional face recognition method and device, electronic equipment and storage medium
CN113888689A (en) Image rendering model training method, image rendering method and image rendering device
CN109948441B (en) Model training method, image processing method, device, electronic equipment and computer readable storage medium
CN107767358B (en) Method and device for determining ambiguity of object in image
CN107766864B (en) Method and device for extracting features and method and device for object recognition
CN112200057A (en) Face living body detection method and device, electronic equipment and storage medium
CN112102424A (en) License plate image generation model construction method, generation method and device
CN114219855A (en) Point cloud normal vector estimation method and device, computer equipment and storage medium
CN113095333A (en) Unsupervised feature point detection method and unsupervised feature point detection device
CN111260794B (en) Outdoor augmented reality application method based on cross-source image matching
CN113129425A (en) Face image three-dimensional reconstruction method, storage medium and terminal device
CN112929626A (en) Three-dimensional information extraction method based on smartphone image
CN112990183A (en) Method, system and device for extracting homonymous strokes of offline handwritten Chinese characters
CN114663880A (en) Three-dimensional target detection method based on multi-level cross-modal self-attention mechanism
CN112070181B (en) Image stream-based cooperative detection method and device and storage medium
CN113706472A (en) Method, device and equipment for detecting road surface diseases and storage medium
CN110533663B (en) Image parallax determining method, device, equipment and system
CN111723688A (en) Human body action recognition result evaluation method and device and electronic equipment
CN116597246A (en) Model training method, target detection method, electronic device and storage medium
CN116206302A (en) Three-dimensional object detection method, three-dimensional object detection device, computer equipment and storage medium
CN115035193A (en) Bulk grain random sampling method based on binocular vision and image segmentation technology

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination