CN117593713A - BEV time sequence model distillation method, device, equipment and medium - Google Patents
BEV time sequence model distillation method, device, equipment and medium Download PDFInfo
- Publication number
- CN117593713A CN117593713A CN202311615040.2A CN202311615040A CN117593713A CN 117593713 A CN117593713 A CN 117593713A CN 202311615040 A CN202311615040 A CN 202311615040A CN 117593713 A CN117593713 A CN 117593713A
- Authority
- CN
- China
- Prior art keywords
- bev
- frame
- model
- loss value
- time sequence
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 51
- 238000004821 distillation Methods 0.000 title claims abstract description 28
- 238000001514 detection method Methods 0.000 claims abstract description 101
- 238000006243 chemical reaction Methods 0.000 claims description 26
- 239000011159 matrix material Substances 0.000 claims description 24
- 230000006870 function Effects 0.000 claims description 22
- 238000012549 training Methods 0.000 claims description 22
- 238000002372 labelling Methods 0.000 claims description 18
- 238000004590 computer program Methods 0.000 claims description 16
- 238000000605 extraction Methods 0.000 claims description 4
- 238000012216 screening Methods 0.000 claims description 4
- 230000000007 visual effect Effects 0.000 claims description 2
- 238000004891 communication Methods 0.000 description 8
- 238000010586 diagram Methods 0.000 description 7
- 238000013140 knowledge distillation Methods 0.000 description 6
- 238000012545 processing Methods 0.000 description 6
- 230000008569 process Effects 0.000 description 4
- 239000000284 extract Substances 0.000 description 3
- 230000003287 optical effect Effects 0.000 description 3
- 235000004522 Pentaglottis sempervirens Nutrition 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 2
- 238000013527 convolutional neural network Methods 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000012546 transfer Methods 0.000 description 2
- 230000016776 visual perception Effects 0.000 description 2
- 238000003491 array Methods 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000001413 cellular effect Effects 0.000 description 1
- 238000012512 characterization method Methods 0.000 description 1
- 230000007423 decrease Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 239000011521 glass Substances 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 239000013307 optical fiber Substances 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 230000001953 sensory effect Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/50—Context or environment of the image
- G06V20/56—Context or environment of the image exterior to a vehicle by using sensors mounted on the vehicle
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
- G06N3/0455—Auto-encoder networks; Encoder-decoder networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/096—Transfer learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/44—Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
- G06V10/443—Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by matching or filtering
- G06V10/449—Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters
- G06V10/451—Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters with interaction between the filter responses, e.g. cortical complex cells
- G06V10/454—Integrating the filters into a hierarchical structure, e.g. convolutional neural networks [CNN]
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Biomedical Technology (AREA)
- General Health & Medical Sciences (AREA)
- Evolutionary Computation (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- Molecular Biology (AREA)
- Mathematical Physics (AREA)
- Multimedia (AREA)
- Software Systems (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- Biodiversity & Conservation Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a BEV time sequence model distillation method, a device, equipment and a medium. The method comprises the following steps: extracting the BEV characteristics of the current frame and the BEV characteristics of the previous N-1 frame from the input time sequence data through a BEV time sequence model, and extracting the BEV characteristics of the current frame from the input single frame data through a BEV single frame model; aligning the BEV characteristics of the current frames of the time sequence model and the single frame model aiming at the region where the detection target is positioned, and calculating a first loss value according to an alignment result; aligning the first N-1 frame BEV characteristic of the time sequence model and the current frame BEV characteristic of the single frame model aiming at the center point position of the detection target, and calculating a second loss value according to an alignment result; and updating the parameters of the BEV single-frame model according to the first loss value, the second loss value and the loss value of the single-frame model. According to the technical scheme provided by the embodiment of the invention, knowledge in the BEV time sequence model can be migrated to the BEV single frame model, so that the performance of the BEV single frame model is effectively improved.
Description
Technical Field
The invention relates to the technical field of automatic driving, in particular to a BEV time sequence model distillation method, a device, equipment and a medium.
Background
In the field of autopilot visual perception, visual perception under Bird's Eye View (BEV) viewing angles is one of the more popular directions. The bird's eye view is a top-down top view representation method, and can establish uniform characteristic representation for downstream tasks such as 3D target detection, lane line detection, and drivable region segmentation.
The accuracy of BEV models increases as model scale increases, but the speed of reasoning decreases accordingly. Knowledge distillation can effectively transfer knowledge in a large model to a small model, so that the accuracy of the small model is improved on the premise of keeping the reasoning speed unchanged. The traditional knowledge distillation method for BEV models is mostly applied to BEV single frame models, i.e. BEV models without timing information.
Since the distillation of the BEV single frame model is less friendly for the detection of certain objects that are easily occluded, and the BEV timing model captures the characteristics of the objects over a period of time, the characterization of the objects is more robust, and therefore a method for distilling the BEV timing model is needed.
Disclosure of Invention
The invention provides a BEV time sequence model distillation method, a device, equipment and a medium, which can transfer knowledge in a BEV time sequence model into a BEV single frame model and effectively improve the performance of the BEV single frame model.
According to an aspect of the present invention, there is provided a BEV timing model distillation method comprising:
extracting BEV features from the input time sequence data through a trained BEV time sequence model to obtain current frame BEV features and front N-1 frame BEV features, and extracting BEV features from the input single frame data through an untrained BEV single frame model to obtain current frame BEV features;
aligning the current frame BEV characteristics of the BEV time sequence model and the current frame BEV characteristics of the BEV single frame model aiming at the region where the detection target is located, and calculating a first loss value according to an alignment result;
aligning the first N-1 frame BEV characteristic of the BEV time sequence model and the current frame BEV characteristic of the BEV single frame model aiming at the center point position of the detection target, and calculating a second loss value according to the alignment result;
and updating parameters of the BEV single-frame model according to the first loss value, the second loss value and the loss value of the BEV single-frame model.
Optionally, before extracting BEV features from the input timing data by the trained BEV timing model to obtain the BEV features of the current frame and the BEV features of the previous N-1 frame, the method further includes:
acquiring time sequence data for model training; the time sequence data includes: the method comprises the steps of marking a plurality of training key frames and detection targets corresponding to each training key frame, and marking N-1 historical frames before each training key frame and detection targets corresponding to each historical frame;
Training a preset BEV single-frame model according to the time sequence data to obtain a trained BEV time sequence model.
Optionally, aligning the BEV features of the current frame of the BEV timing model and the BEV features of the current frame of the BEV single frame model with respect to the region where the detection target is located, and calculating a first loss value according to the alignment result, including:
determining a corresponding target area of each detection target of the current frame in the BEV space according to the labeling information of the current frame, and taking each target area as a mask;
acquiring a first BEV characteristic of each target area corresponding to the BEV time sequence model and a second BEV characteristic of each target area corresponding to the BEV single frame model according to the mask;
the first BEV feature and the second BEV feature are aligned and a first loss value corresponding to the alignment result is calculated from the first loss function.
Optionally, aligning the first N-1 frame BEV feature of the BEV timing model and the current frame BEV feature of the BEV single frame model with respect to a center point position of the detection target, and calculating a second loss value according to the alignment result, including:
determining a third center point coordinate of a center point of each detection target of the previous N-1 frame in the BEV space according to the labeling information of the previous N-1 frame;
according to the labeling information of the current frame, determining a fourth center point coordinate of a center point of each detection target of the current frame in the BEV space;
Acquiring fifth BEV features of the BEV time sequence model corresponding to the coordinates of each third center point and sixth BEV features of the BEV single-frame model corresponding to the coordinates of each fourth center point;
and aligning the fifth BEV feature and the sixth BEV feature aiming at the center point of the same detection target, and calculating a second loss value corresponding to the aligned result according to a second loss function.
Optionally, determining the third center point coordinate of the center point of each detection target of the previous N-1 frame in the BEV space according to the labeling information of the previous N-1 frame includes:
for each historical frame in the previous N-1 frames, determining a historical frame coordinate corresponding to the center point of each detection target of the historical frame according to the marking information of the historical frame;
calculating the product of each historical frame coordinate, the global conversion matrix and the current conversion matrix to obtain the current frame coordinate corresponding to the center point of each detection target of the historical frame;
the global conversion matrix is a coordinate conversion matrix from a historical frame to a global frame, and the current conversion matrix is a coordinate conversion matrix from the global frame to the current frame;
and screening the effective detection targets in the history frames, which are positioned in the field of view of the current frames, and taking the coordinates of the current frames corresponding to the central points of the effective detection targets as the coordinates of the third central points in the BEV space.
Optionally, updating parameters of the BEV single frame model according to the first loss value, the second loss value, and the BEV single frame model own loss value includes:
encoding and decoding the BEV characteristics of the current frame of the BEV single-frame model, and calculating the BEV single-frame model self-loss value corresponding to the decoding result according to the third loss function;
and calculating a weighted sum of the first loss value, the second loss value and the loss value of the BEV single frame model according to a preset weighted formula, and updating parameters of the BEV single frame model according to the weighted sum of the loss values.
Optionally, the BEV timing model comprises a BEVDet4D model and the BEV single frame model comprises a BEVDet model.
According to another aspect of the present invention, there is provided a BEV timing model distillation apparatus comprising:
the feature extraction module is used for extracting BEV features from input time sequence data through a trained BEV time sequence model to obtain current frame BEV features and front N-1 frame BEV features, and extracting BEV features from input single frame data through an untrained BEV single frame model to obtain current frame BEV features;
the feature region alignment module is used for aligning the current frame BEV features of the BEV time sequence model and the current frame BEV features of the BEV single frame model aiming at the region where the detection target is located, and calculating a first loss value according to an alignment result;
The feature point alignment module is used for aligning the first N-1 frame BEV feature of the BEV time sequence model and the current frame BEV feature of the BEV single frame model aiming at the center point position of the detection target, and calculating a second loss value according to an alignment result;
and the parameter updating module is used for updating the parameters of the BEV single-frame model according to the first loss value, the second loss value and the self loss value of the BEV single-frame model.
According to another aspect of the present invention, there is provided an electronic apparatus including:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein,
the memory stores a computer program executable by the at least one processor to enable the at least one processor to perform the BEV timing model distillation method of any one of the embodiments of the present invention.
According to another aspect of the invention, there is provided a computer readable storage medium storing computer instructions for causing a processor to perform the BEV timing model distillation method of any embodiment of the invention.
According to the technical scheme, BEV features are extracted from input time sequence data through a trained BEV time sequence model to obtain current frame BEV features and front N-1 frame BEV features, and BEV features are extracted from input single frame data through an untrained BEV single frame model to obtain current frame BEV features; aligning the current frame BEV characteristics of the BEV time sequence model and the current frame BEV characteristics of the BEV single frame model aiming at the region where the detection target is located, and calculating a first loss value according to an alignment result; aligning the first N-1 frame BEV characteristic of the BEV time sequence model and the current frame BEV characteristic of the BEV single frame model aiming at the center point position of the detection target, and calculating a second loss value according to the alignment result; according to the first loss value, the second loss value and the self loss value of the BEV single frame model, the parameters of the BEV single frame model are updated, the problem that knowledge distillation can only be carried out on the BEV single frame model in the prior art is solved, knowledge in the BEV time sequence model with time sequence information is transferred to the BEV single frame model, and the performance of the BEV single frame model is effectively improved.
It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the invention or to delineate the scope of the invention. Other features of the present invention will become apparent from the description that follows.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of a BEV timing model distillation method according to one embodiment of the present invention;
FIG. 2 is a block diagram of a BEV timing model provided in accordance with a first embodiment of the present invention;
FIG. 3 is a block diagram of a BEV single frame model provided in accordance with a first embodiment of the present invention;
FIG. 4 is a flow chart of an implementation of BEV timing model distillation, provided in accordance with a first embodiment of the present invention;
FIG. 5 is a schematic diagram of a BEV timing model distillation apparatus according to a second embodiment of the present invention;
FIG. 6 is a schematic diagram of the structure of an electronic device implementing the BEV timing model distillation method of an embodiment of the present invention.
Detailed Description
In order that those skilled in the art will better understand the present invention, a technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.
It should be noted that the terms "first," "second," and the like in the description and the claims of the present invention and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the invention described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
Example 1
Fig. 1 is a flowchart of a BEV timing model distillation method according to a first embodiment of the present invention, which is applicable to a case of performing knowledge distillation on a BEV timing model with timing information, and the method may be performed by a BEV timing model distillation apparatus, which may be implemented in hardware and/or software, and the BEV timing model distillation apparatus may be configured in an electronic device. As shown in fig. 1, the method includes:
s110, extracting BEV features from the input time sequence data through the trained BEV time sequence model to obtain the BEV features of the current frame and the BEV features of the previous N-1 frames, and extracting BEV features from the input single frame data through the untrained BEV single frame model to obtain the BEV features of the current frame.
The trained BEV timing model generally has a relatively large parameter number and calculation amount, the precision is relatively high, while the untrained BEV single-frame model has a smaller parameter number and calculation amount than the BEV timing model, and the precision is relatively low. The BEV timing model may include a BEVDet4D model and the BEV single frame model may include a BEVDet model.
In this embodiment, the model structure of the BEV timing model may be as shown in fig. 2, and the model structure of the BEV single-frame model may be as shown in fig. 3, and it can be seen that the BEV timing model and the BEV single-frame model differ in input data of the two, the input of the BEV timing model is N-frame data with timing information, and the input of the BEV single-frame model is single-frame data. The BEV time sequence model extracts BEV characteristics from the input N frames of data to obtain the BEV characteristics of the current frame and the BEV characteristics of the previous N-1 frames, and the BEV characteristics of the N frames are passed through a subsequent BEV encoder and a BEV decoder to obtain the output result of the current frame. In contrast, the BEV single frame model extracts BEV features from input single frame data to obtain BEV features of a current frame, and then the BEV features of the current frame are passed through a subsequent BEV encoder and BEV decoder to obtain an output result of the current frame.
In an alternative embodiment, before extracting BEV features from the input timing data by the trained BEV timing model to obtain the current frame BEV features and the previous N-1 frame BEV features, the method may further include: acquiring time sequence data for model training; the time sequence data includes: the method comprises the steps of marking a plurality of training key frames and detection targets corresponding to each training key frame, and marking N-1 historical frames before each training key frame and detection targets corresponding to each historical frame; training a preset BEV single-frame model according to the time sequence data to obtain a trained BEV time sequence model.
In this embodiment, the detection target may be any object set in advance, for example, an autonomous vehicle, other vehicles traveling on a road, a person, and the like. The labeling information of the detection target may include three-dimensional position information, speed information, and the like, and the three-dimensional position information may include information such as a center point position, a size, a rotation angle, and the like of the detection target. Taking time sequence data as image data as an example, in the process of training a preset BEV single-frame model according to the time sequence data, two-dimensional image feature extraction can be performed on each image data based on a convolutional neural network in the model, wherein the convolutional neural network can use ResNet, mobilenet, vovNet and the like. Then, the two-dimensional image features are converted into BEV features in a preset mode, and different BEV features are fused by using a feature pyramid FPN (Feature Pyramid Networks for Object Detection) to perform feature enhancement. And finally, encoding and decoding the fused BEV features to obtain a detection result, calculating a loss function according to the detection result, and updating model parameters to finally obtain a trained BEV time sequence model capable of capturing the features of the detection target in a period of time.
And S120, aligning the BEV characteristics of the current frame of the BEV time sequence model and the BEV characteristics of the current frame of the BEV single frame model aiming at the region where the detection target is located, and calculating a first loss value according to the alignment result.
In this embodiment, the current frame of the BEV timing model and the current frame of the BEV single frame model are the same, so as shown in fig. 4, the target area corresponding to each detection target in the BEV space in the current frame may be determined first, then the BEV features of the BEV timing model and the current frame of the BEV single frame model are aligned in each target area, and further the knowledge of the BEV timing model may be migrated into the BEV single frame model according to the corresponding loss function. Where BEV space may refer to a three-dimensional space from a top view perspective.
Optionally, aligning the BEV features of the current frame of the BEV timing model and the BEV features of the current frame of the BEV single frame model with respect to the region where the detection target is located, and calculating a first loss value according to the alignment result, including: determining a corresponding target area of each detection target of the current frame in the BEV space according to the labeling information of the current frame, and taking each target area as a mask; acquiring a first BEV characteristic of each target area corresponding to the BEV time sequence model and a second BEV characteristic of each target area corresponding to the BEV single frame model according to the mask; the first BEV feature and the second BEV feature are aligned and a first loss value corresponding to the alignment result is calculated from the first loss function.
In this embodiment, according to the labeling information of the detection targets corresponding to the current frame, for example, three-dimensional position information such as the center point position, the size of the dimension, the rotation angle, and the like of the detection targets, and the speed information, a corresponding target area of each detection target of the current frame in the BEV space, that is, an area including the detection targets in the BEV space, may be determined, and the BEV feature of the current frame of the BEV time sequence model and the BEV feature of the current frame of the BEV single frame model may be aligned in each target area. Specifically, each target area may be used as a mask, and the BEV features of the current frame may be masked according to each mask, so as to obtain a first BEV feature corresponding to the BEV timing model in each target area, and a second BEV feature corresponding to the BEV single-frame model in each target area. And comparing the first BEV characteristic and the second BEV characteristic corresponding to each target area, calculating a first loss value corresponding to the comparison result according to the first loss function, and using the first loss value to pull the distance between the BEV time sequence model and the BEV characteristic of the current frame of the BEV single frame model, and migrating the knowledge of the BEV time sequence model to the BEV single frame model.
The first loss value may be expressed as lcur=Σl1 (Ftc, fs) ×mask, lcur is the first loss value, L1 is the first loss function, ftc is the current frame BEV feature of the BEV timing model, fs is the current frame BEV feature of the BEV single frame model, and mask is the ith mask.
And S130, aligning the BEV characteristics of the previous N-1 frame of the BEV time sequence model and the BEV characteristics of the current frame of the BEV single frame model aiming at the center point position of the detection target, and calculating a second loss value according to the alignment result.
In this embodiment, as shown in fig. 4, considering that at different moments, the same target may lose edge information or object shielding and lose all information due to the movement of the vehicle, the BEV features of the first N-1 frame of the BEV timing model need to be aligned with the BEV features of the current frame of the BEV single frame model at the center point of each detection target. In consideration of the fact that the positions of the vehicles in different time frames are changed, the center point of each detection target in the front N-1 frame of the BEV time sequence model can be subjected to coordinate system conversion, and the front N-1 frame is converted into the current frame, so that the coordinates of the center point of each detection target in the front N-1 frame are overlapped with the coordinates of the center point of the detection target in the current frame of the BEV single frame model, and therefore the front N-1 frame BEV characteristic of the BEV time sequence model and the current frame BEV characteristic of the BEV single frame model can be aligned aiming at the center point position of each detection target, and the prediction precision is enhanced by using time sequence information.
Optionally, aligning the first N-1 frame BEV feature of the BEV timing model and the current frame BEV feature of the BEV single frame model with respect to a center point position of the detection target, and calculating a second loss value according to the alignment result, including: determining a third center point coordinate of a center point of each detection target of the previous N-1 frame in the BEV space according to the labeling information of the previous N-1 frame; according to the labeling information of the current frame, determining a fourth center point coordinate of a center point of each detection target of the current frame in the BEV space; acquiring fifth BEV features of the BEV time sequence model corresponding to the coordinates of each third center point and sixth BEV features of the BEV single-frame model corresponding to the coordinates of each fourth center point; and aligning the fifth BEV feature and the sixth BEV feature aiming at the center point of the same detection target, and calculating a second loss value corresponding to the aligned result according to a second loss function.
In this embodiment, a third center point coordinate corresponding to each detection target center point of the previous N-1 frame in the BEV space may be determined according to labeling information of the detection target corresponding to the previous N-1 frame, for example, three-dimensional position information such as a center point position, a size of the detection target, a rotation angle, and speed information, where the third center point coordinate refers to a coordinate of an effective detection target center point in the field of view of the current frame after the conversion to the current frame. Correspondingly, according to the three-dimensional position information such as the position, the size and the rotation angle of the center point of the detection target corresponding to the current frame and the speed information, the fourth center point coordinate corresponding to the center point of each detection target of the current frame in the BEV space is determined. And acquiring a fifth BEV characteristic of the BEV time sequence model corresponding to each third center point coordinate, and acquiring a sixth BEV characteristic of the BEV single-frame model corresponding to each fourth center point coordinate. And comparing at least one fifth BEV feature and a unique sixth BEV feature corresponding to the center point of each detection target of the current frame, and calculating a second loss value corresponding to the comparison result according to a second loss function so as to further migrate knowledge of the BEV time sequence model to the BEV single frame model.
The second loss value may be expressed as lpre=Σl2 (Ftpi, fsi), where Lpre is the second loss value, L2 is the second loss function, i is the i-th detection target center point of the current frame, ftpi is the fifth BEV feature corresponding to the i-th detection target center point, and Fsi is the sixth BEV feature corresponding to the i-th detection target center point.
In an alternative embodiment, determining third center point coordinates of the center point of each detection target of the previous N-1 frame in the BEV space according to the labeling information of the previous N-1 frame comprises: for each historical frame in the previous N-1 frames, determining a historical frame coordinate corresponding to the center point of each detection target of the historical frame according to the marking information of the historical frame; calculating the product of each historical frame coordinate, the global conversion matrix and the current conversion matrix to obtain the current frame coordinate corresponding to the center point of each detection target of the historical frame; the global conversion matrix is a coordinate conversion matrix from a historical frame to a global frame, and the current conversion matrix is a coordinate conversion matrix from the global frame to the current frame; and screening the effective detection targets in the history frames, which are positioned in the field of view of the current frames, and taking the coordinates of the current frames corresponding to the central points of the effective detection targets as the coordinates of the third central points in the BEV space.
And S140, updating parameters of the BEV single-frame model according to the first loss value, the second loss value and the self loss value of the BEV single-frame model.
Optionally, updating parameters of the BEV single frame model according to the first loss value, the second loss value, and the BEV single frame model own loss value includes: encoding and decoding the BEV characteristics of the current frame of the BEV single-frame model, and calculating the BEV single-frame model self-loss value corresponding to the decoding result according to the third loss function; and calculating a weighted sum of the first loss value, the second loss value and the loss value of the BEV single frame model according to a preset weighted formula, and updating parameters of the BEV single frame model according to the weighted sum of the loss values.
In this embodiment, as shown in fig. 4, the current frame BEV feature of the BEV single frame model may be encoded for subsequent decoding by an encoder, which may be a structure of res net plus FPN. The features of the BEV encoder output are further decoded, i.e. heads corresponding to specific tasks, such as detection heads for 3D object detection. And constraining the result output by the BEV decoder and the standard information by using a third loss function, and calculating to obtain the self loss value Lori of the BEV single frame model corresponding to the decoding result. Finally, according to a preset weighting formula of ltotal=lore+αlcur+βlpre, the first loss value, the second loss value and the self loss value of the BEV single-frame model are weighted and summed to form an integral loss function, and parameters of the BEV single-frame model are updated according to the weighted and summed loss values.
According to the technical scheme, BEV features are extracted from input time sequence data through a trained BEV time sequence model to obtain current frame BEV features and front N-1 frame BEV features, and BEV features are extracted from input single frame data through an untrained BEV single frame model to obtain current frame BEV features; aligning the current frame BEV characteristics of the BEV time sequence model and the current frame BEV characteristics of the BEV single frame model aiming at the region where the detection target is located, and calculating a first loss value according to an alignment result; aligning the first N-1 frame BEV characteristic of the BEV time sequence model and the current frame BEV characteristic of the BEV single frame model aiming at the center point position of the detection target, and calculating a second loss value according to the alignment result; according to the first loss value, the second loss value and the self loss value of the BEV single frame model, the parameters of the BEV single frame model are updated, the problem that knowledge distillation can only be carried out on the BEV single frame model in the prior art is solved, knowledge in the BEV time sequence model with time sequence information is transferred to the BEV single frame model, and the performance of the BEV single frame model is effectively improved.
Example two
Fig. 5 is a schematic diagram of a BEV timing model distillation apparatus according to a second embodiment of the present invention. As shown in fig. 5, the apparatus includes:
The feature extraction module 510 is configured to extract BEV features from the input time sequence data through the trained BEV time sequence model to obtain BEV features of the current frame and BEV features of the previous N-1 frame, and extract BEV features from the input single frame data through the untrained BEV single frame model to obtain BEV features of the current frame;
the feature region alignment module 520 is configured to align the BEV feature of the current frame of the BEV timing model and the BEV feature of the current frame of the BEV single frame model with respect to a region where the detection target is located, and calculate a first loss value according to an alignment result;
a feature point alignment module 530, configured to align the BEV feature of the previous N-1 frame of the BEV timing model and the BEV feature of the current frame of the BEV single frame model with respect to the center point position of the detection target, and calculate a second loss value according to the alignment result;
the parameter updating module 540 is configured to update parameters of the BEV single-frame model according to the first loss value, the second loss value, and the self loss value of the BEV single-frame model.
According to the technical scheme, BEV features are extracted from input time sequence data through a trained BEV time sequence model to obtain current frame BEV features and front N-1 frame BEV features, and BEV features are extracted from input single frame data through an untrained BEV single frame model to obtain current frame BEV features; aligning the current frame BEV characteristics of the BEV time sequence model and the current frame BEV characteristics of the BEV single frame model aiming at the region where the detection target is located, and calculating a first loss value according to an alignment result; aligning the first N-1 frame BEV characteristic of the BEV time sequence model and the current frame BEV characteristic of the BEV single frame model aiming at the center point position of the detection target, and calculating a second loss value according to the alignment result; according to the first loss value, the second loss value and the self loss value of the BEV single frame model, the parameters of the BEV single frame model are updated, the problem that knowledge distillation can only be carried out on the BEV single frame model in the prior art is solved, knowledge in the BEV time sequence model with time sequence information is transferred to the BEV single frame model, and the performance of the BEV single frame model is effectively improved.
Optionally, the method further comprises: a model training module for extracting BEV characteristics from the input time sequence data by the trained BEV time sequence model to obtain the BEV characteristics of the current frame and the BEV characteristics of the previous N-1 frame,
acquiring time sequence data for model training; the time sequence data includes: the method comprises the steps of marking a plurality of training key frames and detection targets corresponding to each training key frame, and marking N-1 historical frames before each training key frame and detection targets corresponding to each historical frame;
training a preset BEV single-frame model according to the time sequence data to obtain a trained BEV time sequence model.
Optionally, the feature region alignment module 520 is configured to:
determining a corresponding target area of each detection target of the current frame in the BEV space according to the labeling information of the current frame, and taking each target area as a mask;
acquiring a first BEV characteristic of each target area corresponding to the BEV time sequence model and a second BEV characteristic of each target area corresponding to the BEV single frame model according to the mask;
the first BEV feature and the second BEV feature are aligned and a first loss value corresponding to the alignment result is calculated from the first loss function.
Optionally, the feature point alignment module 530 is configured to:
Determining a third center point coordinate of a center point of each detection target of the previous N-1 frame in the BEV space according to the labeling information of the previous N-1 frame;
according to the labeling information of the current frame, determining a fourth center point coordinate of a center point of each detection target of the current frame in the BEV space;
acquiring fifth BEV features of the BEV time sequence model corresponding to the coordinates of each third center point and sixth BEV features of the BEV single-frame model corresponding to the coordinates of each fourth center point;
and aligning the fifth BEV feature and the sixth BEV feature aiming at the center point of the same detection target, and calculating a second loss value corresponding to the aligned result according to a second loss function.
Optionally, the feature point alignment module 530 is configured to:
for each historical frame in the previous N-1 frames, determining a historical frame coordinate corresponding to the center point of each detection target of the historical frame according to the marking information of the historical frame;
calculating the product of each historical frame coordinate, the global conversion matrix and the current conversion matrix to obtain the current frame coordinate corresponding to the center point of each detection target of the historical frame;
the global conversion matrix is a coordinate conversion matrix from a historical frame to a global frame, and the current conversion matrix is a coordinate conversion matrix from the global frame to the current frame;
And screening the effective detection targets in the history frames, which are positioned in the field of view of the current frames, and taking the coordinates of the current frames corresponding to the central points of the effective detection targets as the coordinates of the third central points in the BEV space.
Optionally, the parameter updating module 540 is configured to:
encoding and decoding the BEV characteristics of the current frame of the BEV single-frame model, and calculating the BEV single-frame model self-loss value corresponding to the decoding result according to the third loss function;
and calculating a weighted sum of the first loss value, the second loss value and the loss value of the BEV single frame model according to a preset weighted formula, and updating parameters of the BEV single frame model according to the weighted sum of the loss values.
Optionally, the BEV timing model comprises a BEVDet4D model and the BEV single frame model comprises a BEVDet model.
The BEV time sequence model distillation device provided by the embodiment of the invention can execute the BEV time sequence model distillation method provided by any embodiment of the invention, and has the corresponding functional modules and beneficial effects of the execution method.
Example III
Fig. 6 shows a schematic diagram of the structure of an electronic device 10 that may be used to implement an embodiment of the invention. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. Electronic equipment may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices (e.g., helmets, glasses, watches, etc.), and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the inventions described and/or claimed herein.
As shown in fig. 6, the electronic device 10 includes at least one processor 11, and a memory, such as a Read Only Memory (ROM) 12, a Random Access Memory (RAM) 13, etc., communicatively connected to the at least one processor 11, in which the memory stores a computer program executable by the at least one processor, and the processor 11 may perform various appropriate actions and processes according to the computer program stored in the Read Only Memory (ROM) 12 or the computer program loaded from the storage unit 18 into the Random Access Memory (RAM) 13. In the RAM 13, various programs and data required for the operation of the electronic device 10 may also be stored. The processor 11, the ROM 12 and the RAM 13 are connected to each other via a bus 14. An input/output (I/O) interface 15 is also connected to bus 14.
Various components in the electronic device 10 are connected to the I/O interface 15, including: an input unit 16 such as a keyboard, a mouse, etc.; an output unit 17 such as various types of displays, speakers, and the like; a storage unit 18 such as a magnetic disk, an optical disk, or the like; and a communication unit 19 such as a network card, modem, wireless communication transceiver, etc. The communication unit 19 allows the electronic device 10 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.
The processor 11 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of processor 11 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various processors running machine learning model algorithms, digital Signal Processors (DSPs), and any suitable processor, controller, microcontroller, etc. The processor 11 performs the various methods and processes described above, such as the BEV timing model distillation method.
In some embodiments, the BEV timing model distillation method may be implemented as a computer program tangibly embodied on a computer-readable storage medium, such as storage unit 18. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 10 via the ROM 12 and/or the communication unit 19. When a computer program is loaded into RAM 13 and executed by processor 11, one or more steps of the BEV timing model distillation method described above may be performed. Alternatively, in other embodiments, the processor 11 may be configured to perform the BEV timing model distillation method in any other suitable manner (e.g., by means of firmware).
Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.
A computer program for carrying out methods of the present invention may be written in any combination of one or more programming languages. These computer programs may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the computer programs, when executed by the processor, cause the functions/acts specified in the flowchart and/or block diagram block or blocks to be implemented. The computer program may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of the present invention, a computer-readable storage medium may be a tangible medium that can contain, or store a computer program for use by or in connection with an instruction execution system, apparatus, or device. The computer readable storage medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. Alternatively, the computer readable storage medium may be a machine readable signal medium. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on an electronic device having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) through which a user can provide input to the electronic device. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), blockchain networks, and the internet.
The computing system may include clients and servers. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical hosts and VPS service are overcome.
It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps described in the present invention may be performed in parallel, sequentially, or in a different order, so long as the desired results of the technical solution of the present invention are achieved, and the present invention is not limited herein.
The above embodiments do not limit the scope of the present invention. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present invention should be included in the scope of the present invention.
Claims (10)
1. A BEV timing model distillation method, comprising:
extracting BEV features from the input time sequence data through a trained BEV time sequence model to obtain current frame BEV features and front N-1 frame BEV features, and extracting BEV features from the input single frame data through an untrained BEV single frame model to obtain current frame BEV features;
aligning the current frame BEV characteristics of the BEV time sequence model and the current frame BEV characteristics of the BEV single frame model aiming at the region where the detection target is located, and calculating a first loss value according to an alignment result;
Aligning the first N-1 frame BEV characteristic of the BEV time sequence model and the current frame BEV characteristic of the BEV single frame model aiming at the center point position of the detection target, and calculating a second loss value according to an alignment result;
and updating parameters of the BEV single-frame model according to the first loss value, the second loss value and the self loss value of the BEV single-frame model.
2. The method of claim 1, further comprising, prior to extracting BEV features from the input timing data by the trained BEV timing model, obtaining current frame BEV features and previous N-1 frame BEV features:
acquiring time sequence data for model training; the time sequence data comprises: the method comprises the steps of marking a plurality of training key frames and detection targets corresponding to each training key frame, and marking N-1 historical frames before each training key frame and detection targets corresponding to each historical frame;
and training a preset BEV single-frame model according to the time sequence data to obtain a trained BEV time sequence model.
3. The method of claim 2, wherein aligning the current frame BEV features of the BEV timing model and the current frame BEV features of the BEV single frame model for the region where the detection target is located, and calculating a first loss value based on the alignment result, comprises:
Determining a corresponding target area of each detection target of the current frame in the BEV space according to the labeling information of the current frame, and taking each target area as a mask;
acquiring a first BEV characteristic of each target area corresponding to the BEV time sequence model and a second BEV characteristic of each target area corresponding to the BEV single frame model according to a mask;
and aligning the first BEV feature with the second BEV feature, and calculating a first loss value corresponding to the alignment result according to a first loss function.
4. The method of claim 2, wherein aligning the first N-1 frame BEV features of the BEV timing model and the current frame BEV features of the BEV single frame model for the center point position of the detection target, and calculating a second loss value based on the alignment result, comprises:
determining a third center point coordinate of a center point of each detection target of the previous N-1 frame in the BEV space according to the labeling information of the previous N-1 frame;
according to the labeling information of the current frame, determining a fourth center point coordinate of a center point of each detection target of the current frame in the BEV space;
acquiring a fifth BEV characteristic of the BEV time sequence model corresponding to each third center point coordinate and a sixth BEV characteristic of the BEV single frame model corresponding to each fourth center point coordinate;
And aligning the fifth BEV feature and the sixth BEV feature aiming at the center point of the same detection target, and calculating a second loss value corresponding to the aligned result according to a second loss function.
5. The method of claim 4, wherein determining third center point coordinates of the center point of each detection target of the previous N-1 frame in the BEV space based on the labeling information of the previous N-1 frame comprises:
for each historical frame in the previous N-1 frames, determining a historical frame coordinate corresponding to a center point of each detection target of the historical frame according to the marking information of the historical frame;
calculating the product of each historical frame coordinate, the global conversion matrix and the current conversion matrix to obtain the current frame coordinate corresponding to the center point of each detection target of the historical frame;
the global conversion matrix is a coordinate conversion matrix from a historical frame to a global frame, and the current conversion matrix is a coordinate conversion matrix from the global frame to the current frame;
and screening an effective detection target in the history frame, which is positioned in the current frame visual field, and taking the current frame coordinate corresponding to the central point of the effective detection target as a third central point coordinate in the BEV space.
6. The method of claim 1, wherein updating parameters of the BEV single-frame model based on the first loss value, the second loss value, and the BEV single-frame model own loss value comprises:
encoding and decoding the BEV characteristics of the current frame of the BEV single-frame model, and calculating the self-loss value of the BEV single-frame model corresponding to the decoding result according to a third loss function;
and calculating the weighted sum of the first loss value, the second loss value and the loss value of the BEV single-frame model according to a preset weighted formula, and updating the parameters of the BEV single-frame model according to the weighted sum of the loss values.
7. The method of claim 1, wherein the BEV timing model comprises a BEVDet4D model and the BEV single frame model comprises a BEVDet model.
8. A BEV timing model distillation apparatus, comprising:
the feature extraction module is used for extracting BEV features from input time sequence data through a trained BEV time sequence model to obtain current frame BEV features and front N-1 frame BEV features, and extracting BEV features from input single frame data through an untrained BEV single frame model to obtain current frame BEV features;
the feature region alignment module is used for aligning the BEV features of the current frame of the BEV time sequence model and the BEV features of the current frame of the BEV single frame model aiming at the region where the detection target is located, and calculating a first loss value according to an alignment result;
The feature point alignment module is used for aligning the front N-1 frame BEV feature of the BEV time sequence model and the current frame BEV feature of the BEV single frame model aiming at the center point position of the detection target, and calculating a second loss value according to an alignment result;
and the parameter updating module is used for updating the parameters of the BEV single-frame model according to the first loss value, the second loss value and the self loss value of the BEV single-frame model.
9. An electronic device, the electronic device comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein,
the memory stores a computer program executable by the at least one processor to enable the at least one processor to perform the BEV timing model distillation method of any of claims 1-7.
10. A computer readable storage medium storing computer instructions for causing a processor to perform the BEV timing model distillation method of any of claims 1-7 when executed.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311615040.2A CN117593713A (en) | 2023-11-29 | 2023-11-29 | BEV time sequence model distillation method, device, equipment and medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311615040.2A CN117593713A (en) | 2023-11-29 | 2023-11-29 | BEV time sequence model distillation method, device, equipment and medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN117593713A true CN117593713A (en) | 2024-02-23 |
Family
ID=89912998
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202311615040.2A Pending CN117593713A (en) | 2023-11-29 | 2023-11-29 | BEV time sequence model distillation method, device, equipment and medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117593713A (en) |
-
2023
- 2023-11-29 CN CN202311615040.2A patent/CN117593713A/en active Pending
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112966587B (en) | Training method of target detection model, target detection method and related equipment | |
EP3926526A2 (en) | Optical character recognition method and apparatus, electronic device and storage medium | |
CN113674421B (en) | 3D target detection method, model training method, related device and electronic equipment | |
CN112560684B (en) | Lane line detection method, lane line detection device, electronic equipment, storage medium and vehicle | |
CN110675635B (en) | Method and device for acquiring external parameters of camera, electronic equipment and storage medium | |
CN113947188A (en) | Training method of target detection network and vehicle detection method | |
CN112561879A (en) | Ambiguity evaluation model training method, image ambiguity evaluation method and device | |
CN113205041A (en) | Structured information extraction method, device, equipment and storage medium | |
CN116740355A (en) | Automatic driving image segmentation method, device, equipment and storage medium | |
CN113705380B (en) | Target detection method and device for foggy days, electronic equipment and storage medium | |
CN117911891A (en) | Equipment identification method and device, electronic equipment and storage medium | |
CN115239889B (en) | Training method of 3D reconstruction network, 3D reconstruction method, device, equipment and medium | |
CN115761698A (en) | Target detection method, device, equipment and storage medium | |
CN117593713A (en) | BEV time sequence model distillation method, device, equipment and medium | |
CN114529801A (en) | Target detection method, device, equipment and storage medium | |
CN113936158A (en) | Label matching method and device | |
CN113920273A (en) | Image processing method, image processing device, electronic equipment and storage medium | |
CN118552860B (en) | Obstacle detection method and device, electronic equipment and storage medium | |
CN115908982B (en) | Image processing method, model training method, device, equipment and storage medium | |
CN118411381B (en) | Boundary coordinate detection method, device, electronic equipment and storage medium | |
CN114581746B (en) | Object detection method, device, equipment and medium | |
CN116310403A (en) | Target tracking method, device, electronic equipment and readable storage medium | |
CN117392022A (en) | Model training method, video rain removing method, device, equipment and storage medium | |
CN117372477A (en) | Target tracking matching method, device, equipment and medium | |
CN116977524A (en) | Three-dimensional map construction method and device, electronic equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |