CN114758199A

CN114758199A - Training method, device, equipment and storage medium for detection model

Info

Publication number: CN114758199A
Application number: CN202210665491.6A
Authority: CN
Inventors: 李林超; 王威; 周凯; 张腾飞
Original assignee: Zhejiang Zhuoyun Intelligent Technology Co ltd
Current assignee: Zhejiang Zhuoyun Intelligent Technology Co ltd
Priority date: 2022-06-14
Filing date: 2022-06-14
Publication date: 2022-07-15
Also published as: CN115797735A

Abstract

The application provides a training method, a device, equipment and a storage medium for a detection model. The method comprises the following steps: setting one detection model in the plurality of detection models as a teacher model and setting the other detection models as student models in sequence; according to sample data and prior knowledge corresponding to the sample data, determining a first target loss value learned mutually by student model learning teacher models on the bottleneck layer, a second target loss value learned mutually on the correlation of detection targets, and a third target loss value learned mutually on the regression and classification of the detection targets; and training each detection model based on the first target loss value, the second target loss value, the third target loss value and the regression loss value and the classification loss value of the priori knowledge learned by each detection model to obtain the corresponding target detection model. The method can shorten the training time of the detection model, improve the training efficiency of the detection model, and improve the detection performance of all the detection models.

Description

Training method, device, equipment and storage medium for detection model

Technical Field

The embodiment of the application relates to the technical field of computers, in particular to a training method, a training device, training equipment and a storage medium for a detection model.

Background

With the continuous development of deep learning, the deep learning has been widely applied in various fields, for example, in object detection in the field of vision. In order to obtain a better detection effect, a trunk network and a bottleneck layer of a detection model are generally optimized, and the trunk network and the bottleneck layer which are more complicated are used, so that the inference speed of the detection model in actual landing is low, and the requirement on hardware equipment is high.

Based on the above, methods for compressing models such as quantification, pruning and distillation are provided, wherein the distillation effect is the best, but the distillation method is still time-consuming in training the student models, and the detection performance of the student models is limited to the learning ability and distillation knowledge of the teacher model.

Disclosure of Invention

The application provides a training method, a device, equipment and a storage medium of a detection model.

In a first aspect, an embodiment of the present application provides a training method for a detection model, including:

setting one detection model of the plurality of detection models as a teacher model, and setting the other detection models as student models;

inputting sample data into the teacher model and each student model to obtain characteristic values output by the bottleneck layers of the teacher model and each student model;

Determining a loss value of each student model in learning the teacher model on the bottleneck layer, a loss value of each student model in learning the teacher model on the relevance of the detection target and a loss value of each student model in learning the teacher model on the regression and classification of the detection target according to the prior knowledge corresponding to the sample data, the teacher model and the feature value output by the bottleneck layer of each student model;

setting the next detection model in the plurality of detection models as a teacher model, setting the rest models as student models, and continuing to execute the step of inputting sample data into the teacher model and the student models simultaneously to obtain feature values output by the bottleneck layers of the teacher model and the student models until the last detection model in the plurality of detection models is set as the teacher model and the rest models are set as the student models, so as to obtain a first loss value set of the student models in learning the bottleneck layers of the teacher models, a second loss value set of the student models in learning the relevance of the detection targets of the teacher models, and a third loss value set of the student models in learning the regression and classification of the student models in the detection targets;

Processing the first loss value set, the second loss value set and the third loss value set respectively to obtain a first target loss value learned by each student model in the bottleneck layer, a second target loss value learned by each student model in the relevance of the detection target and a third target loss value learned by each student model in the regression and classification of the detection target;

and training each detection model based on the first target loss value, the second target loss value, the third target loss value and the regression loss value and the classification loss value of the priori knowledge learned by each detection model to obtain the corresponding target detection model.

In a second aspect, an embodiment of the present application provides a training apparatus for detecting a model, including:

the setting module is used for setting one detection model in the plurality of detection models as a teacher model and setting the other detection models as student models;

the first processing module is used for inputting sample data to the teacher model and each student model to obtain characteristic values output by the bottleneck layers of the teacher model and each student model;

the determining module is used for determining a loss value of each student model in learning the bottleneck layer of the teacher model, a loss value of each student model in learning the relevance of the detection target of the teacher model and a loss value of each student model in learning the regression and classification of the teacher model in the detection target according to prior knowledge corresponding to the sample data, the teacher model and the feature value output by the bottleneck layer of each student model;

The setting module is further used for setting the next detection model in the plurality of detection models as a teacher model, setting the rest models as student models, and continuing to execute the operation of the first processing module until the setting module sets the last detection model in the plurality of detection models as the teacher model and sets the rest models as the student models, so as to obtain a first loss value set of each student model in the aspect of learning the bottleneck layer of each teacher model, a second loss value set of each student model in the aspect of learning the relevance of each teacher model in the detection target, and a third loss value set of each student model in the aspect of learning the regression and classification of each teacher model in the detection target;

a second processing module, configured to process the first loss value set, the second loss value set, and the third loss value set, respectively, to obtain a first target loss value learned by each student model in a bottleneck layer, a second target loss value learned by each student model in a correlation aspect of a detection target, and a third target loss value learned by each student model in a regression and classification aspect of the detection target;

and the third processing module is used for training each detection model based on the first target loss value, the second target loss value, the third target loss value and the regression loss value and the classification loss value of the priori knowledge learned by each detection model to obtain the corresponding target detection model.

In a third aspect, an embodiment of the present application provides an electronic device, including: one or more processors; a memory for storing one or more programs; the one or more programs, when executed by the one or more processors, cause the processors to implement the steps of the training method of the detection model as provided in the first aspect of the embodiments of the present application.

In a fourth aspect, embodiments of the present application provide a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the steps of the training method for detection models as provided in the first aspect of embodiments of the present application.

According to the technical scheme provided by the embodiment of the application, each detection model in the plurality of detection models can be a teacher model or a student model, the detection models are mutually studied and trained together without being trained in stages, so that the training time of the detection models is shortened, the training efficiency of the detection models is improved, and the problem that the distillation training is limited by the detection capability of the teacher model in the traditional technology is solved; meanwhile, prior knowledge corresponding to sample data is added during mutual learning, and the knowledge provided by the teacher model is filtered by the prior knowledge, so that the student model can learn the correct black box knowledge of the plurality of teacher models in the aspects of bottleneck layers, the relevance of detection targets and the regression and classification of the detection targets, that is, the correct learning capacity of the different detection models on the sample data is absorbed, the learning capacity of each detection model is improved, and the detection performance of all the detection models is improved.

Drawings

Fig. 1 is a schematic flowchart of a training method for a detection model according to an embodiment of the present disclosure;

FIG. 2 is a schematic flow chart of determining loss values of student model learning teacher models at bottleneck level according to the embodiment of the present application;

FIG. 3 is a schematic diagram illustrating an embodiment of the present application for determining loss values of student model learning teacher models at a bottleneck level;

FIG. 4 is a schematic flow chart of determining loss values of student model learning teacher models in terms of detecting relevance of targets according to the embodiment of the application;

FIG. 5 is a schematic diagram illustrating a principle of determining loss values of student model learning teacher models in terms of detecting relevance of targets according to an embodiment of the present application;

FIG. 6 is a schematic flow chart of determining loss values of student model learning teacher models in detecting regression and classification of targets according to an embodiment of the present application;

FIG. 7 is a schematic diagram illustrating an embodiment of a method for determining loss values of student model learning teacher models in detecting regression and classification of targets;

FIG. 8 is a schematic structural diagram of a training apparatus for testing models according to an embodiment of the present disclosure;

Fig. 9 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The present application will be described in further detail with reference to the drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of and not restrictive on the broad application. It should be further noted that, for the convenience of description, only some of the structures associated with the present application are shown in the drawings, not all of them.

Fig. 1 is a schematic flowchart of a training method for a detection model according to an embodiment of the present disclosure. As shown in fig. 1, the method may include:

s101, setting one detection model in the plurality of detection models as a teacher model, and setting the rest models as student models.

In this embodiment, a plurality of detection models may be trained in synchronization, and each of the plurality of detection models may be set as a teacher model and the other models may be set as student models in order during training, so that the student models learn black box knowledge of more teacher models, that is, one of the plurality of detection models may be set as the teacher model and the other models may be set as the student models at this time. Taking n detection models as an example, the detection model 1 of the n detection models may be set as a teacher model, and the detection models 2 to n may be set as student models. The plurality of detection models have different model parameters and sizes.

And S102, inputting sample data to the teacher model and the student models to obtain characteristic values output by the bottleneck layers of the teacher model and the student models.

Each detection model may include a corresponding backbone network and bottleneck layer. And the sample data is training data of each detection model, after the sample data is obtained, the sample data is input into the teacher model and each student model, and characteristic values output by the teacher model and the bottleneck layers of the student models are obtained through processing of a backbone network and the bottleneck layers in the teacher model and the student models.

S103, determining a loss value of each student model in learning the bottleneck layer of the teacher model, a loss value of each student model in learning the relevance of the detection target of the teacher model, and a loss value of each student model in learning the regression and classification of the teacher model in the detection target according to prior knowledge corresponding to the sample data, the teacher model and the feature value output by the bottleneck layer of each student model.

The prior knowledge is label data corresponding to the sample data, and the provided information is accurate. When the learning is carried out, the knowledge of the teacher model is screened by combining the prior knowledge corresponding to the sample data, so that the student model can learn the correct knowledge of the teacher model, and can learn the knowledge of the teacher model in the following three directions, namely the bottleneck layer, the relevance of the detection target and the regression and classification of the detection target, and the accuracy of the student model can be improved through the mutual learning in the three directions. The detection target can be understood as a candidate frame of the detection model output.

Therefore, the characteristic values output by the teacher model are screened in combination with the prior knowledge to screen out correct knowledge, and the loss values of the student models in the three aspects are determined according to the correct knowledge and the characteristic values output by the bottleneck layers of the student models. Continuing with the example in S101 above, after steps S102-S103, the loss values of student models 1 through n for learning teacher model 0 in the above three aspects are obtained.

And S104, setting the next detection model in the plurality of detection models as a teacher model, setting the rest models as student models, and continuing to execute the step of inputting sample data into the teacher model and the student models simultaneously to obtain characteristic values output by the bottleneck layers of the teacher model and the student models until the last detection model in the plurality of detection models is set as the teacher model and the rest models are set as the student models, so as to obtain a first loss value set of the student models for learning the bottleneck layers of the teacher model, a second loss value set of the student models for learning the relevance of the detection targets of the teacher model and a third loss value set of the student models for learning the regression and classification of the student models in the detection targets.

Further, another detection model is reset to be a teacher model, and the rest models are student models. Continuing with the example in S101 as an example, the detection model 2 of the n detection models is set again as the teacher model, and the detection model 1, the detection model 3, and the detection model n are set as the student models. Then, the process of S102-S103 described above is continued until the detection model n is set as a teacher model, the detection models 1 to n-1 are set as student models, and the process of S102-S103 is continued, thereby obtaining the detection model 1 as a student model, learning n-1 teacher models (i.e., the detection models 2 to n) a first loss value set in the bottleneck layer, a second loss value set in the correlation of the detection target, and a third loss value set in the regression and classification of the detection target, learning n-1 teacher models (i.e., the detection model 1, the detection models 3 to n) a first loss value set in the bottleneck layer, a second loss value set in the correlation of the detection target, and a third loss value set in the regression and classification of the detection target, when the detection model 2 is set as a student model, by analogy, when the detection model n is used as a student model, a first loss value set of n-1 teacher models (i.e., detection model 1, detection model 2 to detection model n-1) in the bottleneck layer, a second loss value set in the relevance of the detection target, and a third loss value set in the regression and classification of the detection target are learned.

And S105, processing the first loss value set, the second loss value set and the third loss value set respectively to obtain a first target loss value learned by each student model on the bottleneck layer, a second target loss value learned by each student model on the relevance of the detection target and a third target loss value learned by each student model on the regression and classification of the detection target.

The first loss value set, the second loss value set and the third loss value set comprise loss values learned by student models to a plurality of teacher models, the first loss value set, the second loss value set and the third loss value set are processed, each student model is guided by the plurality of teacher models, an optimal processing method is selected, the student models can learn the defective distillation knowledge of the student models to the maximum extent, and the student models can learn the best knowledge of the teacher models and comprehensively learn the knowledge of the teacher models.

Alternatively, the determination process of the first target loss value that the student models learn each other in the bottleneck layer may be: the maximum loss value in each first loss value set is determined as a first target loss value that each student model learns of each other in the bottleneck layer. In this way, the student model is optimized by the first target loss value so that the student model can learn the knowledge of the best teacher model among the plurality of teacher models. Taking the detection model 1 as an example of a student model, the first loss set corresponding to the detection model 1 includes a first loss value 1, a first loss value 2 … … and a first loss value n-1, the first loss value 1 represents a loss value learned by the detection model 1 to the detection model 2 on the bottleneck level, the first loss value 2 represents a loss value learned by the detection model 1 to the detection model 3 on the bottleneck level, by analogy, the first loss value n-1 represents a loss value that the detection model 1 learns from the detection model n with respect to the bottleneck layer, and thus, the maximum loss value can be determined from the first loss value 1, the first loss value 2 through the first loss value n-1, assuming that the maximum loss value is the first loss value 2, the first loss value 2 may be determined as a first target loss value at which the detection models 1 learn each other in terms of the bottleneck layer. When other detection models are used as student models, calculation can be performed by referring to the detection model 1, and details are not repeated here.

Alternatively, the determination process of the second target loss value that the student models learn each other in the correlation of the detection targets may be: and respectively carrying out averaging calculation on the loss values in the second loss value sets, and determining the calculation result as a second target loss value which is learned by the student models in the aspect of correlation of the detection target. In this way, the student model is optimized by the second target loss value so that the student model can comprehensively learn knowledge of a plurality of teacher models. Taking the detection model 1 as an example of a student model, the second loss set corresponding to the detection model 1 includes a second loss value 1 and a second loss value 2 … …, the second loss value n-1, the second loss value 1 represents a loss value learned by the detection model 1 to the detection model 2 with respect to the correlation of the detection target, the second loss value 2 represents a loss value learned by the detection model 1 to the detection model 3 with respect to the correlation of the detection target, by analogy, the second loss value n-1 represents a loss value at which the detection model 1 learns each other in the correlation of the detection targets to the detection model n, and thus, the second loss value 1, the second loss value 2 to the second loss value n-1 may be subjected to an averaging calculation, and the calculation result may be determined as a second target loss value at which the detection models 1 learn each other in the correlation of the detection targets.

Alternatively, the determination process of the loss value of the third target, in which the student models learn each other in the regression and classification of the detection targets, may be: and respectively carrying out averaging calculation on all the loss values in all the second loss value sets, and determining the calculation result as a third target loss value which is mutually learned by all the student models in the aspects of regression and classification of the detection target. In this way, the student model is optimized by the third target loss value so that the student model can comprehensively learn knowledge of a plurality of teacher models. Taking the detection model 1 as an example of a student model, the third loss set corresponding to the detection model 1 includes a third loss value 1, a third loss value 2 … …, and a third loss value n-1, where the third loss value 1 represents a loss value learned by the detection model 1 to the detection model 2 in terms of regression and classification of the detection target, the third loss value 2 represents a loss value learned by the detection model 1 to the detection model 3 in terms of regression and classification of the detection target, by analogy, the third loss value n-1 represents the loss value that detection model 1 learns from detection model n in terms of regression and classification of detection targets, and thus, the third loss value 1, the third loss value 2 to the third loss value n-1 may be subjected to an averaging calculation, and the calculation result may be determined as a third target loss value at which the detection model 1 learns each other in the regression and classification of the detection target.

S106, training each detection model based on the first target loss value, the second target loss value, the third target loss value and the regression loss value and the classification loss value of the priori knowledge learned by each detection model to obtain the corresponding target detection model.

The detection models learn the regression loss value and the classification loss value of the priori knowledge, the regression loss value and the classification loss value are the self-learning loss values of the detection models, the regression loss value can be obtained by calculating the positions of the candidate frames obtained after the priori knowledge and the detection models process the sample data, and the classification loss value can be obtained by calculating the classes of the candidate frames obtained after the priori knowledge and the detection models process the sample data. And summing the first target loss value, the second target loss value and the third target loss value obtained in the mutual learning process, the regression loss value and the classification loss value obtained in the self-learning process to obtain a total loss value of each detection model, independently and reversely propagating each detection model based on each total loss value, and optimizing parameters of each detection model until each detection model converges to obtain a corresponding target detection model, thereby realizing the simultaneous training of a plurality of detection models.

Therefore, all the detection models are teacher models and student models, all the detection models participate in knowledge guidance and learn correct black box knowledge of other detection models, the detection performance of the detection models is continuously improved and the false positive rate is reduced in the training stage, and the correct knowledge of other detection models is also taught, so that the detection performance of all the detection models is improved, and the problems that the detection capability of the teacher model in the prior art is unchanged and the teacher model has limitation are solved.

According to the technical scheme provided by the embodiment of the application, each detection model in the plurality of detection models can be a teacher model or a student model, the plurality of detection models are used for learning and training mutually without performing staged training, the training time of the detection models is shortened, the training efficiency of the detection models is improved, and the problem that distillation training is limited by the detection capability of the teacher model in the traditional technology is solved; meanwhile, prior knowledge corresponding to sample data is added during mutual learning, and the knowledge provided by the teacher model is filtered by using the prior knowledge, so that the student model can learn the correct black box knowledge of a plurality of teacher models in the aspects of bottleneck layers, the relevance of detection targets and the regression and classification of the detection targets, namely, the correct learning capacity of different detection models on the sample data is absorbed, the learning capacity of each detection model is improved, and the detection performance of all the detection models is improved.

In an embodiment, optionally, as shown in fig. 2 and fig. 3, the above process of determining, according to the prior knowledge corresponding to the sample data, the teacher model and the feature value output by the bottleneck layer of each student model, that each student model learns the loss value of the teacher model in terms of the bottleneck layer may be:

s201, processing the characteristic value output by the bottleneck layer of the teacher model by using the priori knowledge to obtain a processed characteristic value.

The feature value output by the bottleneck layer of the teacher model may be decoded to obtain a candidate box of the teacher model, where the priori knowledge may be a manually marked prior box, and the prior box and the candidate box are subjected to Intersection-over-unity (IoU), and the candidate box with a processing result of IoU greater than a preset threshold is determined as a target candidate box. The preset threshold may be set based on a requirement, and optionally, the preset threshold may be set to 0.4. The target candidate frame may be regarded as a highly accurate candidate frame. And then mapping the target candidate frame to a bottleneck layer of the teacher model to obtain a processed characteristic value.

Optionally, since the feature value output by the bottleneck layer is affected by the traditional convolution, the output feature value has limitation, and a self-attention module can be used for performing self-attention operation on the processed feature value, so that the feature value extracted by the teacher model has globality and is more representative. Further, optionally, the feature values output by the bottleneck layers of the student models can be subjected to self-attention operation, the receptive field of the feature values is expanded, the student models can learn the relation among more pixel points, and the feature values extracted by the student models are global and representative.

And S202, mapping the processed characteristic values to each student model to obtain the mapping characteristic values of each student model.

And mapping the processed characteristic values to the student models by taking the student models as a reference to obtain the mapping characteristic values of the student models.

And S203, determining the quantity of the background characteristic values and the quantity of the foreground characteristic values of the teacher model based on the processed characteristic values.

The processed feature values can be regarded as foreground feature values, and the data in the sample other than the foreground feature values can be regarded as background feature values, so that the quantity of the foreground feature values and the quantity of the background feature values can be counted based on the processed feature values.

S204, determining the loss value of the teacher model in the aspect of the bottleneck layer when each student model learns based on the feature value output by the bottleneck layer of each student model, the mapping feature value, the background feature value quantity and the foreground feature value quantity.

After the feature value output by the bottleneck layer of each student model, the mapping feature value, the number of the background feature values and the number of the foreground feature values are obtained, a loss value of each student model for learning the teacher model in the bottleneck layer can be determined according to the following formula 1.

Equation 1:

；

wherein,

in order to be able to map the characteristic values,

for student modelsCharacteristic value of bottleneck layer output, num_fgNumber of foreground eigenvalues, num_bgIs the number of background feature values, C is the channel of the bottleneck layer of the teacher model, W_trueWidth of the bottleneck layer of the teacher model, H_trueFor the height of the teacher model bottleneck layer, α and β are hyper-parameters, which can be set to 0.1 in this embodiment, and when performing foreground feature value calculation, M_i,jTake 1, when calculating background characteristic value, M_i,jTake 0.

It should be noted that, because the foreground characteristic value is less than the background characteristic value, the loss values corresponding to the foreground characteristic value and the background characteristic value are separately calculated, and then the respective loss values are summed, such an operation can improve the influence of the foreground characteristic value on the detection model and reduce the influence of the background characteristic value on the detection model.

In this embodiment, the prior knowledge is used for carrying out knowledge screening on the teacher model, so that the teacher model can provide more accurate knowledge, and meanwhile, the feature value output by the teacher model is subjected to self-attention operation, so that the extracted feature value has a global receptive field, and the detection performance of the detection model is improved.

In an embodiment, optionally, as shown in fig. 4 and fig. 5, the above process of determining, according to the prior knowledge corresponding to the sample data, the teacher model and the feature value output by the bottleneck layer of each student model, a loss value of each student model in learning the teacher model in terms of relevance of the detection target may be:

S401, detecting head operation is carried out on the characteristic values output by the teacher model and the bottleneck layers of the student models, and a teacher candidate frame of the teacher model and a student candidate frame of each student model are obtained.

After the characteristic values output by the bottleneck layers of the teacher model and the student models are obtained, head detection operation is respectively carried out on the characteristic values output by the bottleneck layers of the teacher model and the student models, so that a teacher candidate frame of the teacher model and student candidate frames of the student models are obtained, the head detection operation is a conventional means in the field, any prior art can be referred to in the specific calculation process, and details are not repeated here.

S402, the prior knowledge is used for processing the teacher candidate frame to obtain an intermediate teacher candidate frame.

The priori knowledge here may be artificially marked priori frames, IoU processing is performed on the teacher candidate frame and the priori frames, and a candidate frame with a processing result of IoU being greater than a preset threshold is determined as an intermediate teacher candidate frame. The preset threshold may be set based on a requirement, and optionally, the preset threshold may be set to 0.4. The teacher candidate frames are processed through the prior frames, so that the obviously wrong teacher candidate frames can be filtered, and the teacher candidate frames with higher accuracy are reserved, so that the middle teacher candidate frame can be regarded as the candidate frame with higher accuracy.

And S403, aiming at each student model, processing the intermediate teacher candidate frame based on the confidence degrees of the intermediate teacher candidate frame and the student candidate frames to obtain a fusion teacher candidate frame.

Wherein the confidence of the fusion teacher candidate frame is greater than or equal to the intermediate teacher candidate frame. In order to further enable the teacher model to provide more accurate black box knowledge, the middle teacher candidate frame can be processed again by combining the student candidate frames of the student model. Optionally, the confidences of the intermediate teacher candidate box and the student candidate box may be compared; if the confidence of the intermediate teacher candidate frame is greater than or equal to the confidence of the student candidate frame, retaining the intermediate teacher candidate frame; replacing the intermediate teacher candidate frame with the student candidate frame if the confidence of the intermediate teacher candidate frame is less than the confidence of the student candidate frame. In this way, although each student model corresponds to the same teacher model for different student models, after the middle teacher candidate frame of the teacher model is processed by the student candidate frames of each student model, each student model has a corresponding fusion teacher candidate frame.

Further, the fusion teacher candidate frame may be subjected to a non-maximum suppression operation to screen out a duplicate candidate frame among the plurality of fusion teacher candidate frames.

S404, determining a first correlation among the candidate frames of the fusion teacher and a second correlation among the candidate frames of the students of the student models.

Specifically, the following formula 2 may be employed to calculate a first correlation between the fused teacher candidate frames of the teacher model and a second correlation between the student candidate frames of each student model.

Equation 2:

；

where Z denotes the number of candidate boxes, x denotes the feature of the candidate boxes, and Φ (x) denotes the correlation between the candidate boxes.

Here, the detection model 1 is taken as a teacher model, and the detection models 2 to n are taken as student models, then, the detection models 2 to n all have the fused teacher candidate frames of the corresponding teacher model, for the detection model 2, the first correlation between the fused teacher candidate frames corresponding to the detection model 2 is calculated by the above formula 2, the second correlation between the student candidate frames of the detection model 2 is calculated by the above formula 2, and so on, the first correlation between the fused teacher candidate frames corresponding to the detection model n is calculated by the above formula 2, and the second correlation between the student candidate frames of the detection model n is calculated by the above formula 2.

S405, determining loss values of the teacher model for learning of the student models in the correlation of the detection targets based on the first correlations and the second correlations.

Specifically, L1 smooth operation may be performed on a first correlation between the fusion teacher candidate frames of the teacher model and a second correlation between the student candidate frames of the student models, so as to obtain a loss value of each student model in learning the teacher model in terms of correlation of the detection target. Wherein the LOSS value LOSS of the teacher model in the correlation of the detection target for learning of each student model can be calculated using the following formula 3_cor。

Equation 3:

；

where Φ (t) is a first correlation between the fused teacher candidate frames of the teacher model, Φ(s) is a second correlation between the student candidate frames of the student model, t is a feature of the fused teacher candidate frames, and s is a feature of the student candidate frames.

In this embodiment, the candidate frames of the teacher model may be screened using the prior knowledge, and the student model may be guided using the correlation between the candidate frames screened by the teacher model, so that the classification of the student model is more accurate.

In an embodiment, optionally, as shown in fig. 6 and fig. 7, the above process of determining, according to the prior knowledge corresponding to the sample data, the teacher model and the feature value output by the bottleneck layer of each student model, the loss value of each student model for learning the teacher model in regression and classification of the detection target may be:

S601, performing detection head operation on the characteristic values output by the teacher model and the bottleneck layers of the student models to obtain a teacher candidate frame of the teacher model and student candidate frames of the student models.

S602, aiming at each student model, fusing the teacher candidate frame and the student candidate frame based on the confidence degrees of the teacher candidate frame and the student candidate frame to obtain a fused candidate frame.

In order to further correctly guide the student model to train in the correct direction, the teacher candidate frame and the student candidate frame can be fused. Optionally, the confidence levels of the teacher candidate box and the student candidate box may be compared; if the confidence of the teacher candidate frame is greater than or equal to the confidence of the student candidate frame, retaining the teacher candidate frame; and if the confidence coefficient of the teacher candidate frame is less than that of the student candidate frame, replacing the teacher candidate frame with the student candidate frame to obtain a fusion candidate frame. Further, the fusion candidate frame may be subjected to a non-maximum suppression operation to screen out duplicate candidate frames in the plurality of fusion candidate frames.

And S603, processing the fusion candidate frame by using the priori knowledge to obtain pseudo label data.

The priori knowledge here can be manually marked priori boxes, IoU processing is carried out on the fusion candidate box and the priori boxes, and the candidate box with the processing result of IoU being larger than a preset threshold value is determined as the pseudo label data. The preset threshold may be set based on a requirement, and optionally, the preset threshold may be set to 0.4. The fusion candidate frames are processed through the priori knowledge, obviously wrong candidate frames can be filtered, and the candidate frames with high accuracy are reserved, so that the pseudo tag data can be regarded as tag data with high accuracy.

And S604, determining the loss value of each student model in the regression and classification aspects of the teacher model for learning the detection target by using the pseudo tag data and the student candidate boxes of each student model.

Specifically, the following formula 4 may be used to calculate the Loss value Loss of each student model for learning the teacher model in the regression and classification of the detection target_{label_box}。

Equation 4:

；

where α, β represent hyper-parameters, which may be set on demand, optionally set to 0.1, L_clsRepresents the regression loss function, L _regA function representing the loss of classification is represented,

indicates the category of the ith pseudo tag data,

a class representing the ith detection target of the student model,

position information indicating the ith pseudo tag data,

and z is the number of pseudo tag data.

In the embodiment, the student candidate frames of the student models and the teacher candidate frames of the teacher models can be fused, the candidate frames with higher confidence coefficient are selected as the pseudo tag data, and meanwhile, the pseudo tag data are screened by combining the priori knowledge, so that the pseudo tag data are more accurate, each detection model is trained based on the more accurate pseudo tag data, and the training effect of the detection models is improved.

Fig. 8 is a schematic structural diagram of a training apparatus for detecting a model according to an embodiment of the present disclosure. As shown in fig. 8, the apparatus may include: a setup module 801, a first processing module 802, a determination module 803, a second processing module 804 and a third processing module 805.

Specifically, the setting module 801 is configured to set one of the plurality of detection models as a teacher model, and set the other detection models as student models;

the first processing module 802 is configured to input sample data to the teacher model and each student model to obtain feature values output by the bottleneck layers of the teacher model and each student model;

The determining module 803 is configured to determine, according to the priori knowledge corresponding to the sample data, the teacher model and the feature value output by the bottleneck layer of each student model, a loss value of each student model in learning the teacher model at the bottleneck layer, a loss value of each student model in learning the teacher model at the relevance of the detection target, and a loss value of each student model in learning the teacher model at the regression and classification of the detection target;

the setting module 801 is further configured to set a next detection model of the plurality of detection models as a teacher model, set the rest models as student models, and continue to perform the operations of the first processing module 802 until the setting module sets a last detection model of the plurality of detection models as the teacher model and the rest models as the student models, so as to obtain a first loss value set of each student model in learning the teacher model at the bottleneck level, a second loss value set of each student model in learning the relevance of each teacher model at the detection target, and a third loss value set of each student model in learning the regression and classification of each teacher model at the detection target;

the second processing module 804 is configured to process the first loss value set, the second loss value set, and the third loss value set, respectively, to obtain a first target loss value learned by each student model in a bottleneck layer, a second target loss value learned by each student model in a relevance of a detection target, and a third target loss value learned by each student model in a regression and classification of the detection target;

The third processing module 805 is configured to train each detection model based on the first target loss value, the second target loss value, the third target loss value, and the regression loss value and the classification loss value of the priori knowledge learned by each detection model to obtain a corresponding target detection model.

According to the training device for the detection models, each detection model in the detection models can be a teacher model or a student model, the detection models are mutually studied and trained without being trained in stages, the training time of the detection models is shortened, the training efficiency of the detection models is improved, and the problem that distillation training is limited by the detection capability of the teacher model in the traditional technology is solved; meanwhile, prior knowledge corresponding to sample data is added during mutual learning, and the knowledge provided by the teacher model is filtered by the prior knowledge, so that the student model can learn the correct black box knowledge of the plurality of teacher models in the aspects of bottleneck layers, the relevance of detection targets and the regression and classification of the detection targets, that is, the correct learning capacity of the different detection models on the sample data is absorbed, the learning capacity of each detection model is improved, and the detection performance of all the detection models is improved.

On the basis of the foregoing embodiment, optionally, the determining module 803 may include: a first determination unit.

Specifically, the first determining unit is configured to process a feature value output by a bottleneck layer of the teacher model by using the priori knowledge to obtain a processed feature value; mapping the processed characteristic values to each student model to obtain mapping characteristic values of each student model; determining the number of background characteristic values and the number of foreground characteristic values of the teacher model based on the processed characteristic values; and determining the loss value of the teacher model in the aspect of the bottleneck layer when each student model learns based on the feature value output by the bottleneck layer of each student model, the mapping feature value, the background feature value quantity and the foreground feature value quantity.

On the basis of the foregoing embodiment, optionally, the first determining unit is further configured to, after the feature value output by the bottleneck layer of the teacher model is processed by using the priori knowledge to obtain a processed feature value, perform self-attention operation on the processed feature value, and perform self-attention operation on the feature value output by the bottleneck layer of each student model.

On the basis of the foregoing embodiment, optionally, the determining module 803 may further include: a second determination unit.

Specifically, the second determining unit is configured to perform a head detection operation on feature values output by the teacher model and the bottleneck layers of the student models to obtain a teacher candidate frame of the teacher model and a student candidate frame of each student model; processing the teacher candidate frame by using the prior knowledge to obtain an intermediate teacher candidate frame; for each student model, processing the intermediate teacher candidate frame based on the confidence degrees of the intermediate teacher candidate frame and the student candidate frames to obtain a fusion teacher candidate frame; wherein the confidence of the fusion teacher candidate frame is greater than or equal to the intermediate teacher candidate frame; determining a first correlation between the fusion teacher candidate frames and a second correlation between the student candidate frames of each student model; based on the first correlations and the second correlations, loss values of the teacher model for learning by the student models in terms of correlation of detection targets are determined.

On the basis of the foregoing embodiment, optionally, the second determining unit is further configured to compare the confidence degrees of the intermediate teacher candidate frame and the student candidate frame; if the confidence of the intermediate teacher candidate frame is greater than or equal to the confidence of the student candidate frame, retaining the intermediate teacher candidate frame; replacing the intermediate teacher candidate frame with the student candidate frame if the confidence of the intermediate teacher candidate frame is less than the confidence of the student candidate frame.

On the basis of the foregoing embodiment, optionally, the determining module 803 may further include: a third determination unit.

Specifically, the third determining unit is configured to perform a head detection operation on feature values output by the teacher model and the bottleneck layers of the student models to obtain a teacher candidate frame of the teacher model and a student candidate frame of each student model; for each student model, fusing the teacher candidate frame and the student candidate frames based on the confidence degrees of the teacher candidate frame and the student candidate frames to obtain fused candidate frames; processing the fusion candidate frame by using the priori knowledge to obtain pseudo label data; and determining the loss value of each student model for learning the teacher model in the regression and classification of the detection target by using the pseudo tag data and the student candidate boxes of each student model.

On the basis of the foregoing embodiment, optionally, the second processing module 804 is specifically configured to determine a maximum loss value in each first loss value set as a first target loss value learned by each student model in the bottleneck layer; calculating the mean value of each loss value in each second loss value set respectively, and determining the calculation result as a second target loss value which is learned by each student model in the aspect of correlation of the detection target; and respectively carrying out averaging calculation on all the loss values in all the second loss value sets, and determining the calculation result as a third target loss value which is mutually learned by all the student models in the aspects of regression and classification of the detection target.

Fig. 9 is a schematic structural diagram of an electronic device provided in an embodiment of the present application, as shown in fig. 9, the electronic device may include a processor 90, a memory 91, an input device 92, and an output device 93; the number of the processors 90 in the electronic device may be one or more, and one processor 90 is taken as an example in fig. 9; the processor 90, the memory 91, the input device 92 and the output device 93 in the electronic apparatus may be connected by a bus or other means, and the connection by the bus is exemplified in fig. 9.

The memory 91 serves as a computer-readable storage medium, and can be used for storing software programs, computer-executable programs, and modules, such as program instructions/modules corresponding to the training method of the detection model in the embodiment of the present application (for example, the setting module 801, the first processing module 802, the determining module 803, the second processing module 804, and the third processing module 805 in the training apparatus of the detection model). The processor 90 executes various functional applications and data processing of the electronic device by executing software programs, instructions and modules stored in the memory 91, namely, implements the training method of the detection model described above.

The memory 91 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the terminal, and the like. Further, the memory 91 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some examples, the memory 91 may further include memory located remotely from the processor 90, which may be connected to the device/terminal/server via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The input device 92 may be used to receive input numeric or character information and generate key signal inputs related to user settings and function control of the adaptive cruise control. The output device 93 may include a display device such as a display screen.

Embodiments of the present application also provide a storage medium containing computer-executable instructions that, when executed by a computer processor, perform a method of training a detection model, the method comprising:

setting one detection model in the plurality of detection models as a teacher model, and setting the other detection models as student models;

determining a loss value of each student model in learning the bottleneck layer of the teacher model, a loss value of each student model in learning the relevance of the teacher model in the detection target and a loss value of each student model in learning the regression and classification of the teacher model in the detection target according to prior knowledge corresponding to the sample data, the teacher model and a feature value output by the bottleneck layer of each student model;

Setting the next detection model in the plurality of detection models as a teacher model, setting the rest models as student models, and continuing to execute the step of inputting sample data into the teacher model and the student models simultaneously to obtain feature values output by the bottleneck layers of the teacher model and the student models until the last detection model in the plurality of detection models is set as the teacher model and the rest models are set as the student models, thereby obtaining a first loss value set of each student model for learning the bottleneck layers of the teacher model, a second loss value set of each student model for learning the relevance of the detection targets of the teacher model, and a third loss value set of each student model for learning the regression and classification of the teacher model in the detection targets;

Of course, the storage medium provided in the embodiments of the present application contains computer-executable instructions, and the computer-executable instructions are not limited to the method operations described above, and may also perform related operations in the training method of the detection model provided in any embodiments of the present application.

From the above description of the embodiments, it is obvious for those skilled in the art that the present application can be implemented by software and necessary general hardware, and certainly can be implemented by hardware, but the former is a better embodiment in many cases. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which may be stored in a computer-readable storage medium, such as a floppy disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a FLASH Memory (FLASH), a hard disk or an optical disk of a computer, and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device) to execute the methods described in the embodiments of the present application.

It should be noted that, in the embodiment of the above search apparatus, each included unit and module are merely divided according to functional logic, but are not limited to the above division as long as the corresponding functions can be implemented; in addition, specific names of the functional units are only used for distinguishing one functional unit from another, and are not used for limiting the protection scope of the application.

It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present application and the technical principles employed. It will be understood by those skilled in the art that the present application is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the application. Therefore, although the present application has been described in more detail with reference to the above embodiments, the present application is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present application, and the scope of the present application is determined by the scope of the appended claims.

Claims

1. The training method of the detection model is characterized by comprising the following steps:

2. The method according to claim 1, wherein the determining, according to the prior knowledge corresponding to the sample data, the teacher model and the feature value output by the bottleneck layer of each student model, the loss value of each student model in learning the teacher model at the bottleneck layer comprises:

processing the characteristic value output by the bottleneck layer of the teacher model by using the priori knowledge to obtain a processed characteristic value;

mapping the processed characteristic values to each student model to obtain mapping characteristic values of each student model;

Determining the quantity of background characteristic values and the quantity of foreground characteristic values of the teacher model based on the processed characteristic values;

and determining the loss value of the teacher model in the aspect of the bottleneck layer when each student model learns based on the feature value output by the bottleneck layer of each student model, the mapping feature value, the background feature value quantity and the foreground feature value quantity.

3. The method of claim 2, wherein after the processing the feature values output by the bottleneck layer of the teacher model using the prior knowledge to obtain processed feature values, the method further comprises:

and performing self-attention operation on the processed characteristic values, and performing self-attention operation on the characteristic values output by the bottleneck layers of the student models.

4. The method according to claim 1, wherein the determining, according to the prior knowledge corresponding to the sample data, the teacher model and the feature value output by the bottleneck layer of each student model, the loss value of each student model in learning the teacher model in terms of relevance of the detection target comprises:

performing detection head operation on the characteristic values output by the teacher model and the bottleneck layers of the student models to obtain a teacher candidate frame of the teacher model and student candidate frames of the student models;

Processing the teacher candidate frame by using the prior knowledge to obtain an intermediate teacher candidate frame;

for each student model, processing the intermediate teacher candidate frame based on the confidence degrees of the intermediate teacher candidate frame and the student candidate frames to obtain a fused teacher candidate frame; wherein the confidence of the fusion teacher candidate frame is greater than or equal to the intermediate teacher candidate frame;

determining a first correlation between the fusion teacher candidate frames and a second correlation between the student candidate frames of each student model;

and determining the loss value of each student model in the teacher model in the correlation of the detection target based on the first correlation and each second correlation.

5. The method of claim 4, wherein processing the intermediate teacher candidate box based on the confidence levels of the intermediate teacher candidate box and the student candidate boxes to obtain a fused teacher candidate box comprises:

comparing the confidence levels of the intermediate teacher candidate box and the student candidate box;

if the confidence of the intermediate teacher candidate frame is greater than or equal to the confidence of the student candidate frame, retaining the intermediate teacher candidate frame;

Replacing the intermediate teacher candidate frame with the student candidate frame if the confidence of the intermediate teacher candidate frame is less than the confidence of the student candidate frame.

6. The method of claim 1, wherein determining the loss value of each student model for learning the teacher model in regression and classification of the detection target according to the prior knowledge corresponding to the sample data, the teacher model and the feature value output by the bottleneck layer of each student model comprises:

for each student model, fusing the teacher candidate frame and the student candidate frames based on the confidence degrees of the teacher candidate frame and the student candidate frames to obtain fused candidate frames;

processing the fusion candidate frame by using the priori knowledge to obtain pseudo label data;

and determining the loss value of each student model for learning the teacher model in the regression and classification of the detection target by using the pseudo tag data and the student candidate boxes of each student model.

7. The method according to any one of claims 1 to 6, wherein the processing the first loss value set, the second loss value set and the third loss value set respectively to obtain a first target loss value of each student model learning mutually in a bottleneck layer, a second target loss value of each student model learning mutually in a relevance of a detection target and a third target loss value of each student model learning mutually in a regression and classification of the detection target comprises:

determining the maximum loss value in each first loss value set as a first target loss value which is learned by each student model on the aspect of the bottleneck layer;

calculating the mean value of each loss value in each second loss value set respectively, and determining the calculation result as a second target loss value which is learned by each student model in the aspect of correlation of the detection target;

and respectively carrying out average calculation on the loss values in the second loss value sets, and determining the calculation result as a third target loss value of the student models which mutually learn in the aspects of regression and classification of the detection target.

8. Training device for testing models, comprising:

the setting module is used for setting one detection model in the plurality of detection models as a teacher model and setting the other models as student models;

the setting module is further used for setting the next detection model in the plurality of detection models as a teacher model, setting the rest models as student models, and continuing to execute the operation of the first processing module until the setting module sets the last detection model in the plurality of detection models as the teacher model and the rest models as the student models, so as to obtain a first loss value set of each student model for learning each teacher model on the bottleneck layer, a second loss value set of each student model for learning each teacher model on the relevance of the detection target, and a third loss value set of each student model for learning each teacher model on the regression and classification of the detection target;

9. An electronic device, comprising:

one or more processors;

a memory for storing one or more programs;

the one or more programs, when executed by the one or more processors, cause the processors to perform the steps of the method of any of claims 1 to 7.

10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 7.