CN115359322A

CN115359322A - Target detection model training method, device, equipment and storage medium

Info

Publication number: CN115359322A
Application number: CN202211067342.6A
Authority: CN
Inventors: 李林超; 何林阳; 王威; 周凯; 张腾飞
Original assignee: Zhejiang Zhuoyun Intelligent Technology Co ltd
Current assignee: Zhejiang Zhuoyun Intelligent Technology Co ltd
Priority date: 2022-09-01
Filing date: 2022-09-01
Publication date: 2022-11-18

Abstract

The invention discloses a target detection model training method, a device, equipment and a storage medium. The method comprises the following steps: inputting a sample image into a trained first model and an original second model, and acquiring a first characteristic value extracted by a bottleneck layer of the trained first model and a first detection result output by a detection layer, and a second characteristic value extracted by a bottleneck layer of the original second model and a second detection result output by the detection layer; determining a bottleneck layer distillation loss value of the original second model according to the first characteristic value, the second characteristic value, the first detection result and the second detection result; determining a distillation loss value of a detection layer of the original second model according to the first detection result and the second detection result; and training the original second model based on the distillation loss value of the bottleneck layer and the distillation loss value of the detection layer, and taking the trained original second model as a target detection model. The technical scheme of the embodiment of the invention can optimize the training effect of the target detection model.

Description

Target detection model training method, device, equipment and storage medium

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a method, a device, equipment and a storage medium for training a target detection model.

Background

With the rapid development of neural networks in the field of target detection, the function of a target detection model is more and more powerful. These target detection models often employ a deep neural network structure, which consumes a large amount of storage and computing resources, and the time consumption for target detection is also increasing.

To overcome the above problems, distillation algorithms have been developed. The distillation method is a model compression method, and the main idea is to utilize a trained teacher network to assist the training of a student network with low resource consumption so as to reduce the resource consumption in the target detection process and achieve the same target detection effect.

At present, in the process of student network training, the parameter adjustment of a bottleneck layer only considers the difference of characteristic values in the forward propagation process of a student model and a teacher model, and the information learned by the student model is limited, so that the trained student model cannot achieve the expected effect in the process of target detection.

Disclosure of Invention

The invention provides a target detection model training method, a device, equipment and a storage medium, which aim to solve the problem that a target detection model obtained based on distillation algorithm training cannot achieve the expected effect.

According to an aspect of the present invention, there is provided a target detection model training method, including:

inputting a sample image into the trained first model and the original second model, and acquiring a first characteristic value extracted by the bottleneck layer of the trained first model and a first detection result output by the detection layer, and a second characteristic value extracted by the bottleneck layer of the original second model and a second detection result output by the detection layer;

determining a bottleneck layer distillation loss value of the original second model according to the first characteristic value, the second characteristic value, the first detection result and the second detection result;

determining a distillation loss value of a detection layer of the original second model according to the first detection result and the second detection result;

and training the original second model based on the bottleneck layer distillation loss value and the detection layer distillation loss value, and taking the trained original second model as a target detection model.

According to another aspect of the present invention, there is provided an object detection model training apparatus, including:

the characteristic value acquisition module is used for inputting the sample image into the trained first model and the original second model, acquiring a first characteristic value extracted by the bottleneck layer of the trained first model and a first detection result output by the detection layer, and outputting a second characteristic value extracted by the bottleneck layer of the original second model and a second detection result output by the detection layer;

the bottleneck layer loss value determining module is used for determining a bottleneck layer distillation loss value of the original second model according to the first characteristic value, the second characteristic value, the first detection result and the second detection result;

the detection layer loss value determining module is used for determining a detection layer distillation loss value of the original second model according to the first detection result and the second detection result;

and the target detection model training module is used for training the original second model based on the bottleneck layer distillation loss value and the detection layer distillation loss value, and taking the trained original second model as a target detection model.

According to another aspect of the present invention, there is provided an electronic apparatus including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores a computer program executable by the at least one processor, the computer program being executable by the at least one processor to enable the at least one processor to perform a method of object detection model training as described in any of the embodiments of the invention.

According to another aspect of the present invention, there is provided a computer-readable storage medium storing computer instructions for causing a processor to implement the method for training an object detection model according to any one of the embodiments of the present invention when the computer instructions are executed.

According to the technical scheme, firstly, a sample image is input into a trained first model and an original second model, a first characteristic value extracted from a bottleneck layer of the trained first model and a first detection result output by a detection layer are obtained, a second characteristic value extracted from a bottleneck layer of the original second model and a second detection result output by the detection layer are obtained, then, a bottleneck layer distillation loss value of the original second model is determined according to the first characteristic value, the second characteristic value, the first detection result and the second detection result, a detection layer distillation loss value of the original second model is determined according to the first detection result and the second detection result, finally, the original second model is trained on the basis of the bottleneck layer distillation loss value and the detection layer distillation loss value, and the trained original second model is used as a target detection model. In the training process, the detection result is reversely mapped to the bottleneck layer to determine the distillation loss value of the bottleneck layer, the characteristic value difference of the bottleneck layer and the prediction result difference of each position can be considered at the same time, and the model training effect is optimized.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present invention, nor do they necessarily limit the scope of the invention. Other features of the present invention will become apparent from the following description.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings required to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the description below are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a flowchart of a method for training a target detection model according to an embodiment of the present invention;

FIG. 2 is a flowchart of a method for training a target detection model according to an embodiment of the present invention;

FIG. 3 is a flowchart of a method for training a target detection model according to an embodiment of the present invention;

FIG. 4 is a schematic structural diagram of an apparatus for training a target detection model according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of an electronic device implementing the target detection model training method according to the embodiment of the present invention.

Detailed Description

In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

Fig. 1 is a flowchart of a method for training an object detection model according to an embodiment of the present invention, which is applicable to a case of training an object detection model based on a knowledge-based distillation algorithm, and the method can be performed by an object detection model training apparatus, which can be implemented in hardware and/or software, and can be configured in various general-purpose computing devices. As shown in fig. 1, the method includes:

s110, inputting the sample image into the trained first model and the original second model, obtaining a first characteristic value extracted by the bottleneck layer of the trained first model and a first detection result output by the detection layer, and obtaining a second characteristic value extracted by the bottleneck layer of the original second model and a second detection result output by the detection layer.

In the knowledge distillation-based model training method, a small network model with a simple structure is used for simulating a large network model with a complex structure, so that the small network model can be close to the performance of the large network model. Specifically, training is performed on the large network model by using training data and a preset data label to obtain a trained large network model. Further, training data are simultaneously input into the large network model and the small network model after training, the output result of the large network model is used as a soft label, a preset data label is used as a hard label, and a loss function of the small network model based on the soft label and a loss function based on the hard label are respectively constructed. And finally, forming a total loss function according to the soft label loss function and the hard label loss function, training the small network model, and finally taking the trained small network model as a target detection model.

In the embodiment of the invention, the trained first model is a trained large network model and is also called a teacher model, and the original second model is a small network model to be trained and is also called a student model. Wherein the trained first model and the original second model both comprise a Backbone network Layer (Backbone Layer), a bottleneck Layer (sock Layer), and a detection Layer (Head Layer). The system comprises a main network layer, a bottleneck layer and a detection layer, wherein the main network layer is used for extracting features of input data (such as images), the bottleneck layer is located between the main network layer and the detection layer and used for carrying out feature fusion, and the detection layer is used for positioning and classifying targets in the input data.

In the embodiment of the invention, the sample image is input into the trained first model, the first characteristic value of the sample image extracted at the bottleneck layer in the forward propagation process of the trained first model is obtained, and the first detection result aiming at the sample image is output at the detection layer. Meanwhile, the sample image is input into the original second model, a second characteristic value of the sample image extracted by the original second model at the bottleneck layer in the forward propagation process is obtained, and a second detection result aiming at the sample image and output at the detection layer is obtained.

The sample image can be an image containing one or more objects, and the first feature value and the second feature value are feature data obtained by feature extraction of the sample image by the bottleneck layers of the trained first model and the original second model, respectively. The first detection result and the second detection result are classification results output by the detection layers of the trained first model and the original second model respectively. For example, the first detection result and the second detection result may include a position of a prediction frame for the sample image, a prediction category of the prediction frame, and a corresponding confidence, and may further include a probability that each prediction frame belongs to each category.

And S120, determining a distillation loss value of the bottleneck layer of the original second model according to the first characteristic value, the second characteristic value, the first detection result and the second detection result.

In the prior art, the distillation loss value of the bottleneck layer is generally determined by calculating the difference between the first characteristic value and the second characteristic value. In the method, only the difference between each feature value output by the bottleneck layer of the forward propagation trained first model and the original second model is considered, and the difference of the detection result of each position feature value is not considered, namely, semantic information of each position is not mined.

In the embodiment of the invention, in order to overcome the above problems, the first characteristic value, the second characteristic value, the first detection result and the second detection result are used together to determine the distillation loss value of the bottleneck layer of the original second model. So that the bottleneck layer distillation loss value comprises the difference between the characteristic values and the detection result difference at each position, namely the category difference, and the original second model learns more semantic information.

Specifically, the probability that each first prediction box in the first detection result belongs to each category is mapped to the bottleneck layer of the trained first model, and the probability that each position of the bottleneck layer belongs to each category is obtained and serves as a first probability mapping value. In a similar way, the probability that each second prediction box in the second detection result belongs to each category is mapped to the bottleneck layer of the original second model, and the probability that each position of the bottleneck layer belongs to each category is obtained and serves as a second probability mapping value. Further, according to the first mapping probability value and the second mapping probability value, the category loss value of the original second model at each position of the bottleneck layer is calculated, and according to the first characteristic value and the second characteristic value, the characteristic loss value of the original second model at each position of the bottleneck layer is calculated. And finally, the category loss value and the characteristic loss value jointly form the comprehensive loss of each position of the bottleneck layer, and the distillation loss value of the bottleneck layer is calculated according to the comprehensive loss of each position.

By reversely mapping the first detection result and the second detection result to the bottleneck layers of the respective corresponding models, the distillation loss value of the bottleneck layers can be obtained according to the category difference and the feature difference of each position of the bottleneck layers, so that the bottleneck layers of the original second model can learn the category semantic information, and the training effect of the original second model is optimized.

And S130, determining a distillation loss value of the detection layer of the original second model according to the first detection result and the second detection result.

In the embodiment of the invention, after the distillation loss value of the bottleneck layer is determined, the distillation loss value of the detection layer of the original second model is further calculated according to the first detection result and the second detection result. Specifically, the position of each first detection frame in the first detection result and the position of each second detection frame in the second detection result are obtained first. Further, for each first detection frame, a second detection frame corresponding to the current first detection frame is determined, an intersection ratio of the current first detection frame and the corresponding second detection frame is calculated based on the position of the current first detection frame and the position of the corresponding second detection frame, and a position loss value of the second detection frame is determined according to the intersection ratio. Furthermore, the confidence of each first detection frame is obtained from the first detection result, the confidence of each second detection frame is obtained from the second detection result, and then the confidence loss value of the current second detection frame is calculated according to the confidence of the current first detection frame and the confidence of the corresponding second detection frame. Finally, a detection layer distillation loss value of the original second model is jointly determined based on the position loss value and the confidence loss value.

And calculating a position loss value according to the position information of each prediction frame output by the detection layer, calculating a confidence coefficient loss value according to the confidence coefficient of each prediction frame, and finally determining the distillation loss value of the detection layer by the position loss value and the confidence coefficient loss value together.

S140, training the original second model based on the bottleneck layer distillation loss value and the detection layer distillation loss value, and taking the trained original second model as a target detection model.

In the embodiment of the invention, the original second model is trained together based on the distillation loss value of the bottleneck layer and the distillation loss value of the detection layer determined in the steps, and the trained original second model is used as a target detection model. Specifically, the total loss value of the original second model can be determined by the distillation loss value of the bottleneck layer, the distillation loss value of the detection layer, the classification loss of the artificial label and the regression loss of the artificial label. Total Loss value Loss _{sum_student} The calculation formula is as follows:

Loss _{sum_student} ＝Loss _{class_tag} +Loss _{bbox_tag} +Loss _{prediction box} +Loss _FPN

Therein, loss _{class_tag} Is the Loss value of artificial label classification, loss _{bbox_tag} Is the value of the artificial tag regression Loss, loss _{Prediction box} Is the distillation Loss value of the detection layer, loss _FPN Is the value of distillation loss of the bottleneck layer.

The finally trained target detection model can be used for carrying out target detection on the input image: and inputting the image to be detected into a target detection model, and outputting a detection result aiming at the image to be detected by using the target detection model. The detection result comprises a prediction frame in the image to be detected, a category of the prediction frame containing the object, and a confidence degree aiming at the category.

According to the technical scheme, a sample image is input into a trained first model and an original second model, a first characteristic value extracted from a bottleneck layer of the trained first model and a first detection result output by a detection layer are obtained, a second characteristic value extracted from the bottleneck layer of the original second model and a second detection result output by the detection layer are obtained, a bottleneck layer distillation loss value of the original second model is determined according to the first characteristic value, the second characteristic value, the first detection result and the second detection result, a detection layer distillation loss value of the original second model is determined according to the first detection result and the second detection result, the original second model is trained on the basis of the bottleneck layer distillation loss value and the detection layer distillation loss value, and the trained original second model is used as a target detection model. In the training process, the detection result is reversely mapped to the bottleneck layer to determine the distillation loss value of the bottleneck layer, the characteristic value difference of the bottleneck layer and the prediction result difference of each position can be considered at the same time, and the model training effect is optimized.

Optionally, after the training of the target detection model is completed, the method further includes:

acquiring an image to be detected;

inputting an image to be detected into a target detection model, and determining at least one target in the image to be detected based on the output of the target detection model; the target detection model is obtained by training based on the target detection model training method in any embodiment of the invention.

In this optional embodiment, the target detection model is obtained by training the target detection model training method provided in any embodiment of the present invention. The image to be detected can be an image containing at least one target, the image to be detected is input into a trained target detection model, and the target detection model can output a detection result aiming at a specified target in the image to be detected. The detection result may include a prediction box where the target is located, the type information and the confidence level of the target, and the like.

Fig. 2 is a flowchart of a method for training a target detection model according to an embodiment of the present invention, which is further detailed based on the above embodiment, and provides specific steps for determining a distillation loss value of a bottleneck layer of an original second model according to a first feature value, a second feature value, a first detection result, and a second detection result. As shown in fig. 2, the method includes:

s210, inputting the sample image into the trained first model and the original second model, obtaining a first characteristic value extracted by the bottleneck layer of the trained first model and a first detection result output by the detection layer, and obtaining a second characteristic value extracted by the bottleneck layer of the original second model and a second detection result output by the detection layer.

S220, mapping the class probability of the first prediction box in the first detection result to the bottleneck layer of the trained first model to obtain a first probability mapping value, and mapping the class probability of the second prediction box in the second detection result to the bottleneck layer of the original second model to obtain a second probability mapping value.

In the embodiment of the invention, the probability that the first prediction box in the first detection result belongs to each category is mapped to the bottleneck layer of the trained first model, and the probability that each position of the bottleneck layer belongs to each category is obtained and used as a first probability mapping value. In a similar way, the probability that the second prediction box in the second detection result belongs to each category is mapped to the bottleneck layer of the original second model, and the probability that each position of the bottleneck layer belongs to each category is obtained and serves as a second mapping probability value.

In one specific example, the bottleneck layer of the trained first model has a width Weight and a Height Weight. Mapping the probability that each first prediction box in the first detection result belongs to each category to the bottleneck layer of the trained first model, so as to obtain the probability that each position (i, j) of the bottleneck layer belongs to each category, for example, if the total number of categories is 10, each position of the bottleneck layer corresponds to 10 category probabilities.

And S230, determining the category loss value of the original second model at each position of the bottleneck layer according to the first probability mapping value and the second probability mapping value.

In the embodiment of the invention, after the first detection result and the second detection result are mapped to the bottleneck layer to obtain the first probability mapping value and the second probability mapping value, the category loss value of the original second model at each position of the bottleneck layer is determined according to the first probability mapping value and the second probability mapping value. Specifically, based on the first mapping probability value and the second mapping probability value, a KL divergence (Kullback-Leible divergence) calculation mode may be used to calculate a category loss value of the original second model at each position of the bottleneck layer. Class penalty value for location (i, j) of bottleneck layer

The calculation formula of (a) is as follows:

where i is the horizontal axis coordinate of the bottleneck layer, j is the vertical axis coordinate of the bottleneck layer, C is the total number of categories,

is the probability that the location (i, j) of the trained first model bottleneck layer belongs to the z-th class,

is the probability that the location (i, j) of the original second model bottleneck layer belongs to the z-th category.

And S240, determining the characteristic loss value of the original second model at each position of the bottleneck layer according to the characteristic difference between the first characteristic value and the second characteristic value.

In the embodiment of the invention, the difference between the first characteristic value output by the trained first model bottleneck layer and the second characteristic value output by the original second model bottleneck layer is calculated, and the characteristic loss value of the original second model at each position of the bottleneck layer is determined based on the difference.

Illustratively, the first feature value corresponding to the trained first model bottleneck layer position (i, j) is

The original second model bottleneck layer position (i, j) corresponds to a second characteristic value of

The characteristic loss value at that location is

And S250, determining the distillation loss value of the bottleneck layer of the original second model based on the class loss value and the characteristic loss value.

In the embodiment of the invention, after the category loss value and the characteristic loss value of the original second model at each position of the bottleneck layer are respectively obtained, the distillation loss of the bottleneck layer of the original second model is determined based on the category loss value and the characteristic loss value. Specifically, the category loss value and the characteristic loss value at each position may be added to obtain a comprehensive loss value at each position, and finally, the comprehensive loss values at all positions are added to obtain the bottleneck layer distillation loss. Specifically, the category loss value and the characteristic loss value at each position of the bottleneck layer may be added to obtain a comprehensive loss value at each position, the bottleneck layer position is divided into a foreground position and a background position according to the calibration mask matrix, the loss mean value of the foreground position and the loss mean value of the background position are respectively calculated, and finally the loss mean value of the foreground position and the loss mean value of the background position are summed to obtain the distillation loss of the bottleneck layer. The foreground position mean value can be obtained by calculation according to the comprehensive loss value of the foreground position, the number of the foreground positions and the difficult case weight corresponding to the foreground position, the difficult case weight can be determined according to the confidence coefficient output by the original second model at the foreground position, and the higher the confidence coefficient is, the lower the difficult case weight is.

Optionally, determining a bottleneck layer distillation loss value of the original second model based on the class loss value and the characteristic loss value, including:

determining a comprehensive loss value of the original second model at each position of the bottleneck layer based on the category loss value and the characteristic loss value;

determining a foreground loss mean value and a background loss mean value of the bottleneck layer according to the comprehensive loss value and the calibration mask matrix; determining a calibration mask matrix according to a pre-calibrated foreground label and a pre-calibrated background label;

and forming a bottleneck layer distillation loss value of the original second model by the foreground loss mean value and the background loss mean value.

In this optional embodiment, a specific manner for determining the distillation loss value of the bottleneck layer of the original second model based on the category loss value and the characteristic loss value is provided: firstly, determining a comprehensive loss value of the original second model at each position of the bottleneck layer based on the category loss value and the characteristic loss value at each position of the bottleneck layer, specifically, summing the category loss value and the characteristic loss value at each position, and taking the summation result as the comprehensive loss value at the position. Because the position in the bottleneck layer contains the position of foreground position and the position of background, in order to reduce the influence of background parameter, can adopt different calculation methods to foreground position and background position when calculating the synthetic loss value: when the current position is the foreground position, directly summing the category loss value and the characteristic loss value of the position to obtain a comprehensive loss value of the position; when the current position is the background position, the characteristic loss value of the position can be adjusted through the super parameter, and the adjusted characteristic loss value and the background loss value are summed to obtain the comprehensive loss value of the position. Value of global loss Feature _ loss for bottleneck layer position (i, j) _i,j The calculation formula is as follows:

wherein,

is the class loss value of the original second model at position (i, j), M _ Teacher _i,j Is a mask matrix determined based on a first prediction of the trained first model,

is the first eigenvalue of the trained first model at position (i, j),

is the second eigenvalue of the original second model at position (i, j),

is the background weight.

After the comprehensive loss value of each position of the bottleneck layer is obtained through calculation, a foreground loss mean value belonging to the foreground position and a background loss mean value belonging to the background position are determined in the comprehensive loss values according to the calibration mask matrix. And finally, summing the foreground loss mean value and the background loss mean value, and obtaining the distillation loss value of the bottleneck layer of the original second model. The calibration mask matrix is determined according to a foreground label and a background label calibrated manually, the position value of the foreground in the calibration mask matrix is 1, and the position value of the background in the calibration mask matrix is 0.

When the distillation loss value of the bottleneck layer is calculated, the comprehensive loss value of the foreground position and the comprehensive loss value of the background position are comprehensively considered, on one hand, the influence of the distillation loss of the foreground and the background on the original second model can be balanced, on the other hand, all background information can participate in the distillation loss calculation, so that the original second model learns more background information, and the training effect of the model is optimized.

Optionally, determining the foreground loss mean value of the bottleneck layer according to the comprehensive loss value and the calibration mask matrix, including:

determining a comprehensive loss value of each foreground position in the comprehensive loss value of each position of the bottleneck layer according to the calibration mask matrix;

mapping the confidence coefficient of a second prediction frame in a second detection result to a bottleneck layer of an original second model to obtain a second confidence coefficient mapping value, and determining the difficult-to-case weight of each foreground position in the bottleneck layer according to the second confidence coefficient mapping value; the higher the second confidence coefficient mapping value is, the smaller the corresponding difficult case weight is;

and calculating a weighted average value of the comprehensive loss values corresponding to the foreground positions in the bottleneck layer according to the difficult-example weight, and taking the weighted average value as the foreground loss average value of the bottleneck layer.

In this optional embodiment, a specific way of determining the foreground loss mean value of the bottleneck layer according to the comprehensive loss value and the calibration mask matrix is provided: firstly, according to a calibration mask matrix, determining a comprehensive loss value of each foreground position in the comprehensive loss values of each position of the bottleneck layer. Further, the confidence of a second prediction frame in a second detection result is mapped to the bottleneck layer of the original second model to obtain a second confidence mapping value, and the weight of the difficult case of each foreground position in the bottleneck layer is determined based on the second confidence mapping value, wherein the higher the second confidence mapping value is, the smaller the corresponding weight of the difficult case is for the foreground position. And finally, carrying out weighted summation on the comprehensive loss value of each foreground position in the bottleneck layer according to the difficult-case weight, and solving an average value of the weighted summation result to obtain a foreground loss average value of the bottleneck layer. The weight of the difficult cases of each foreground position of the bottleneck layer is determined through the confidence coefficient of the second prediction frame in the second detection result, and the learning weight of the difficult cases can be improved by the original second model through distinguishing the difficult cases from simple characteristic values during distillation learning of the bottleneck layer, so that the defect can be subjected to targeted learning.

Alternatively, the background loss average may be an average of the integrated loss values of all background locations in the bottleneck layer directly.

Finally, summing the foreground Loss mean value and the background Loss mean value to obtain the bottleneck layer distillation Loss value Loss of the original second model _FPN The specific calculation formula is as follows:

wherein, num _Background Is the number of original second model bottleneck layer backgrounds, num _{Prospect of} Is the amount of foreground of originally dropping into the model bottleneck layer, weight is the width of the original second model bottleneck layer, height is the Height of the original second model bottleneck layer, M _i,j Is a mask matrix determined based on an artificial label,

is the confidence of the original second model at the (i, j) position, feature _ loss _i,j Is the integrated loss value of the original second model at the (i, j) position.

And S260, determining a distillation loss value of the detection layer of the original second model according to the first detection result and the second detection result.

S270, training the original second model based on the bottleneck layer distillation loss value and the detection layer distillation loss value, and taking the trained original second model as a target detection model.

According to the technical scheme, the detection result is mapped to the bottleneck layer, the distillation loss value of the bottleneck layer is determined according to the difference between the characteristic values corresponding to each position in the bottleneck layer and the difference between the detection results corresponding to each position, and finally the original second model is trained based on the distillation loss value of the bottleneck layer and the distillation loss value of the detection layer, so that more semantic information except the characteristic values can be learned by the original second model, and the model training effect is improved.

Fig. 3 is a flowchart of a method for training a target detection model according to an embodiment of the present invention, which is further detailed on the basis of the foregoing embodiment, and provides specific steps for determining a distillation loss value of a detection layer of an original second model according to a first detection result and a second detection result. As shown in fig. 3, the method includes:

s310, inputting the sample image into the trained first model and the original second model, obtaining a first characteristic value extracted by the bottleneck layer of the trained first model and a first detection result output by the detection layer, and obtaining a second characteristic value extracted by the bottleneck layer of the original second model and a second detection result output by the detection layer.

S320, determining the distillation loss value of the bottleneck layer of the original second model according to the first characteristic value, the second characteristic value, the first detection result and the second detection result.

S330, determining the position loss value of each second prediction frame according to the position of the first prediction frame in the first prediction result and the position of the second prediction frame in the second prediction result.

The cross-over ratio is a commonly used detection effect measurement index when target detection is carried out. In the embodiment of the invention, the position loss value of each second prediction frame is determined according to the position of the first prediction frame in the first prediction result and the position of the second prediction frame in the second prediction result. Specifically, first, in at least one prediction frame included in the first detection result, a first prediction frame corresponding to each second prediction frame in the second prediction result is determined. Furthermore, the intersection ratio of each second detection frame and the corresponding first detection frame is calculated, and finally the position loss value of each second detection frame is determined according to the intersection ratio. The larger the intersection ratio is, the smaller the corresponding position loss value is. The position loss value loss _ Bbox of the ith second prediction frame in the second detection result _i The calculation formula of (c) is as follows:

wherein,

is the ith second prediction box,

is the first prediction frame corresponding to the ith second prediction frame.

S340, determining a confidence coefficient loss value of each second prediction frame according to the confidence coefficient of the first prediction frame in the first prediction result and the confidence coefficient of the second prediction frame in the second prediction result.

In the embodiment of the present invention, in addition to calculating the position loss value of each second prediction frame output by the original second model, the confidence loss value of each second prediction frame needs to be further calculated according to the confidence of the first prediction frame in the first prediction result and the confidence of the second prediction frame in the second prediction result.

Optionally, determining a confidence loss value of each second prediction box according to the confidence of the first prediction box in the first prediction result and the confidence of the second prediction box in the second prediction result, including:

determining the confidence coefficient loss weight of the first prediction frame according to the confidence coefficient of the first prediction frame in the first prediction result; the higher the confidence of the first prediction box is, the smaller the confidence loss weight is;

under the condition that the second prediction frame belongs to the foreground, determining a confidence coefficient loss value of the second prediction frame according to foreground super parameters, the confidence coefficient loss weight, the confidence coefficient of the second prediction frame and the confidence coefficient corresponding to the first prediction frame;

and under the condition that the second prediction frame belongs to the background, determining a confidence coefficient loss value of the second prediction frame according to the background hyperparameter, the confidence coefficient loss weight, the confidence coefficient of the second prediction frame and the confidence coefficient corresponding to the first prediction frame.

In this optional embodiment, a specific manner for determining the confidence coefficient loss value of each second prediction frame according to the confidence coefficient of the first prediction frame in the first prediction result and the confidence coefficient of the second prediction frame in the second prediction result is provided: firstly, determining the confidence coefficient loss weight of a first prediction frame according to the confidence coefficient of the first prediction frame in a first prediction result, wherein the higher the confidence coefficient of the first prediction frame is, the smaller the corresponding confidence coefficient loss weight is. Further, a confidence loss value for each second prediction box is calculated based on the confidence loss weight: under the condition that the second prediction frame belongs to the foreground, determining a confidence coefficient loss value of the second prediction frame according to preset foreground super parameters, the confidence coefficient loss weight of the first prediction frame corresponding to the current second prediction frame, the confidence coefficient of the second prediction frame and the confidence coefficient of the corresponding first prediction frame; and under the condition that the second prediction frame belongs to the background, determining a confidence coefficient loss value of the second prediction frame according to preset background super parameters, the confidence coefficient loss weight of the first prediction frame corresponding to the current second prediction frame, the confidence coefficient of the second prediction frame and the confidence coefficient of the corresponding first prediction frame.

Specifically, the confidence loss value loss _ P for the second prediction box i _i The calculation formula of (a) is as follows:

wherein,

is the confidence of the ith second prediction box,

is the confidence of the first prediction box corresponding to the ith second prediction box,

and r is a super parameter for adjusting the weights of the foreground and the background.

The method is used for debugging the difficult samples, reducing the weight of the simple samples and improving the weight of the difficult samples, so that the original second model can be optimized aiming at the defects of the original second model.

And S350, determining a distillation loss value of the detection layer of the original second model based on the position loss value and the confidence coefficient loss value.

In the embodiment of the invention, after the position loss value and the confidence coefficient loss value of each second prediction frame of the detection layer of the original second model are obtained through calculation, the position loss value and the confidence coefficient loss value can be summed to obtain the comprehensive loss value of each second prediction frame, and finally, the comprehensive loss values of all the second prediction frames are summed to obtain the distillation loss value of the detection layer of the original second model.

The weight of each first prediction frame can also be determined by calculating the intersection ratio of the first prediction frame corresponding to each second prediction frame and the corresponding marking frame, wherein the weight of the first prediction frame is higher when the intersection ratio of the first prediction frame and the marking frame is larger, and meanwhile, the weight of the first prediction frame is higher when the confidence coefficient of the first prediction frame is higher. And finally, the position loss value of the corresponding second prediction frame is adjusted by the weight of the first prediction frame, so that the original second model can be prevented from learning the wrong black box knowledge of the trained first model under the condition that the confidence coefficient of the prediction frame output by the trained first model is low.

Optionally, determining a detection layer distillation loss value of the original second model based on the position loss value and the confidence loss value, including:

determining the position loss weight of the first prediction frame matched with each second prediction frame based on the position of the first prediction frame, the position of the corresponding labeling frame and the confidence coefficient of the first prediction frame in the first detection result;

and processing the position loss value through the position loss weight, and determining the distillation loss value of the detection layer of the original second model according to the confidence coefficient loss value and the processed position loss value.

In this alternative embodiment, a specific manner for determining the distillation loss value of the detection layer of the original second model based on the position loss value and the confidence coefficient loss value is provided: first, based on the position of the first prediction frame in the first detection result, the corresponding labeling frame (artificial labeling frame), and the confidence of the first prediction frame, the position loss weight of the first prediction frame matching each second prediction frame is determined. The Weight of position loss Weight of the first prediction box matched by each second prediction box _i The calculation formula is as follows:

wherein,

a first prediction block corresponding to the ith second prediction block,

is the label box corresponding to the ith second prediction box.

And after the position loss weight of the first prediction frame corresponding to each second prediction frame is obtained through calculation, processing the position loss value of the corresponding second prediction frame through the position loss weight, and determining the distillation loss value of the detection layer of the original second model according to the confidence coefficient loss value and the processed position loss value. Specifically, the confidence coefficient loss value of each second prediction frame and the processed position loss value may be summed to obtain a comprehensive loss value of each second prediction frame, and the sum of the comprehensive loss values of all the second prediction frames is finally calculated to serve as the detection layer distillation loss value of the original second model. Loss value Loss by distillation of detection layer _{Prediction box} The calculation formula of (a) is as follows:

where s _ sum is the number of original second model output prediction boxes, weight _i Is the position loss weight, loss _ Bbox, of the first prediction frame corresponding to the ith second prediction frame _i Is the position loss value of the ith second prediction frame, loss _ P _i Is the confidence loss value for the ith second prediction box.

And S360, training the original second model based on the bottleneck layer distillation loss value and the detection layer distillation loss value, and taking the trained original second model as a target detection model.

According to the technical scheme of the embodiment of the invention, when the distillation loss value of the detection layer is determined, on one hand, the position loss value is determined according to the position information of the prediction frame in the prediction result, on the other hand, the confidence coefficient loss value is determined according to the confidence coefficient loss value of the detection frame, and finally, the distillation loss value of the detection layer is determined by the confidence coefficient loss value and the position loss value together, so that the model training effect can be optimized.

Fig. 4 is a schematic structural diagram of a target detection model training apparatus according to an embodiment of the present invention. As shown in fig. 4, the apparatus includes:

a feature value obtaining module 410, configured to input a sample image into the trained first model and the original second model, obtain a first feature value extracted by the trained first model bottleneck layer and a first detection result output by the detection layer, and obtain a second feature value extracted by the original second model bottleneck layer and a second detection result output by the detection layer;

a bottleneck layer loss value determining module 420, configured to determine a bottleneck layer distillation loss value of the original second model according to the first characteristic value, the second characteristic value, the first detection result, and the second detection result;

a detection layer loss value determining module 430, configured to determine a detection layer distillation loss value of the original second model according to the first detection result and the second detection result;

and the target detection model training module 440 is configured to train the original second model based on the bottleneck layer distillation loss value and the detection layer distillation loss value, and use the trained original second model as a target detection model.

Optionally, the bottleneck layer loss value determining module 420 includes:

a reverse mapping value determining unit, configured to map a class probability of a first prediction box in the first detection result to a bottleneck layer of a trained first model to obtain a first probability mapping value, and map a class probability of a second prediction box in the second detection result to a bottleneck layer of an original second model to obtain a second probability mapping value;

the category loss value determining unit is used for determining a category loss value of the original second model at each position of the bottleneck layer according to the first probability mapping value and the second probability mapping value;

the characteristic loss value determining unit is used for determining the characteristic loss value of the original second model at each position of the bottleneck layer according to the characteristic difference between the first characteristic value and the second characteristic value;

and the bottleneck layer loss determining unit is used for determining the bottleneck layer distillation loss value of the original second model based on the class loss value and the characteristic loss value.

Optionally, the bottleneck layer loss determining unit includes:

the comprehensive loss value determining subunit is used for determining the comprehensive loss value of the original second model at each position of the bottleneck layer based on the category loss value and the characteristic loss value;

the foreground loss determining subunit is used for determining a foreground loss mean value and a background loss mean value of the bottleneck layer according to the comprehensive loss value and the calibration mask matrix; the calibration mask matrix is determined according to a foreground label and a background label calibrated in advance;

and the distillation layer loss value determining subunit is used for forming the bottleneck layer distillation loss value of the original second model by the foreground loss average value and the background loss average value.

Optionally, the foreground loss determining subunit is specifically configured to;

mapping the confidence coefficient of a second prediction frame in the second detection result to a bottleneck layer of an original second model to obtain a second confidence coefficient mapping value, and determining the hard case weight of each foreground position in the bottleneck layer according to the second confidence coefficient mapping value; the higher the second confidence coefficient mapping value is, the smaller the corresponding difficult case weight is;

and calculating a weighted average value of the comprehensive loss values corresponding to the foreground positions in the bottleneck layer according to the difficult-case weight, and taking the weighted average value as the foreground loss average value of the bottleneck layer.

Optionally, the detection layer loss value determining module 430 includes:

a position loss value determining unit, configured to determine a position loss value of each second prediction frame according to a position of the first prediction frame in the first prediction result and a position of the second prediction frame in the second prediction result;

the confidence coefficient loss value determining unit is used for determining the confidence coefficient loss value of each second prediction frame according to the confidence coefficient of the first prediction frame in the first prediction result and the confidence coefficient of the second prediction frame in the second prediction result;

and the detection layer loss value determining unit is used for determining the detection layer distillation loss value of the original second model based on the position loss value and the confidence coefficient loss value.

Optionally, the detection layer loss value determining unit is specifically configured to:

Optionally, the confidence loss value determining unit is specifically configured to:

The target detection model training device provided by the embodiment of the invention can execute the target detection model training method provided by any embodiment of the invention, and has the corresponding functional modules and beneficial effects of the execution method.

FIG. 5 illustrates a schematic diagram of an electronic device 10 that may be used to implement an embodiment of the invention. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital assistants, cellular phones, smart phones, wearable devices (e.g., helmets, glasses, watches, etc.), and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the inventions described and/or claimed herein.

As shown in fig. 5, the electronic device 10 includes at least one processor 11, and a memory communicatively connected to the at least one processor 11, such as a Read Only Memory (ROM) 12, a Random Access Memory (RAM) 13, and the like, wherein the memory stores a computer program executable by the at least one processor, and the processor 11 can perform various suitable actions and processes according to the computer program stored in the Read Only Memory (ROM) 12 or the computer program loaded from a storage unit 18 into the Random Access Memory (RAM) 13. In the RAM 13, various programs and data necessary for the operation of the electronic apparatus 10 can also be stored. The processor 11, the ROM 12, and the RAM 13 are connected to each other via a bus 14. An input/output (I/O) interface 15 is also connected to bus 14.

A number of components in the electronic device 10 are connected to the I/O interface 15, including: an input unit 16 such as a keyboard, a mouse, or the like; an output unit 17 such as various types of displays, speakers, and the like; a storage unit 18 such as a magnetic disk, optical disk, or the like; and a communication unit 19 such as a network card, modem, wireless communication transceiver, etc. The communication unit 19 allows the electronic device 10 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.

The processor 11 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of processor 11 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various processors running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, or the like. The processor 11 performs the various methods and processes described above, such as the target detection model training method.

In some embodiments, the object detection model training method may be implemented as a computer program tangibly embodied in a computer-readable storage medium, such as storage unit 18. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 10 via the ROM 12 and/or the communication unit 19. When the computer program is loaded into RAM 13 and executed by processor 11, one or more steps of the object detection model training method described above may be performed. Alternatively, in other embodiments, the processor 11 may be configured to perform the object detection model training method in any other suitable manner (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), system on a chip (SOCs), complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

A computer program for implementing the methods of the present invention may be written in any combination of one or more programming languages. These computer programs may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the computer programs, when executed by the processor, cause the functions/acts specified in the flowchart and/or block diagram block or blocks to be performed. A computer program can execute entirely on a machine, partly on a machine, as a stand-alone software package partly on a machine and partly on a remote machine or entirely on a remote machine or server.

In the context of the present invention, a computer-readable storage medium may be a tangible medium that can contain, or store a computer program for use by or in connection with an instruction execution system, apparatus, or device. A computer readable storage medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. Alternatively, the computer readable storage medium may be a machine readable signal medium. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on an electronic device having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the electronic device. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user can be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), blockchain networks, and the Internet.

The computing system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical host and VPS service are overcome.

It should be understood that various forms of the flows shown above, reordering, adding or deleting steps, may be used. For example, the steps described in the present invention may be executed in parallel, sequentially, or in different orders, and are not limited herein as long as the desired results of the technical solution of the present invention can be achieved.

The above-described embodiments should not be construed as limiting the scope of the invention. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made, depending on design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method for training a target detection model is characterized by comprising the following steps:

determining a distillation loss value of a bottleneck layer of the original second model according to the first characteristic value, the second characteristic value, the first detection result and the second detection result;

2. The method of claim 1, wherein determining the bottleneck layer distillation loss value of the original second model according to the first characteristic value, the second characteristic value, the first detection result and the second detection result comprises:

mapping the class probability of a first prediction box in the first detection result to a bottleneck layer of a trained first model to obtain a first probability mapping value, and mapping the class probability of a second prediction box in the second detection result to the bottleneck layer of an original second model to obtain a second probability mapping value;

determining a category loss value of the original second model at each position of the bottleneck layer according to the first probability mapping value and the second probability mapping value;

determining a characteristic loss value of the original second model at each position of the bottleneck layer according to the characteristic difference between the first characteristic value and the second characteristic value;

determining a bottleneck layer distillation loss value of the original second model based on the class loss value and the characteristic loss value.

3. The method of claim 2, wherein determining a bottleneck layer distillation loss value for the original second model based on the class loss value and the characteristic loss value comprises:

determining a foreground loss mean value and a background loss mean value of the bottleneck layer according to the comprehensive loss value and the calibration mask matrix; the calibration mask matrix is determined according to a foreground label and a background label calibrated in advance;

4. The method of claim 3, wherein determining the foreground loss mean of the bottleneck layer based on the composite loss value and the calibration mask matrix comprises:

mapping the confidence coefficient of a second prediction frame in the second detection result to a bottleneck layer of an original second model to obtain a second confidence coefficient mapping value, and determining the difficult-to-case weight of each foreground position in the bottleneck layer according to the second confidence coefficient mapping value; the higher the second confidence coefficient mapping value is, the smaller the corresponding difficult case weight is;

5. The method of claim 1, wherein determining a detection layer distillation loss value of the original second model based on the first and second detection results comprises:

determining a position loss value of each second prediction frame according to the position of the first prediction frame in the first prediction result and the position of the second prediction frame in the second prediction result;

determining a confidence coefficient loss value of each second prediction frame according to the confidence coefficient of the first prediction frame in the first prediction result and the confidence coefficient of the second prediction frame in the second prediction result;

determining a detection layer distillation loss value of the original second model based on the position loss value and the confidence loss value.

6. The method of claim 5, wherein determining a detection layer distillation loss value for the original second model based on the position loss value and the confidence loss value comprises:

7. The method of claim 5, wherein determining the confidence loss value for each second prediction box based on the confidence of the first prediction box in the first prediction result and the confidence of the second prediction box in the second prediction result comprises:

8. The method of claim 1, further comprising:

acquiring an image to be detected;

inputting the image to be detected into the target detection model, and determining at least one target in the image to be detected based on the output of the target detection model; the object detection model is trained based on the object detection model training method according to any one of claims 1 to 7.

9. An object detection model training apparatus, comprising:

10. An electronic device, characterized in that the electronic device comprises:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores a computer program executable by the at least one processor, the computer program being executable by the at least one processor to enable the at least one processor to perform the object detection model training method of any one of claims 1-8.

11. A computer-readable storage medium storing computer instructions for causing a processor to implement the method of object detection model training of any one of claims 1-8 when executed.