CN114387483A

CN114387483A - Target detection method, model training method, device, equipment and storage medium

Info

Publication number: CN114387483A
Application number: CN202210010287.0A
Authority: CN
Inventors: 李波; 姚勇强; 谭靖儒
Original assignee: Shanghai Sensetime Intelligent Technology Co Ltd
Current assignee: Shanghai Sensetime Intelligent Technology Co Ltd
Priority date: 2022-01-06
Filing date: 2022-01-06
Publication date: 2022-04-22

Abstract

The embodiment of the application discloses a target detection method, a training method and device of a model, equipment and a storage medium, wherein the training method comprises the following steps: adopting a detection model to be trained to perform target detection on the obtained training image set to obtain a predicted detection result; determining the unbalance degree of positive and negative samples of different classes in the training image set; determining a loss of the predicted detection result in the different classes based on the predicted detection result and a degree of imbalance of positive and negative samples of the different classes; and updating the network parameters in the detection model to be trained by using the loss until the updated detection model meets the convergence condition.

Description

Target detection method, model training method, device, equipment and storage medium

Technical Field

The embodiment of the application relates to a computer vision technology, and relates to, but is not limited to, a target detection method, a training method and device of a model, equipment and a storage medium.

Background

The long tail distribution of data is an important problem faced by target detection algorithms in the real world. General target detection algorithms are applied to a relatively balanced data distribution scene, and the accuracy of the target detection algorithms is often greatly influenced by the imbalance problem caused by long-tail distribution. The existing long-tail target detection algorithm provides solutions such as sample resampling, weight loss and weight loss weighting and the like for overcoming the problem of long-tail distribution.

However, the existing target detection framework for long-tailed distribution data is mainly based on a two-stage target detector, and there is no relevant content of single-stage detection for long-tailed distribution data at present.

Disclosure of Invention

In view of this, embodiments of the present application provide a target detection method, a training method and apparatus for a model, a device, and a storage medium.

The technical scheme of the embodiment of the application is realized as follows:

in a first aspect, an embodiment of the present application provides a training method for a detection model, where the method includes: adopting a detection model to be trained to perform target detection on the obtained training image set to obtain a predicted detection result; determining the unbalance degree of positive and negative samples of different classes in the training image set; determining a loss of the predicted detection result in the different classes based on the predicted detection result and a degree of imbalance of positive and negative samples of the different classes; and updating the network parameters in the detection model to be trained by using the loss until the updated detection model meets the convergence condition.

By the mode, the problem of unbalanced samples among foreground classes can be solved by utilizing the unbalanced degree of the positive and negative samples of different classes.

In some embodiments, the determining the degree of imbalance of positive and negative samples of different classes in the training image set comprises: determining cumulative gradient ratios of positive and negative samples of different classes in the training image set; determining the degree of imbalance of the positive and negative samples of the different classes based on the cumulative gradient ratios of the positive and negative samples of the different classes.

In the manner, the imbalance degree of the positive and negative samples of different classes can be determined by utilizing the cumulative gradient ratios of different classes.

In some embodiments, said determining a loss of said predicted detection result in said different class based on said predicted detection result and a degree of imbalance of positive and negative samples of said different class comprises: weighting the unbalance degrees of the positive and negative samples of different classes respectively by utilizing a first hyper-parameter to obtain a weighting result of each class in the different classes; the first hyper-parameter is used for controlling the learning strength of the detection model to be trained on the unbalance degree of the positive and negative samples of the rare classes in different classes; determining a loss of the predicted detection result in the each class based on the predicted detection result and the weighted result of the each class.

By the method, the unbalance degree of the positive and negative samples of different classes can be weighted by using the hyper-parameters, so that the problem of unbalance of the samples among the foreground classes is solved.

In some embodiments, said determining a loss of said predicted detection result in said different class based on said predicted detection result and a degree of imbalance of positive and negative samples of said different class comprises: determining a loss of focus for the predicted detection result; and adjusting the focus loss by adopting the unbalance degree of the positive and negative samples of different classes to obtain the predicted loss of the detection result in the different classes.

By the method, the problem of unbalanced samples among the foreground categories and the problem of unbalanced foreground and background can be solved simultaneously, so that the detection model has advantages in the aspects of simplicity and high efficiency.

In some embodiments, the different categories include at least: a rare class and a frequent class, wherein the focus loss is adjusted by using the imbalance degree of the positive and negative samples of the different classes, and the predicted detection result loss in the different classes is obtained, and the method further comprises: determining a weighting factor based on a basic factor and the unbalance degree of the positive and negative samples of the rare class and the frequent class; wherein the base factor is used for balancing the loss contribution of the foreground region and the background region in the training image; based on the weight factor, adjusting the loss contribution of the rare category relative to the frequent category in the loss to obtain optimized loss; the updating the network parameters in the detection model to be trained by using the loss comprises: and updating the network parameters in the detection model to be trained by utilizing the optimized loss.

By the method, the problem of unbalanced samples among the foreground classes and the problem of unbalanced foreground and background can be solved, and meanwhile loss contribution of the rare classes is improved.

In some embodiments, the updating the network parameters in the detection model to be trained by using the loss includes: determining a gradient value output by the detection model to be trained in the updating process based on the loss; performing gradient cutting on the gradient value under the condition that the gradient value is greater than or equal to a preset threshold value to obtain a cut gradient value; and updating the network parameters in the detection model based on the cut gradient value.

By the mode, the training of the detection model can be more stable, and the fluctuation of the training result is relieved.

In some embodiments, the method further comprises: determining the iteration times of the iterative training of the detection model to be trained in a pre-training stage based on the distribution condition of the images in the training image set; pre-training the detection model to be trained based on the iteration times and the training image set to obtain a candidate detection model; the updating the network parameters in the detection model to be trained by using the loss until the updated detection model meets the convergence condition includes: and updating the network parameters in the candidate detection model by using the loss until the updated detection model meets the convergence condition.

Through the method, the proper Number of pre-training iterations can be selected according to the distribution condition of the training data, so that the probability of NaN (Not a Number) values in the training process is reduced.

In some embodiments, the determining, based on the distribution of the images in the training image set, the number of iterations of iterative training of the detection model to be trained in a pre-training stage includes: setting the iteration times of the detection model to be trained in a pre-training stage as a first preset value under the condition that the training image set is a long-tail distribution image set; setting the iteration times as a second preset value under the condition that the training image set is a non-long-tail distribution image set; wherein the first preset value is greater than the second preset value.

Through the mode, more iteration times can be set for the data distributed by the long tail, so that the probability of abnormal problems of the long tail data in the training process is reduced.

In some embodiments, the performing, by using a detection model to be trained, target detection on the acquired training image set to obtain a predicted detection result includes: generating at least two prediction frames on each region of a training image by adopting a detection model to be trained; wherein the at least two prediction boxes have different scales; determining a cross-correlation between the generated prediction frame and a labeling frame of the training image; and measuring the probability that the target in the prediction frame is the foreground based on the cross-correlation to obtain the detection result of the training image prediction.

By the method, the sampling strategy of the detection model can be adaptively selected, and the implementation mode of the center-less branch in the detection model is changed, so that the detection performance is improved.

In some embodiments, the method further comprises: preprocessing each training image in the training image set to obtain a preprocessed training image; performing feature extraction on the preprocessed training image to obtain a feature map; the generating at least two prediction frames on each region of the training image by using the detection model to be trained comprises: generating at least two prediction frames on each pixel of the feature map by adopting a detection model to be trained; wherein the pixels of the feature map have a correspondence with regions in the training image.

By the mode, a complete model training process can be realized.

In a second aspect, an embodiment of the present application provides a target detection method, where the method includes: acquiring an image to be detected; preprocessing the image to be detected to obtain a preprocessed image to be detected; performing feature extraction on the preprocessed image to be detected to obtain a feature map of the image to be detected; inputting the characteristic diagram of the image to be detected into a detection model to obtain a prediction result; the detection model is obtained by training based on the training method; and carrying out post-processing on the prediction result to obtain a detection result of the image to be detected.

By the method, the detection model obtained by the training method can be used for processing the image to be detected, and the detection rate is improved.

In some embodiments, the prediction result comprises a plurality of prediction boxes, and a target class corresponding to each prediction box; the post-processing the prediction result to obtain the detection result of the image to be detected comprises the following steps: and filtering the overlapped prediction frames in the plurality of prediction frames to obtain the filtered prediction frames and the target category corresponding to each filtered prediction frame.

Through the mode, the multiple prediction frames obtained by the detection model can be subjected to de-duplication processing, and a more accurate detection result is obtained.

In a third aspect, an embodiment of the present application provides a training apparatus for detecting a model, where the apparatus includes: the first prediction unit is used for carrying out target detection on the obtained training image set by adopting a detection model to be trained to obtain a predicted detection result; the degree determining unit is used for determining the unbalance degree of the positive and negative samples of different classes in the training image set; a loss determination unit configured to determine a loss of the predicted detection result in the different classes based on the predicted detection result and a degree of imbalance of positive and negative samples of the different classes; and the parameter updating unit is used for updating the network parameters in the detection model to be trained by using the loss until the updated detection model meets the convergence condition.

In a fourth aspect, an embodiment of the present application provides an object detection apparatus, including: the acquisition unit is used for acquiring an image to be detected; the preprocessing unit is used for preprocessing the image to be detected to obtain a preprocessed image to be detected; the characteristic extraction unit is used for extracting the characteristics of the preprocessed image to be detected to obtain a characteristic diagram of the image to be detected; the second prediction unit is used for inputting the characteristic diagram of the image to be detected into the detection model to obtain a prediction result; the detection model is obtained by training based on the training method; and the post-processing unit is used for performing post-processing on the prediction result to obtain the detection result of the image to be detected.

In a fifth aspect, an embodiment of the present application provides an electronic device, including a memory and a processor, where the memory stores a computer program operable on the processor, and the processor executes the computer program to implement the steps in the training method or the steps in the detection method.

In a sixth aspect, embodiments of the present application provide a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the steps in the training method or implements the steps in the detection method.

The embodiment of the application provides a target detection method, a model training method, a device, equipment and a storage medium, wherein the target detection is carried out on an obtained training image set by adopting a detection model to be trained to obtain a predicted detection result; determining the unbalance degree of positive and negative samples of different classes in the training image set; determining a loss of the predicted detection result in the different classes based on the predicted detection result and a degree of imbalance of positive and negative samples of the different classes; and updating the network parameters in the detection model to be trained by using the loss until the updated detection model meets the convergence condition, so that the problem of unbalanced samples among the foreground classes can be solved by using the unbalanced degree of the positive and negative samples of different classes.

Drawings

FIG. 1 is a first schematic flow chart illustrating an implementation of a training method for a detection model according to an embodiment of the present application;

FIG. 2 is a schematic diagram of a second implementation flow of a training method for a detection model according to an embodiment of the present application;

FIG. 3 is a third schematic flow chart illustrating an implementation process of the training method for detecting a model according to the embodiment of the present application;

FIG. 4 is a schematic diagram illustrating a flow chart of an implementation of a target detection method according to an embodiment of the present application;

fig. 5A is a block diagram of a single-stage long-tailed target detection scheme according to an embodiment of the present application;

FIG. 5B is a graph comparing sample distributions of a single-stage detector and a two-stage detector according to an embodiment of the present invention;

FIG. 5C is a graph illustrating a ratio of the number of samples with different data distributions according to an embodiment of the present application;

FIG. 5D is a first diagram illustrating an equalizing focus loss function according to an embodiment of the present disclosure;

FIG. 5E is a diagram illustrating a second example of an equalizing focus loss function according to the present application;

FIG. 6 is a schematic diagram illustrating a structure of a training apparatus for testing a model according to an embodiment of the present disclosure;

FIG. 7 is a schematic diagram illustrating a structure of a target detection apparatus according to an embodiment of the present disclosure;

fig. 8 is a hardware entity diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The technical solution of the present application is further elaborated below with reference to the drawings and the embodiments. It should be apparent that the described embodiments are only some of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present application without making any creative effort, shall fall within the protection scope of the present application.

In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is understood that "some embodiments" may be the same subset or different subsets of all possible embodiments, and may be combined with each other without conflict.

In the following description, suffixes such as "module", "component", or "unit" used to denote elements are used only for the convenience of description of the present application, and have no specific meaning by themselves. Thus, "module", "component" or "unit" may be used mixedly.

It should be noted that the terms "first \ second \ third" referred to in the embodiments of the present application are only used for distinguishing similar objects and do not represent a specific ordering for the objects, and it should be understood that "first \ second \ third" may be interchanged under specific ordering or sequence if allowed, so that the embodiments of the present application described herein can be implemented in other orders than illustrated or described herein.

The single-stage target detector is concise in form, high in reasoning speed and easy to deploy. However, the performance of the current single-stage target detector is often inferior to that of the two-stage detector, and the single-stage target detection model mainly faces two difficulties in the case of long tail distribution of data. First, the long tail distribution of data introduces a sample imbalance problem between foreground classes, which is an inherent problem in the long tail distribution scenario. Secondly, the single-stage target detection algorithm is based on a dense sample candidate set, and the problem of unbalanced foreground and background is introduced in the training process of the model.

The two big unbalance problems jointly hinder the performance of a single-stage target detection model under a long-tailed scene. Existing solutions generally focus only on the problem of sample imbalance between foreground classes, such as EQLv 2; alternatively, only foreground-background imbalance issues are of interest, such as Focal Loss (Focal local). The prior art lacks the ability to solve two kinds of imbalance problems at the same time, so the prior art cannot well process the single-stage long-tail target detection task.

Based on this, the embodiment of the present application provides a training method for a detection model, where the method is applied to an electronic device, and functions implemented by the method may be implemented by a processor in the electronic device calling a program code, where of course, the program code may be stored in a storage medium of the electronic device. Fig. 1 is a first schematic flow chart of an implementation process of a training method for a detection model according to an embodiment of the present application, as shown in fig. 1, the method includes:

s101, performing target detection on the obtained training image set by adopting a detection model to be trained to obtain a predicted detection result;

here, the electronic device may be various types of devices having information processing capability, such as a navigator, a smart phone, a tablet computer, a wearable device, a laptop portable computer, a kiosk and a desktop computer, a server cluster, and the like.

In this embodiment, the training image set may be a data set commonly used in the computer vision field, such as a COCO data set, an LVIS data set, and the like. Of course, the training image set may also be a set of images of a real scene, such as images captured in the field of automatic driving. And, the data distribution of the images in the training image set may be a long-tailed distribution.

Here, the detection model to be trained may be a single-stage detection model or a double-stage detection model, which is not limited in this embodiment of the present application. In the case that the detection model to be trained is a single-stage detection model, the single-stage detection model may include a retanet (focal local for detect Object detection) single-stage detector, an ATSS (Adaptive Training Sample Selection) single-stage detector, a yolo (young Only Look one) single-stage detector, and the like. The predicted detection result may include the detected objects and the class of objects to which each object belongs.

S102, determining the unbalance degree of positive and negative samples of different types in the training image set;

for example, there are three categories of target objects that need to be detected: a first category of target objects (e.g., cats), a second category of target objects (e.g., pandas), and a third category of target objects (e.g., ounces). There are 225 images including the target object of the first category, 115 images including the target object of the second category, and 5 images including the target object of the third category in the training image set. The different classes in the training image set refer to these three classes, and the first class is a frequent class, the second class is a normal class, and the third class is a rare class, and the training image set can be regarded as a data set with a long tail distribution.

Furthermore, the unbalanced degree of the positive and negative samples of different categories in the training image set means that when the training image set is a data set with a long tail distribution, not only the positive and negative samples of each category are unbalanced, but also the rare categories face a problem of more serious positive and negative sample unbalance than the frequent categories. That is, in the training image set, the imbalance degree of the positive and negative samples of the third category is greater than the imbalance degree of the positive and negative samples of the second category, and the imbalance degree of the positive and negative samples of the second category is greater than the imbalance degree of the positive and negative samples of the first category.

Here, the positive-negative sample imbalance means that the number of negative samples (background samples) is much greater than the number of positive samples (foreground samples).

Step S103, determining the loss of the predicted detection result in different classes based on the predicted detection result and the unbalance degree of the positive and negative samples of the different classes;

here, the loss of the detection result of the training sample set in the different classes may be determined based on the predicted detection result and the degree of imbalance between the positive and negative samples of the different classes. Such as loss of detection results of the training sample set in the first class, loss of detection results of the training sample set in the second class, and loss of detection results of the training sample set in the third class. That is, each of the different categories corresponds to a loss value.

In the embodiment of the application, the loss is determined according to the unbalanced degree of the positive and negative samples of different classes, so that the learning strength of the rare classes relative to the frequent classes is improved by the detection model, the problem of unbalanced samples among the foreground classes is solved, and the detection effect of the detection model on the rare classes is improved.

And step S104, updating the network parameters in the detection model to be trained by using the loss until the updated detection model meets the convergence condition.

Here, the updated detection model satisfies the convergence condition, and can be implemented in the following three ways: the first is that the loss of the output of the detection model is less than a certain preset value; the second is that the change of the weight value between two iterations is less than a certain preset value; and thirdly, the iteration times reach the preset times.

In some embodiments, the step S102 of determining the imbalance degree of the positive and negative samples of different classes in the training image set includes:

s1021, determining the cumulative gradient ratio of positive and negative samples of different classes in the training image set;

step S1022, determining the imbalance degree of the positive and negative samples of the different classes based on the cumulative gradient ratio of the positive and negative samples of the different classes.

Here, the cumulative gradient ratio of the positive and negative samples of each category in the training process may be determined, and the degree of imbalance of the positive and negative samples of the corresponding category may be measured by using the cumulative gradient ratio of the positive and negative samples of each category.

In some embodiments, the method further comprises:

step S11, determining the iteration times of the iterative training of the detection model to be trained in the pre-training stage based on the distribution condition of the images in the training image set;

step S12, pre-training the detection model to be trained based on the iteration times and the training image set to obtain a candidate detection model;

correspondingly, the step S104 of updating the network parameters in the detection model to be trained by using the loss until the updated detection model meets the convergence condition includes: and updating the network parameters in the candidate detection model by using the loss until the updated detection model meets the convergence condition.

Here, the pre-training phase, i.e., the training arm up phase, trains at a very low learning rate when training is started, so that the network is familiar with data, and as the learning rate of the training gradually increases to a certain extent, the training is performed at the set initial learning rate, and then the learning rate gradually decreases. That is to say, after the war up is started, the iteration number of the war up needs to be set, and when the current iteration number is smaller than the set iteration number, the learning rate is equal to the current iteration number divided by the set iteration number and then multiplied by the basic learning rate, and because the value of the current iteration number divided by the set iteration number is a numerical value smaller than 1, the learning rate is an incremental process in the whole war up process, and when the war up is finished, the learning rate is trained according to the basic learning rate.

Therefore, in the embodiment of the present application, the number of iterations corresponding to the detection model in the pre-training stage may be determined according to the distribution of the images in the training image set (for example, whether the data in the training image set is long-tailed distribution is determined), and then pre-training is performed, so as to reduce the probability of a series of abnormal problems occurring in the training process (for example, the probability of NAN data occurring). If the training data is a long-tailed distribution, the preset number of iterations can be set to a larger value, so that the duration of the war up is longer.

For example, the learning rate of the original network can be reached to 0.01 only after 1000 iterations, and the learning rate of 0.01 can be reached only after 6000 iterations in the application. Thus, the time of the arm up is increased, which in turn reduces the probability of NAN data occurring during the training process.

In some embodiments, the step S11, determining, based on the distribution of the images in the training image set, the number of iterations of performing iterative training on the detection model to be trained in a pre-training stage, includes:

step S11a, under the condition that the training image set is a long-tail distribution image set, setting the iteration times of the detection model to be trained in a pre-training stage as a first preset value;

step S11b, setting the iteration times as a second preset value under the condition that the training image set is a non-long-tail distribution image set; wherein the first preset value is greater than the second preset value.

Based on the foregoing embodiment, an embodiment of the present application further provides a training method for a detection model, where the method is applied to an electronic device, and the method includes:

s111, performing target detection on the obtained training image set by adopting a detection model to be trained to obtain a predicted detection result;

step S112, determining the cumulative gradient ratio of positive and negative samples of different classes in the training image set;

step S113, determining the unbalance degree of the positive and negative samples of different classes based on the cumulative gradient ratio of the positive and negative samples of different classes;

step S114, weighting the unbalance degrees of the positive and negative samples of different categories by utilizing a first super-parameter to obtain a weighting result of each category in the different categories; the first hyper-parameter is used for controlling the learning strength of the detection model to be trained on the unbalance degree of the positive and negative samples of the rare classes in different classes;

here, the cumulative gradient ratio is related to the degree of balance of the training process. Class of equilibrium trained, cumulative gradient ratio close to 1; in the class of trained imbalances, the cumulative gradient ratio is close to 0. The class of the training imbalance then needs to be weighted to control the maximum strength of learning the positive and negative sample imbalance problem for the class of the training imbalance (e.g., rare class).

Step S115, determining the loss of the predicted detection result in each category based on the predicted detection result and the weighted result of each category;

and S116, updating the network parameters in the detection model to be trained by using the loss until the updated detection model meets the convergence condition.

step S121, carrying out target detection on the obtained training image set by adopting a detection model to be trained to obtain a predicted detection result;

step S122, determining the unbalance degree of the positive and negative samples of different types in the training image set;

step S123, determining the predicted focus loss of the detection result;

here, the predicted focus loss of the detection result may be determined using a focus loss function in the following formula (1):

FL(p_t)＝-α_t(1-p_t)^γlog(p_t)..................................(1)；

wherein, FL (p)_t) For loss of focus, α_tFor balancing the relationship between positive and negative examples in each category, p_tTo determine the probability that the target in the prediction box is foreground, γ is the focus parameter used to make the model focus more on the samples that are difficult to classify by reducing the weights of the samples that are easy to classify.

Step S124, adjusting the focus loss by adopting the unbalance degrees of the positive and negative samples of different categories to obtain the predicted loss of the detection result in the different categories;

here, because the loss in the different categories is obtained by adjusting the focus loss by using the unbalanced degree of the positive and negative samples in the different categories, the detection model at the training position can not only balance the loss contribution of the foreground sample and the background sample, but also improve the learning strength of the rare category relative to the frequent category, thereby solving the problem of unbalanced foreground and background, and solving the problem of unbalanced samples between the foreground categories.

And step S125, updating the network parameters in the detection model to be trained by using the loss until the updated detection model meets the convergence condition.

In some embodiments, a loss function including a focusing factor for adjusting learning strengths of different classes of samples differently according to a degree of imbalance of positive and negative samples of different classes in the training image set is introduced based on the above-mentioned focus loss. Wherein for rare classes, the focusing factor emphasizes increasing learning strength for positive samples while reducing learning attention for negative samples. While for the frequent category the focus factor maintains a similar behavior as the conventional focus loss. Therefore, the two unbalanced problems can be solved into the problem that the unbalanced degrees of the positive and negative samples of different classes are inconsistent, and the two unbalanced problems of the samples between the foreground classes and the foreground background are solved simultaneously through the provided loss function comprising the focusing factor.

The new loss function can be obtained by the following equation (2):

wherein, said γ is_bThe basic factor is used for balancing the loss contribution of the foreground region and the background region in the training image; the s is a first hyper-parameter and is used for controlling the learning strength of the detection model to be trained on the unbalance degree of the positive and negative samples of the rare types in different types; the j is a category index of different categories, the g^jIs the cumulative gradient ratio of positive and negative samples of the jth class in the training process; the gamma is_b+s(1-g^j) Is a focusing factor; the EFL (p)_t) Is a new loss function proposed in the embodiments of the present application.

Based on the foregoing embodiment, an embodiment of the present application further provides a training method for a detection model, where the method is applied to an electronic device, fig. 2 is a schematic diagram of an implementation flow of the training method for a detection model according to the embodiment of the present application, and as shown in fig. 2, the method includes:

step S201, carrying out target detection on the obtained training image set by adopting a detection model to be trained to obtain a predicted detection result;

step S202, determining the unbalance degree of positive and negative samples of different types in the training image set;

step S203, determining the predicted focus loss of the detection result;

step S204, adjusting the focus loss by adopting the unbalance degrees of the positive and negative samples of different categories to obtain the predicted loss of the detection result in the different categories;

step S205, determining a weight factor based on the basic factor and the unbalance degree of the positive and negative samples of the rare class and the frequent class in the different classes; wherein the base factor is used for balancing the loss contribution of the foreground region and the background region in the training image;

here, the weighting factor is mainly used to boost the loss contribution of the rare class relative to the frequent class.

S206, based on the weight factor, adjusting the loss contribution of the rare category relative to the frequent category in the loss to obtain optimized loss;

here, the different categories at least include the rare category and the frequent category, and the loss contribution of the rare category with respect to the frequent category in the loss obtained in step S204 above may be adjusted based on the weighting factor.

And step S207, updating the network parameters in the detection model to be trained by using the optimized loss until the updated detection model meets a convergence condition.

In some embodiments, the new loss function in the above formula (2) is optimized to obtain an optimized new loss function. The optimized new loss function can be obtained by the following formula (3):

wherein C is the total number of categories; the above-mentioned

Is a weighting factor.

The reason for optimizing the new loss function in equation (2) above is: although the focusing factor can dynamically adjust the learning strength of each class sample according to the imbalance degree of the positive and negative samples of each class, the focusing factor alone cannot enable the detection model to achieve the optimal performance. Since in the scenario of classifying multiple classes together, the larger focusing factor results in smaller loss contribution during the training process, the rare class needs to use a large focusing factor to learn the extreme imbalance of positive and negative samples, but loses gradient contribution due to the large focusing factor. And, when the rare class difficult samples are learned together with the frequent class difficult samples, even if each class has a different focusing factor, the loss contribution of the two is substantially consistent finally. Therefore, the embodiment of the application further provides the weighting factor on the basis of the loss function comprising the focusing factor, so that the rare class difficult samples have more loss contributions than the frequent class difficult samples.

s211, carrying out target detection on the obtained training image set by adopting a detection model to be trained to obtain a predicted detection result;

step S212, determining the unbalance degree of the positive and negative samples of different categories in the training image set;

step S213, determining the loss of the predicted detection result in the different classes based on the predicted detection result and the unbalance degree of the positive and negative samples of the different classes;

step S214, determining a gradient value output by the detection model to be trained in the updating process based on the loss;

step S215, performing gradient cutting on the gradient value under the condition that the gradient value is greater than or equal to a preset threshold value to obtain a cut gradient value;

here, since the training result is unstable due to the propagation of the gradient during training, the gradient is too large, and therefore needs to be processed, and the gradient is cut off directly when the training is performed until the gradient is large, for example, when 100 gradients are trained, the gradient is cut off into 35. Especially in the case of long tail distribution, the gradient is relatively unstable, so the gradient cutting strategy is added in the training stage in the embodiment of the application. Therefore, the training of the single-stage detector is more stable, and the fluctuation of the training result is relieved.

And S216, updating the network parameters in the detection model based on the cut gradient value until the updated detection model meets the convergence condition.

Based on the foregoing embodiment, an embodiment of the present application further provides a training method for a detection model, where the method is applied to an electronic device, fig. 3 is a schematic diagram of an implementation flow of the training method for a detection model in the embodiment of the present application, and as shown in fig. 3, the method includes:

s301, generating at least two prediction frames on each region of the obtained training image by adopting a detection model to be trained; wherein the at least two prediction boxes have different scales;

for example, the Detection model to be trained includes an ATSS detector, the existing ATSS detector is obtained by improving on the basis of a RetinaNet network and an FCOS (full volumetric One-Stage Object Detection) algorithm, the ATSS detector only lays 1 anchor frame for each pixel point on a feature map, and selects 9 anchor frames as positive samples on each layer. In the embodiment of the application, the existing ATSS detector is improved, and at least 2 anchor frames are laid on each pixel point on the characteristic diagram, and the at least 2 anchor frames have different dimensions. For example, laying two anchor boxes at each point, one anchor box having dimensions of 6 and the other anchor box having dimensions of 8, and each layer having the first 18 samples as positive samples, can improve the performance of the ATSS detector.

Step S302, determining the cross-correlation between the generated prediction frame and the mark frame of the training image;

here, the label box of the training image may be a GT box, i.e., a ground route box, of the training image.

Step S303, based on the cross-over ratio, measuring the probability that the target in the prediction frame is a foreground to obtain the detection result of the training image prediction;

for example, the detection model to be trained includes an ATSS detector, and the existing ATSS detector includes three branches, namely a classification branch, a positioning branch, and a center branch (center-less) for determining the probability of whether the target in each prediction box is foreground or background, and further distinguishing the foreground from the background, and the existing implementation is to score the position. In the embodiment of the present application, the existing ATSS detector is improved, and the implementation manner of the center branch is changed to a manner of using an Intersection Over Union (IOU).

Step S304, determining the unbalance degree of the positive and negative samples of different classes in the training image set;

step S305, determining the loss of the predicted detection result in the different classes based on the predicted detection result and the unbalance degree of the positive and negative samples of the different classes;

and S306, updating the network parameters in the detection model to be trained by using the loss until the updated detection model meets the convergence condition.

step S311, preprocessing each training image in the acquired training image set to obtain a preprocessed training image;

here, the preprocessing of the training image includes, but is not limited to: scaling of image pixels, flipping of images, conversion of image data forms.

Step S312, extracting the features of the preprocessed training image to obtain a feature map;

here, it is necessary to extract features of the training image, and input the extracted feature map into the detection model for processing. It should be noted that, in the embodiment of the present application, the method used for performing feature extraction is not limited. For example, the Feature extraction may be performed by a method of combining backbone with FPN (Feature Pyramid Networks).

Step S313, generating at least two prediction frames on each pixel of the feature map by adopting a detection model to be trained; wherein the at least two prediction boxes have different scales; the pixels of the feature map have a corresponding relationship with the regions in the training image;

here, at least two prediction blocks may be generated on each pixel of the feature map, and since the size of the feature map is much smaller than that of the original training image, one pixel of the feature map corresponds to one region in the original training image.

Step S314, determining the cross-correlation between the generated prediction frame and the mark frame of the training image;

step S315, based on the cross-correlation, measuring the probability that the target in the prediction frame is a foreground to obtain the detection result of the training image prediction;

step S316, determining the unbalance degree of the positive and negative samples of different types in the training image set;

step S317, determining the loss of the predicted detection result in the different classes based on the predicted detection result and the unbalance degree of the positive and negative samples of the different classes;

and step S318, updating the network parameters in the detection model to be trained by using the loss until the updated detection model meets the convergence condition.

Based on the foregoing embodiments, the present application provides an object detection method, where the method is applied to an electronic device, and functions implemented by the method may be implemented by a processor in the electronic device calling a program code, where of course, the program code may be stored in a storage medium of the electronic device. Fig. 4 is a schematic flow chart of an implementation of a target detection method according to an embodiment of the present application, and as shown in fig. 4, the method includes:

s401, acquiring an image to be detected;

here, the image to be detected may be a test image in a data set commonly used in the computer vision field, such as a COCO data set, an LVIS data set, and the like. Of course, the image to be detected may also be an image in a real scene, for example, an image captured in the field of automatic driving.

S402, preprocessing the image to be detected to obtain a preprocessed image to be detected;

here, the preprocessing of the image to be detected may include adjustment of an image angle, filtering of the image, and the like, in addition to scaling of image pixels, flipping of the image, and conversion of an image data format.

Step S403, extracting the characteristics of the preprocessed image to be detected to obtain a characteristic diagram of the image to be detected;

s404, inputting the characteristic diagram of the image to be detected into a detection model to obtain a prediction result; the detection model is obtained by training based on the training method;

in the embodiment of the present application, the detection model trained by the training method may be used to process the image to be detected, so as to obtain a plurality of prediction frames and the category of each prediction frame. For example, after the trained detection model is used to process the image to be detected, 8 prediction frames are obtained, wherein 2 prediction frames are backgrounds, 6 prediction frames are foregrounds, and in the 6 foreground prediction frames, 4 foreground prediction frames comprise the same kitten, and 2 foreground prediction frames comprise the same puppy.

And S405, performing post-processing on the prediction result to obtain a detection result of the image to be detected.

For example, after the trained detection model is used to process the image to be detected, 8 prediction frames are obtained, wherein 2 prediction frames are backgrounds, 6 prediction frames are foregrounds, and in the 6 foreground prediction frames, 4 foreground prediction frames comprise the same kitten, and 2 foreground prediction frames comprise the same puppy. Processing the detection result, filtering the background frame and the overlapped foreground frame, only reserving one foreground prediction frame containing the kitten and one foreground prediction frame containing the puppy, and taking the 2 foreground prediction frames as the final detection result of the image to be detected.

In some embodiments, the prediction result comprises a plurality of prediction boxes, and a target class corresponding to each prediction box; the step S405, performing post-processing on the prediction result to obtain a detection result of the image to be detected, includes: and filtering the overlapped prediction frames in the plurality of prediction frames to obtain the filtered prediction frames and the target category corresponding to each filtered prediction frame.

For example, a Non-Maximum Suppression (NMS) method may be used to filter the prediction frames with overlap in the plurality of prediction frames, so as to obtain a detection result of the image to be detected.

Based on the foregoing embodiment, an embodiment of the present application further provides a target detection method, where the target detection method is applied to a single-stage long-tail distribution scene.

The target detection method mainly comprises the following steps:

(1) aiming at the problem of high-performance baseline, the embodiment of the application is based on a plurality of existing single-stage detectors, and provides an enhanced baseline which can achieve the performance equivalent to that of a two-stage detector and is convenient for follow-up research.

(2) Aiming at the two imbalance problems of the imbalance of the samples between the foreground categories and the imbalance of the foreground and background samples, the embodiment of the application provides a new Loss function, namely EFL (Equalized Focal Loss), wherein the EFL comprises two key adjusting factors, namely a focusing factor and a weighting factor, and the two imbalance problems can be solved at the same time.

(3) By combining the enhanced baseline and the EFL, the embodiment of the application provides a complete single-stage long-tail target detection scheme which has superior performance in a long-tail scene.

Fig. 5A is a block diagram of a single-stage long-tailed target detection scheme according to an embodiment of the present application, and as shown in fig. 5A, the single-stage long-tailed target detection scheme mainly includes four parts, which are explained in detail below:

1) first portion 51: this section mainly includes three modules, namely an image reading module 511, a preprocessing module 512, and a feature extraction module 513. The image reading module 511 is mainly used for reading an original training image or an image to be detected; the preprocessing module 512 is mainly used for preprocessing an original training image or an image to be detected, where the preprocessing includes scaling of image pixels, flipping of the image, conversion of the data format of the image, and the like; the feature extraction module 513 is mainly used to extract features of the preprocessed image, and for example, the feature extraction module may include a backbone and a FPN.

Here, the image reading module 511, the preprocessing module 512, and the feature extraction module 513 may be multi-process parallel. The image read by the image reading module 511 is image data distributed along the long tail.

2) Second portion 52: the input to the second section is a profile of the output of the first section, which mainly comprises the ATSS single-stage detector 521, as well as the stability settings 522 and the enhancement settings 523 for said ATSS single-stage detector. Wherein the ATSS single stage detector 521, the stability setting 522, and the enhancement setting 523 collectively comprise an enhancement baseline.

In a scenario where data is distributed with long tails, the performance of most single-stage detectors is inferior to that of two-stage detectors, so the embodiment of the present application improves this problem, and the corresponding improvement method includes:

the embodiment of the present application adds a stability setting 522 for stable training on the basis of the existing ATSS single-stage detector 521, where the stability setting 522 includes adding a gradient cutting strategy with a maximum regularization value being a first value, for example, the first value is 35, and when a gradient of 100 is trained, the gradient is cut into 35. The stability setting 522 also includes extending a pre-training phase (the warm up phase) from a first number of iterations to a second number of iterations, where the first number is less than the second number, e.g., extending the number of iterations of the pre-training phase from 1000 to 6000 iterations. The stability setting 522 makes the training process of the ATSS single-stage detector 521 more stable, and also alleviates the fluctuation of the training result and reduces a series of problems such as NaN values in the training process.

In the embodiment of the present application, an enhanced setting 523 is further added on the basis of the existing ATSS single-stage detector 521, where the enhanced setting 523 includes replacing a central branch (centerless branch) of the ATSS single-stage detector 521 with IoU branches, adjusting the scale of a prediction frame (anchor frame) of the ATSS single-stage detector 521 from 8 to 6 and 8, and taking the first 18 samples as positive samples to perform sampling training according to such a sampling strategy.

Thus, ATSS single-stage detector 521, in combination with stability setting 522 and enhancement setting 523, forms an enhanced baseline that achieves comparable performance to a two-stage detector as proposed by the embodiments of the present application.

3) Third portion 53: the embodiment of the present application provides a new loss function, namely, an equilibrium focus loss EFL.

Fig. 5B is a comparison graph of sample distributions of a single-stage detector and a two-stage detector according to an embodiment of the present invention, as shown in fig. 5B, the two-stage detector generally does not have a foreground-background imbalance problem because it has an RPN (Region pro-technical Network) that filters most of background samples before final classification, and thus the two-stage classifier acts on a relatively balanced distribution of the foreground and background. However, since the single-stage detector performs dense prediction directly on the feature map, there is an extreme problem of imbalance of foreground and background, i.e. the number of negative samples is far greater than that of positive samples.

The performance of the single-stage long tail detector is hindered by the two unbalance problems in combination with the problem that the data is unbalanced samples between the foreground categories introduced under the condition of long tail distribution.

Since the two imbalance problems mentioned above affect each other in the single-stage detector and have coupling, an algorithm is needed to solve these two problems at one time. Existing solutions often have the ability to only solve the problem of foreground-background imbalance or the problem of sample imbalance between foreground classes.

Fig. 5C is a diagram illustrating a ratio of the number of samples having different data distributions according to an embodiment of the present application, where the horizontal axis is the category index, the vertical axis is the ratio, the curve 501 is the data set with long tail distribution, and the curve 502 is the data set with uniform distribution, as shown in fig. 5C. As can be seen from the horizontal axis, curve 501 is descending, i.e., more classes are ranked in front than in the back, indicating an imbalance in sample classes in the long-tailed data set. As can be seen from the vertical axis, the foreground and background are unbalanced, and too many negative samples and too few positive samples.

Moreover, fig. 5C can analyze the distribution rule of the positive and negative samples of each category under the long tail distribution, and through analysis, it is found that although the distribution of the positive and negative samples is unbalanced in each category in the long tail distribution, the rare category obviously faces the problem of the imbalance of the positive and negative samples which is more serious than that of the frequent category. Therefore, in the embodiment of the application, the two unbalance problems of the single-stage detection model under the long-tail distribution are finally summarized as the problem that the unbalance degrees of the positive and negative samples in different categories are inconsistent.

The focus loss (focal loss) equally deals with the imbalance problem of positive and negative samples of all categories, but cannot deal with the imbalance problem of positive and negative samples of different categories.

Based on this analysis, the embodiment of the present application proposes an equilibrium focus loss EFL based on the above-mentioned focus loss, which can be obtained from the following equation (4):

wherein, γ^jThe first important factor proposed for the embodiment of the present application is a focusing factor (focusing factor), and the focusing factor can dynamically adjust the learning strength of each class sample according to the imbalance degree of positive and negative samples of a class. EFL (p)_t) Representing the loss function in the jth class.

For the rare classes, the above-described balanced focus loss emphasis increases the learning strength of positive examples while decreasing the learning attention of negative examples. While for the frequent category, the above-described balanced focal loss can maintain a similar behavior as the above-described focal loss.

In the embodiment of the present application, the focusing factor is divided into two parts, one part is the basic factor γ independent of the type_bThe other part is a factor related to the category

. The gamma is_bBasic behavior for controlling loss of equilibrium focus for dealing with positive and negative sample imbalance problems; the above-mentioned

According to the gradient weighting mechanism proposed in EQLv2, a large value is given to the rare class, so that the attention degree of the rare class to the unbalance of the positive and negative samples is expanded, and meanwhile, a small value is given to the frequent class, and the basic behavior of the loss function is kept. The focusing factor can be obtained from the following equation (5):

and s is a hyper-parameter, and the maximum learning strength of the unbalanced problem of the balanced focus loss on the rare positive and negative samples is controlled. g^jIs the positive and negative sample cumulative gradient ratio of the jth class in the training process.

FIG. 5D is a first diagram illustrating an equalizing focus loss function according to the present invention, as shown in FIG. 5D, where the horizontal axis is x_tX is said_tFor features before entry into the classifier, where p_t＝sigmoid(x_t) The vertical axis represents loss. The different curves in fig. 5D represent different sample classes. As can be seen from fig. 5D, the focusing factor alone does not yet enable the single-stage model to achieve optimal performance, for two reasons, one in the scenario where multiple classes are classified together,

the larger the size, the more trained it will beThe smaller the loss contribution in the process, the more rare the class needs to use a large one

To learn the extreme imbalance of positive and negative samples, but because of the large size

But the gradient contribution is lost. Another reason is that when rare class difficulty samples are learned together with frequent class difficulty samples, even if each class has a difference

Eventually the loss contributions of both are substantially identical. In fact, since the rare difficult samples are rare, it is more desirable to have more loss contribution than the frequent category difficult samples in the classification.

Based on the two reasons, the embodiment of the present application adds a weighting factor (weighting factor) to the balanced focus loss, where the weighting factor can promote loss contribution for the rare category and maintain the original loss contribution of the frequent category. The equalized focus loss (final EFL) after adding the weighting factor can be obtained by the following equation (6):

wherein C is the total number of categories,

is the weight factor.

FIG. 5E is a diagram illustrating a second equalizing focus loss function according to the embodiment of the present application, where the horizontal axis is x as shown in FIG. 5E_tThe vertical axis represents loss. In FIG. 5D and FIG. 5E each for rare categories

A larger value is given to pay attention to the problem of unbalanced foreground and background, and common class pairs are givenShould be that

Smaller values are given. In FIG. 5D, when the horizontal axis x_tIt is not expected that the loss functions of all classes will converge to the same value when the value of (a) is increasingly negative, since x is_tSmaller indicates more difficult samples. In FIG. 5E, when the horizontal and vertical directions x_tThe degree of convergence of each class sample is differentiated when it becomes increasingly negative, the more rare class losses, the more common class losses, and the relative relationship between the corresponding losses for each class is also different. Therefore, comparing fig. 5D and fig. 5E, it can be seen that when a weighting factor is added to the balanced focus loss, the attention of the rare class is higher than that of the common class, and the loss contribution is also more.

Therefore, the balance focus loss in the formula (3) solves the problem of unbalanced samples between the foreground classes in the single-stage long-tail target detection scene and also solves the problem of unbalanced foreground and background. Furthermore, in combination with the enhanced baseline, the single-stage long-tail target detection scheme provided by the embodiment of the application surpasses all existing long-tail correlation algorithms. In addition, in the scene distribution of foreground class sample balance, the balance focus loss in the embodiment of the present application is equivalent to the focus loss, and this characteristic enables the balance focus loss proposed in the embodiment of the present application to be well combined with any sampler (e.g., random sampler, class balance sampler, class weight sampler, etc.) and to work well on any data distribution.

4) Fourth portion 54: if the first part inputs an image to be detected, the second part outputs a prediction frame corresponding to the image to be detected, the prediction frame comprises a foreground frame and a background frame, and correspondingly, the output of the second part also comprises the category of the foreground in each foreground frame. However, there are many redundant prediction blocks (e.g., prediction blocks with high overlap) in the output prediction block, so that the fourth part needs to further process the output of the second part. Furthermore, the fourth part mainly includes a post-processing module 541 and an output module 542, where the post-processing module 541 is configured to perform post-processing on the output of the second part, for example, perform an NMS operation on the output of the second part; the output model 542 is used for outputting a detection result of the image to be detected.

Here, the first part 51, the second part 52 and the third part 53 collectively constitute a training part of the single-stage long-tail target detection scheme of the embodiment of the present application; the first section 51, the second section 52 and the fourth section 54 collectively form the testing section of the single-stage long-tail target detection scheme of the embodiment of the present application.

The target detection method provided by the embodiment of the application is a first long-tail single-stage target detection scheme, and the scheme improves a single-stage algorithm ATSS, adds stability setting and enhancement setting, and provides an enhancement baseline. In addition, the embodiment of the application comprehensively considers two problems of unbalanced samples between the foreground categories and unbalanced foreground and background, and provides balanced focus loss and solves the two problems at the same time.

Thus, the target detection method in the embodiment of the present application can achieve the following technical effects: 1) the method is closer to a real scene, has the capability of solving the long tail problem and the advantages of quick inference of a single-stage detector and easy deployment, and has the advantages of simplicity and high efficiency. 2) An enhanced baseline is provided for single-stage long-tail target detection, and subsequent research is facilitated. 3) Meanwhile, the two problems of unbalanced samples between foreground categories and unbalanced foreground and background in single-stage long-tail target detection are solved; the two large unbalance problems are solved by solving the problem that the unbalance degrees of different types of positive and negative samples are inconsistent under the long-tailed scene, and then two adjustment factors of a focusing factor and a weighting factor are provided and the two large unbalance problems are solved at the same time. 4) Based on two adjustment factors, balanced focus loss is provided, the balanced focus loss is combined with an enhanced baseline, the performance is superior, and the method is superior to the existing long-tail target detection algorithm. 5) The coupling degree of the balance focus loss and the model is low, and the method can be applied to most single-stage models.

Therefore, the user usage scenario of the target detection method in the embodiment of the present application includes: 1) and (3) long-tail target detection task: the long-tail problem exists in many practical application scenes, such as automatic driving, smart cities, industrial detection and the like, and the common situations of large class sample amount and small rare class sample amount exist. 2) Training and deployment process of the single-stage detector: the target detection method in the embodiment of the application is a long-tail solution completely based on a single-stage model, is closer to a real scene, and combines the advantage that a single-stage detector is easy to deploy.

Based on the foregoing embodiments, the present application further provides a training method for a detection model, where the training method uses a new loss function, and the following details are provided:

1) loss of focus;

in single stage detectors, loss of focus is widely used to solve the foreground-background imbalance problem. The focus loss redistributes the loss contribution of simple and difficult samples, thus attenuating most background samples. The conventional focus loss can be obtained by the following formula (7):

FL(p_t)＝-α_t(1-p_t)^γlog(p_t)……………………………(7)；

wherein p is_tThe value range of (1) is 0 to 1, and the prediction confidence score of the candidate object is represented; alpha is alpha_tIs used to balance the positive and negative samples. Modulation factor (1-p)_t)^γThe method is a key component of focus loss, reduces the weight of a simple sample, and improves the learning attention of a difficult sample. Since the negative examples are easily classified with respect to the positive examples, the problem of imbalance between the positive examples and the negative examples can be regarded as the problem of imbalance between the simple examples and the difficult examples. If gamma takes a larger value, the loss contribution of negative samples will be greatly reduced, thus improving the impact of positive samples on the training result. That is, the higher the degree of imbalance between the positive and negative samplesThe expected value of gamma between the positive and negative samples is larger.

Since the focus loss treats the learning process of all classes equally, all classes have the same modulation factor, the focus loss cannot deal with the long tail imbalance problem.

2) Equalizing the focus loss;

in addition to foreground-background imbalance in long tail datasets, single stage detectors also suffer from imbalance between foreground classes. Through research, the rare category suffers from more serious positive and negative sample imbalance under the long tail data. That is, most single-stage detectors perform worse in the rare class than in the frequent class, indicating that the same modulation factor does not fit into the problem of different degrees of imbalance for all positive and negative samples.

Based on the above analysis, the embodiments of the present application propose to solve the problem of imbalance of positive and negative samples to different degrees by using class-dependent equilibrium focus loss (i.e., EFL). The balance focus loss is related to the category, and the balance focus loss of the j-th category can be obtained by the following formula (8):

wherein alpha is_tAnd p_tAs in conventional focus loss; parameter gamma^jIs a class j focusing factor that, in addition to having a similar effect as the parameter gamma in conventional focus loss, can also be used to mitigate the problem of imbalance between different classes of positive and negative samples. The focusing factor gamma^jConsisting of two components, one being a class-independent parameter gamma_bA specific parameter associated with a category

The focusing factor gamma^jCan be obtained by the following formula (9):

wherein, γ_bParameters representing the focus factor in the balance data controlling the basic behavior of the classifier;

the value range of (1) is more than or equal to 0, the value range is a parameter related to the balance degree of the class j, a large value is given to the rare class according to a gradient weighting mechanism proposed in EQLv2, the attention degree of the rare class to the unbalance of positive and negative samples is expanded, a small value is given to the frequent class, and the basic behavior of a loss function is kept. g^jIndicating the cumulative gradient ratio, g, of class j positive and negative samples^jA larger value indicates a more balanced class j training, g^jSmaller values indicate more imbalance in class j training. To satisfy the pair gamma^jIn the examples of the present application, g may be specified^jTaking values in the range of 0 to 1 and using 1-g^jIts distribution is reversed. The hyperparameter s is a scaling factor used to determine γ in EFL^jThe upper limit of (3). With respect to the focus loss, the EFL proposed by the embodiment of the present application can separately handle the positive and negative sample imbalance problem of each category, thereby improving the performance of the detector.

Only the focusing factor gamma^jIt has not been possible to achieve optimal performance of the detection model because gamma is a common classification in a scenario where multiple classes are classified together^jThe larger the probability that the rarity class will contribute less to its loss during training, the smaller the probability that a large gamma is used^jTo learn the extreme imbalance of positive and negative samples, but because of large gamma^jBut the gradient contribution is lost. Also, when the rare class difficulty sample and the frequent class difficulty sample are learned together, even if each class has a different γ^jEventually, the loss contributions of both are substantially identical. In fact, since the number of rare difficult samples is rare, rare category difficult samples should have more loss contribution than frequent category difficult samples.

Accordingly, embodiments of the present application propose weighting factors to mitigate the above-mentioned situation by rebalancing the different classes of loss contributions. And focusing factorA large weighting factor value is assigned to rare classes similar to the embodiments of the present application to increase their loss contribution while keeping the weighting factor value close to 1 for frequent classes. For the embodiment of this application

To represent a weighting factor of class j, which is related to the focusing factor. The final balanced focus loss can be obtained by the following equation (10):

that is, equalizing focus loss after adding the weighting factors significantly increases the loss contribution of rare classes, while also improving the learning attention of rare difficult samples relative to frequent difficult samples.

In general, the focus factor and the weighting factor constitute class-dependent adjustment factors in the equalized focus loss EFL that enable the detection model to dynamically adjust the loss contribution. The EFL can improve the learning strength of the rare categories relative to the frequent categories, balance the loss contribution of the foreground samples and the background samples and improve the loss contribution of the rare categories relative to the frequent categories. Furthermore, the EFL can simultaneously solve the problem of imbalance between foreground categories and the problem of imbalance between foreground and background. And, if the data is evenly distributed, for each class in the EFL

That is, the EFL corresponds to a focus loss in the case where data is distributed evenly, and this property makes it possible to apply the EFL to data of different distributions and different data samplers.

Based on the foregoing embodiments, the present application provides a training apparatus for detecting a model, where the apparatus includes each included unit, each subunit and each module included in each unit, and each sub-module and each component included in each module, and may be implemented by a processor in an electronic device; of course, the implementation can also be realized through a specific logic circuit; in the implementation process, the processor may be a CPU (Central Processing Unit), an MPU (Microprocessor Unit), a DSP (Digital Signal Processing), an FPGA (Field Programmable Gate Array), or the like.

Fig. 6 is a schematic structural diagram of a training apparatus for detecting a model according to an embodiment of the present application, and as shown in fig. 6, the training apparatus 600 includes:

a first prediction unit 601, configured to perform target detection on the acquired training image set by using a detection model to be trained, so as to obtain a predicted detection result;

a degree determining unit 602, configured to determine imbalance degrees of positive and negative samples of different classes in the training image set;

a loss determining unit 603 configured to determine a loss of the predicted detection result in the different classes based on the predicted detection result and the imbalance degree of the positive and negative samples of the different classes;

a parameter updating unit 604, configured to update the network parameter in the detection model to be trained by using the loss until the updated detection model meets a convergence condition.

In some embodiments, the degree determining unit 602 includes:

the cumulative gradient ratio determining module is used for determining the cumulative gradient ratio of positive and negative samples of different classes in the training image set;

and the degree determining module is used for determining the unbalance degree of the positive and negative samples of different classes based on the cumulative gradient ratio of the positive and negative samples of different classes.

In some embodiments, the loss determining unit 603 includes:

the weighting module is used for weighting the unbalance degrees of the positive and negative samples of different categories respectively by utilizing a first hyper-parameter to obtain a weighting result of each category in the different categories; the first hyper-parameter is used for controlling the learning strength of the detection model to be trained on the unbalance degree of the positive and negative samples of the rare classes in different classes;

a first loss determination module to determine a loss of the predicted detection result in the each class based on the predicted detection result and the weighted result of the each class.

In some embodiments, the loss determining unit 603 includes:

a focus loss determination module for determining a focus loss of the predicted detection result;

and the second loss determining module is used for adjusting the focus loss by adopting the unbalance degrees of the positive and negative samples of different categories to obtain the predicted loss of the detection result in the different categories.

In some embodiments, the different categories include at least: a rare category and a frequent category, the apparatus further comprising:

the weight determining unit is used for determining a weight factor based on a basic factor and the unbalance degree of the positive and negative samples of the rare class and the frequent class; wherein the base factor is used for balancing the loss contribution of the foreground region and the background region in the training image;

the optimization unit is used for adjusting the loss contribution of the rare category relative to the frequent category in the loss based on the weight factor to obtain the optimized loss;

the parameter updating unit 604 includes:

and the first parameter updating subunit is used for updating the network parameters in the detection model to be trained by using the optimized loss.

In some embodiments, the parameter updating unit 604 includes:

a first gradient determining module, configured to determine, based on the loss, a gradient value output by the detection model to be trained in an updating process;

the second gradient determining module is used for performing gradient cutting on the gradient value under the condition that the gradient value is greater than or equal to a preset threshold value to obtain a cut gradient value;

and the parameter updating module is used for updating the network parameters in the detection model based on the cut gradient value.

In some embodiments, the apparatus further comprises:

the iteration number determining unit is used for determining the iteration number of the iterative training of the detection model to be trained in a pre-training stage based on the distribution condition of the images in the training image set;

the pre-training unit is used for pre-training the detection model to be trained based on the iteration times and the training image set to obtain a candidate detection model;

the parameter updating unit 604 includes:

and the second parameter updating subunit is used for updating the network parameters in the candidate detection model by using the loss until the updated detection model meets the convergence condition.

In some embodiments, the iteration number determining unit includes:

the first iteration number determining module is used for setting the iteration number of the detection model to be trained in a pre-training stage as a first preset value under the condition that the training image set is a long-tail distribution image set;

the second iteration number determining module is used for setting the iteration number as a second preset value under the condition that the training image set is an image set which is not distributed in a long tail manner; wherein the first preset value is greater than the second preset value.

In some embodiments, the first prediction unit 601 includes:

the prediction frame generation module is used for generating at least two prediction frames on each region of the training image by adopting the detection model to be trained; wherein the at least two prediction boxes have different scales;

the merging and crossing ratio determining module is used for determining the merging and crossing ratio between the generated prediction frame and the mark frame of the training image;

and the prediction module is used for measuring the probability that the target in the prediction frame is the foreground based on the cross-over ratio to obtain the detection result of the training image prediction.

In some embodiments, the apparatus further comprises:

the preprocessing unit is used for preprocessing each training image in the training image set to obtain a preprocessed training image;

the first feature extraction unit is used for extracting features of the preprocessed training images to obtain a feature map;

the prediction box generation module comprises:

the prediction frame generation sub-module is used for generating at least two prediction frames on each pixel of the feature map by adopting a detection model to be trained; wherein the pixels of the feature map have a correspondence with regions in the training image.

Based on the foregoing embodiments, the present application provides an object detection apparatus, where the apparatus includes units, sub-units and modules included in the units, and sub-modules and components included in the modules, and may be implemented by a processor in an electronic device; of course, the implementation can also be realized through a specific logic circuit; in implementation, the processor may be a CPU, MPU, DSP, FPGA, or the like.

Fig. 7 is a schematic structural diagram of a target detection apparatus according to an embodiment of the present application, and as shown in fig. 7, the target detection apparatus 700 includes:

an acquiring unit 701 configured to acquire an image to be detected;

a preprocessing unit 702, configured to preprocess the image to be detected, so as to obtain a preprocessed image to be detected;

a feature extraction unit 703, configured to perform feature extraction on the preprocessed image to be detected to obtain a feature map of the image to be detected;

the second prediction unit 704 is configured to input the feature map of the image to be detected to the detection model to obtain a prediction result; wherein the detection model is trained based on the method of any one of claims 1 to 10;

and a post-processing unit 705, configured to perform post-processing on the prediction result to obtain a detection result of the image to be detected.

In some embodiments, the prediction result comprises a plurality of prediction boxes, and a target class corresponding to each prediction box; the post-processing unit 705 includes:

and the post-processing subunit is used for filtering the overlapped prediction frames in the plurality of prediction frames to obtain the filtered prediction frames and the target category corresponding to each filtered prediction frame.

The above description of the apparatus embodiments, similar to the above description of the method embodiments, has similar beneficial effects as the method embodiments. For technical details not disclosed in the embodiments of the apparatus of the present application, reference is made to the description of the embodiments of the method of the present application for understanding.

It should be noted that, in the embodiment of the present application, if the training method or the detection method is implemented in the form of a software functional module and sold or used as a standalone product, the training method or the detection method may also be stored in a computer readable storage medium. Based on such understanding, the technical solutions of the embodiments of the present application may be essentially implemented or portions thereof contributing to the prior art may be embodied in the form of a software product stored in a storage medium, and including several instructions for causing an electronic device (which may be a personal computer, a server, etc.) to execute all or part of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a ROM (Read Only Memory), a magnetic disk, or an optical disk. Thus, embodiments of the present application are not limited to any specific combination of hardware and software.

Correspondingly, an embodiment of the present application provides an electronic device, which includes a memory and a processor, where the memory stores a computer program that is executable on the processor, and the processor executes the computer program to implement the steps in the training method or the steps in the detection method provided in the foregoing embodiment.

Correspondingly, the embodiment of the present application provides a readable storage medium, on which a computer program is stored, and the computer program, when executed by a processor, implements the steps in the training method or the steps in the detection method.

Here, it should be noted that: the above description of the storage medium and platform embodiments is similar to the description of the method embodiments above, with similar beneficial effects as the method embodiments. For technical details not disclosed in the embodiments of the storage medium and the platform of the present application, reference is made to the description of the embodiments of the method of the present application for understanding.

It should be noted that fig. 8 is a schematic diagram of a hardware entity of an electronic device according to an embodiment of the present application, and as shown in fig. 8, the hardware entity of the electronic device 800 includes: a processor 801, a communication interface 802, and a memory 803, wherein

The processor 801 generally controls the overall operation of the electronic device 800.

The communication interface 802 may enable the electronic device 800 to communicate with other servers or electronic devices or platforms via a network.

The Memory 803 is configured to store instructions and applications executable by the processor 801, and may also buffer data (e.g., image data, audio data, voice communication data, and video communication data) to be processed or already processed by the processor 801 and modules in the electronic device 800, and may be implemented by a FLASH Memory (FLASH Memory) or a Random Access Memory (RAM);

wherein the various hardware entities within electronic device 800 are coupled together by a bus 804. It is understood that bus 804 is used to enable communications of connections between these hardware entities.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above-described device embodiments are merely illustrative, for example, the division of the unit is only a logical functional division, and there may be other division ways in actual implementation, such as: multiple units or components may be combined, or may be integrated into another system, or some features may be omitted, or not implemented. In addition, the coupling, direct coupling or communication connection between the components shown or discussed may be through some interfaces, and the indirect coupling or communication connection between the devices or units may be electrical, mechanical or other forms.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed on a plurality of network units; some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, all functional units in the embodiments of the present application may be integrated into one processing module, or each unit may be separately regarded as one unit, or two or more units may be integrated into one unit; the integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit. Those of ordinary skill in the art will understand that: all or part of the steps for implementing the method embodiments may be implemented by hardware related to program instructions, and the program may be stored in a computer readable storage medium, and when executed, the program performs the steps including the method embodiments; and the aforementioned storage medium includes: a removable storage device, a ROM, a RAM, a magnetic or optical disk, or various other media that can store program code.

The methods disclosed in the several method embodiments provided in the present application may be combined arbitrarily without conflict to obtain new method embodiments. Features disclosed in several of the product embodiments provided in the present application may be combined in any combination to yield new product embodiments without conflict. The features disclosed in the several method or apparatus embodiments provided in the present application may be combined arbitrarily, without conflict, to arrive at new method embodiments or apparatus embodiments.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A training method for a test model, the method comprising:

adopting a detection model to be trained to perform target detection on the obtained training image set to obtain a predicted detection result;

determining the unbalance degree of positive and negative samples of different classes in the training image set;

determining a loss of the predicted detection result in the different classes based on the predicted detection result and a degree of imbalance of positive and negative samples of the different classes;

and updating the network parameters in the detection model to be trained by using the loss until the updated detection model meets the convergence condition.

2. The training method of claim 1, wherein the determining the degree of imbalance for different classes of positive and negative samples in the set of training images comprises:

determining cumulative gradient ratios of positive and negative samples of different classes in the training image set;

determining the degree of imbalance of the positive and negative samples of the different classes based on the cumulative gradient ratios of the positive and negative samples of the different classes.

3. The method of claim 1 or 2, wherein determining the loss of the predicted detection result in the different classes based on the predicted detection result and the degree of imbalance of the positive and negative samples of the different classes comprises:

weighting the unbalance degrees of the positive and negative samples of different classes respectively by utilizing a first hyper-parameter to obtain a weighting result of each class in the different classes; the first hyper-parameter is used for controlling the learning strength of the detection model to be trained on the unbalance degree of the positive and negative samples of the rare classes in different classes;

determining a loss of the predicted detection result in the each class based on the predicted detection result and the weighted result of the each class.

4. The method of any of claims 1 to 3, wherein said determining a loss of said predicted detection result in said different class based on said predicted detection result and a degree of imbalance of positive and negative samples of said different class comprises:

determining a loss of focus for the predicted detection result;

and adjusting the focus loss by adopting the unbalance degree of the positive and negative samples of different classes to obtain the predicted loss of the detection result in the different classes.

5. The method according to claim 4, characterized in that said different categories comprise at least: a rare class and a frequent class, wherein the focus loss is adjusted by using the imbalance degree of the positive and negative samples of the different classes, and the predicted detection result loss in the different classes is obtained, and the method further comprises:

determining a weighting factor based on a basic factor and the unbalance degree of the positive and negative samples of the rare class and the frequent class; wherein the base factor is used for balancing the loss contribution of the foreground region and the background region in the training image;

based on the weight factor, adjusting the loss contribution of the rare category relative to the frequent category in the loss to obtain optimized loss;

the updating the network parameters in the detection model to be trained by using the loss comprises: and updating the network parameters in the detection model to be trained by utilizing the optimized loss.

6. The method according to any one of claims 1 to 5, wherein the updating the network parameters in the detection model to be trained by using the loss comprises:

determining a gradient value output by the detection model to be trained in the updating process based on the loss;

performing gradient cutting on the gradient value under the condition that the gradient value is greater than or equal to a preset threshold value to obtain a cut gradient value;

and updating the network parameters in the detection model based on the cut gradient value.

7. The method according to any one of claims 1 to 6, further comprising:

determining the iteration times of the iterative training of the detection model to be trained in a pre-training stage based on the distribution condition of the images in the training image set;

pre-training the detection model to be trained based on the iteration times and the training image set to obtain a candidate detection model;

the updating the network parameters in the detection model to be trained by using the loss until the updated detection model meets the convergence condition includes:

and updating the network parameters in the candidate detection model by using the loss until the updated detection model meets the convergence condition.

8. The method according to claim 7, wherein the determining the number of iterations for performing iterative training on the detection model to be trained in a pre-training stage based on the distribution of the images in the training image set comprises:

setting the iteration times of the detection model to be trained in a pre-training stage as a first preset value under the condition that the training image set is a long-tail distribution image set;

setting the iteration times as a second preset value under the condition that the training image set is a non-long-tail distribution image set; wherein the first preset value is greater than the second preset value.

9. The method according to any one of claims 1 to 8, wherein the performing target detection on the acquired training image set by using the detection model to be trained to obtain a predicted detection result comprises:

generating at least two prediction frames on each region of a training image by adopting a detection model to be trained; wherein the at least two prediction boxes have different scales;

determining a cross-correlation between the generated prediction frame and a labeling frame of the training image;

and measuring the probability that the target in the prediction frame is the foreground based on the cross-correlation to obtain the detection result of the training image prediction.

10. The method of claim 9, further comprising:

preprocessing each training image in the training image set to obtain a preprocessed training image;

performing feature extraction on the preprocessed training image to obtain a feature map;

the generating at least two prediction frames on each region of the training image by using the detection model to be trained comprises:

generating at least two prediction frames on each pixel of the feature map by adopting a detection model to be trained; wherein the pixels of the feature map have a correspondence with regions in the training image.

11. A method of object detection, the method comprising:

acquiring an image to be detected;

preprocessing the image to be detected to obtain a preprocessed image to be detected;

performing feature extraction on the preprocessed image to be detected to obtain a feature map of the image to be detected;

inputting the characteristic diagram of the image to be detected into a detection model to obtain a prediction result; wherein the detection model is trained based on the method of any one of claims 1 to 10;

and carrying out post-processing on the prediction result to obtain a detection result of the image to be detected.

12. The method of claim 11, wherein the prediction result comprises a plurality of prediction boxes, and a target class corresponding to each prediction box;

the post-processing the prediction result to obtain the detection result of the image to be detected comprises the following steps:

and filtering the overlapped prediction frames in the plurality of prediction frames to obtain the filtered prediction frames and the target category corresponding to each filtered prediction frame.

13. A training apparatus for testing a model, the apparatus comprising:

the first prediction unit is used for carrying out target detection on the obtained training image set by adopting a detection model to be trained to obtain a predicted detection result;

the degree determining unit is used for determining the unbalance degree of the positive and negative samples of different classes in the training image set;

a loss determination unit configured to determine a loss of the predicted detection result in the different classes based on the predicted detection result and a degree of imbalance of positive and negative samples of the different classes;

and the parameter updating unit is used for updating the network parameters in the detection model to be trained by using the loss until the updated detection model meets the convergence condition.

14. An object detection apparatus, characterized in that the apparatus comprises:

the acquisition unit is used for acquiring an image to be detected;

the preprocessing unit is used for preprocessing the image to be detected to obtain a preprocessed image to be detected;

the characteristic extraction unit is used for extracting the characteristics of the preprocessed image to be detected to obtain a characteristic diagram of the image to be detected;

the second prediction unit is used for inputting the characteristic diagram of the image to be detected into the detection model to obtain a prediction result; wherein the detection model is trained based on the method of any one of claims 1 to 10;

and the post-processing unit is used for performing post-processing on the prediction result to obtain the detection result of the image to be detected.

15. An electronic device comprising a memory and a processor, the memory storing a computer program operable on the processor to perform the steps of the method of any one of claims 1 to 10 or the steps of the method of any one of claims 11 to 12 when the program is executed by the processor.

16. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 10 or carries out the steps of the method of any one of claims 11 to 12.