CN117197592B

CN117197592B - Target detection model training method and device, electronic equipment and medium

Info

Publication number: CN117197592B
Application number: CN202311461357.5A
Authority: CN
Inventors: 沈西; 谭期友; 黄世华; 于永军; 童小树
Original assignee: Intelingda Information Technology Shenzhen Co ltd
Current assignee: Intelingda Information Technology Shenzhen Co ltd
Priority date: 2023-11-06
Filing date: 2023-11-06
Publication date: 2024-03-01
Anticipated expiration: 2043-11-06
Also published as: CN117197592A

Abstract

The embodiment of the application provides a target detection model training method, a device, electronic equipment and a medium, and relates to the technical field of image processing, wherein the method comprises the following steps: and acquiring a positive sample image set and a negative sample image set, and then respectively carrying out target detection on each negative sample image included in the negative sample image set by utilizing a target detection network to obtain a target detection result of each negative sample image and the confidence of the target detection result. And screening out a specified number of negative sample images based on the confidence coefficient of the target detection result of each negative sample image, training a target detection network based on each positive sample image included in the positive sample image set and the screened negative sample images, and returning to the step of respectively carrying out target detection on each negative sample image included in the negative sample image set by using the target detection network until the target detection network is trained, wherein the trained target detection network is used as a target detection model. Thereby improving the detection accuracy of the target detection model.

Description

Target detection model training method and device, electronic equipment and medium

Technical Field

The present disclosure relates to the field of image processing technologies, and in particular, to a method and apparatus for training a target detection model, an electronic device, and a medium.

Background

The target detection task in the field of computer vision is a task with a wider application range. The object detection task is typically implemented using an object detection model, such as identifying the location of a specified object in an image. In the task of target detection, the detection performance of the target detection model is generally measured by using the detection rate (Recall) and the accuracy (Precision). The detection rate is the proportion of the images with the specified target identified by the target detection model in the images with the specified target. The accuracy is the proportion of the image actually comprising the specified target in the image which is identified by the target detection model to exist the specified target.

Accuracy is particularly important in large-scale target detection scenarios. For example, in a video auditing scenario, assuming there are tens of thousands of target detection traffic lines, each target detection traffic line adopts a target detection model to audit whether a video frame including a specified target exists in the video. Assuming that each service line generates one false alarm every day, each service line generates tens of thousands of false alarms every day, if it is required to manually check whether there is a false alarm video in the video including the specified target identified by the target detection model, the workload of the manual check is huge. False positives refer to the situation where a video is identified by the object detection model as including a specified object, but the video does not actually include the specified object.

Therefore, in order to improve the accuracy and reduce the false alarm rate, the conventional method is to directly take each video frame included in the false alarm video as a negative sample image, take the video frame including the specified target manually extracted from the video as a positive sample image, and train a target detection model by using the positive and negative sample images. However, in this case, the number of negative sample images in the training sample is several tens times that of positive sample images, so that the proportion of the negative sample images and the positive sample images in the training sample is extremely unbalanced. Under the condition of excessive negative samples, when the target detection model is trained, the target detection model can pay attention to the negative sample images excessively, and learning of the positive sample images is ignored, so that the detection accuracy of the target detection model is low.

Disclosure of Invention

An object of the embodiment of the application is to provide a method, a device, an electronic device and a medium for training a target detection model, so as to improve the detection accuracy of the target detection model. The specific technical scheme is as follows:

in a first aspect of an embodiment of the present application, there is provided a method for training a target detection model, the method including:

acquiring a positive sample image set and a negative sample image set;

Respectively carrying out target detection on each negative sample image included in the negative sample image set by utilizing a target detection network to obtain target detection results of the negative sample images and confidence degrees of the target detection results;

screening out a specified number of negative sample images based on the confidence level of the target detection result of each negative sample image;

and training the target detection network based on each positive sample image and the screened negative sample image included in the positive sample image set, and returning to the step of respectively carrying out target detection on each negative sample image included in the negative sample image set by using the target detection network until the training of the target detection network is completed, wherein the trained target detection network is used as a target detection model.

Optionally, the training the target detection network based on each positive sample image included in the positive sample image set and the screened negative sample image includes:

constructing a plurality of groups of training sample images based on each positive sample image included in the positive sample image set and the screened negative sample images;

according to the arrangement sequence of each group of training sample images, a group of training sample images are obtained, and the target detection model is utilized to respectively carry out target detection on each training sample image included in the group of training sample images, so that the target detection result and the confidence coefficient of the target detection result of each training sample image included in the group of training sample images are obtained;

Calculating a loss value by using the target detection result of each training sample image included in the group of training sample images and the confidence coefficient of the target detection result;

determining whether the target detection network converges based on the loss value;

if yes, determining that the target detection network training is completed;

if not, the network parameters of the target detection network are adjusted based on the loss value, and the step of obtaining a group of training sample images according to the arrangement sequence of the training sample images of each group is returned until each group of training samples are obtained.

Optionally, the target detection result includes: a detection frame in which a specified target in the image is located and a detection type to which the specified target in the image belongs; calculating a loss value by using the target detection result of each training sample image included in the set of training sample images and the confidence of the target detection result, including:

determining a position error between a detection frame in the training sample image and a label frame in which a specified target in the training sample image is actually positioned for each training sample image included in the group of training sample images, and determining regression loss according to the position error of each training sample image included in the group of training sample images;

Determining type errors between a detection type to which a specified target in the training sample image belongs and a tag type to which the specified target in the training sample image actually belongs for each training sample image included in the group of training sample images, and determining classification loss according to the type errors of each training sample image included in the group of training sample images;

for each training sample image included in the group of training sample images, determining a confidence coefficient error of the training sample image according to the label type actually belonged to the appointed target in the training sample image, and determining a confidence coefficient loss according to the confidence coefficient error of each training sample image included in the group of training sample images;

the loss value is determined based on the regression loss, the classification loss, and the confidence loss.

Optionally, the loss value is:

;

wherein,for the loss value, +.>For the regression loss, ++>Preset weights for the classification loss, +.>For the classification loss, < >>Preset weight for said confidence loss, < ->And losing the confidence.

Optionally, the constructing a plurality of groups of training sample images based on each positive sample image included in the positive sample image set and the screened negative sample image includes:

Acquiring a first preset number of positive sample images from the positive sample image set, and acquiring a second preset number of negative sample images from the screened negative sample images;

performing stitching or overlapping treatment on the first preset number of positive sample images and the second preset number of negative sample images, taking the treated images as a training sample image, and returning to the step of acquiring the first preset number of positive sample images from the positive sample image set until the number of the training sample images reaches the preset sample number;

dividing the training sample images of the preset sample number into a plurality of groups.

Optionally, before the target detection is performed on each negative sample image included in the negative sample image set by using the target detection network, the method further includes:

setting the mode of the target detection network as a test mode;

after the target detection is performed on each negative sample image included in the negative sample image set by using the target detection network, the method further includes:

and setting the mode of the target detection network as a training mode.

Optionally, the screening out a specified number of negative sample images based on the confidence of the target detection result of each negative sample image includes:

And screening out the negative sample images with the specified number according to the order of the confidence of the target detection result of each negative sample image from high to low.

In a second aspect of the embodiments of the present application, there is provided a training apparatus for a target detection model, the apparatus including:

the acquisition module is used for acquiring a positive sample image set and a negative sample image set;

the detection module is used for respectively carrying out target detection on each negative sample image included in the negative sample image set by utilizing a target detection network to obtain target detection results of the negative sample images and confidence degrees of the target detection results;

the screening module is used for screening out the negative sample images with specified number based on the confidence level of the target detection result of each negative sample image;

the training module is used for training the target detection network based on each positive sample image and the screened negative sample image included in the positive sample image set, and returning to the step of respectively carrying out target detection on each negative sample image included in the negative sample image set by using the target detection network until the training of the target detection network is completed, and taking the trained target detection network as a target detection model.

Optionally, the training module is specifically configured to:

if yes, determining that the target detection network training is completed;

Optionally, the target detection result includes: a detection frame in which a specified target in the image is located and a detection type to which the specified target in the image belongs; the training module is specifically configured to:

Optionally, the loss value is:

;

Optionally, the training module is specifically configured to:

Optionally, the apparatus further includes:

the setting module is used for setting the mode of the target detection network as a test mode before the target detection is carried out on each negative sample image included in the negative sample image set by using the target detection network;

The setting module is further configured to set a mode of the target detection network as a training mode after the target detection is performed on each negative sample image included in the negative sample image set by using the target detection network.

Optionally, the screening module is specifically configured to:

In a third aspect of the embodiments of the present application, an electronic device is provided, including a processor, a communication interface, a memory, and a communication bus, where the processor, the communication interface, and the memory complete communication with each other through the communication bus;

a memory for storing a computer program;

a processor configured to implement the object detection model training method step of any one of the first aspect when executing a program stored on a memory.

In a fourth aspect of embodiments of the present application, there is provided a computer readable storage medium having stored therein a computer program which, when executed by a processor, implements the object detection model training method steps of any one of the first aspects.

In a fifth aspect of embodiments of the present application, there is provided a computer program product comprising instructions which, when run on a computer, cause the computer to perform the object detection model training method of any of the above-described first aspects.

The beneficial effects of the embodiment of the application are that:

according to the target detection model training method, device, electronic equipment and medium, target detection can be carried out on each negative sample image included in the negative sample image set by utilizing the target detection network, target detection results of the negative sample images and confidence coefficients of the target detection results are obtained, and a specified number of negative sample images are screened out based on the confidence coefficients of the target detection results of the negative sample images, so that the target detection network is trained by utilizing the positive sample images and the screened negative sample images, and a target detection model is obtained. Because the negative sample images are screened, the number of negative samples used in model training is reduced, the situation that the space between the duty ratio of the negative sample images and the duty ratio of the positive sample images in the training samples is large can be relieved, the attention of the target detection network to the positive sample images and the negative sample images in training can be balanced, and therefore the detection accuracy of the target detection model obtained through training is improved.

Of course, not all of the above-described advantages need be achieved simultaneously in practicing any one of the products or methods of the present application.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the following description will briefly introduce the drawings that are required to be used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are only some embodiments of the present application, and other embodiments may also be obtained according to these drawings to those skilled in the art.

FIG. 1 is a flowchart of a training method for a target detection model according to an embodiment of the present application;

FIG. 2 is a flowchart of another training method for a target detection model according to an embodiment of the present application;

FIG. 3 is an exemplary schematic diagram of a training process for a target detection model according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of a backbone network and a feature fusion network included in a target detection model according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of a CSPlayer included in a target detection model according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of a detection head network included in a target detection model according to an embodiment of the present application;

Fig. 7 is a schematic structural diagram of a training device for a target detection model according to an embodiment of the present application;

fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all, of the embodiments of the present application. Based on the embodiments herein, a person of ordinary skill in the art would be able to obtain all other embodiments based on the disclosure herein, which are within the scope of the disclosure herein.

In order to improve the detection accuracy of the target detection model, the embodiment of the application provides a target detection model training method which is applied to equipment with model training capability such as a server. As shown in fig. 1, the method comprises the steps of:

s101, acquiring a positive sample image set and a negative sample image set.

The positive sample images included in the positive sample image set all comprise the appointed target, and the negative sample images included in the negative sample image set all do not comprise the appointed target. The specified target can be set according to the actual application scene. For example, when the embodiment of the application is applied to an intelligent security monitoring scene, the specified target may be a face or a human body; when the embodiment of the application is applied to an automatic driving scene, the specified target can be a vehicle or a pedestrian or the like; when the embodiments of the present application are applied to intelligent medical scenarios, the designated target may be a surgical tracer or lesion, etc. Or the embodiment of the application may also be applied to other scenes, and the application scene and the specified target in the application scene are not particularly limited.

S102, respectively carrying out target detection on each negative sample image included in the negative sample image set by utilizing a target detection network to obtain target detection results of the negative sample images and confidence degrees of the target detection results.

The target detection result of each negative sample image may include whether the negative sample image includes a specified target, and a position of the specified target in the image in a case where the negative sample image includes the specified target.

The confidence (objectness) of each target detection result is within a preset confidence range, for example, the preset confidence range is 0-1.

S103, screening out the negative sample images with specified number based on the confidence of the target detection result of each negative sample image.

S104, training a target detection network based on each positive sample image and the screened negative sample image included in the positive sample image set, and returning to the step S101 of respectively carrying out target detection on each negative sample image included in the negative sample image set by utilizing the target detection network until the training of the target detection network is completed, wherein the trained target detection network is used as a target detection model.

According to the target detection model training method, target detection can be carried out on each negative sample image included in the negative sample image set by utilizing the target detection network, target detection results of the negative sample images and confidence levels of the target detection results are obtained, and a specified number of negative sample images are screened out based on the confidence levels of the target detection results of the negative sample images, so that the target detection network is trained by utilizing the positive sample images and the screened negative sample images, and a target detection model is obtained. Because the negative sample images are screened, the number of negative samples used in model training is reduced, the situation that the space between the duty ratio of the negative sample images and the duty ratio of the positive sample images in the training samples is large can be relieved, the attention of the target detection network to the positive sample images and the negative sample images in training can be balanced, and therefore the detection accuracy of the target detection model obtained through training is improved.

In this embodiment of the present application, the step S103 of screening out the negative sample images with a specified number based on the confidence of the target detection result of each negative sample image may be implemented as follows: and screening out the negative sample images with the specified number according to the order of the confidence of the target detection result of each negative sample image from high to low.

The appointed number is marked as K, and the first K negative sample images can be selected according to the order of confidence from high to low to obtain a screened negative sample image setWherein->Representing the confidence of the target detection result of the negative sample image.

Alternatively, the specified number may be N of the number of positive sample images included in the positive sample image set ₁ Multiple, e.g. N ₁ =2. Alternatively, the specified number may be 1/N of the number of negative sample images included in the negative sample image set ₂ For example, N ₂ =10. Or the designated number can be other numbers, and can be specifically set according to actual requirements.

The target detection result may include: a detection frame in which the specified target in the image is located and a detection type to which the specified target in the image belongs. The pixel coordinates of the center point of the detection frame in the image, and the lateral offset distance and the vertical offset distance of the center point can be used to represent the detection frame. For example, the target detection results include: the pixel coordinates of the center point of the detection frame in the image are [ a, b ], x and y, and the width of the target detection frame is 2x and the height is 2y. Wherein the detection box center point may be referred to as an anchor point, the target detection model may determine the anchor point based on an optimal transport allocation (OptimalTransport Assignment, OTA) or a simplified optimal transport allocation (Simplified Optimal Transport Assignment, simOTA) algorithm. The detection type to which the specified target in the image belongs is one of a plurality of preset types. For example, the various preset types are: trucks, buses, fire trucks and the like, and the detection type is trucks.

The confidence of the target detection result is as follows: the likelihood that a specified object in the image belongs to the detection type. Therefore, the greater the confidence of the target detection result of the negative sample image, the greater the likelihood that the target detection model considers that the negative sample image includes the specified target, but the negative sample image actually does not include the specified target, so the greater the detection difficulty of the negative sample image for the target detection model.

Therefore, in the embodiment of the application, the negative sample images with larger confidence coefficient of the target detection result are screened out from the negative sample images, and the detection difficulty of the negative sample images on the target detection model is larger, so that the negative sample images are utilized to train the target detection model, and the detection accuracy of the target detection model can be improved more effectively.

In contrast, the negative sample images with smaller confidence of the target detection result have smaller detection difficulty on the target detection model, and the negative sample images are used for training the target detection model, so that huge training resources are required to be consumed, the model training time consumption is increased, the model training efficiency is reduced, and the effect of improving the recognition accuracy of the target detection model is not obvious. Therefore, the embodiment of the application discards training the target detection model by using the negative sample images, so that the detection accuracy of the target detection model is improved on the basis of improving the model training efficiency, and the false detection rate is reduced on the basis of ensuring the detection rate of the target detection model.

Alternatively, the manner of screening the negative sample image in S103 may be: negative sample images with the confidence of the target detection result in the same confidence interval are divided into a group, and a plurality of negative sample images are randomly acquired from each group of negative sample images.

Wherein the number of negative sample images acquired from each set of negative sample images may be the same or different. For example, a negative image with the confidence of the target detection result being [ 0-0.5 ] is used as a first group of negative images, a negative image with the confidence of the target detection result being (0.5, 1) is used as a second group of negative images, n1 negative images are randomly acquired from the first group of negative images, and n2 negative images are randomly acquired from the second group of negative images.

Alternatively, the negative sample image may be screened in other manners, which are not specifically limited in the embodiments of the present application.

In this embodiment of the present application, referring to fig. 2, the step S104 of training the target detection network based on each positive sample image and the screened negative sample image included in the positive sample image set includes the following steps:

S201, constructing a plurality of groups of training sample images based on each positive sample image included in the positive sample image set and the screened negative sample images.

The first preset number of positive sample images may be obtained from the positive sample image set, and the second preset number of negative sample images may be obtained from the screened negative sample images. Wherein, each time the image is acquired, the image can be randomly selected. Then, performing stitching or overlapping treatment on the first preset number of positive sample images and the second preset number of negative sample images, taking the treated images as a training sample image, and returning to the step of acquiring the first preset number of positive sample images from the positive sample image set until the number of the training sample images reaches the preset number of samples. And dividing the training sample images with preset sample numbers into a plurality of groups.

Wherein the first preset number and the second preset number may be the same or different. For example, the first preset number and the second preset number are both 2.

For example, a Mosaic (Mosaic) algorithm may be adopted to perform stitching on the first preset number of positive sample images and the second preset number of negative sample images in a random clipping, random scaling or random arrangement manner, and the stitched images are used as a training sample.

For another example, a hybrid (MixUp) algorithm may be employed to determine, for each of the same location pixels in the first and second predetermined number of positive sample images, and calculating the pixel value of the pixel point at the position in the positive sample image multiplied by alpha% + and the pixel value of the pixel point at the position in the negative sample image multiplied by beta, and taking the calculation result as the pixel value of the pixel point at the position in the superimposed image. Wherein. The value ranges of alpha and beta are 0-100, and alpha and beta can be preset.

Or the images may be spliced or superimposed in other manners, which is not specifically limited in the embodiment of the present application.

The obtained training sample images are then divided into multiple groups, each group of training sample images may be referred to as a small batch (mini-batch), and each training sample image may be referred to as a batch (batch).

By the method, each training sample obtained by the embodiment of the application comprises the positive sample image and the negative sample image, so that the situation that only the positive sample image or only the negative sample image is used in each iteration when the target detection model is trained is avoided, and the detection accuracy of the target detection network is improved.

S202, acquiring a group of training sample images according to the arrangement sequence of the training sample images, and respectively carrying out target detection on each training sample image included in the group of training sample images by utilizing a target detection model to obtain a target detection result and a confidence coefficient of the target detection result of each training sample image included in the group of training sample images.

The first set of training samples may be obtained when the first execution of S202 is performed, and S202 to S206 are performed, then the second set of training samples may be obtained when the second execution of S202 is performed, and so on, until the last set of training samples is obtained when the execution of S202 is performed, and S202 to S206 are performed, or until the training of the target detection network is completed.

S203, calculating a loss value by using the target detection result of each training sample image included in the training sample image and the confidence coefficient of the target detection result.

The target detection result of each training sample image included in the set of training sample images and the confidence level of the target detection result may be substituted into a preset loss function, so as to calculate a loss value. For example, the preset loss function is: cross entropy loss function or L1 loss function, where L1 loss function is also known as average absolute error loss function, etc.

S204, determining whether the target detection network converges or not based on the loss value. If yes, executing S205; if not, S206 is performed.

Optionally, it may be determined whether the loss value is smaller than a preset threshold, if yes, the convergence of the target detection network is determined, otherwise, it is determined that the target detection network is not converged. Or, it may be determined whether the difference between the loss value calculated in the current iteration and the loss value calculated in the last iteration is smaller than a preset difference, that is, whether the difference between the loss value calculated in the current execution S204 and the loss value calculated in the last execution S204 is smaller than a preset difference, if yes, it is determined that the target detection network converges, or if not, it is determined that the target detection network does not converge.

Alternatively, whether the target detection network converges may be determined in other manners, which is not specifically limited in the embodiments of the present application.

S205, determining that the target detection network training is completed.

If the target detection network converges, it is indicated that the detection accuracy of the target detection network is higher at this time, so that it can be determined that the training of the target detection network is completed.

S206, adjusting network parameters of the target detection network based on the loss value, and returning to S202 to acquire a group of training sample images according to the arrangement sequence of the training sample images of each group until each group of training samples is acquired.

If the target detection network converges, it is indicated that the detection accuracy of the target detection network is low at this time, and training needs to be continued, so that network parameters of the target detection network can be adjusted based on the loss value, and a step of acquiring a set of training sample images according to the arrangement sequence of the training sample images of each set is returned, so that the next iteration is performed.

Through the method, the training sample can be split into the mini-latches, so that each mini-latch is utilized to complete one training iteration on the target detection network, and the training efficiency of the target detection network is improved. In addition, each mini-batch contains both positive sample images and negative sample images, so that the target detection network can balance the attention degree of the positive sample images and the negative sample images in the training process, the recognition accuracy of the target detection network is improved, and the false detection rate is greatly reduced.

In this embodiment of the present application, the step S203 calculates the loss value by using the target detection result and the confidence coefficient of the target detection result of each training sample image included in the set of training sample images, and includes the following steps:

step one, determining the position error between a detection frame in the training sample image and a label frame in which a specified target in the training sample image is actually positioned for each training sample image included in the group of training sample images, and determining regression loss according to the position error of each training sample image included in the group of training sample images.

The label frame where the appointed target in each training sample image is actually located is: the training sample image includes a tag box in which a specified target in the positive sample image is actually located. The label frame of the positive sample image can be obtained through manual labeling. Further, the negative sample image does not include a label box.

The detection frame and the label frame of each training sample image included in the set of training sample images may be substituted into a two-class cross entropy (Binary CrossEntropy, BCE) loss function or a cross entropy loss function, so as to calculate a regression loss.

Step two, determining type errors between the detection type of the designated target in the training sample image and the tag type of the designated target in the training sample image aiming at each training sample image included in the group of training sample images, and determining classification loss according to the type errors of each training sample image included in the group of training sample images.

The label types actually belonged to the appointed target in each training sample image are as follows: the training sample image includes a tag type to which the specified target in the positive sample image actually belongs. The label type of the appointed target in the positive sample image can be obtained through manual labeling. Further, the negative sample image does not include a tag type to which the specified target actually belongs.

The detection type and the label type of each training sample image included in the set of training sample images may be substituted into the intersection ratio (Intersection over Union, ioU) loss function or the L1 loss function to calculate the classification loss. Wherein IoU loss function is used to calculate the ratio of intersection and union of detection frames and label frames in the training sample image.

Determining confidence errors of the training sample images according to the label types actually belonged to the appointed targets in the training sample images aiming at each training sample image included in the group of training sample images, and determining confidence losses according to the confidence errors of the training sample images included in the group of training sample images.

It can be determined whether the label type and the detection type of the training sample image are the same. If the confidence coefficient is the same, determining that the actual confidence coefficient of the label type is the maximum value of a preset confidence coefficient interval; if the confidence value is different, determining that the actual confidence of the label type is the minimum value of the preset confidence interval.

And then, the confidence coefficient and the actual confidence coefficient detected by each training sample image included in the group of training sample images can be substituted into a two-class cross entropy loss function or a cross entropy loss function, so that the confidence coefficient loss is calculated.

And step four, determining a loss value based on the regression loss, the classification loss and the confidence loss.

Wherein, the loss value may be:

;

wherein,for loss value, +_>For regression loss->Preset weight for classification loss, +.>For classifying loss->Preset weight for confidence loss, +.>Is a confidence loss. />And->May be preset.

Through the above formula, the embodiments of the present application can utilizeAnd->The proportion of confidence loss and classification loss in the loss value is adjusted, so that the attention degree of the target detection network to the type and/or the position of the specified target in the image is adjusted, and the target detection network obtained through training can meet actual requirements.

In addition, when the target detection network is trained, the loss value is calculated by combining the detection type, the detection frame position and the confidence coefficient of the detection type of the specified target in the training sample image, so that the type, the position and the confidence coefficient of the specified target can be predicted more accurately by the target detection network after the target detection network is adjusted by using the loss value.

The process of training the target detection network using all training sample data as described above in fig. 2 is referred to as an epoch (epoch). After each epoch is performed, and before returning to S102 to perform object detection on each negative image included in the negative image set by using the object detection network, a mode of the object detection network may be set to a test (Eval) mode.

If the target detection network adopts a miss (Dropout) mechanism, the Dropout mechanism is suspended in the Eval mode. The Dropout mechanism refers to: the target detection network randomly throws away the characteristics extracted by partial neurons in partial network layers in the process of carrying out target detection on the image. The object detection network will not randomly throw away the image features of the negative sample image in Eval mode.

Moreover, if the target detection network includes a Batch Normalization (BN) layer, updating of the sliding mean (running mean) and the sliding variance (running variance) of the BN layer is suspended in the Eval mode.

Therefore, when the negative sample image is screened, the unstable influence caused by a Dropout mechanism and a dynamic updating mechanism of a BN layer can be avoided, and the possibility that the screened negative sample image belongs to a difficult sample is improved.

Then, when performing S102 to perform target detection on each negative sample image included in the negative sample image set by using the target detection network, a confidence level of the target detection network under the current epoch on the target detection result of each negative sample image is obtained. Wherein (1)>Represents the target detection network under epoch-i, and epoch-i represents the ith epoch,/- >Representing a negative set of images, +.>The number of negative sample images included is NN is typically more than ten times the number of positive sample images.

After that, after performing S102 the target detection on each negative sample image included in the negative sample image set with the target detection network, respectively, the mode of the target detection network may also be set as a training (train) mode. Thereby restoring the Dropout mechanism included in the target detection model and the dynamic update mechanism of the BN layer.

Referring to fig. 3, the following describes an epoch procedure in the embodiment of the present application with reference to an actual scenario:

and (3) recording the current epoch as an epoch-i, and respectively carrying out target detection on each negative sample image included in the negative sample image set by utilizing a target detection model under the epoch-i to obtain a target detection result of each negative sample image and the confidence of the target detection result.

And screening out a specified number of negative sample images according to the order of the confidence of the target detection result of each negative sample image from high to low, and marking a set formed by the negative sample images as a difficult negative sample image set.

And training the target detection network based on each positive sample image included in the positive sample image set and each negative sample image included in the difficult negative sample image set, so as to obtain a target detection model under epoch- (i+1).

In the embodiment of the application, the target detection model may be various neural network models. For example, the object detection model may be you only look at-X (You Only Look Once-X, YOLO-X) once.

YOLO-X includes: backbone (backbone) networks, feature fusion networks, also known as neck (neg) networks, and detection head (detection head) networks.

FIG. 4 is a back bone network and a back network of YOLO-X. Wherein, the backhaul network includes: a Focus (Focus) module, a first two-dimensional convolution extraction (conv2d+batch norm+silu, CBS) module, a first cross-stage local layer (Cross Stage Partial Layer, cscplayer), a second CBS module, a second cscplayer, a third CBS module, a third cscplayer, a fourth CBS module, a spatial pyramid pooling (spatial pyramid pooling, SPP) layer, and a fourth cscplayer.

The neg network comprises: a first splice (Concat) layer, a first Upsample (Updsample) layer, a fifth CBS module, a fifth CSPLAyer, a second Concat layer, a second Updsample layer, a sixth CBS module, a sixth CSPLAyer, a first Downsample (Down sample) layer, a third Concat layer, a seventh CSPLAyer, a second Down sample layer, a fourth Concat layer, and an eighth CSPLAyer.

The Focus module, also called an image slicing module, is configured to upsample image initial information obtained by downsampling an image, so that the object detection model focuses on features of a small object included in the image. The input data to the Focus module is an image.

The structure of each CBS module is the same, and each CBS module is also called a two-dimensional convolution extraction module and is used for extracting the characteristic information of the image. Each CBS module includes a convolutional layer, a BN layer, and an activation function (Sigmoid Linear Unit, siLU).

The structure of each CSPLayer is identical, each CSPLayer has a composite residual structure, as shown in fig. 5, the trunk branch of each CSPLayer includes 2 CBS modules, the residual branch includes 1 CBS module, and then the trunk branch and the residual branch are connected with the Concat layer and one CBS module. Each CSPLayer is used to extract feature information of an image feature.

The SPP layer is used for carrying out maximum pooling treatment on image features with different pooling kernel sizes so as to extract the image features and increase the receptive field of the target detection model.

The structures of the Concat layers are the same, each Concat layer is used for splicing the feature images in the channel dimension, namely, the feature images are unchanged in size after splicing, and the number of channels is increased.

FIG. 6 is a detection head network for YOLO-X. Referring to fig. 6, the detection head network includes: a first convolution block (ConvBlock), a first convolution layer (ConvLayer), a second ConvBlock, a second ConvLayer, and a third ConvLayer. Wherein each ConvBlock comprises a convolutional layer, a BN layer, and a linear rectification function (Linear rectification function, reLU) layer.

The input data of the first ConvBlock and the second ConvBlock are: and splicing the characteristic graphs output by the sixth CSPLlayer, the seventh CSPLlayer and the eighth CSPLlayer of the neg network. The output result of the first ConvLayer is the detection type of the appointed target included in the image, the output result of the second ConvLayer is the detection frame of the appointed target included in the image, and the output result of the third ConvLayer is the confidence coefficient of the detection type of the appointed target included in the image.

Alternatively, the object detection model in the embodiment of the present application may be: YOLO-1, YOLO-2, YOLO-3, YOLO-4, YOLO-5, YOLO-6, YOLO-7, or FAST-area convolutional neural networks (Fast Region Convolutional Neural Networks, FAST RCNN), etc., as embodiments of the present application are not specifically limited thereto. And the structure of the target detection network and the structure of the target detection model are the same.

The following comparison and explanation are carried out on the accuracy of the target detection model obtained by training the embodiment of the application through experiments:

77 ten thousand negative sample images and 5 ten thousand positive sample images were obtained before the experiment. Referring to table one, experiment 1 only trains the target detection network based on 5 ten thousand positive sample images, and the detection rate of the obtained target detection model is 81%, and the false alarm rate is 16%. In experiment 2, 77 ten thousand negative sample images and 5 ten thousand positive sample images are directly utilized to train a target detection network, the detection rate of the obtained target detection model is 76%, and the false alarm rate is 13%. Experiment 3 is to utilize the training method provided by the embodiment of the application to screen the negative sample image and combine the positive sample image to train the target detection network, and the detection rate of the obtained target detection model is 82% and the false alarm rate is 8%.

Wherein, the detection rate and the false alarm rate can be measured by average precision (mean Average Precision, mAP).

List one

Therefore, the training method provided by the embodiment of the application is used for training the target detection model, so that the detection rate can be ensured, and the false alarm rate can be reduced.

Based on the same inventive concept, the embodiment of the present application further provides a training device for a target detection model, as shown in fig. 7, where the device includes: an acquisition module 701, a detection module 702, a screening module 703 and a training module 704;

An acquiring module 701, configured to acquire a positive sample image set and a negative sample image set;

the detection module 702 is configured to perform target detection on each negative sample image included in the negative sample image set by using a target detection network, so as to obtain a target detection result of each negative sample image and a confidence coefficient of the target detection result;

a screening module 703, configured to screen out a specified number of negative sample images based on the confidence level of the target detection result of each negative sample image;

the training module 704 is configured to train the target detection network based on each positive sample image included in the positive sample image set and the screened negative sample image, and return to the step of respectively performing target detection on each negative sample image included in the negative sample image set by using the target detection network, until training of the target detection network is completed, and taking the trained target detection network as a target detection model.

Optionally, training module 704 is specifically configured to:

according to the arrangement sequence of each group of training sample images, a group of training sample images are obtained, and target detection is carried out on each training sample image included in the group of training sample images by utilizing a target detection model, so that a target detection result and a confidence coefficient of the target detection result of each training sample image included in the group of training sample images are obtained;

if yes, determining that the target detection network training is completed;

if not, the network parameters of the target detection network are adjusted based on the loss values, and the step of obtaining a group of training sample images according to the arrangement sequence of the training sample images of each group is returned until each group of training samples are obtained.

Optionally, the target detection result includes: a detection frame in which a specified target in the image is located and a detection type to which the specified target in the image belongs; training module 704 is specifically configured to:

Optionally, the loss value is:

;

wherein,for loss value, +_>For regression loss->Preset weight for classification loss, +.>For classifying loss->Preset weight for confidence loss, +.>Is a confidence loss.

Optionally, training module 704 is specifically configured to:

Training sample images of a preset sample number are divided into a plurality of groups.

Optionally, the apparatus may further include:

the setting module is used for setting the mode of the target detection network as a test mode before utilizing the target detection network to respectively carry out target detection on each negative sample image included in the negative sample image set;

the setting module is further configured to set a mode of the target detection network as a training mode after performing target detection on each negative sample image included in the negative sample image set by using the target detection network.

Optionally, the screening module 703 is specifically configured to:

The embodiment of the present application further provides an electronic device, as shown in fig. 8, including a processor 801, a communication interface 802, a memory 803, and a communication bus 804, where the processor 801, the communication interface 802, and the memory 803 complete communication with each other through the communication bus 804,

a memory 803 for storing a computer program;

the processor 801 is configured to implement the method steps in the above-described method embodiments when executing the program stored in the memory 803.

The communication bus mentioned above for the electronic devices may be a peripheral component interconnect standard (Peripheral Component Interconnect, PCI) bus or an extended industry standard architecture (Extended Industry Standard Architecture, EISA) bus, etc. The communication bus may be classified as an address bus, a data bus, a control bus, or the like. For ease of illustration, the figures are shown with only one bold line, but not with only one bus or one type of bus.

The communication interface is used for communication between the electronic device and other devices.

The Memory may include random access Memory (Random Access Memory, RAM) or may include Non-Volatile Memory (NVM), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the aforementioned processor.

The processor may be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU), a network processor (Network Processor, NP), etc.; but also digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), field programmable gate arrays (Field-ProgrammableGate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components.

In yet another embodiment provided herein, there is also provided a computer readable storage medium having stored therein a computer program which when executed by a processor implements the steps of any of the object detection model training methods described above.

In yet another embodiment provided herein, there is also provided a computer program product containing instructions that, when run on a computer, cause the computer to perform any of the object detection model training methods of the above embodiments.

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, produces a flow or function in accordance with embodiments of the present application, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in or transmitted from one computer-readable storage medium to another, for example, by wired (e.g., coaxial cable, optical fiber, digital Subscriber Line (DSL)), or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid State Disk (SSD)), etc.

It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

In this specification, each embodiment is described in a related manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for the device embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, and reference is made to the description of the method embodiments in part.

The foregoing description is only of the preferred embodiments of the present application and is not intended to limit the scope of the present application. Any modifications, equivalent substitutions, improvements, etc. that are within the spirit and principles of the present application are intended to be included within the scope of the present application.

Claims

1. A method for training a target detection model, the method comprising:

acquiring a positive sample image set and a negative sample image set, wherein each positive sample image included in the positive sample image set comprises a specified target, and each negative sample image included in the negative sample image set does not comprise the specified target;

training the target detection network based on each positive sample image and the screened negative sample image included in the positive sample image set, and returning to the step of respectively carrying out target detection on each negative sample image included in the negative sample image set by using the target detection network until the training of the target detection network is completed, wherein the trained target detection network is used as a target detection model;

Before the target detection is performed on each negative sample image included in the negative sample image set by using the target detection network, the method further includes:

setting the mode of the target detection network as a test mode, wherein a dropoff mechanism and a dynamic update mechanism of a BN layer in the test mode are suspended, and the dropoff mechanism refers to: the method comprises the steps that in the process of carrying out target detection on an image, the target detection network randomly throws away the characteristics extracted by partial neurons in partial network layers, and a dynamic updating mechanism of the BN layer comprises updating the sliding mean value and the sliding variance of the BN layer;

setting the mode of the target detection network as a training mode, wherein the Dropout mechanism and the dynamic update mechanism of the BN layer are recovered in the training mode;

the training the target detection network based on each positive sample image included in the positive sample image set and the screened negative sample image includes:

if yes, determining that the target detection network training is completed;

if not, adjusting network parameters of the target detection network based on the loss value, and returning to the step of acquiring a group of training sample images according to the arrangement sequence of the training sample images of each group until each group of training samples is acquired;

the constructing a plurality of groups of training sample images based on each positive sample image included in the positive sample image set and the screened negative sample image includes:

Overlapping the first preset number of positive sample images and the second preset number of negative sample images, taking the images obtained after the overlapping as a training sample image, and returning to the step of acquiring the first preset number of positive sample images from the positive sample image set until the number of the training sample images reaches the preset sample number; wherein, the pixel value of the pixel point at each position in the training sample image is: the pixel value of the pixel point at the position in the first preset number of positive sample images is multiplied by alpha% + the pixel value of the pixel point at the position in the second preset number of negative sample images is multiplied by beta, and alpha and beta are preset;

2. The method of claim 1, wherein the target detection result comprises: a detection frame in which a specified target in the image is located and a detection type to which the specified target in the image belongs; calculating a loss value by using the target detection result of each training sample image included in the set of training sample images and the confidence of the target detection result, including:

3. The method of claim 2, wherein the loss value is:

；

wherein,for the loss value, +.>For the regression loss, ++>For a preset weight of the classification loss,for the classification loss, < >>Preset weight for said confidence loss, < ->And losing the confidence.

4. A method according to any one of claims 1-3, wherein screening out a specified number of negative images based on the confidence level of the target detection result of each negative image comprises:

5. An object detection model training apparatus, the apparatus comprising:

the acquisition module is used for acquiring a positive sample image set and a negative sample image set, wherein each positive sample image included in the positive sample image set comprises a specified target, and each negative sample image included in the negative sample image set does not comprise the specified target;

the training module is used for training the target detection network based on each positive sample image and the screened negative sample image included in the positive sample image set, and returning to the step of respectively carrying out target detection on each negative sample image included in the negative sample image set by using the target detection network until the training of the target detection network is completed, and taking the trained target detection network as a target detection model;

The apparatus further comprises:

the setting module is configured to set, before the target detection is performed on each negative-sample image included in the negative-sample image set by using the target detection network, a mode of the target detection network to be a test mode, where a dropoff mechanism and a dynamic update mechanism of a BN layer in the test mode are both suspended, where the dropoff mechanism refers to: the method comprises the steps that in the process of carrying out target detection on an image, the target detection network randomly throws away the characteristics extracted by partial neurons in partial network layers, and a dynamic updating mechanism of the BN layer comprises updating the sliding mean value and the sliding variance of the BN layer;

the setting module is further configured to set a mode of the target detection network as a training mode after the target detection is performed on each negative sample image included in the negative sample image set by using the target detection network, where the Dropout mechanism and the dynamic update mechanism of the BN layer are restored in the training mode;

the training module is specifically configured to:

if yes, determining that the target detection network training is completed;

the training module is specifically configured to:

6. The apparatus of claim 5, wherein the target detection result comprises: a detection frame in which a specified target in the image is located and a detection type to which the specified target in the image belongs; the training module is specifically configured to:

7. The apparatus of claim 6, wherein the loss value is:

；

8. The apparatus according to any one of claims 5-7, wherein the screening module is specifically configured to:

9. The electronic equipment is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory are communicated with each other through the communication bus;

a memory for storing a computer program;

a processor for implementing the method of any of claims 1-4 when executing a program stored on a memory.

10. A computer readable storage medium, characterized in that the computer readable storage medium has stored therein a computer program which, when executed by a processor, implements the method of any of claims 1-4.