CN114511731A

CN114511731A - Training method and device of target detector, storage medium and electronic equipment

Info

Publication number: CN114511731A
Application number: CN202111637544.5A
Authority: CN
Inventors: 王珏; 唐光远; 罗琴; 李润静
Original assignee: Gree Electric Appliances Inc of Zhuhai; Zhuhai Lianyun Technology Co Ltd
Current assignee: Gree Electric Appliances Inc of Zhuhai; Zhuhai Lianyun Technology Co Ltd
Priority date: 2021-12-29
Filing date: 2021-12-29
Publication date: 2022-05-17

Abstract

The invention relates to the technical field of target detection, in particular to a training method, a training device, a storage medium and electronic equipment of a target detector, wherein the method comprises the following steps: acquiring a sample data set and a pre-training target detector, wherein a backbone network of the pre-training target detector comprises a dense neural network; training the trunk network of the pre-training target detector based on the sample data set, and pruning the trunk network of the pre-training target detector to generate a sparse neural network; retraining the sparse neural network to obtain a dense neural network; the accuracy of the target detector without an anchor frame can be increased without significantly increasing the amount of calculation of the target detector.

Description

Training method and device of target detector, storage medium and electronic equipment

Technical Field

The present invention relates to the field of target detection technologies, and in particular, to a training method and apparatus for a target detector, a storage medium, and an electronic device.

Background

Object detection is a fundamental but challenging task in the field of computer vision. Object detection is the determination of whether a target object of interest is present in a given image. If the target object exists, the category and the position of all the target objects are given. In contrast to image classification, target detection outputs position parameters of a target object in addition to identifying the category of the target object.

The target detection method is divided into two types: one is an anchor frame based detection method; one type is a detection method based on an anchor-frame-free object detector. The detection method based on the anchor frame has the technical problem that the high precision and the high speed of target detection are difficult to balance. In contrast to anchor-based target detectors, anchor-free target detectors do not require anchor-related hyper-parameters to directly return object positions. However, the target detector without the anchor frame has a problem of low detection accuracy in a real application scene, and the detection accuracy of the target detector needs to be improved by manually designing a network structure and increasing the amount of calculation.

There is a need in the art for a scheme to train target detectors to increase the accuracy of anchor-frame-free target detectors without increasing the amount of computation.

Disclosure of Invention

The invention provides a training method and device of a target detector, a storage medium and electronic equipment, and solves the technical problem that detection precision is not high in some technical schemes.

In a first aspect, the present invention provides a method for training a target detector, including:

acquiring a sample data set and a pre-training target detector, wherein a backbone network of the pre-training target detector comprises a dense neural network;

training a trunk network of a pre-training target detector based on the sample data set, and pruning the trunk network of the pre-training target detector to generate a sparse neural network;

and retraining the sparse neural network to obtain a dense neural network.

In some embodiments, pruning the backbone network of the pre-trained target detectors to generate a sparse neural network comprises:

acquiring all convolutional layer parameters of a backbone network of a pre-training target detector;

calculating an importance score of each convolutional layer parameter;

determining a pruning threshold according to the number of the convolutional layer parameters and a preset pruning rate;

pruning is performed according to the pruning threshold and the importance score to generate a sparse neural network.

In some embodiments, calculating an importance score for each convolutional layer parameter comprises:

the L2 norm of the product of the weight and back-propagation computed gradient of each convolutional layer parameter is computed to obtain the importance score of each convolutional layer parameter.

In some embodiments, the back-propagation computational gradient is obtained by the expression:

wherein, g_iDenotes the back-propagation calculated gradient, L denotes the total loss value of the target detector, w_iRepresents the weight of the convolutional layer parameter, and i represents the index of the convolutional layer parameter.

In some embodiments, determining the pruning threshold according to the number of convolutional layer parameters and a preset pruning rate includes:

determining the pruning quantity N × p based on the number N of the convolutional layer parameters and a preset pruning rate p;

sorting the convolutional layer parameters according to the importance scores of the convolutional layer parameters;

and taking the importance scores of the N × p convolutional layer parameters after sorting as pruning threshold values.

In some embodiments, pruning is performed according to a pruning threshold and an importance score to generate a sparse neural network, including:

and traversing each convolutional layer parameter, and pruning the network connection corresponding to the convolutional layer parameter with the importance score smaller than the pruning threshold.

In some embodiments, pruning network connections corresponding to convolutional layer parameters whose importance scores are less than a pruning threshold comprises:

and setting the parameter of the convolutional layer with the importance score smaller than the pruning threshold value to be zero.

In some embodiments, further comprising: the bounding box is screened using a non-maximum suppression strategy.

In some embodiments, the bounding box is screened using a non-maxima suppression strategy, comprising:

taking the bounding box with the highest confidence coefficient as a real box;

if the intersection ratio of the first boundary frame and the real frame is larger than the non-maximum value inhibition threshold value, reducing the confidence coefficient of the first boundary frame;

if the confidence coefficient of the first boundary frame is lower than the deletion threshold value after the reduction, deleting the first boundary frame;

the first bounding box is any bounding box except the bounding box with the highest confidence coefficient.

In some embodiments, retraining the sparse neural network to obtain a dense neural network comprises:

and reactivating the network connection which is set to zero to obtain a dense neural network.

In a second aspect, the present invention provides an apparatus for training an object detector, comprising:

the acquisition module is used for acquiring a sample data set and a pre-training target detector, and a main network of the pre-training target detector comprises a dense neural network;

the training system comprises a first training module, a second training module and a third training module, wherein the first training module is used for training a trunk network of a pre-training target detector based on a sample data set and pruning the trunk network of the pre-training target detector to generate a sparse neural network;

and the second training module is used for retraining the sparse neural network to obtain a dense neural network.

In a third aspect, the invention provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the method of the first aspect.

In a fourth aspect, the present invention provides an electronic device comprising a processor and a memory, the memory having a computer program stored thereon, the processor implementing the method of the first aspect when executing the computer program.

According to the training method and device for the target detector, the storage medium and the electronic equipment, the sample data set and the pre-training target detector are acquired, and a backbone network of the pre-training target detector comprises a dense neural network; training a trunk network of a pre-training target detector based on the sample data set, and pruning the trunk network of the pre-training target detector to generate a sparse neural network; retraining the sparse neural network to obtain a dense neural network; the accuracy of the anchor-frame-less object detector can be increased without increasing the amount of calculation of the object detector.

Drawings

The invention will be described in more detail hereinafter on the basis of embodiments and with reference to the accompanying drawings:

FIG. 1 is a flow chart of a method for training a target detector according to an embodiment of the present invention;

fig. 2 is a schematic diagram of a training apparatus of an object detector according to an embodiment of the present invention.

In the drawings, like parts are designated with like reference numerals, and the drawings are not drawn to scale.

Detailed description of the invention

In order to make those skilled in the art better understand the technical solutions of the present invention and how to apply technical means to solve the technical problems and to achieve the corresponding technical effects, the implementation processes will be fully understood and implemented, and the technical solutions in the embodiments of the present invention will be described below clearly and completely with reference to the drawings in the embodiments of the present invention. The embodiments of the present invention and the features of the embodiments can be combined with each other without conflict, and the formed technical solutions are within the scope of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

It should be noted that the steps illustrated in the flowcharts of the figures may be performed in a computer system such as a set of computer-executable instructions and that, although a logical order is illustrated in the flowcharts, in some cases, the steps illustrated or described may be performed in an order different than presented herein.

In the target detection using the deep learning, a training set and a verification set are obtained by labeling images. The labeled data label is the target object framed on the original image by using a rectangular frame, and the position parameters (the coordinates of the upper left corner point and the lower right corner point) of the rectangular frame are obtained.

The target detection method is divided into two types: one is an anchor frame based detection method; one type is a detection method based on an anchor-frame-free object detector. The detection method based on the anchor frame has the technical problem that the high precision and the high speed of target detection are difficult to balance. In contrast to anchor-based target detectors, anchor-free target detectors do not require anchor-related hyper-parameters to directly return object positions.

The detection method based on the target detector without the anchor frame comprises the following steps: a center point based approach and a keypoint based approach.

However, the target detector without the anchor frame has a problem of low detection capability in a real application scene, and the detection capability of the target detector needs to be improved by manually designing a network structure and increasing the amount of calculation. It can be seen that there is a need in the art for a solution that improves the training process, improves the accuracy of the target detector, and improves the detection capability of the target detector without significantly increasing the computational load of the target detector.

Hereinafter, the technical solutions of the present invention will be described in a plurality of examples, respectively.

Example one

The underlying Backbone network (Backbone) used by the target detector (FCOS) is a deep residual network comprising at least three convolutional layers, and the FCOS pre-training network comprises ResNet-50, ResNet-101, ResNeXt-101, etc.

Fig. 1 is a flowchart of a training method of a target detector according to an embodiment of the present invention. As shown in fig. 1, a training method of an object detector includes steps S110 to S130:

step S110, acquiring a sample data set and a pre-training target detector, wherein a main network of the pre-training target detector comprises a dense neural network;

step S120, training a trunk network of a pre-training target detector based on the sample data set, and pruning the trunk network of the pre-training target detector to generate a sparse neural network;

and step S130, retraining the sparse neural network to obtain a dense neural network.

In the embodiment, a sample data set and a pre-training target detector are obtained, and a backbone network of the pre-training target detector comprises a dense neural network; training a trunk network of a pre-training target detector based on the sample data set, and pruning the trunk network of the pre-training target detector to generate a sparse neural network; retraining the sparse neural network to obtain a dense neural network; the accuracy of the anchor-frame-less object detector can be increased without increasing the amount of calculation of the object detector.

Example two

On the basis of the above embodiment, pruning is performed on the trunk network of the pre-training target detector to generate a sparse neural network, including steps S201 to S204:

step S201, acquiring all convolutional layer parameters of a backbone network of a pre-training target detector; the main network is a dense pre-training neural network, and the basic main network comprises a plurality of convolutional layer parameters.

Step S202, calculating the importance score of each convolution layer parameter;

step S203, determining a pruning threshold according to the number of the convolutional layer parameters and a preset pruning rate;

and step S204, pruning is carried out according to the pruning threshold and the importance score so as to generate a sparse neural network.

In some implementations, step S202, calculating an importance score for each convolutional layer parameter includes:

In some implementations, the back-propagation computational gradient is obtained by the expression:

Thus, the importance score s for each convolutional layer parameter_iThis can be obtained from the following expression:

s_i＝||g_i·w_i||₂

wherein i is an index of convolution layer parameters in the target detector without the anchor frame, | a | | purple₂Is the L2 norm of a.

In some implementations, the step S203 of determining the pruning threshold according to the number of the convolutional layer parameters and the preset pruning rate includes steps S2031 to S2033:

step S2031, determining the pruning quantity N × p based on the quantity N of the convolutional layer parameters and a preset pruning rate p;

step S2032, sorting the convolutional layer parameters according to the importance scores of the convolutional layer parameters;

step S2033, the importance scores of the sorted nth × p convolutional layer parameters are used as pruning threshold values.

In some implementations, the pruning step S204 performs pruning according to the pruning threshold and the importance score to generate a sparse neural network, including:

In some implementations, pruning network connections corresponding to convolutional layer parameters whose importance scores are less than a pruning threshold includes:

Pruning, which means suppressing convolutional layer parameters to convert a dense pre-trained neural network into a sparse neural network, so that the sparse network has better efficiency.

In this embodiment, assuming a total of N convolutional layer parameters, N × p importance scores s need to be clipped_iThe lowest convolutional layer parameters, then:

s331, scoring S according to importance_iSorting the convolutional layer parameters to obtain the importance scores s of the Nth convolutional layer parameters_iAs pruning threshold T.

S332, traversing all the convolution layer parameters.

And S333, when the importance score of a certain convolutional layer parameter is smaller than the pruning threshold T, cutting the connection corresponding to the convolutional layer parameter, thereby cutting off the part of connection with lower importance score. Optionally, the weight of the convolutional layer parameter corresponding to the importance score may be set to 0, so as to implement clipping on the connection corresponding to the convolutional layer parameter.

And when the importance score of a certain convolutional layer parameter is larger than the threshold value T, reserving the connection corresponding to the convolutional layer parameter.

And after the traversal is completed, obtaining a sparse backbone network, wherein the convolution layer has N x (1-p) parameters with nonzero weight values in the sparse backbone network.

In this embodiment, a sparse neural network is generated by pruning the trunk network of the pre-training target detector, and a sparse neural network with only N × 1-p parameters with nonzero weights can be obtained, so that the sparse neural network has better efficiency.

EXAMPLE III

On the basis of the above embodiment, the training method of the target detector further includes: screening the bounding box using a non-maxima suppression strategy comprising:

taking the bounding box with the highest confidence coefficient as a real box;

Since the entire network generates a large number of bounding boxes, there are overlapping, false detection samples in the numerous bounding boxes. Overlapping bounding boxes on the same target need to be culled, and finally only one bounding box with the highest score is reserved for each target.

Specifically, the intersection ratio of the bounding box to the real box is calculated IoU,

wherein BB denotes a Bounding Box (Bounding Box), and GT denotes a real Box (Ground Truth).

Assume that the set of all bounding boxes is S. And sequencing according to the confidence degrees of all the bounding boxes, selecting the bounding box with the highest confidence degree as a real box B, and adding the real box B into the result set S'. And excludes the real box B from the set S of all bounding boxes.

Traversing the remaining first bounding boxes A in the set S of all the bounding boxes; if the intersection ratio IoU of the first bounding box A and the real box B is less than or equal to the non-maximum suppression threshold, then the first bounding box A is retained in the set S of all bounding boxes; if the intersection ratio IoU of the remaining first bounding box A and the true box B is greater than the non-maximum suppression threshold, the confidence of the first bounding box A is decreased. And if the confidence coefficient of the first bounding box A is lower than the deletion threshold, deleting the first bounding box A. If the score of the first bounding box A is not less than the deletion threshold, the first bounding box A is retained.

Then, the bounding box with the highest confidence is selected from the unprocessed bounding boxes in the set S of all the bounding boxes, and the above-mentioned flow is repeated. Until set S is empty. The set S' is the result.

In this embodiment, if the intersection ratio of the first bounding box and the real box is greater than the non-maximum suppression threshold, the confidence of the first bounding box is reduced instead of directly deleting the first bounding box, so that the bounding boxes can be better screened, and the accuracy of the target detector is improved.

Example four

On the basis of the above embodiment, retraining the sparse neural network to obtain a dense neural network includes:

In the training process of the present embodiment, the convolutional neural network undergoes a "dense-sparse-dense" flow. The initialized dense neural network is a common pre-trained convolutional network (CNN). After network pruning, unimportant connections are zeroed out, resulting in a small sparse neural network. During the network retraining process, the zeroed network connections are reactivated, resulting in a dense neural network. Reactivation means that the convolutional layer parameters that have been zeroed during the thinning process can be continuously updated through back propagation during the retraining process.

In this embodiment, the neural network required by the target detector can be obtained by reactivating the zeroed network connections to obtain a dense neural network.

EXAMPLE five

Based on the above embodiments, fig. 2 is a schematic diagram of a training apparatus of an object detector according to an embodiment of the present invention. The present embodiment provides a training apparatus for an object detector, including:

an obtaining module 101, configured to obtain a sample data set and a pre-training target detector, where a backbone network of the pre-training target detector includes a dense neural network;

the first training module 102 is configured to train a backbone network of a pre-training target detector based on a sample data set, and prune the backbone network of the pre-training target detector to generate a sparse neural network;

and the second training module 103 is configured to retrain the sparse neural network to obtain a dense neural network.

In the embodiment, a sample data set and a pre-training target detector are obtained, and a backbone network of the pre-training target detector comprises a dense neural network; training a trunk network of a pre-training target detector based on the sample data set, and pruning the trunk network of the pre-training target detector to generate a sparse neural network; retraining the sparse neural network to obtain a dense neural network; the accuracy of an anchor-frame-less object detector can be increased without increasing the amount of calculation of the object detector.

On the basis of the foregoing embodiment, the first training module 102 prunes the trunk network of the pre-training target detector to generate a sparse neural network, and includes modules S201 to S204:

a first obtaining module 201, configured to obtain all convolutional layer parameters of a backbone network of a pre-training target detector; the main network is a dense pre-training neural network, and the basic main network comprises a plurality of convolutional layer parameters.

A calculating module 202, configured to calculate an importance score of each convolutional layer parameter;

a pruning threshold determining module 203, configured to determine a pruning threshold according to the number of the convolutional layer parameters and a preset pruning rate;

and a sparse module 204, configured to perform pruning according to the pruning threshold and the importance score to generate a sparse neural network.

In some implementations, calculating an importance score for each convolutional layer parameter includes:

Thus, the importance score s for each convolutional layer parameter_iCan be represented by the following expressionObtaining:

s_i＝||g_i·w_i||₂。

In some implementations, determining the pruning threshold according to the number of convolutional layer parameters and a preset pruning rate includes:

In some implementations, pruning is performed according to a pruning threshold and an importance score to generate a sparse neural network, including:

In the embodiment, the training process of the anchor-frame-free target detector is improved and the precision of the anchor-frame-free target detector is improved by a Dense-Sparse-Dense (Dense-Sparse-Dense) training strategy; by the improved pruning scheme, the sparse network in the training process has better efficiency, and the performance of the trunk network after pruning is improved; the embodiment also uses an improved softening non-maximum value inhibition strategy to better screen the frame, thereby improving the precision; in the embodiment, a deep learning scheme is used, and a target detection sample data set is combined to optimize a training process, so that the detection performance is improved; the present embodiment can be applied to an actual target detection item.

EXAMPLE six

On the basis of the above embodiments, the present embodiment provides a computer-readable storage medium on which a computer program is stored, which, when executed by a processor, implements the method of the above embodiments.

The storage medium may be a flash memory, a hard disk, a multimedia card, a card type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a Read Only Memory (ROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), a Programmable Read Only Memory (PROM), a magnetic memory, a magnetic disk, an optical disk, a server, an App application mall, etc.

In the embodiment, a sample data set of a target detection task and a pre-trained ImageNet convolutional neural network are obtained; loading a pre-training model as initialization; generating a sparse neural network for a backbone network of a target detector according to a preset pruning strategy; and retraining the main network by using the target detector without the anchor frame, and recovering the connection after cutting to obtain the dense neural network.

For other contents of the method, please refer to the foregoing embodiments, which are not described in detail in this embodiment.

EXAMPLE seven

On the basis of the foregoing embodiments, the present embodiment provides an electronic device, which includes a processor and a memory, where the memory stores a computer program, and the processor implements the method of the foregoing embodiments when executing the computer program.

For the content of the method, please refer to the foregoing embodiments, which are not repeated in this embodiment.

The Processor may be an Application Specific Integrated Circuit (ASIC), a Digital Signal Processor (DSP), a Digital Signal Processing Device (DSPD), a Programmable Logic Device (PLD), a Field Programmable Gate Array (FPGA), a controller, a microcontroller, a microprocessor, or other electronic components, and is configured to perform the method of the above embodiments. For the content of the method, please refer to the foregoing embodiments, which are not repeated in this embodiment.

The Memory may be implemented by any type of volatile or non-volatile Memory device or combination thereof, such as Static Random Access Memory (SRAM), Electrically Erasable Programmable Read-Only Memory (EEPROM), Erasable Programmable Read-Only Memory (EPROM), Programmable Read-Only Memory (PROM), Read-Only Memory (ROM), magnetic Memory, flash Memory, magnetic disk or optical disk.

In the embodiments provided in the present invention, it should be understood that the disclosed apparatus and method can be implemented in other ways. The apparatus embodiments described above are merely illustrative and, for example, the flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

It should be noted that, in the present invention, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, the recitation of an element by the phrase "comprising an … …" does not exclude the presence of additional like elements in the process, method, article, or apparatus that comprises the element.

Although the present invention has been described in terms of the above embodiments, the above embodiments are merely used for understanding the present invention, and are not intended to limit the present invention. It will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A method of training an object detector, comprising:

training the trunk network of the pre-training target detector based on the sample data set, and pruning the trunk network of the pre-training target detector to generate a sparse neural network;

and retraining the sparse neural network to obtain a dense neural network.

2. The training method of claim 1, wherein pruning the backbone network of the pre-trained target detectors to generate a sparse neural network comprises:

acquiring all convolutional layer parameters of a backbone network of the pre-training target detector;

calculating an importance score of each convolutional layer parameter;

and pruning according to the pruning threshold and the importance score to generate a sparse neural network.

3. The training method of claim 2, wherein the calculating an importance score for each convolutional layer parameter comprises:

4. A training method as claimed in claim 3, characterized in that the back-propagation computation gradient is obtained by the expression:

5. The training method of claim 2, wherein the determining the pruning threshold according to the number of convolutional layer parameters and a preset pruning rate comprises:

the importance score of the nth × p convolutional layer parameters in the ranking is used as a pruning threshold.

6. The training method of claim 2, wherein the pruning according to the pruning threshold and the importance score to generate a sparse neural network comprises:

7. The training method of claim 6, wherein pruning network connections corresponding to convolutional layer parameters whose importance scores are less than the pruning threshold comprises:

8. The training method of claim 1, further comprising: the bounding box is screened using a non-maximum suppression strategy.

9. The training method of claim 8, wherein the screening bounding boxes using a non-maximum suppression strategy comprises:

taking the bounding box with the highest confidence coefficient as a real box;

if the intersection ratio of the first boundary frame and the real frame is larger than a non-maximum value inhibition threshold value, reducing the confidence coefficient of the first boundary frame;

if the confidence coefficient of the first boundary box is lower than a deletion threshold value after the reduction, deleting the first boundary box;

wherein the first bounding box is any bounding box except the bounding box with the highest confidence coefficient.

10. The training method of claim 1, wherein the retraining the sparse neural network to obtain a dense neural network comprises:

11. An apparatus for training an object detector, comprising:

the acquisition module is used for acquiring a sample data set and a pre-training target detector, wherein a main network of the pre-training target detector comprises a dense neural network;

a first training module, configured to train a backbone network of the pre-training target detector based on the sample data set, and prune the backbone network of the pre-training target detector to generate a sparse neural network;

12. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the method of any one of claims 1 to 10.

13. An electronic device comprising a processor and a memory, wherein the memory has stored thereon a computer program which, when executed by the processor, implements the method of any of claims 1 to 10.