CN112699914A

CN112699914A - Target detection method and system based on heterogeneous composite backbone

Info

Publication number: CN112699914A
Application number: CN202011388828.0A
Authority: CN
Inventors: 郑慧诚; 严志伟; 陈蔓薇; 李烨
Original assignee: Sun Yat Sen University
Current assignee: Sun Yat Sen University
Priority date: 2020-12-02
Filing date: 2020-12-02
Publication date: 2021-04-23
Anticipated expiration: 2040-12-02
Also published as: CN112699914B

Abstract

The invention discloses a target detection method and a system based on a heterogeneous composite backbone, wherein the method comprises the following steps: acquiring training data and preprocessing the training data to obtain preprocessed data; constructing a target detection network based on a heterogeneous composite backbone architecture; training the target detection network based on the preprocessed data and a preset training strategy to obtain a trained target detection network; and acquiring data to be detected, inputting the data to be detected to the trained target detection network, and outputting a detection result. The system comprises: the device comprises a preprocessing module, a network construction module, a training module and a detection module. By using the method and the device, complementary characteristics learned by two heterogeneous backbone networks are integrated, and characteristic redundancy is avoided, so that the overall characteristic expression and target detection performance of the detector are enhanced. The target detection method and the target detection system based on the heterogeneous composite backbone can be widely applied to the field of target detection networks.

Description

Target detection method and system based on heterogeneous composite backbone

Technical Field

The invention belongs to the field of target detection networks, and particularly relates to a target detection method and a target detection system based on a heterogeneous composite backbone.

Background

Object detection is a task that is fundamental and widely used in the field of computer vision. As an essential component of many visual systems, it plays a significant role in the overall performance of the system. With the research and application of deep learning in the field of computer vision, the performance of the current target detector is continuously and greatly improved.

In a common target detection network based on deep learning, a backbone network is mainly responsible for extracting relevant features of a target, and an output feature map is the basis of target positioning and identification of a detection head and is important for the overall performance of the network. The amount of parameters of the backbone network is typically larger in the detector than in the detector head to ensure adequate learning and expression of the data distribution. The existing target detector generally adopts a network pre-trained on a classification task to fully utilize a large number of training samples with class labels, but a network structure designed for the classification task is not completely suitable for a detection task, and a domain deviation phenomenon is often generated when the network structure is applied to a specific target detection task.

Disclosure of Invention

In order to solve the above technical problems, an object of the present invention is to provide a target detection method and system based on a heterogeneous composite backbone, which solve the domain offset problem existing in a backbone network.

The first technical scheme adopted by the invention is as follows: a target detection method and system based on heterogeneous composite trunks comprises the following steps:

acquiring training data and preprocessing the training data to obtain preprocessed data;

constructing a target detection network based on a heterogeneous composite backbone architecture;

training the target detection network based on the preprocessed data and a preset training strategy to obtain a trained target detection network;

and acquiring data to be detected, inputting the data to be detected to the trained target detection network, and outputting a detection result.

Further, the step of obtaining training data and preprocessing the training data to obtain preprocessed data specifically includes:

acquiring training data according to a preset problem;

carrying out category marking on the training data to obtain marked training data;

and the information in the marked training data comprises an original material picture, and a marking record of a target position and a category in the picture.

Further, the target detection network comprises a detail extraction backbone, a depth backbone and a composite module, and the detail extraction backbone and the depth backbone realize backbone network composite through the composite module.

Further, the composite module includes a 1 × 1 convolutional layer and an addition unit.

Further, the detail extraction backbone is constructed based on the ResNet structure, the detail extraction backbone comprising a stem section that removes the first pooling layer and an exploration subnet that replaces the base module with a narrow module.

Further, the narrow module includes two parametrically small 3 x 3 convolutional layers.

Further, the step of training the target detection network based on the preprocessed data and a preset training strategy to obtain a trained target detection network specifically includes:

dividing data into a training set, a verification set and a test set according to a certain proportion;

calculating network output by taking the training set as input in the training process of the target detection network through operations such as convolution and the like to obtain a prediction frame set;

according to the classification subtask and the positioning subtask, each prediction frame in the prediction frame set comprises a category vector and a position vector;

for the classification subtask, using the cross entropy between the prediction frame class vector and the labeling frame class vector as a loss function;

for the positioning subtask, calculating the position loss of the prediction frame and the marking frame through a Smooth L1 loss function;

calculating the gradient of the parameters in the convolutional layer by layer according to the calculated loss and a random gradient descent method, and updating the parameters of each layer in the network;

in the training process, the generalization of the network is evaluated by taking the verification set as input at fixed iteration times at intervals;

and after the training is finished, evaluating the performance of the network by taking the test set as the input of the network, and simultaneously storing parameters such as a convolution kernel, an offset and the like in the network to obtain the trained target detection network.

Further, the step of acquiring data to be detected, inputting the data to be detected to the trained target detection network, and outputting a detection result specifically includes:

obtaining an image of a target to be detected by taking data to be detected;

inputting an image of a target to be detected into a trained target detection network, and outputting a 4-dimensional vector sequence representing the position of a prediction frame and an N-dimensional vector sequence representing class prediction through a convolutional layer;

the detector discards a part of low-quality results according to the N-dimensional vector sequence predicted by the category through an artificially preset category confidence threshold to obtain the residual detection results;

and calculating the overlapping rate of the prediction frames according to the residual detection results through the confidence degrees of the prediction frames and the position-based 4-dimensional vector, and removing the duplication of the prediction frames based on a non-maximum suppression algorithm to obtain and output the final detection result of the detector.

The method and the system have the beneficial effects that: the detection network integrates the complementary characteristics learned by the two heterogeneous backbone networks through the composite module, avoids characteristic redundancy, thereby enhancing the overall characteristic expression and target detection performance of the detector, and simplifies the network structure to reduce the network parameter number and the calculation complexity.

Drawings

FIG. 1 is a network architecture of a target detection network based on heterogeneous composite backbone according to the present invention;

FIG. 2 is a flowchart illustrating steps of a method for detecting a target based on a heterogeneous composite backbone according to the present invention;

fig. 3 is a structural block diagram of a target detection system based on a heterogeneous composite backbone according to the present invention.

Detailed Description

The invention is described in further detail below with reference to the figures and the specific embodiments. The step numbers in the following embodiments are provided only for convenience of illustration, the order between the steps is not limited at all, and the execution order of each step in the embodiments can be adapted according to the understanding of those skilled in the art.

As shown in fig. 1 and fig. 2, the present invention provides a target detection method based on heterogeneous composite trunks, which includes the following steps:

s1, acquiring training data and preprocessing the training data to obtain preprocessed data;

s2, constructing a target detection network based on the heterogeneous composite backbone architecture;

s3, training the target detection network based on the preprocessed data and a preset training strategy to obtain a trained target detection network;

and S4, acquiring the data to be detected, inputting the data to be detected to the trained target detection network, and outputting a detection result.

Further, as a preferred embodiment of the method, the step of obtaining the training data and preprocessing the training data to obtain preprocessed data specifically includes:

acquiring training data according to a preset problem;

specifically, training data is collected according to the problem to be solved (such as general target detection, face detection, and floater detection).

Carrying out category and position marking on the training data to obtain marked training data;

Further, as a preferred embodiment of the method, the target detection network comprises a detail extraction backbone, a depth backbone and a composite module, and the detail extraction backbone and the depth backbone realize backbone network composite through the composite module.

As a further preferred embodiment of the method, the composite module comprises a 1 × 1 convolutional layer and an adding unit.

Further as a preferred embodiment of the method, the detail extraction backbone is constructed based on a ResNet structure, and the detail extraction backbone comprises a stem part which removes the first pooling layer and a discovery subnet which replaces the base module with a narrow module.

Specifically, the detail extraction backbone is based on a ResNet structure and mainly comprises a stem part and an exploration subnet, and compared with the common ResNet, the difference is that the stem part of the detail extraction backbone network provided by the invention removes a first pooling layer in an original network. In the subnet exploration, the invention replaces the basic module in the original ResNet with the narrow module with less parameter quantity so as to reduce the network capacity and the video memory occupation.

In addition, the detail extraction backbone is used for fine-grained feature extraction, and as shown in fig. 1, includes five convolution levels, i.e., conv _1, conv2_ x, conv3_ x, conv4_ x, and conv5_ x. In contrast, the present invention eliminates the first pooling layer, so the output characteristics of each convolution level have smaller step sizes, thereby effectively preserving local detail information.

The characteristic step length in the detail extraction trunk proposed by the invention is half of ResNet-34, and the receptive field range of each convolution level is smaller than the corresponding range of ResNet-34. Compared with ResNet-34 which is designed aiming at the classification task, the network can better keep the spatial local information in the image, thereby being more suitable for target detection.

Further as a preferred embodiment of the method, the narrow module comprises two parametrically small 3 x 3 convolutional layers.

Specifically, reducing the feature extraction step size facilitates the extraction of local detail information, while also increasing the computational burden. For this purpose, the invention designs a narrow module to replace the basic residual module, BasicBlock, in the original ResNet-34. As shown in fig. 1, the narrow module includes two convolutional layers, for example, the number of input channels is 256, and the number of output channels of the first convolutional layer is, wherein the compression ratio is a super parameter. The second convolutional layer expands the feature channel back to 256. Narrow module parameter of BaThe computational complexity of a single convolutional layer of sicBlock is, where k, C_i、C_oH and W are respectively the size of the square convolution kernel, the number of input characteristic channels, the number of output characteristic channels and the height and width of the output characteristic diagram. The two convolution layers can be adjusted by the compression ratio, thereby controlling the parameter quantity and the calculation complexity.

The invention places narrow modules in the deep layer of the convolutional network containing most parameters, namely the exploring subnetwork part of figure 1, so as to greatly reduce the network parameters without changing the stem structure, thereby fully utilizing the low-layer pre-training parameters with better generalization

As a preferred embodiment of the method, the step of training the target detection network based on the preprocessed data and a preset training strategy to obtain the trained target detection network specifically includes:

specifically, before training, a series of preprocessing rules for the input image are set, wherein the preprocessing operations that must be included include image normalization for stable training and changing image size to control computational complexity. During training, on the basis of necessary operation, a series of random preprocessing operations such as random clipping are introduced to achieve the purpose of data augmentation and enhance the performance of the network.

Specifically, in actual detection, the trained model can be recovered only by assigning the parameter value to the parameter of the corresponding layer in the network through the parameter name, and the model is used as a basis for outputting a detection result in a subsequent detection process.

Specifically, the training strategy mainly comprises the following two parts:

(1) three-way supervision: if the heterogeneous composite backbone network is trained directly, a suppression phenomenon may occur, that is, the pre-trained backbone will converge faster and reach local optimum, and most features in the detail extraction network will be optimized to 0 and become invalid. Increasing the learning rate of the detail extraction network merely accelerates the suppression phenomenon and cannot solve the problem. Therefore, the invention respectively sets anchors on three branches of a pre-training deep network, a detail extraction network and a composite part and adds supervision signals, and the total loss can be expressed as:

where j is the branch index, p_jiAnd r_jiRespectively representing the category confidence prediction and the positioning prediction of the branch j on the ith anchor, c_iAnd g_iRespectively representing the real category label and the positioning label corresponding to the ith anchor. Everson bracket function [ c ]_i≥1]If and only if c _i1 is only when the value is more than or equal to 1, and 0 is not. The three-way supervision ensures that all the characteristics including the detail extraction network have the discriminability required by detection through additional supervision signals, so that the network can easily escape from the local optimal solution.

(2) Network decomposition: the invention provides a network decomposition training strategy, which divides a network into a shallow stem part and a deep exploring subnetwork part. The stem part is initialized by adopting pre-training parameters and only finely adjusted in the training process, so that the generalization of the shallow feature is fully utilized to reduce the optimization difficulty and accelerate convergence. For exploring sub-networks, the basic modules in the original ResNet are replaced by narrow modules to reduce the number of parameters and to submit computational efficiency, wherein the parameters adopt random initialization and larger learning rate to promote a larger search of parameter space. The training cost of the decomposed training strategy is substantially consistent with a fully pre-trained network.

In addition, a penalty function L for training the classification subtasks_confFor cross entropy, a loss function L for training regression subtasks_locAs a function of Smooth L1 losses.

Further, as a preferred embodiment of the method, the step of acquiring the data to be detected, inputting the data to be detected to the trained target detection network, and outputting the detection result specifically includes:

obtaining an image of a target to be detected by taking data to be detected;

Specifically, the detector first discards a portion of the low quality results from the N-dimensional sequence of class predictions by an artificially predetermined class confidence threshold. The remaining detection results are de-duplicated from the prediction boxes according to a non-maximum suppression (NMS) algorithm by the prediction box confidence and the overlap ratio between the prediction boxes calculated based on the position 4-dimensional vector. And finally, the residual prediction frame is the detection result of the detector.

As shown in fig. 3, a target detection system based on heterogeneous composite backbones includes the following modules:

the preprocessing module is used for acquiring training data and preprocessing the training data to obtain preprocessed data;

the network construction module is used for constructing a target detection network based on the heterogeneous composite backbone architecture;

the training module is used for training the target detection network based on the preprocessed data and a preset training strategy to obtain a trained target detection network;

and the detection module is used for acquiring data to be detected, inputting the data to be detected into the trained target detection network and outputting a detection result.

The contents in the above method embodiments are all applicable to the present system embodiment, the functions specifically implemented by the present system embodiment are the same as those in the above method embodiment, and the beneficial effects achieved by the present system embodiment are also the same as those achieved by the above method embodiment.

While the preferred embodiments of the present invention have been illustrated and described, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A target detection method based on a heterogeneous composite trunk is characterized by comprising the following steps:

2. The method according to claim 1, wherein the step of obtaining training data and preprocessing the training data to obtain preprocessed data specifically comprises:

acquiring training data according to a preset problem;

3. The target detection method based on the heterogeneous composite backbone of claim 2, wherein the target detection network comprises a detail extraction backbone, a depth backbone and a composite module, and the detail extraction backbone and the depth backbone realize backbone network composite through the composite module.

4. The method of claim 3, wherein the composite module comprises a 1 x 1 convolutional layer and an addition unit.

5. The method of claim 4, wherein the detail extraction backbone is constructed based on a ResNet structure, and the detail extraction backbone comprises a stem part with a first pooling layer removed and a exploration subnet with a narrow module instead of a basic module.

6. The method of claim 5, wherein the narrow module comprises two 3 x 3 convolutional layers with a small number of parameters.

7. The target detection method based on the heterogeneous composite backbone according to claim 6, wherein the step of training the target detection network based on the preprocessed data and the preset training strategy to obtain the trained target detection network specifically comprises:

8. The target detection method based on the heterogeneous composite backbone of claim 7, wherein the step of obtaining the data to be detected, inputting the data to be detected to the trained target detection network, and outputting the detection result specifically comprises:

acquiring data to be detected to obtain an image of a target to be detected;

9. A target detection system based on heterogeneous composite backbones is characterized by comprising the following modules: