CN112396126A

CN112396126A - Target detection method and system based on detection of main stem and local feature optimization

Info

Publication number: CN112396126A
Application number: CN202011388976.2A
Authority: CN
Inventors: 郑慧诚; 严志伟; 黄梓轩; 李烨; 陈绿然
Original assignee: Sun Yat Sen University
Current assignee: Sun Yat Sen University
Priority date: 2020-12-02
Filing date: 2020-12-02
Publication date: 2021-02-23
Anticipated expiration: 2040-12-02
Also published as: CN112396126B

Abstract

The invention discloses a target detection method and a system based on detection of trunk and local feature optimization, wherein the method comprises the following steps: acquiring training data and preprocessing the training data to obtain preprocessed data; constructing a target detection network based on a long-neck backbone architecture and a local feature optimization module; training the target detection network based on the preprocessed data and a preset training strategy to obtain a trained target detection network; and acquiring data to be detected, inputting the data to be detected to the trained target detection network, and outputting a detection result. The system comprises: the device comprises a preprocessing module, a network construction module, a training module and a detection module. By using the invention, the detector is ensured to obtain satisfactory performance on the premise of being computationally friendly. The target detection method and the system based on the detection backbone and the local feature optimization can be widely applied to the field of target detection networks.

Description

Target detection method and system based on detection of main stem and local feature optimization

Technical Field

The invention belongs to the field of target detection networks, and particularly relates to a target detection method and a target detection system based on detection backbone and local feature optimization.

Background

Target detection has wide application as a basic task of computer vision, and is a hot field of research in academic and industrial fields. With the rise of deep learning, the field of target detection is greatly developed. However, the current detector has poor performance for detecting small-scale targets, mainly due to the fast information loss in the backbone network and the insufficient modeling of local information by the detection head.

The main network is used as a basic structure for feature extraction, and plays a significant role in the target detection effect. Due to the general shortage of target detection training samples, most current detectors employ network backbones pre-trained on large image classification datasets. The difference of tasks causes the problem of domain deviation when the network is finely adjusted, and meanwhile, the structural design space of the backbone network is limited to a certain extent by adopting the pre-training network. Due to the fact that the currently and commonly adopted backbone network carries out pooling operation prematurely, space detail information is lost, and the method is unfavorable for feature expression of small targets.

On the other hand, the detection head part of the current mainstream detector usually uses a feature pyramid as an input, shallow feature semantic information in the pyramid is insufficient, and spatial information of deep features is seriously lost, so how to enhance feature expression and detection of a detection layer on a small-scale target is a problem that needs to be solved at present.

Disclosure of Invention

In order to solve the above technical problems, an object of the present invention is to provide a method and a system for target detection based on detection of main stem and local feature optimization, which ensure that the detector obtains satisfactory performance on the premise of being computationally friendly.

The first technical scheme adopted by the invention is as follows: a target detection method based on main detection and local feature optimization comprises the following steps:

acquiring training data and preprocessing the training data to obtain preprocessed data;

constructing a target detection network based on a long-neck backbone architecture and a local feature optimization module;

training the target detection network based on the preprocessed data and a preset training strategy to obtain a trained target detection network;

and acquiring data to be detected, inputting the data to be detected to the trained target detection network, and outputting a detection result.

Further, the step of obtaining training data and preprocessing the training data to obtain preprocessed data specifically includes:

collecting training data according to the problem domain and marking the training data to obtain marked training data;

the training data comprises public data sets and solid shot images from the Internet, and information in the training data comprises original material pictures and annotation records of target positions and categories in the pictures.

Further, the target detection network comprises a long-neck residual error trunk network and a local feature optimization module, the long-neck residual error trunk network comprises six feature extraction convolution modules, and the local feature optimization module comprises a local fusion module and a scale supervision module.

Further, the feature extraction convolution module comprises an inclusion module, and the inclusion module comprises two branches.

Further, the local fusion module comprises a detail re-guiding branch, a local context branch and an original input mapping branch, wherein the detail re-guiding branch enables the input feature graph to sequentially pass through a 1 × 1 convolution layer, a maximum pooling layer, a 3 × 3 convolution layer and a batch normalization layer, the local upper and lower branches enable the input feature graph to sequentially pass through the 1 × 1 convolution layer, an inverse convolution layer, the 3 × 3 convolution layer and the batch normalization layer, and the original input mapping branch enables the input feature graph to sequentially pass through the 1 × 1 convolution layer, the 3 × 3 convolution layer and the batch normalization layer.

Further, the step of training the target detection network based on the preprocessed data and a preset training strategy to obtain a trained target detection network specifically includes:

dividing data into a training set, a verification set and a test set according to a certain proportion;

calculating network output by taking the training set as input in the training process of the target detection network through operations such as convolution and the like to obtain a prediction frame set;

according to the classification subtask and the positioning subtask, each prediction frame in the prediction frame set comprises a category vector and a position vector;

for the classification subtask, using the cross entropy between the prediction frame class vector and the labeling frame class vector as a loss function;

for the positioning subtask, calculating the position loss of the prediction frame and the marking frame through a Smooth L1 loss function;

calculating the gradient of the parameters in the convolutional layer by layer according to the calculated loss and a random gradient descent method, and updating the parameters of each layer in the network;

in the training process, the generalization of the network is evaluated by taking the verification set as input at fixed iteration times at intervals;

and after the training is finished, evaluating the performance of the network by taking the test set as the input of the network, and simultaneously storing parameters such as a convolution kernel, an offset and the like in the network to obtain the trained target detection network.

Further, the step of acquiring data to be detected, inputting the data to be detected to the trained target detection network, and outputting a detection result specifically includes:

obtaining an image of a target to be detected by taking data to be detected;

inputting an image of a target to be detected into a trained target detection network, and outputting a 4-dimensional vector sequence representing the position of a prediction frame and an N-dimensional vector sequence representing class prediction through a convolutional layer;

the detector discards a part of low-quality results according to the N-dimensional vector sequence predicted by the category through an artificially preset category confidence threshold to obtain the residual detection results;

and (4) the residual detection results pass through the confidence degrees of the prediction frames and the overlapping rate between the prediction frames calculated based on the position 4-dimensional vector, and the prediction frames are subjected to de-duplication based on a non-maximum suppression algorithm to obtain and output the final detection result of the detector.

The method and the system have the beneficial effects that: a local feature optimization module for spatial local information fusion is designed, so that not only can semantic information of a detection layer be enhanced, but also spatial local information of detection head features is guaranteed, small target detection is particularly facilitated, in order to overcome the problem of performance reduction during random initialization of backbone network parameters, a suitable learning strategy is further provided, and the detector is guaranteed to obtain satisfactory performance on the premise of friendly calculation power.

Drawings

FIG. 1 is a network architecture of a target detection network based on detection backbone and local feature optimization according to the present invention;

FIG. 2 is a flowchart illustrating steps of a method for detecting a target based on detecting stem and local feature optimization according to the present invention;

FIG. 3 is a block diagram of a target detection system based on detection of stem and local feature optimization according to the present invention;

FIG. 4 illustrates a branch structure in a local fusion module according to an embodiment of the present invention.

Detailed Description

The invention is described in further detail below with reference to the figures and the specific embodiments. The step numbers in the following embodiments are provided only for convenience of illustration, the order between the steps is not limited at all, and the execution order of each step in the embodiments can be adapted according to the understanding of those skilled in the art.

As shown in fig. 1 and fig. 2, the present invention provides a target detection method based on detecting stem and local feature optimization, which includes the following steps:

s1, acquiring training data and preprocessing the training data to obtain preprocessed data;

s2, constructing a target detection network based on the long-neck backbone architecture and the local feature optimization module;

s3, training the target detection network based on the preprocessed data and a preset training strategy to obtain a trained target detection network;

specifically, in order to overcome the performance reduction caused by no pre-training, the invention optimizes the training strategy to ensure that similar or even better performance is obtained under the same training resources, and the specific improvement is as follows: (1) differentiation learning rate: the part of the network before the local inclusion module is consistent with the existing ResNet structure, and meanwhile, the lower-layer visual features have stronger generalization capability, so that pre-training initialization parameters can be adopted. For the pre-trained network part, a smaller learning rate is adopted to keep the pre-training knowledge; for randomly initialized parameters, a large learning rate is employed to facilitate the search of the network in the parameter space. By adopting the difference learning strategy, the detection network not only can have generalization performance brought by pre-training, but also can ensure faster learning convergence speed. (2) Strengthening the stability of the initial stage of training: the network adopts a characteristic pyramid structure to carry out target detection, which is beneficial to enhancing the robustness of a target scale, but a high-resolution characteristic diagram in a detection layer easily generates overlarge gradient at the initial training stage, and the convergence of a learning process is influenced. The invention adopts the preheating technology, ensures the gradual optimization of the network by gradually increasing the learning rate in the initial training stage, and prevents the network from deviating from the optimization target too far in the initial stage, thereby ensuring the learning process to be more stable. By adopting preheating, the statistical characteristics obtained by the network at the initial training stage are more accurate, and the problem that the existing randomly initialized target detection network depends on large-batch learning is solved, so that the satisfactory performance can be obtained under the condition of smaller computing resource requirements.

And S4, acquiring the data to be detected, inputting the data to be detected to the trained target detection network, and outputting a detection result.

Further, as a preferred embodiment of the method, the step of obtaining the training data and preprocessing the training data to obtain preprocessed data specifically includes:

Specifically, here a label box is generated, containing a label box category vector and a position vector.

Further as a preferred embodiment of the method, the target detection network includes a long-neck residual error trunk network and a local feature optimization module, the long-neck residual error trunk network includes six feature extraction convolution modules, and the local feature optimization module includes a local fusion module and a scale supervision module.

Specifically, as shown in the upper half of fig. 1, "long-neck residual backbone network", the structure of the backbone network basically adopts a residual structure, but differs from the conventional ResNet in two places: (1) a local inclusion module for obtaining a multiple receptive field ratio is added; (2) the neck is longer, so that richer space detail characteristics can be extracted;

in addition, as shown in the upper left of fig. 1, the architecture of the long-neck trunk is based on a residual error network, and mainly includes 6 convolution levels responsible for feature extraction, one of which is a local inclusion module. Unlike the normal residual network, the long-neck backbone network cancels one of the largest pooling layers after the conv1 level, resulting in multiplication of the input profile resolution of the conv2_ x level and thereafter the backbone network. In addition, removal of the pooling layer also slows down the increase of the receptive field in the trunk, thereby facilitating capture of fine-grained features.

If the pooling layer is simply removed, the feature resolution will be increased, which results in a certain amount of computation increase. The invention also provides a simplified version of a long-neck residual backbone network (LN-ResNet-light). In comparison to LN-ResNet, LN-ResNet-light preserves the largest pooling layer behind conv1 in the original ResNet structure, and reduces the first residual block convolution step of conv3_ x to 1, thereby reducing the overall computation.

The long-neck backbone network (LN-ResNet) provided by the invention is mainly used for extracting fine-grained spatial information in an image. The network enhances the extraction of high-resolution features by prolonging the depth of a neck (each convolution layer in front of a detection layer), relieves the problem of too fast loss of space detail information in a common backbone network, and enhances the feature expression of small-scale targets

Further as a preferred embodiment of the method, the feature extraction convolution module includes an inclusion module, and the inclusion module includes two branches.

Specifically, the local inclusion module comprises two branches. The input features are first passed through a volume of 1 x 1 layers in both branches to compress the number of channels to reduce the number of computations.

After that, the two branches respectively include a 1 × 3 convolution and a 3 × 1 convolution, and the two parallel convolution layer processes are different from the serial processes in the common inclusion, and are mainly used for obtaining the receptive field information with different aspect ratios, so that the targets with different aspect ratios are more effectively expressed and modeled. In addition, the convolutional layers are also beneficial to expanding the receptive field and deepening the network, thereby enhancing the semantic expression.

And finally, splicing the output characteristics of the two branches and fusing the output characteristics through a 3 multiplied by 3 convolutional layer. The fused output is added with the input of the whole module to form a residual structure, so that the effective propagation of the gradient is ensured.

As a further preferred embodiment of the method, the local fusion module includes a detail re-directing branch, a local context branch and an original input mapping branch, the detail re-directing branch sequentially passes the input feature map through the 1 × 1 convolutional layer, the maximum pooling layer, the 3 × 3 convolutional layer and the batch normalization layer, the local up-down branch sequentially passes the input feature map through the 1 × 1 convolutional layer, the inverse convolutional layer, the 3 × 3 convolutional layer and the batch normalization layer, and the original input mapping branch sequentially passes the input feature map through the 1 × 1 convolutional layer, the 3 × 3 convolutional layer and the batch normalization layer.

Specifically, as shown in fig. 4, the detail re-directing branch: the branch is designed primarily to alleviate the problem of loss of detail information due to pooling. It uses as input the feature map that is shallowest in the previous adjacent level of the detection layer and has twice the spatial resolution to guarantee spatial detail as much as possible. The input feature map is first passed through a convolutional layer compression pass, and then the resolution is reduced using a max pooling layer (Maxpooling) to obtain a feature map with the same resolution as the middle branch. Finally, a convolution layer and Batch Normalization (BN) layer are used for further feature transformation; local context branching: the branch assists the location and identification of the target by introducing local context information of the target. The input of the method is from the next stage of the current detection layer, and the spatial resolution is half of the characteristic diagram of the detection layer. Firstly, the input feature map passes through a 1 × 1 convolutional layer to reduce the number of channels, then the deconvolution layer performs up-sampling on the feature map to generate a feature map with the same spatial resolution as that of the detection layer, and finally the feature map passes through a 3 × 3 convolutional layer and a batch normalization layer. Different from a common hourglass structure, the input of the branch is a characteristic layer adjacent to a detection layer, so that the detection layer semantics are enhanced, and meanwhile, the locality of context characteristics is guaranteed; original input mapping branch: the branch inputs the original feature map into a 1 × 1 convolutional layer and a 3 × 3 convolutional layer for feature transformation before channel compression and fusion so as to control the subsequent calculation increase possibly brought by a local fusion module and better fuse with the features of the other two branches.

As a preferred embodiment of the method, the step of training the target detection network based on the preprocessed data and a preset training strategy to obtain the trained target detection network specifically includes:

specifically, before training, a series of preprocessing rules for the input image are set, wherein the preprocessing operations that must be included include image normalization for stable training and changing image size to control computational complexity. During training, on the basis of necessary operation, a series of random preprocessing operations such as random clipping are introduced to achieve the purpose of data augmentation and enhance the performance of the network.

in the training process, the generalization of the network is evaluated by taking the verification set as input at fixed iteration times at intervals, so that the network is prevented from being influenced by overfitting;

Specifically, in actual detection, the trained model can be recovered only by assigning the parameter value to the parameter of the corresponding layer in the network through the parameter name, and the model is used as a basis for outputting a detection result in a subsequent detection process.

Further, as a preferred embodiment of the method, the step of acquiring the data to be detected, inputting the data to be detected to the trained target detection network, and outputting the detection result specifically includes:

obtaining an image of a target to be detected by taking data to be detected;

and calculating the overlapping rate of the prediction frames according to the residual detection results through the confidence degrees of the prediction frames and the position-based 4-dimensional vector, and removing the duplication of the prediction frames based on a non-maximum suppression algorithm to obtain and output the final detection result of the detector.

Specifically, the detector first discards a portion of the low quality results from the N-dimensional sequence of class predictions by an artificially predetermined class confidence threshold. The remaining detection results are de-duplicated from the detection boxes according to a non-maximum suppression (NMS) algorithm by the prediction box confidence and the overlap ratio between the prediction boxes calculated based on the position 4-dimensional vector. And finally, the residual prediction frame is the detection result of the detector.

As shown in fig. 3, an object detection system based on the optimization of detection main stem and local features includes the following modules:

the preprocessing module is used for acquiring training data and preprocessing the training data to obtain preprocessed data;

the network construction module is used for constructing a target detection network based on the long-neck backbone architecture and the local feature optimization module;

the training module is used for training the target detection network based on the preprocessed data and a preset training strategy to obtain a trained target detection network;

and the detection module is used for acquiring data to be detected, inputting the data to be detected into the trained target detection network and outputting a detection result.

The contents in the above method embodiments are all applicable to the present system embodiment, the functions specifically implemented by the present system embodiment are the same as those in the above method embodiment, and the beneficial effects achieved by the present system embodiment are also the same as those achieved by the above method embodiment.

While the preferred embodiments of the present invention have been illustrated and described, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. a target detection method based on detection backbone and local feature optimization, is characterized in that, comprises the following steps:

Obtain training data and preprocess the training data to obtain preprocessed data;

Build a target detection network based on the long-neck backbone architecture and local feature optimization module;

The target detection network is trained based on the preprocessed data and the preset training strategy, and the trained target detection network is obtained;

Obtain the data to be tested and input it into the trained target detection network, and output the detection result.

2. a kind of target detection method based on detection backbone and local feature optimization according to claim 1, is characterized in that, described acquisition training data and training data are preprocessed, obtain this step of preprocessing data, it specifically comprises :

Collect training data and label it according to the problem domain, and obtain labelled training data;

The training data includes public data sets and field images from the Internet, and the information in the training data includes original material pictures, target positions in the pictures, and annotation records of categories.

3. A target detection method based on detection backbone and local feature optimization according to claim 2, wherein the target detection network comprises a long-neck residual backbone network and a local feature optimization module, and the long-neck residual The backbone network includes six feature extraction convolution modules, and the local feature optimization module includes a local fusion module and a scale supervision module.

4. A target detection method based on detection backbone and local feature optimization according to claim 3, wherein the feature extraction convolution module comprises an Inception module, and the Inception module comprises two branches.

5. A target detection method based on detection backbone and local feature optimization according to claim 4, wherein the local fusion module comprises a detail re-reference branch, a local context branch and an original input mapping branch, and the detail re-reference branch. The introductory branch passes the input feature map through a 1×1 convolutional layer, a max pooling layer, a 3×3 convolutional layer, and a batch normalization layer in order, and the local up-down branch passes the input feature map through a 1×1 volume in order Convolution layer, deconvolution layer, 3×3 convolution layer and batch normalization layer, the raw input mapping branch passes the input feature map through 1×1 convolution layer, 3×3 convolution layer and batch normalization layer in order One layer.

6. A target detection method based on detection backbone and local feature optimization according to claim 5, wherein the target detection network is trained based on preprocessing data and a preset training strategy to obtain the target after training The step of detecting the network includes:

Divide the data into training set, validation set, and test set according to a certain proportion;

Taking the training set as the input in the training process of the target detection network, through operations such as convolution, the network output is calculated, and the set of prediction boxes is obtained;

According to the classification subtask and the positioning subtask, each prediction frame in the prediction frame set includes a category vector and a position vector;

For the classification subtask, use the cross-entropy between the predicted box category vector and the labeled box category vector as the loss function;

For the positioning subtask, the position loss of the prediction frame and the annotation frame is calculated by the Smooth L1 loss function;

According to the calculated loss, the gradient of the parameters in the convolutional layer is calculated layer by layer according to the stochastic gradient descent method, and the parameters of each layer in the network are updated;

During the training process, the generalization of the network is evaluated with the validation set as the input at a fixed number of iterations.

After the training is completed, the performance of the network is evaluated with the test set as the input of the network, and the parameters such as the convolution kernel and bias in the network are saved to obtain the trained target detection network.

7. a kind of target detection method based on detection backbone and local feature optimization according to claim 3, is characterized in that, described acquisition to be measured data and input to the target detection network after training, this step of outputting detection result, its Specifically include:

Obtain the data to be tested to obtain the image of the target to be detected;

Input the image of the target to be detected into the trained target detection network, and output a 4-dimensional vector sequence representing the position of the prediction frame and an N-dimensional vector sequence expressing the category prediction through the convolution layer;

The detector discards a part of the low-quality results through the artificially predetermined category confidence threshold and the N-dimensional vector sequence predicted by the category, and obtains the remaining detection results;

The remaining detection results are calculated by the confidence of the prediction frame and the overlap rate between the prediction frames based on the position 4-dimensional vector, and the prediction frame is deduplicated based on the non-maximum suppression algorithm, and the final detection result of the detector is obtained and output. .

8. A target detection system based on detection backbone and local feature optimization, characterized in that it comprises the following modules:

The preprocessing module is used to obtain the training data and preprocess the training data to obtain the preprocessed data;

A network building module for building a target detection network based on the long-neck backbone architecture and local feature optimization module;

The training module is used to train the target detection network based on the preprocessing data and the preset training strategy, and obtain the trained target detection network;

The detection module is used to obtain the data to be tested and input it to the trained target detection network, and output the detection result.