CN112396126B

CN112396126B - Target detection method and system based on detection trunk and local feature optimization

Info

Publication number: CN112396126B
Application number: CN202011388976.2A
Authority: CN
Inventors: 郑慧诚; 严志伟; 黄梓轩; 李烨; 陈绿然
Original assignee: Sun Yat Sen University
Current assignee: Sun Yat Sen University
Priority date: 2020-12-02
Filing date: 2020-12-02
Publication date: 2023-09-22
Anticipated expiration: 2040-12-02
Also published as: CN112396126A

Abstract

The application discloses a target detection method and a target detection system based on detection trunk and local feature optimization, wherein the method comprises the following steps: acquiring training data and preprocessing the training data to obtain preprocessed data; constructing a target detection network based on a long-neck trunk architecture and a local feature optimization module; training the target detection network based on the preprocessing data and a preset training strategy to obtain a trained target detection network; and acquiring data to be detected, inputting the data to the trained target detection network, and outputting a detection result. The system comprises: the system comprises a preprocessing module, a network construction module, a training module and a detection module. By using the application, the satisfactory performance of the detector is ensured under the premise of friendly calculation force. The target detection method and the target detection system based on detection trunk and local feature optimization can be widely applied to the field of target detection networks.

Description

Target detection method and system based on detection trunk and local feature optimization

Technical Field

The application belongs to the field of target detection networks, and particularly relates to a target detection method and system based on detection trunk and local feature optimization.

Background

The object detection is widely applied as a basic task of computer vision, and is a hot spot field of research in academia and industry. With the rise of deep learning, the field of target detection has been greatly developed. However, current detectors do not perform well for small-scale targets, mainly due to the too fast information loss in the backbone network and the insufficient modeling of local information by the detection head.

The main network is used as a basic structure of feature extraction, and plays a role in target detection effect. Because of the lack of training samples for target detection in general, current detectors mostly employ a network backbone pre-trained on a large image classification dataset. The task difference causes the domain offset problem in the network fine tuning, and the adoption of the pre-training network also limits the structural design space of the backbone network to a certain extent. Because the pooling operation is carried out on the backbone network which is commonly adopted at present too early, the space detail information is lost, and the feature expression of the small target is unfavorable.

On the other hand, the detection head part of the current mainstream detector generally uses a feature pyramid as input, shallow feature semantic information in the pyramid is insufficient, and space information of deep features is seriously lost, so that how to enhance feature expression and detection of a small-scale target by a detection layer is a problem to be solved currently.

Disclosure of Invention

In order to solve the technical problems, the application aims to provide a target detection method and a target detection system based on detection trunk and local feature optimization, which ensure that a detector obtains satisfactory performance on the premise of friendly calculation force.

The first technical scheme adopted by the application is as follows: a target detection method based on detection trunk and local feature optimization comprises the following steps:

acquiring training data and preprocessing the training data to obtain preprocessed data;

constructing a target detection network based on a long-neck trunk architecture and a local feature optimization module;

training the target detection network based on the preprocessing data and a preset training strategy to obtain a trained target detection network;

and acquiring data to be detected, inputting the data to the trained target detection network, and outputting a detection result.

Further, the step of obtaining training data and preprocessing the training data to obtain preprocessed data specifically includes:

collecting training data according to the problem domain and marking to obtain marked training data;

the training data comprises a public data set from the Internet and an in-situ photographed image, and the information in the training data comprises original material pictures and labeling records of target positions and categories in the pictures.

Further, the target detection network comprises a long-neck residual error main network and a local feature optimization module, the long-neck residual error main network comprises six feature extraction convolution modules, and the local feature optimization module comprises a local fusion module and a scale supervision module.

Further, the feature extraction convolution module comprises an acceptance module, wherein the acceptance module comprises two branches.

Further, the local fusion module includes a detail re-directing branch that sequentially passes the input feature map through the 1×1 convolution layer, the max-pooling layer, the 3×3 convolution layer, and the batch normalization layer, a local context branch that sequentially passes the input feature map through the 1×1 convolution layer, the deconvolution layer, the 3×3 convolution layer, and the batch normalization layer, and an original input mapping branch that sequentially passes the input feature map through the 1×1 convolution layer, the 3×3 convolution layer, and the batch normalization layer.

Further, the step of training the target detection network based on the preprocessing data and a preset training strategy to obtain a trained target detection network specifically includes:

dividing the data into a training set, a verification set and a test set according to a certain proportion;

the training set is used as input in the target detection network training process, and the network output is calculated through convolution and other operations to obtain a prediction frame set;

according to the classification subtask and the positioning subtask, each prediction frame in the prediction frame set comprises a category vector and a position vector;

for the classification subtasks, using cross entropy between the prediction frame class vector and the annotation frame class vector as a loss function;

for a positioning subtask, calculating the position loss of the prediction frame and the annotation frame through a Smooth L1 loss function;

calculating the gradient of the parameters in the convolution layer by layer according to the calculated loss and a random gradient descent method, and updating the parameters of each layer in the network;

in the training process, evaluating the generalization of the network by taking the verification set as input every fixed iteration number;

after training, the performance of the network is evaluated by taking the test set as the input of the network, and parameters such as convolution kernel, bias and the like in the network are saved at the same time, so that the trained target detection network is obtained.

Further, the step of acquiring the data to be detected and inputting the data to the trained target detection network and outputting the detection result specifically includes:

obtaining an image of a target to be detected by taking data to be detected;

inputting an image of a target to be detected into a trained target detection network, and outputting a 4-dimensional vector sequence representing the position of a predicted frame and an N-dimensional vector sequence expressing category prediction through a convolution layer;

the detector discards a part of low-quality results through a manually preset category confidence threshold value according to the N-dimensional vector sequence of category prediction to obtain residual detection results;

and carrying out de-duplication on the predicted frames based on a non-maximum suppression algorithm by using the rest detection results through the confidence of the predicted frames and the overlapping rate between the predicted frames calculated based on the position 4-dimensional vector, so as to obtain the final detection result of the detector and output the final detection result.

The method and the system have the beneficial effects that: the design of the local feature optimization module for carrying out the spatial local information fusion not only can enhance the semantic information of the detection layer, but also ensures the spatial local information of the detection head features, is particularly beneficial to small target detection, further provides a suitable learning strategy for overcoming the problem of performance degradation during the random initialization of the backbone network parameters, and ensures that the detector obtains satisfactory performance on the premise of friendly calculation power.

Drawings

FIG. 1 is a network architecture of a target detection network based on detection backbone and local feature optimization of the present application;

FIG. 2 is a flow chart of steps of a target detection method based on detection of backbone and local feature optimization in accordance with the present application;

FIG. 3 is a block diagram of a target detection system based on detection backbone and local feature optimization in accordance with the present application;

FIG. 4 is a branching structure in a local fusion module in accordance with an embodiment of the present application.

Detailed Description

The application will now be described in further detail with reference to the drawings and to specific examples. The step numbers in the following embodiments are set for convenience of illustration only, and the order between the steps is not limited in any way, and the execution order of the steps in the embodiments may be adaptively adjusted according to the understanding of those skilled in the art.

As shown in fig. 1 and 2, the present application provides a target detection method based on detection of a trunk and local feature optimization, the method comprising the steps of:

s1, acquiring training data and preprocessing the training data to obtain preprocessed data;

s2, constructing a target detection network based on a long-neck trunk architecture and a local feature optimization module;

s3, training the target detection network based on the preprocessing data and a preset training strategy to obtain a trained target detection network;

specifically, in order to overcome performance degradation caused by no pre-training, the training strategy is optimized to ensure that similar or even better performance is obtained under the same training resources, and the specific improvement is as follows: (1) differential learning rate: the part of the network in front of the local acceptance module is consistent with the existing ResNet structure, and meanwhile, the low-level visual features have stronger generalization capability, so that the pre-training initialization parameters can be adopted. For the pre-trained network part, adopting a smaller learning rate to maintain pre-training knowledge; for randomly initialized parameters, a large learning rate is employed to facilitate searching of the network in the parameter space. By adopting the differential learning strategy, the detection network not only can have generalization performance brought by pre-training, but also can ensure faster learning convergence speed. (2) enhancing initial training stability: the network adopts the feature pyramid structure to detect the target, which is beneficial to enhancing the robustness to the target scale, but the high-resolution feature map in the detection layer easily generates overlarge gradient in the initial stage of training, and influences the convergence of the learning process. The application adopts the preheating technology, ensures the gradual optimization of the network by gradually increasing the learning rate in the initial stage of training, and prevents the deviation from the optimization target too far in the initial stage, thereby ensuring the learning process to be more stable. By adopting preheating, the statistical characteristics obtained by the network at the initial stage of training are more accurate, the problem that the existing random initialized target detection network depends on large-batch learning is solved, and therefore satisfactory performance can be obtained under the condition of smaller calculation resource requirement.

S4, acquiring data to be detected, inputting the data to be detected to a trained target detection network, and outputting a detection result.

Further as a preferred embodiment of the method, the step of obtaining training data and preprocessing the training data to obtain preprocessed data specifically includes:

Specifically, a label box is generated here, containing a label box category vector and a location vector.

Further as a preferred embodiment of the method, the target detection network includes a long-neck residual backbone network and a local feature optimization module, the long-neck residual backbone network includes six feature extraction convolution modules, and the local feature optimization module includes a local fusion module and a scale supervision module.

Specifically, as shown in the upper half of fig. 1, the structure of the backbone network is a long neck residual backbone network, basically adopts a residual structure, but is different from the general res net in two points: (1) A local acceptance module for obtaining the multi-receptive field proportion is added; (2) The neck part is longer, so that the extraction of more abundant space detail features is facilitated;

in addition, as shown in the upper left of fig. 1, the architecture of the long neck trunk is based on a residual network, and mainly comprises 6 convolution levels responsible for feature extraction, one of which is a local acceptance module. Unlike the common residual network, the long-neck backbone network cancels a maximum pooling layer after the conv1 level, resulting in doubling of the resolution of the input feature map of the conv2_x level and the subsequent backbone networks. In addition, removal of the pooling layer also slows the increase in receptive field in the backbone, thereby facilitating capture of fine-grained features.

Simply removing the pooling layer will result in an increase in feature resolution, resulting in a certain increase in computation. The application also provides a simplified version of the long neck residual backbone (LN-ResNet-light). Compared with LN-ResNet, LN-ResNet-light retains the largest pooling layer behind conv1 in the original ResNet structure, while reducing the first residual block convolution step of conv3_x to 1, thereby reducing overall computational effort.

The long-neck backbone network (LN-ResNet) provided by the application is mainly used for extracting fine granularity spatial information in an image. The network strengthens the extraction of high-resolution features by prolonging the depth of a neck (each convolution layer before a detection layer), relieves the problem of too fast loss of space detail information in a common backbone network, and strengthens the feature expression of a small-scale target

Further as a preferred embodiment of the method, the feature extraction convolution module includes an acceptance module, where the acceptance module includes two branches.

Specifically, the local acceptance module contains two branches. The input features are first laminated by one volume 1 x 1 in both branches to reduce the number of channels by compressing the number of channels.

After that, the two branches respectively comprise a 1×3 convolution and a 3×1 convolution, and the two parallel convolution layer processes are different from the serial processes in the common acceptance and are mainly used for obtaining the receptive field information with different length-width ratios, so that the expression modeling of the targets with different length-width ratios is more effectively performed. In addition, these convolution layers also facilitate expanding receptive fields and deepening networks, thereby enhancing semantic expression.

Finally, the output characteristics of the two branches are spliced and then fused through a 3X 3 convolution layer. The fused output is added with the input of the whole module to form a residual structure, so that the effective propagation of the gradient is ensured.

Further as the preferred embodiment of the method, the local fusion module includes a detail re-directing branch that sequentially passes the input feature map through the 1×1 convolution layer, the max-pooling layer, the 3×3 convolution layer, and the batch normalization layer, a local context branch that sequentially passes the input feature map through the 1×1 convolution layer, the deconvolution layer, the 3×3 convolution layer, and the batch normalization layer, and an original input mapping branch that sequentially passes the input feature map through the 1×1 convolution layer, the 3×3 convolution layer, and the batch normalization layer.

Specifically, as shown in fig. 4, the detail re-directs the branch: the design of the branch is mainly used for relieving the problem of detail information loss caused by pooling. It uses as input the feature map that is shallowest in the immediately preceding adjacent level of the detection layer and has twice the spatial resolution to guarantee spatial detail as much as possible. The input profile is first passed through a convolutional layer laminating channel and then a maximum pooling layer (Maxpooling) is used to reduce the resolution to obtain the profile with the same resolution as the middle leg. Finally, further transforming the features using a convolution layer and a Batch Normalization (BN) layer; local context branching: the branch assists in the localization and identification of the target by introducing local context information of the target. Its input originates from the next stage of the current detection layer, with a spatial resolution of half of the detection layer profile. Firstly, the number of channels of an input feature map is reduced through a 1×1 convolution layer, then the deconvolution layer upsamples the feature map to generate a feature map with the same spatial resolution as that of a detection layer, and finally the feature map passes through a 3×3 convolution layer and a batch normalization layer. Different from a common hourglass structure, the input of the branch is a feature layer adjacent to the detection layer, so that the semantics of the detection layer are enhanced and the locality of the context features is ensured; original input mapping branches: the branch inputs the original feature map into a 1X 1 convolution layer and a 3X 3 convolution layer to perform channel compression and feature transformation before fusion so as to control the subsequent calculation amount possibly brought by a local fusion module to be increased and better fuse with the features of the other two branches.

Further as a preferred embodiment of the method, the step of training the target detection network based on the preprocessing data and a preset training policy to obtain a trained target detection network specifically includes:

specifically, a series of preprocessing rules for the input image are set prior to training, wherein the preprocessing operations that must be involved include stabilizing the trained image normalization and controlling the changing image size of the computational complexity. During training, a series of random preprocessing operations such as random clipping are introduced on the basis of necessary operations to achieve the purpose of data augmentation and enhance the performance of the network.

in the training process, the generalization of the network is evaluated by taking the verification set as input every fixed iteration number, so that the network is prevented from being influenced by over fitting;

Specifically, in actual detection, the trained model can be recovered only by giving parameter values to parameters of corresponding layers in the network through parameter names, and the parameters are used as the basis for outputting detection results in the subsequent detection process.

Further, as a preferred embodiment of the method, the step of acquiring the data to be tested, inputting the data to the trained target detection network, and outputting the detection result specifically includes:

obtaining an image of a target to be detected by taking data to be detected;

and calculating the overlapping rate between the prediction frames through the confidence coefficient of the prediction frames and the position 4-dimensional vector, de-duplicating the prediction frames based on a non-maximum suppression algorithm, and obtaining and outputting the final detection result of the detector.

Specifically, the detector first discards a portion of the low quality results from the N-dimensional vector sequence of class predictions by manually predetermined class confidence thresholds. The remaining detection results are de-duplicated for the detection frames according to a non-maximum suppression (NMS) algorithm by predicting frame confidence and calculating the overlap rate between prediction frames based on the position 4-dimensional vector. And finally, the residual prediction frame is the detection result of the detector.

As shown in fig. 3, a target detection system based on detection trunk and local feature optimization includes the following modules:

the preprocessing module is used for acquiring training data and preprocessing the training data to obtain preprocessed data;

the network construction module is used for constructing a target detection network based on the long-neck trunk architecture and the local feature optimization module;

the training module is used for training the target detection network based on the preprocessing data and a preset training strategy to obtain a trained target detection network;

the detection module is used for acquiring the data to be detected, inputting the data to the trained target detection network and outputting a detection result.

The content in the method embodiment is applicable to the system embodiment, the functions specifically realized by the system embodiment are the same as those of the method embodiment, and the achieved beneficial effects are the same as those of the method embodiment.

While the preferred embodiment of the present application has been described in detail, the application is not limited to the embodiment, and various equivalent modifications and substitutions can be made by those skilled in the art without departing from the spirit of the application, and these equivalent modifications and substitutions are intended to be included in the scope of the present application as defined in the appended claims.

Claims

1. The target detection method based on detection trunk and local feature optimization is characterized by comprising the following steps of:

obtaining data to be detected, inputting the data to a trained target detection network, and outputting a detection result;

the target detection network comprises a long-neck residual error main network and a local feature optimization module, wherein the long-neck residual error main network comprises six feature extraction convolution modules, and the local feature optimization module comprises a local fusion module and a scale supervision module;

the local fusion module comprises a detail re-leading branch, a local context branch and an original input mapping branch, wherein the detail re-leading branch sequentially passes an input characteristic image through a 1×1 convolution layer, a maximum pooling layer, a 3×3 convolution layer and a batch normalization layer, the local context branch sequentially passes the input characteristic image through the 1×1 convolution layer, a deconvolution layer, the 3×3 convolution layer and the batch normalization layer, and the original input mapping branch sequentially passes the input characteristic image through the 1×1 convolution layer, the 3×3 convolution layer and the batch normalization layer;

the step of acquiring the data to be detected and inputting the data to the trained target detection network and outputting the detection result specifically comprises the following steps:

acquiring data to be detected to obtain an image of a target to be detected;

2. The method for detecting a target based on detection of a trunk and local feature optimization according to claim 1, wherein the step of acquiring training data and preprocessing the training data to obtain preprocessed data specifically comprises:

the training data comprises a public data set from the Internet and an in-situ photographed image, and the information in the training data comprises original material pictures, target positions in the pictures and labeling records of categories.

3. The target detection method based on detection backbone and local feature optimization according to claim 2, wherein the feature extraction convolution module comprises an acceptance module, and the acceptance module comprises two branches.

4. The method for detecting a target based on detection trunk and local feature optimization according to claim 3, wherein the step of training the target detection network based on the preprocessing data and a preset training strategy to obtain a trained target detection network specifically comprises the following steps:

5. The target detection system based on detection trunk and local feature optimization is characterized by comprising the following modules:

the detection module is used for acquiring data to be detected, inputting the data to the trained target detection network and outputting a detection result;

the method for obtaining the data to be detected and inputting the data to the trained target detection network and outputting the detection result specifically comprises the following steps:

acquiring data to be detected to obtain an image of a target to be detected; inputting an image of a target to be detected into a trained target detection network, and outputting a 4-dimensional vector sequence representing the position of a predicted frame and an N-dimensional vector sequence expressing category prediction through a convolution layer; the detector discards a part of low-quality results through a manually preset category confidence threshold value according to the N-dimensional vector sequence of category prediction to obtain residual detection results; and calculating the overlapping rate between the prediction frames through the confidence coefficient of the prediction frames and the position 4-dimensional vector, de-duplicating the prediction frames based on a non-maximum suppression algorithm, and obtaining and outputting the final detection result of the detector.