CN112464954A

CN112464954A - Lightweight target detection network applied to embedded equipment and training method

Info

Publication number: CN112464954A
Application number: CN202011411218.8A
Authority: CN
Inventors: 王伟栋; 沈修平
Original assignee: SHANGHAI ULUCU ELECTRONIC TECHNOLOGY CO LTD
Current assignee: SHANGHAI ULUCU ELECTRONIC TECHNOLOGY CO LTD
Priority date: 2020-12-06
Filing date: 2020-12-06
Publication date: 2021-03-09

Abstract

The invention provides a lightweight target detection network applied to embedded equipment and a corresponding training method. The detection network comprises a backbone network, a prediction network and a nack network connecting the backbone network and the output terminal. The trunk network and the tack network compress the parameter quantity, simultaneously reserve stronger characteristic expressive force, and give consideration to detection precision and efficiency. The detection network maps the input image into feature maps with different scales through a backbone network for detecting targets with different sizes, and fuses the small-size feature map and the adjacent large-size feature map through up-sampling to supplement semantic information lacking in the large-size features. And further improving the probability of the predicted output target center at each pixel point of the feature map after feature expression and the corresponding center and width and height offset by using a lightweight residual block in the hack network for the fused feature map.

Description

Lightweight target detection network applied to embedded equipment and training method

Technical Field

The invention belongs to the field of target detection in computer vision, and particularly relates to a lightweight target detection network suitable for an embedded device end and a corresponding training method.

Background

With the advent of deep learning techniques, object detection has made tremendous progress in recent years as a direction of intense research in the field of computer vision. Compared with the traditional target detection technology, the target detection based on the deep neural network has the characteristics of high speed and high precision. From the initial two-stage detection network such as Fast/Fast RCNN to the single-stage detection network such as ssd, yolo and the like, the detection network can ensure the detection precision and obviously improve the detection efficiency.

The main components of the deep neural network are convolution, pooling and activation, which are combined to construct a series of nonlinear transformations. The more the quantity of the parameters of the network is, the stronger nonlinear expression capability and generalization capability are meant, and meanwhile, the increase of the parameters can further improve the complexity of the calculation, so the good performance of deep learning often needs the strong calculation capability of the equipment as a support. The limited computational power and memory space has resulted in the inability of mainstream deep neural networks to run on embedded devices.

Because the neural network has high computational complexity and has severe requirements on the computing power of the device, the target detection network generally needs to be run on a device (such as a cloud-side server) with high-performance gpu. However, on the premise that the current network bandwidth is limited and real-time response needs to be met, applications with high real-time requirements cannot depend on a cloud-based processing mode. Compared with the prior art, the embedded device has the characteristics of no dependence on a network for calculation, no calculation delay and the like.

However, the embedded device has limited computational power and running real-time target detection on the embedded device relies on an efficient neural network, which means that the embedded neural network must undergo structural simplification and parameter compression.

Disclosure of Invention

The invention aims to provide a lightweight target detection network applied to embedded equipment and a corresponding training method so as to realize real-time detection and positioning of the embedded equipment (such as a camera) on a target.

In a first aspect, the present invention provides a lightweight target detection network for embedded devices, the network comprising a backbone network, a hack network and a prediction network;

the input object of the backbone network in the detection network is a preprocessed image to be detected, and the backbone network maps the input image to feature maps of multiple sizes for detecting targets of different sizes. The neck network is composed of a plurality of lightweight residual blocks, the input end of the neck network is connected with the output end of the main network, the input end of the neck network is sequentially fused with feature maps with adjacent sizes, each residual block inside the neck network further enhances the feature expressive force of the corresponding size feature map in the subsequent process, and finally the fused feature maps are output to a subsequent prediction network for result prediction.

The main network is composed of a plurality of two-way dense modules, the features of the input feature map under different scale receptive fields are respectively obtained, and the feature map size output by each output layer of the network is 1/2 of the size of the last output feature map. The hack network performs feature up-sampling on the feature map of the current size in a neighbor interpolation mode and then fuses the feature map with the corresponding feature map output by the previous layer in a splicing mode. The prediction network is similar to a prediction network of the traditional single-order target detection and is a two-way network, and the two branches respectively predict the category of the target and the offset of the center and the width and the height of the target relative to the current pixel position.

Finally, all the prediction results are processed by using non-maximum suppression, and a coordinate frame without an object and a plurality of invalid coordinate frames surrounding the same target are filtered.

As an optional implementation manner, according to the distribution of the size of the detection target and the detection efficiency, the backbone network may be made to output feature maps of 2 to 3 sizes.

As an optional implementation manner, the size of the backbone network input layer may be adjusted by itself based on the aspect ratio distribution of the detection target coordinate frame.

As an alternative implementation, the size of the first output feature map of the backbone network may be 1/8 or 1/4 of the input size, depending on the size of the detection target relative to the size of the original image.

As an alternative implementation manner, a spatial attention mechanism can be added into the prediction network, and the prediction capability of the prediction network for the target center position is enhanced.

As an alternative implementation, based on the degree of occlusion between the same kind of detection targets, an improved version of non-maxima suppression method may be used, such as: Soft-NMS, DIoU.

In a second aspect, the present invention provides a training method for the lightweight network, including the following steps:

1) randomly extracting training sample images with the quantity of batch _ size x2, splicing the training sample images pairwise, and randomly selecting a splicing mode in horizontal and vertical splicing;

2) zooming the spliced image to a fixed size and normalizing the pixel value to a range of 0-1;

3) inputting the preprocessed training sample images into a backbone network to obtain feature maps of a plurality of sizes;

4) acquiring a clustering center of the size of a target frame in a training sample by using a Kmeans clustering algorithm, calculating the average coincidence degree of the width and the height of the coordinate frame and the width of the clustering center according to different numbers of clustering centers, searching the number of the largest clustering centers with obvious gradient change based on a curve graph of the average coincidence degree and the change of the number of the clustering centers, and uniformly distributing the corresponding clustering center sizes to feature maps with corresponding sizes from small to large;

5) taking each pixel in the feature map as a center, and constructing a corresponding number of prior frames according to the sizes of the distributed clustering centers with different sizes;

6) calculating the coincidence degree of the target frame of the sample and the corresponding prior frame under different sizes, and taking the prior frame with the coincidence degree of the target frame being more than or equal to 0.5 as a positive sample containing the target, otherwise as a background sample;

7) calculating the classification error of the prediction category of each prior frame and the real category in 6) and the offset error of the prior frame containing the target on the central point and the width and height compared with the real target frame;

8) the network parameters are optimized by the Adam algorithm, and two errors in 7) are minimized.

As an alternative implementation, data pre-processing may add flipping and color dithering.

As an alternative implementation, the classification error may be calculated using either focal distance or ce distance.

As an alternative implementation, the offset error may be calculated using euclidean distance, Smooth L1, GIoU, etc.

As an alternative implementation, all prior blocks are used to calculate the classification error.

As an alternative implementation, a certain number of prior boxes are randomly selected to calculate the classification error.

As an alternative implementation, an a priori box that is difficult to find the classification using OHEM is used to calculate the classification error.

As an alternative implementation, the prior box that is difficult to classify is found using OHEM and randomly selected portions among the remaining prior boxes that are easy to classify are used together to calculate the classification error.

Drawings

Fig. 1 is a diagram of a backbone network structure of a lightweight detection network.

Fig. 2 is a block diagram of a two-way dense module (ddm) in a backbone network.

Fig. 3 is a diagram of a hack network architecture for a lightweight detection network.

Fig. 4 is a diagram of a lightweight residual block (lrm).

Fig. 5 is a diagram of a predictive network architecture for a lightweight detection network.

Fig. 6 is a flow chart of a network training method of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and specific embodiments.

The invention provides a lightweight target detection network applied to embedded equipment and a training method.

The network input size is 448x448, the pixel value of the image to be detected is converted from 8-bit integer without symbols to 32-bit floating point type and then divided by 255, and the pixel value is transmitted into the backbone network of the detection network after being normalized to be between 0 and 1.

Fig. 1 and 2 are block diagrams of a backbone network and a two-way dense module, as shown in the figure, the backbone network includes a plurality of two-way dense modules. After the receptor field information of different scales is obtained through a plurality of layers of two-way dense magic blocks connected in series, a characteristic diagram after down sampling is output by using the mean pooling of 2x 2. For the input image, the backbone network outputs feature maps of two sizes, 56x56 and 28x28, in sequence.

As shown in fig. 3, a feature map output by the backbone network is used as an input of the hack network, the feature map 1 is fused with a previous-stage feature after upsampling to obtain a feature map 2, and the two feature maps are input into corresponding lightweight residual blocks to enhance feature expressive force. Fig. 4 is a block diagram of a lightweight residual block, where one branch is a single normal 1x1 convolution and the other branch is concatenated with two normal 1x1 convolutions and a 3x3 deep separable convolution, which guarantees strong feature representation without adding much computation.

The two branches of the prediction network in fig. 4 respectively output the probability of the category to which each prior frame belongs and the offset of the center and width of the prior frame, and the coordinates of the upper left corner and the lower right corner of the target frame can be obtained according to the following formula by combining the center and width of the prior frame:

Center(x,y)＝pred(Δx,Δy)*prior(x,y)+prior(x,y)

(w,h)＝prior(w,h)*e^{pred(Δw,Δh)}

top_left(x,y)＝Center(x,y)-0.5*(w,h)

bottom_right(x,y)＝Center(x,y)+0.5*(w,h)

pred (delta x, delta y) and pred (delta w, delta h) in the above formula represent the offsets of the prior frame center and width and height obtained by network prediction; prior (x, y) and prior (w, h) represent the initial center coordinates and width and height of the prior frame; centers (x, y) and (w, h) represent the Center and width of the target frame obtained by the adjustment.

The network model size of two kinds of characteristic diagrams output by the backbone network is about 3.5M, when an input image is 448x448, the detection speed on GTX1080ti is about 200FPS, and the detection speed on Haisi AI chip Hi3516CV500 is about 35FPS, so that the real-time detection of an embedded device end can be met.

In the training stage, according to the size of an input image defined by a detection network, calculating a clustering center (a width-height pair of a target frame) of the size of a target frame in a training sample and an average overlap ratio of the width height of a coordinate frame and the width height of the clustering center by using a kmeans clustering method, wherein a calculation formula of the average overlap ratio is as follows:

in the above formula, N represents the total number of target frames in the training sample, N represents the number of clustering centers, and N_jIndicates the number of target frames contained in the jth class, s_i,jRepresenting the jth target box, c, located in the ith class_iRepresenting the ith cluster center. With the increase of the number of the clustering centers, the gradient change of the average overlapping degree gradually tends to 0, and on the premise, the clustering center with the highest average overlapping degree is selected as the prior frame size.

In the current implementation, the training images are resized to 448x448 after being stitched together two by two for each batch of training images, and then the images are color dithered or horizontally flipped. Because the main network outputs feature maps with two sizes, the clustering centers obtained by the kmeans clustering method are divided into two groups from small to large according to the area, wherein the group with smaller size is used for making a prior frame of the feature map with large size, and the group with larger size is used for making the prior frame of the feature map with small size. Then, for each element on the feature maps of all sizes, the row and column index number is increased by 0.5 to serve as the prior frame center position, and then a corresponding number of prior frames are constructed around the center according to the cluster centers allocated previously. The center and width height of the real target box in the sample data are mapped to all the prior boxes in the feature maps of 56x56 and 48x48 corresponding to the feature maps to calculate the coincidence degree. And taking 0.5 as a threshold value, taking the prior frame with the coincidence degree of any real target frame more than 0.5 as a positive sample (containing the target), and taking the prior frame with the coincidence degree of any real target frame as a negative sample (background). For the positive sample, the offset of the positive sample with respect to the real target frame in the center coordinate and width and height needs to be further calculated.

Since the self-made data set has unbalanced positive and negative sample ratios, the classification error and the offset error are calculated by focal loss and Smooth L1 loss, respectively, and the formula is as follows:

where α is used to balance the ratio of the positive and background samples; gamma is used for distinguishing the prior frames which are easy to classify and difficult to classify, so that the optimization of errors is more focused on the classification result of the prior frames which are difficult to classify; p and p' represent the true class of the prior box and the prior box class of the network prediction, respectively.

Where x' and x ∈ { Δ x, Δ y, Δ w, Δ h }, respectively represent the predicted prior frame offset and the true prior frame offset of the detection network.

For the calculation of the classification error, here both positive and negative examples are used; while the offset error is calculated using only the offset of the positive samples. In order to further reduce the negative influence on training caused by far more negative samples than positive samples, only partial negative samples are adopted to participate in the calculation of the total classification error. While the optimized direction to avoid errors is dominated by easily categorizable negative examples. And for each batch of input images, part of negative samples with the maximum classification error are taken from the sequence, the number of the negative samples is 3 times of the number of the positive samples, and meanwhile, part of negative samples which are randomly taken from the rest negative samples and are used for calculating the total classification error together with the negative samples which are difficult to classify.

Claims

1. A lightweight target detection network applied to embedded equipment is characterized in that the network consists of a backbone network, a tack network and a prediction network;

the method comprises the steps that an input object of a backbone network in a detection network is a preprocessed image to be detected, and the backbone network maps the input image to feature maps of multiple sizes for detecting targets of different sizes; the neck network is composed of a plurality of lightweight residual blocks, the input end of the neck network is connected with the output end of the main network, feature graphs with adjacent sizes are sequentially fused, each residual block inside the neck network further enhances the feature expressive force of the corresponding size feature graph in the subsequent process, and finally the fused feature graph is output to a subsequent prediction network for result prediction;

the main network consists of a plurality of double-path dense modules, the input characteristic diagrams are respectively obtained under different scale receptive fields, and the size of the characteristic diagram output by each output layer of the network is 1/2 of the size of the last output characteristic diagram; the hack network performs feature up-sampling on the feature map of the current size in a neighbor interpolation mode and then fuses the feature map with the corresponding feature map output by the previous layer in a splicing mode; the prediction network is similar to a prediction network of the traditional single-order target detection and is a two-way network, and the two branches respectively predict the category of the target and the offset of the center and width of the target relative to the current pixel position;

by processing all of the predicted results using non-maximum suppression, the coordinate frame without the object and a plurality of invalid coordinate frames surrounding the same object are filtered.

2. The lightweight object detection network applied to the embedded device according to claim 1, wherein the backbone network outputs feature maps of 2-3 sizes according to the distribution of the sizes of the detected objects and the detection efficiency.

3. The lightweight target detection network applied to the embedded device according to claim 1, wherein the size of the input layer of the backbone network is automatically adjusted based on the aspect ratio distribution of the detection target coordinate frame.

4. The lightweight object detection network applied to embedded device of claim 3, wherein the size of the first output feature map of the backbone network is 1/8 or 1/4 of the input size according to the size of the detected object relative to the size of the original image.

5. A training method of a lightweight target detection network applied to an embedded device is characterized by comprising the following steps:

(1) randomly extracting training sample images with the quantity of batch _ size x2, splicing the training sample images pairwise, and randomly selecting a splicing mode in horizontal and vertical splicing;

(2) zooming the spliced image to a fixed size and normalizing the pixel value to a range of 0-1;

(3) inputting the preprocessed training sample images into a backbone network to obtain feature maps of a plurality of sizes;

(4) acquiring a clustering center of the size of a target frame in a training sample by using a Kmeans clustering algorithm, calculating the average coincidence degree of the width and the height of the coordinate frame and the width of the clustering center according to different numbers of clustering centers, searching the number of the largest clustering centers with obvious gradient change based on a curve graph of the average coincidence degree and the change of the number of the clustering centers, and uniformly distributing the corresponding clustering center sizes to feature maps with corresponding sizes from small to large;

(5) taking each pixel in the feature map as a center, and constructing a corresponding number of prior frames according to the sizes of the distributed clustering centers with different sizes;

(6) calculating the coincidence degree of the target frame of the sample and the corresponding prior frame under different sizes, and taking the prior frame with the coincidence degree of the target frame being more than or equal to 0.5 as a positive sample containing the target, otherwise as a background sample;

(7) calculating the classification error of the prediction category of each prior frame and the real category in 6) and the offset error of the prior frame containing the target on the central point and the width and height compared with the real target frame;

(8) the network parameters are optimized by the Adam algorithm, and two errors in 7) are minimized.