CN110287849B

CN110287849B - Lightweight depth network image target detection method suitable for raspberry pi

Info

Publication number: CN110287849B
Application number: CN201910534572.0A
Authority: CN
Inventors: 任坤; 黄泷; 范春奇
Original assignee: Beijing University of Technology
Current assignee: China Industrial Internet (Beijing) Technology Group Co.,Ltd.
Priority date: 2019-06-20
Filing date: 2019-06-20
Publication date: 2022-01-07
Anticipated expiration: 2039-06-20
Also published as: CN110287849A

Abstract

A lightweight deep network image target detection method suitable for a raspberry pie belongs to the field of deep learning and target detection, and comprises the steps of firstly collecting images containing targets to be detected, preprocessing the collected images, and using the preprocessed images for network training; secondly, inputting the preprocessed image into a depth separable expansion convolution neural network for feature extraction to obtain feature maps with different resolutions; inputting the feature maps with different resolutions into a feature pyramid network for feature fusion to generate a fusion feature map carrying more abundant information; and then, classifying and positioning the target to be detected by adopting a detection network to the fusion characteristic graph, and finally performing non-maximum value inhibition to obtain an optimal target detection result. The invention overcomes the difficulties that the image target detection method based on the deep neural network is difficult to realize on the raspberry platform and the image target detection method based on the lightweight network is low in detection accuracy on the raspberry platform.

Description

Lightweight depth network image target detection method suitable for raspberry pi

Technical Field

The invention belongs to the field of deep learning and target detection, and particularly relates to a lightweight deep network image target detection method suitable for a raspberry group.

Background

Object detection is a fundamental task in computer vision. The main purpose of object detection is to locate objects of interest from an input image or video, accurately classify the class of each object, and provide a bounding box for each object. Early target detection technologies adopted a manual feature extraction method, and the manually extracted features were combined with a classifier to implement a target detection task. The method for manually extracting features is not only complex, but also the extracted features have no good expression capability and robustness, so researchers propose a target detection method based on a convolutional neural network. The convolution neural network can autonomously learn useful characteristics of the image, so that the limitation of manually designing the characteristics is saved, and the accuracy of target detection is improved. These advantages make the convolutional neural network-based method rapidly replacing the traditional method a mainstream research direction in the field of target detection.

At present, an image target detection model based on a convolutional neural network optimizes a network model by deepening a network hierarchy so as to improve detection accuracy. With the deepening of network level, hardware resources required by the training model are changed from a common hardware platform to a large-scale high-performance server, and large-scale intensive computation makes it difficult to realize the depth detection model in a micro computing platform (such as raspberry pie) with limited resources. In order to solve the problems, the conventional technical scheme mainly compresses and accelerates the deep convolutional neural network, reduces network parameters and calculated amount, enables memory occupation and calculation power of a deep neural network model to meet the requirement of low configuration, and has the cost of greatly reducing detection accuracy.

Disclosure of Invention

In order to solve the technical problems, the invention aims to provide a lightweight depth network image target detection method suitable for a raspberry group, and overcomes the difficulties that an image target detection method based on a depth neural network is difficult to realize on a raspberry group platform and the detection accuracy of the image target detection method based on the lightweight network on the raspberry group platform is low.

In order to achieve the technical purpose, the technical scheme of the invention is as follows:

a lightweight depth network image target detection method suitable for a raspberry pi comprises the following steps:

(1) collecting images containing targets to be detected, and preprocessing the collected images for network training;

(2) inputting the image obtained after the preprocessing in the step (1) into a depth separable expansion convolution neural network for feature extraction to obtain feature maps with different resolutions;

(3) selecting the different resolution characteristic graphs obtained in the step (2) to input into a characteristic pyramid network for characteristic fusion, and generating a fusion characteristic graph carrying more abundant information;

(4) and (4) inputting the fusion characteristic diagram generated in the step (3) into a detection network to classify and position the target to be detected, and finally performing non-maximum value inhibition to obtain an optimal target detection result.

Further, the specific process of step (1) is as follows:

(a) selecting the types of targets to be detected, collecting images containing the targets of the types, and marking the targets, namely marking a boundary frame and type information of each target to be detected appearing in each image;

(b) when the number of the collected images is small, the existing images are utilized to carry out data enhancement operation. The method of turning, translating, rotating or adding noise and the like is adopted to create more images, so that the trained neural network has better effect;

(c) uniformly converting the image resolution into 224 × 224 to adapt to the input size;

(d) and optimizing the image based on the number of positive and negative samples, and dividing to obtain a training image set and a testing image set.

Further, the specific process of step (2) is as follows:

(A) firstly, carrying out primary feature extraction on an input image through a standard convolution block of 7 × 7 to obtain a 112 × 64 feature map, wherein 64 represents the number of channels of the feature map;

(B) sequentially extracting the depth features of the 3 depth separable volume blocks from the 112 × 64 feature maps obtained in the step (a) to obtain feature maps of 56 × 256, 28 × 512 and 14 × 1024 respectively;

(C) and (C) performing final feature extraction on the 14 × 1024 feature map obtained in the step (B) through a depth separable extended volume block to obtain a feature map with 14 × 1024 resolution.

Wherein, the depth separable volume block in step (B) can greatly compress the network parameters, which is specifically explained as follows:

3 x 3 Standard convolution with H_i*W_iInput tensor L of M_iAnd using a convolution kernel K of 3X 3M N_sTo obtain H_i*W_iOutput tensor L of N_jIn which H is_i，W_iRespectively representing the length and width of an input image, M representing the number of channels of an input feature map, N representing the number of channels of an output feature map, and 3 x 3 representing the spatial dimension of a convolution kernel. The 3 x 3 standard convolution requires the computational cost to be:

H_i*W_i*M*N*3*3。

the depth separable convolution decomposes the standard convolution into two steps: 3 x 3 depth convolution and 1 x 1 point-by-point convolution. 3 x 3 depth convolution the input feature maps are each convolved using only a single convolution kernel. Point-by-point convolution then linearly combines the output of the depth convolution layer with a 1 x 1 convolution kernel.

Depth separable convolution employing H_i*W_iInput tensor L of M_iAnd using a depth convolution kernel K of 3X 1X M_dTo obtain H_i*W_iOutput tensor L of 1_jThen, a point-by-point convolution kernel K of 1X 1M N is adopted_pTo obtain H_i*W_iOutput tensor L of N_k. The depth separable convolution requires a computational cost of:

H_i*W_i*M*3*3+M*N*3*3

depth separable convolution by representing the convolution as a filtering and combining process, the computation cost of depth separable convolution is only that of conventional convolution

After point-by-point convolution (1 x 1), a ReLU layer is adopted for non-linearization, gradient disappearance is avoided, and network sparsity is increased to avoid overfitting. And no ReLU layer is added after the deep convolution (3 × 3) to ensure information circulation between feature maps and reduce calculation.

In addition, the depth separable expanded convolution block in the step (C) can effectively expand the receptive field of the convolution kernel and improve the regression rate and the positioning accuracy of the target under the condition of not increasing the network parameter number.

Further, the specific process of step (3) is as follows:

(I) respectively carrying out 1 × 1 convolution operation on the 28 × 512 and 14 × 1024 feature maps obtained by feature extraction in the step (2), and unifying the number of channels into 256 to obtain 28 × 256 and 14 × 256 feature maps;

and (II) adjusting the plurality of feature maps with different spatial resolutions obtained in the step (I) to the same resolution through upsampling, and then performing splicing processing to generate fused feature maps 56 x 256, 28 x 256 and 14 x 256 which carry more abundant information.

Further, the specific process of step (4) is as follows:

(i) and (3) taking the fused feature map obtained in the step (II) as an input, generating a plurality of default frames for each pixel of the input feature map, and then respectively detecting by the positioning sub-network and the classification sub-network. The detection value contains two parts: bounding box position and category confidence;

(ii) the positioning subnetwork predicts a bounding box for each default box; the classification sub-network predicts for each default box the confidence of all its classes;

(iii) and inhibiting the confidence degrees of the object types in the plurality of prediction frames and the position offset of the prediction frames relative to the default frame by using non-maximum value inhibition, and selecting the prediction frame with the minimum target loss function as the optimal prediction frame to obtain the object type and the position of the prediction frame in the optimal prediction frame.

Wherein the target loss function L (x, L, c, g) of the detection network in step (iii) is defined by the classification loss function L_conf(x, c) and a localization loss function L_loc(x, l, g) composition:

wherein x is a default frame on the feature map, L is a prediction frame, c is a confidence prediction value of the default frame on the feature map on each category, g is a real frame, and L is a confidence value of the default frame on the feature map on each category_conf(x, c) denotes the softmax classification loss function of the default box on the feature map over the set of category scores c, L_loc(x, l, g) represents a position loss function, N represents the number of default boxes matched with the real boxes, and the weight coefficient α is set to 1 by cross validation.

The detection network realizes more accurate target positioning and classification by optimizing the loss function.

Adopt the beneficial effect that above-mentioned technical scheme brought:

the invention provides a method adopting deep separable convolution, which reduces redundant information in a characteristic diagram, realizes the great compression of network parameters under the condition of extremely small precision loss, and reduces the requirements on hardware memory and computing power; the depth separable expansion convolution is introduced to increase the receptive field of the characteristic diagram, and the small target detection effect and the target positioning precision are enhanced on the premise of not increasing network parameters; and the characteristic pyramid is used for carrying out multi-scale characteristic fusion, so that the characteristics under all scales have abundant image information, and the detection and target positioning precision of the small target is further improved. The method has the advantages of low memory occupation and low calculation power requirement, and can realize the target detection task on the raspberry dispatching platform.

Drawings

FIG. 1 is a schematic flow diagram of the present invention;

FIG. 2 is a model block diagram of the present invention;

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is explained below with reference to the accompanying drawings and examples, but is not limited thereto:

step 1, collecting images containing targets to be detected, and preprocessing the collected images for network training.

Selecting the types of targets to be detected, then collecting a large number of images containing the targets of the types, and marking the targets, namely marking the boundary frame and the type information of each target to be detected appearing in each image;

when the number of the collected images is small, the existing images are utilized to carry out data enhancement operation. The method of turning, translating, rotating or adding noise and the like is adopted to create more images, so that the trained neural network has better effect;

uniformly converting the image resolution into 224 × 224 to adapt to the input size;

and optimizing the image based on the number of positive and negative samples, and dividing to obtain a training image set and a testing image set.

And 2, inputting the image obtained after the preprocessing in the step 1 into a depth separable expansion convolutional neural network for feature extraction to obtain feature maps with different resolutions.

In stage 1, the input image at 224 × 224 is down-sampled by a standard convolution with 7 × 7 at step 2, outputting a feature map at 112 × 64.

In stage 2, the input feature maps of 112 × 64 are down-sampled using 3 × 3 max pooling layers, feature extraction is performed through 3 depth separable convolution layers, and feature maps of 56 × 256 are output.

In stage 3, the input feature map of 56 × 256 is down-sampled with 3 × 3 depth separable convolutional layers of step 2, and feature extraction is performed through the 3 depth separable convolutional layers, outputting a feature map of 28 × 512.

In stage 4, the 28 x 512 input feature map is downsampled with 3 x 3 depth separable convolutional layers with step 2, feature extraction is performed through 5 depth separable convolutional layers, and a 14 x 1024 feature map is output.

In stage 5, 14 × 14 input feature maps are convolved using the depth separable layer with the expansion rate of 2, and the feature maps are output 14 × 141024 while the spatial resolution of the feature maps is kept constant while the receptive field is expanded.

And 3, selecting the different resolution characteristic graphs obtained in the step 2, inputting the different resolution characteristic graphs into a characteristic pyramid network for characteristic fusion, and generating a fusion characteristic graph carrying more abundant information.

And (5) respectively carrying out 1-by-1 convolution on the feature graphs finally output in the stages 2-5 to unify the number of channels into 256.

And (4) fusing the feature map A with the 14 × 14 feature map B output by the stage 4 through 1 × 1 convolution to obtain a 14 × 14 feature map AB.

And (4) upsampling the feature map AB to obtain 28 × 28 feature maps, and then fusing the 28 × 28 feature maps C output by the stage 3 to obtain a feature map ABC.

The feature map ABC is up-sampled to obtain 56 × 56 feature maps, and then fused with the 56 × 56 feature map D output in stage 2 to obtain a feature map ABCD.

And 4, inputting the fusion characteristic diagram generated in the step 3 into a detection network to classify and position the target to be detected, and finally performing non-maximum suppression to obtain an optimal target detection result.

And (4) taking the fused feature map obtained in the step (3) as an input, generating 4 default frames for each pixel of the input feature map, and then respectively detecting by the positioning sub-network and the classification sub-network. The detection value contains two parts: bounding box position and category confidence;

the positioning sub-network generates a prediction box for each default box; the classification sub-network predicts for each default box the confidence of all its classes;

and inhibiting the confidence degrees of the object types in the plurality of prediction frames and the position offset of the prediction frames relative to the default frame by using non-maximum value inhibition, and selecting the prediction frame with the minimum target loss function as the optimal prediction frame to obtain the object type and the position of the prediction frame in the optimal prediction frame.

Wherein the target loss function L (x, L, c, g) is defined by the classification loss function L_conf(x, c) and a localization loss function L_loc(x, l, g) composition:

The above embodiments are merely illustrative of the technical ideas of the present invention, and the technical ideas of the present invention can not be limited thereto, and any modifications based on the technical ideas of the present invention are within the scope of the present invention.

Claims

1. A lightweight depth network image target detection method suitable for a raspberry pi is characterized by comprising the following steps:

(4) inputting the fusion characteristic diagram generated in the step (3) into a detection network to classify and position the target to be detected, and finally performing non-maximum value inhibition to obtain an optimal target detection result;

the specific process of the step (1) is as follows:

(A) firstly, carrying out primary feature extraction on the 224 × 224 input image through a 7 × 7 standard volume block to obtain a 112 × 64 feature map, wherein 64 represents the channel number of the feature map;

(C) performing final feature extraction on the 14 × 1024 feature map obtained in the step (B) through a depth separable extended volume block to obtain a feature map with a resolution of 14 × 1024;

the specific process of the step (2) is as follows:

(I) respectively carrying out 1 × 1 convolution operation on the 28 × 512 and 14 × 1024 characteristic graphs obtained by the characteristic extraction in the step (1), and unifying the number of channels into 256 to obtain 28 × 256 and 14 × 256 characteristic graphs;

and (II) adjusting the multiple feature maps with different spatial resolutions obtained in the step (I) to the same resolution through upsampling, then performing splicing processing to generate fused feature maps 56 x 256, 28 x 256 and 14 x 256 which carry more abundant information, and completing target detection by using the multi-scale fused feature maps.