CN111898668A

CN111898668A - Small target object detection method based on deep learning

Info

Publication number: CN111898668A
Application number: CN202010723829.XA
Authority: CN
Inventors: 杨海东; 巴姗姗; 黄坤山
Original assignee: Foshan Nanhai Guangdong Technology University CNC Equipment Cooperative Innovation Institute; Foshan Guangdong University CNC Equipment Technology Development Co. Ltd
Current assignee: Foshan Nanhai Guangdong Technology University CNC Equipment Cooperative Innovation Institute; Foshan Guangdong University CNC Equipment Technology Development Co. Ltd
Priority date: 2020-07-24
Filing date: 2020-07-24
Publication date: 2020-11-06

Abstract

The invention discloses a small target object detection method based on deep learning, which can overcome the problems of insufficient detection efficiency, low accuracy and the like in the existing small target object detection method. Firstly, extracting an image without a small target object based on a COCO data set, splicing after adjusting the size of the image, forming a new data set by the spliced image and the image with the small target object in the COCO data set, and carrying out image splicing according to the ratio of 4: 1, dividing a data set into a training set and a testing set; then, modifying the basic feature extraction network of the Faster-RCNN to perform feature fusion; then, selecting a candidate region of each level of fused features after fusion through an RPN; then, training the improved network by using a training set to obtain a training model; and finally, inputting the test set into the trained model for target detection.

Description

Small target object detection method based on deep learning

Technical Field

The invention relates to the technical field of target detection, in particular to a small target object detection method based on deep learning.

Background

The object detection is a basic computer vision task combining two tasks of object positioning and identification, and aims to find a plurality of objects in a complex background of an image, provide an accurate object frame for each object and judge the category of the object in the frame. The target detection technology has wide application in daily life of people, such as target tracking and recognition, face recognition, character detection, pedestrian detection, medical diagnosis, intelligent monitoring systems and the like, and as a basic task, rapid development of target detection in recent years promotes progress of other visual tasks. Although deep learning based methods work well on generic target detection data sets, they still do not solve the problem of small target detection well. The main reason is that there are two problems with small target detection:

(1) the amount of information is insufficient, i.e. the target occupies a very small area in the image, and the amount of information that can be reflected by the pixels of the corresponding area is very limited.

(2) The scarcity of data volume, i.e., few images in the data set containing small objects, results in an unbalanced class for the entire training set. For example, in the COCO dataset, although the approximate proportions of the small object, the medium object, and the large object are 42%, 34%, and 24%, respectively, only about 52% of the images contain the small object, and the proportions of the medium object and the large object are 71% and 83%, respectively. In other words, in some images, most objects are small objects, and only half of the images contain small objects, which severely affects the imbalance during training, resulting in small target objects being detected with much lower accuracy than medium and large objects.

The small targets exist in a small amount in general images, and also exist in images shot by unmanned aerial vehicle cameras, communication base station cameras and other image capturing devices with higher erection heights, and research on small target detection is very important for analyzing and utilizing the images. Accordingly, further improvements and improvements are needed in the art.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provides a small target object detection method based on deep learning.

The purpose of the invention is realized by the following technical scheme:

a small target object detection method based on deep learning mainly comprises the following specific steps:

step S1: extracting an image without a small target object based on the COCO data set, splicing the image after adjusting the size of the image, forming a new data set by the spliced image and the image with the small target object in the COCO data set, and carrying out image splicing according to the ratio of 4: a scale of 1 divides the data set into a training set and a test set.

Step S2: and modifying the basic feature extraction network of the Faster-RCNN to perform feature fusion.

Step S3: and selecting a candidate region through the RPN according to each level of fusion characteristics after fusion.

Step S4: and inputting the training images into an improved network for training, and constructing a loss function according to target classification and regression.

Step S5: and repeatedly selecting the training picture until the loss function is converged and storing the training model.

Step S6: and inputting the test set into the trained model for testing.

Further, in step S1, performing size adjustment on an image that does not include a small target object in the COCO dataset, and splicing the adjusted images; the method aims to solve the problem of unbalance of the sizes of target objects in the training process, the size of the spliced image is the same as that of a conventional image, and large objects and medium objects are reduced into medium objects and small objects in this way, so that the distribution of the objects with different scales in the training process is balanced.

Furthermore, in the image splicing process, scaling k conventional images (with the size of W x H) with uniform resolution by a nearest neighbor interpolation method, and then combining to construct a spliced image; in order to preserve the properties of the original image,scaled image preservation

The aspect ratio of (a) is generally, when k is 1, the stitched image introduces a regular image.

Further, the nearest neighbor interpolation process is: firstly, assuming that the size of an original image pixel is W x H, the size of the zoomed image pixel is W x H, and the coordinates of each pixel point in the original image are integers; if one pixel point is (X, Y) after scaling, the corresponding pixel point in the original image is (X, Y) ═ W/W X, H/H Y, but the value in (W/W X, H/H Y) is not necessarily an integer due to scaling, and at this time, the value is rounded to an integer and expressed as g, so that g (X, Y) ═ g (W/W X, H/H Y).

As a preferred embodiment of the present invention, in step S2, the base feature extraction network module of the fast-RCNN is a residual network ResNet-50, which includes input, conv1, max power, conv2_ x, conv3_ x, conv4_ x, conv5_ x, wherein the conv1 layer includes 1 convolution operation, the convolution kernel is 7 × 7 and has a step size of 2, conv2_ x, conv3_ x, conv4_ x, and conv5_ x layers respectively include 3, 4, 6, and 3 residual blocks, each residual block includes 3 layers of convolution, the convolution kernel size is 1 × 1, 3 × 3, and 1 × 1 in sequence, wherein the convolution layer 3 × 3 of the first residual block of the conv3_ x, conv4_ x, and conv5_ x layers has a step size of 2, the resolution is made to be 2, the downsampling depth is made to be reduced, and all the rest are downsampling depth steps are increased.

As a preferred scheme of the present invention, in step S2, the improved fast-RCNN feature extraction network adopts a feature fusion structure with horizontal connection from top to bottom on the basis of the basic feature extraction network; in the basic feature extraction network module, although top-level features have highly abstract semantic information, the top-level features are insensitive to features such as edges or geometric information of objects due to the fact that multiple pooling downsampling operations are performed, and small target objects are more difficult to characterize; it is worth noting that the shallow feature map usually has high resolution and rich geometric detail features, while the top feature map has stronger semantic abstract information and robustness to the posture, position change and the like of an object, but the resolution is lower, so that on the basis of a bottom-up network structure, a top feature fusion structure with horizontal connection from top to bottom is adopted, the resolution is amplified by an up-sampling means for the top feature map, and the deep feature map and the shallow feature map are combined to generate a feature map with high resolution and rich semantics; after fusion feature maps of different levels are obtained through feature fusion, convolution operation with convolution kernel of 3 × 3 is performed on the fusion feature map of each level, so as to remove aliasing effect caused by up-sampling.

Further, by means of upsampling, the spatial size of the deep features is enlarged by means of a bilinear interpolation method; the bilinear interpolation method is to perform secondary one-dimensional linear interpolation, namely, to perform one-dimensional linear interpolation in the u and v directions respectively; suppose that the pixel point of the original image corresponding to the pixel point of the new image is (u)₀,v₀)，u₀And v₀If not an integer, it must fall into four pixels of the original image, which are (u ', v'), (u ', v' +1), (u '+ 1, v'), (u '+ 1, v' +1), respectively; firstly, two points (u ', v'), (u '+ 1, v') are linearly interpolated in one dimension to obtain g (u₀V'); then, one-dimensional linear interpolation is carried out on two points of (u ', v' +1) and (u '+ 1, v' +1) to obtain g (u₀V' + 1); finally, the obtained (u) is aligned again₀,v′),(u₀V' +1) two points are subjected to one-dimensional linear interpolation to obtain g (u)₀,v₀)。

As a preferred embodiment of the present invention, in step S3, the obtained feature map is input into the RPN network to locate candidate targets, a Softmax two classifier is used to determine whether the obtained candidate targets belong to the foreground or the background, a bounding box regressor is used to correct the positions of the candidate targets, so as to obtain target candidate regions, and finally the final feature map and the candidate regions generated in the RPN network are sent into the Fast-RCNN network, so as to finally realize the classification and regression of the targets.

As a preferable aspect of the present invention, in step S4, a loss function is established based on the target classification and regression, and the classification loss is defined as:

for regression of the frame, smooth-L is adopted₁The loss, defined as:

the loss function is therefore:

wherein N is_clsFor training the number of anchor frames, p, in the RPN process_iAs the probability that the anchor frame is predicted as the target, when the prediction result is a positive sample,

otherwise predicting as negative sample

N_regAs the number of anchor points, t_iFor the offset predicted by the RPN training phase,

is the offset relative to the real frame; λ is a balance parameter between the two terms.

As a preferred scheme of the present invention, in step S5, when the picture is repeatedly selected for training, the input selection of the next iteration is adaptively determined by using the loss in the current iteration as feedback; in the current iteration t, if the loss of the small object is negligible, namely the small target loss ratio

And if the input value is less than a certain threshold value, the input of the iteration t +1 is the spliced image, otherwise, the input is still the conventional image under the default setting.

The working process and principle of the invention are as follows: the invention discloses a small target object detection method based on deep learning, which can overcome the problems of insufficient detection efficiency, low accuracy and the like in the existing small target object detection method. Firstly, extracting an image without a small target object based on a COCO data set, splicing after adjusting the size of the image, forming a new data set by the spliced image and the image with the small target object in the COCO data set, and carrying out image splicing according to the ratio of 4: 1, dividing a data set into a training set and a testing set; then, modifying the basic feature extraction network of the Faster-RCNN to perform feature fusion; then, selecting a candidate region of each level of fused features after fusion through an RPN; then, training the improved network by using a training set to obtain a training model; and finally, inputting the test set into the trained model for target detection.

Compared with the prior art, the invention also has the following advantages:

(1) the small target object detection method based on deep learning provided by the invention uses the spliced images, changes some large objects and medium objects into medium objects and small objects, and balances the distribution of the objects with different scales in the training process.

(2) The small target object detection method based on deep learning improves the basic feature extraction network of fast-RCNN, fuses shallow features and deep features to generate a feature map with high resolution and rich semantics, and performs convolution operation on the fused features to remove aliasing effect caused by up-sampling.

(3) The small target object detection method based on deep learning provided by the invention utilizes the loss in the current iteration as feedback in the training process, adaptively determines the input selection of the next iteration, and determines whether the input of the next iteration is a conventional image or a spliced image according to the loss ratio of the small target object, thereby improving the accuracy of small target object detection.

Drawings

Fig. 1 is an overall flowchart of a small target object detection method based on deep learning according to the present invention.

Fig. 2 is a schematic diagram of an improved feature extraction network provided by the present invention.

FIG. 3 is a flow chart of input image class selection in a training process provided by the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer and clearer, the present invention is further described below with reference to the accompanying drawings and examples.

Example 1:

as shown in fig. 1 to 3, the present embodiment discloses a small target object detection method based on deep learning, which includes the following steps:

step 1: extracting an image without a small target object based on the COCO data set, splicing the image after adjusting the size of the image, forming a new data set by the spliced image and the image with the small target object in the COCO data set, and carrying out image splicing according to the ratio of 4: 1, dividing a data set into a training set and a testing set;

step 2: modifying a basic feature extraction network of the Faster-RCNN to perform feature fusion;

and step 3: selecting a candidate region of each level of fused features after fusion through an RPN;

and 4, step 4: inputting the training image into an improved network for training, and constructing a loss function according to target classification and regression;

and 5: repeatedly selecting a training picture until the loss function is converged and storing the training model;

step 6: and inputting the test set into the trained model for testing.

In step 1, the size of the image not containing the small target object in the COCO data set is adjusted, and the adjusted images are spliced. The method aims to solve the problem of unbalance of the sizes of target objects in the training process, the size of the spliced image is the same as that of a conventional image, and large objects and medium objects are reduced into medium objects and small objects in this way, so that the distribution of the objects with different scales in the training process is balanced. In conventional images, objects may become blurred due to photographic problems, such as focus blur or motion blur. Although the conventional image is adjusted to a smaller size, the objects with the medium size or larger inside will also become smaller, but the outline or the detail of the conventional image is still clearer than that of the original small objects.

In the image stitching process, k conventional images (with the size of W x H) with uniform resolution are scaled through a nearest neighbor interpolation method and then combined to form a stitched image. To preserve the properties of the original image, the scaled image is kept

The principle of nearest neighbor interpolation is: firstly, the size of an original image pixel is assumed to be W x H, the size of the zoomed image pixel is assumed to be W x H, and the coordinates of each pixel point in the original image are integers. If one pixel point is (X, Y) after scaling, the corresponding pixel point in the original image is (X, Y) ═ W/W X, H/H Y, but the value in (W/W X, H/H Y) is not necessarily an integer due to scaling, and at this time, the value is rounded to an integer and expressed as g, so that g (X, Y) ═ g (W/W X, H/H Y).

In step 2, the base feature extraction network module of the fast-RCNN is a residual network ResNet-50, which includes input, conv1, max power, conv2_ x, conv3_ x, conv4_ x, conv5_ x, wherein the conv1 layer contains 1 convolution operation, the convolution kernel is 7 × 7, the step size is 2, conv2_ x, conv3_ x, conv4_ x, and conv5_ x layers respectively contain 3, 4, 6, and 3 residual blocks, each residual block contains 3 layers of convolution, the sizes of the convolution kernels are 1 × 1, 3 × 3, and 1 × 1 in sequence, wherein the step size of the first residual block of the conv3_ x, conv4_ x, and conv5_ x layers is 3 × 3, so as to make resolution decrease, and depth increase, and all the rest of downsampling steps are 1.

In step 2, the improved fast-RCNN feature extraction network adopts a feature fusion structure with horizontal connection from top to bottom on the basis of the basic feature extraction network. In the basic feature extraction network module, although the top-level features have highly abstract semantic information, the top-level features are insensitive to features such as edges or geometric information of objects due to the fact that multiple pooling downsampling operations are performed, and therefore small target objects are more difficult to characterize. It is worth noting that the shallow feature map usually has high resolution and rich geometric detail features, while the top feature map has stronger semantic abstract information and robustness to the posture, position change and the like of an object, but the resolution is lower, so that a top-down feature fusion structure with horizontal connection can be adopted on the basis of a bottom-up network structure, the top feature map is amplified in resolution by an up-sampling means, and the deep and shallow feature maps are combined to generate the feature map with high resolution and rich semantics. After the fused feature maps of different levels are obtained through feature fusion, a convolution operation with a convolution kernel of 3 × 3 is performed on the fused feature map of each level, so as to remove an aliasing effect caused by upsampling, and the specific structure of the method is shown in fig. 2.

By upsampling, the deep features are subjected to bilinear interpolation to enlarge the spatial size of the deep features. The bilinear interpolation method is to perform quadratic one-dimensional linear interpolation, that is, to perform one-dimensional linear interpolation in the u and v directions, respectively. Suppose that the pixel point of the original image corresponding to the pixel point of the new image is (u)₀,v₀)(u₀,v₀Not an integer), it must fall among four pixel points of the original image, which are (u ', v'), (u ', v' +1), (u '+ 1, v'), (u '+ 1, v' +1), respectively. Firstly, two points (u ', v'), (u '+ 1, v') are linearly interpolated in one dimension to obtain g (u₀V ') (1- α) g (u ', v ') + α g (u ' +1, v '); then, one-dimensional linear interpolation is carried out on two points of (u ', v' +1) and (u '+ 1, v' +1) to obtain g (u₀V ' +1) ═ g (1- α) g (u ', v ' +1) + α g (u ' +1, v ' + 1); finally, the obtained (u) is aligned again₀,v′),(u₀V' +1) two points are subjected to one-dimensional linear interpolation to obtain g (u)₀,v₀)＝(1-β)g(u₀,v′+1)+βg(u₀V' +1), so there are:

g(u₀,v₀)＝(1-α)(1-β)g(u′,v′)+α(1-β)g(u′+1,v′)+β(1-α)g(u′,v′+1)+α·βg(u′+1,v′+1)

wherein

In step 3, the feature map obtained in the step 6 is input into the RPN network to perform candidate target positioning, whether the obtained candidate target belongs to the foreground or the background is judged through a Softmax two-classifier, meanwhile, the position of the candidate target is corrected through a bounding box regressor, so that a target candidate region is obtained, finally, the final feature map and the candidate region generated in the RPN network are sent into the Fast-RCNN network, and finally, the classification and regression of the target are realized.

In step 4, a loss function is established based on the target classification and regression, and the classification loss is defined as:

for regression of the frame, smooth-L is adopted₁The loss, defined as:

the loss function is therefore:

otherwise predicting as negative sample

In step 5, when the picture is repeatedly selected for training, the input selection of the next iteration is adaptively determined by using the loss in the current iteration as feedback, as shown in fig. 3. In the current iteration t, if the loss of the small object is negligible, namely the small target loss ratio

The loss fraction of small objects is calculated as follows:

wherein A is_oIs the area of a small object, w_o,h_oIn order to be able to determine its width and height,

is shown from area A_oNot more than A_sOf small objects, here A_s＝1024，L^tRepresenting the loss, scale, of all objects from the current image

As a potential feedback to guide the learning of the next iteration.

The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.

Claims

1. A small target object detection method based on deep learning is characterized by comprising the following steps:

step S1: extracting an image without a small target object based on the COCO data set, splicing the image after adjusting the size of the image, forming a new data set by the spliced image and the image with the small target object in the COCO data set, and carrying out image splicing according to the ratio of 4: 1, dividing a data set into a training set and a testing set;

step S2: modifying a basic feature extraction network of the Faster-RCNN to perform feature fusion;

step S3: selecting a candidate region of each level of fused features after fusion through an RPN;

step S4: inputting the training image into an improved network for training, and constructing a loss function according to target classification and regression;

step S5: repeatedly selecting a training picture until the loss function is converged and storing the training model;

step S6: and inputting the test set into the trained model for testing.

2. The method for detecting small target objects based on deep learning of claim 1, wherein in step S1, the images in the COCO data set that do not include small target objects are resized and the resized images are stitched together; the method aims to solve the problem of unbalance of the sizes of target objects in the training process, the size of the spliced image is the same as that of a conventional image, and large objects and medium objects are reduced into medium objects and small objects in this way, so that the distribution of the objects with different scales in the training process is balanced.

3. The method for detecting the small target object based on the deep learning of claim 2 is characterized in that in the image stitching process, k conventional images (with the size of W x H) with uniform resolution are scaled through a nearest neighbor interpolation method and then combined to form a stitched image; to preserve the properties of the original image, the scaled image is kept

4. The deep learning-based small target object detection method according to claim 3, wherein the nearest neighbor interpolation process is: firstly, assuming that the size of an original image pixel is W x H, the size of the zoomed image pixel is W x H, and the coordinates of each pixel point in the original image are integers; if one pixel point is (X, Y) after scaling, the corresponding pixel point in the original image is (X, Y) ═ W/W X, H/H Y, but the value in (W/W X, H/H Y) is not necessarily an integer due to scaling, and at this time, the value is rounded to an integer and expressed as g, so that g (X, Y) ═ g (W/W X, H/H Y).

5. The deep learning-based small target object detection method according to claim 1, in step S2, the base feature extraction network module of the fast-RCNN is a residual network ResNet-50, which includes input, conv1, maxporoling, conv2_ x, conv3_ x, conv4_ x, conv5_ x, wherein, the conv1 layer comprises 1 convolution operation, the convolution kernel is 7 × 7, the step size is 2, the conv2_ x, conv3_ x, conv4_ x and conv5_ x layers respectively comprise 3, 4, 6 and 3 residual blocks, each residual block comprises 3 layers of convolution, the convolution kernel size is 1 × 1, 3 × 3 and 1 × 1 in sequence, where the step size of the 3 x 3 convolutional layer of the first residual block of the conv3_ x, conv4_ x, conv5_ x layers is 2 in order to downsample to reduce resolution while increasing depth, and the step size of all the remaining convolutional layers is 1.

6. The method for detecting small target objects based on deep learning of claim 1, wherein in step S2, the improved fast-RCNN feature extraction network is based on its basic feature extraction network, and adopts a top-down feature fusion structure with horizontal connection; in the basic feature extraction network module, although top-level features have highly abstract semantic information, the top-level features are insensitive to features such as edges or geometric information of objects due to the fact that multiple pooling downsampling operations are performed, and small target objects are more difficult to characterize; it is worth noting that the shallow feature map usually has high resolution and rich geometric detail features, while the top feature map has stronger semantic abstract information and robustness to the posture, position change and the like of an object, but the resolution is lower, so that on the basis of a bottom-up network structure, a top feature fusion structure with horizontal connection from top to bottom is adopted, the resolution is amplified by an up-sampling means for the top feature map, and the deep feature map and the shallow feature map are combined to generate a feature map with high resolution and rich semantics; after fusion feature maps of different levels are obtained through feature fusion, convolution operation with convolution kernel of 3 × 3 is performed on the fusion feature map of each level, so as to remove aliasing effect caused by up-sampling.

7. The method for detecting the small target object based on the deep learning as claimed in claim 6, wherein the deep features are subjected to bilinear interpolation to enlarge the spatial size of the deep features by upsampling; the bilinear interpolation method is to make twoLinear interpolation of the second dimension, namely, one-dimensional linear interpolation is respectively carried out in the directions of u and v; suppose that the pixel point of the original image corresponding to the pixel point of the new image is (u)₀,v₀)，u₀And v₀If not an integer, it must fall into four pixels of the original image, which are (u ', v'), (u ', v' +1), (u '+ 1, v'), (u '+ 1, v' +1), respectively; firstly, two points (u ', v'), (u '+ 1, v') are linearly interpolated in one dimension to obtain g (u₀V'); then, one-dimensional linear interpolation is carried out on two points of (u ', v' +1) and (u '+ 1, v' +1) to obtain g (u₀V' + 1); finally, the obtained (u) is aligned again₀,v′),(u₀V' +1) two points are subjected to one-dimensional linear interpolation to obtain g (u)₀,v₀)。

8. The method for detecting small target objects based on deep learning of claim 1, wherein in step S3, the obtained feature map is inputted into the RPN network for locating candidate targets, a Softmax two classifier is used to determine whether the target belongs to the foreground or the background for the obtained candidate targets, meanwhile, a bounding box regressor is used to correct the positions of the candidate targets, so as to obtain target candidate regions, and finally, the final feature map and the candidate regions generated in the RPN network are sent to the Fast-RCNN network, so as to finally realize the classification and regression of the target.

9. The deep learning-based small target object detection method according to claim 1, wherein in step S4, a loss function is established according to target classification and regression, and the classification loss is defined as:

for regression of the frame, smooth-L is adopted₁The loss, defined as:

the loss function is therefore:

otherwise predicting as negative sample

10. The method for detecting small target objects based on deep learning of claim 1, wherein in step S5, when the picture is repeatedly selected for training, the input selection of the next iteration is adaptively determined by using the loss in the current iteration as feedback; in the current iteration t, if the loss of the small object is negligible, namely the small target loss ratio