CN110782430A

CN110782430A - Small target detection method and device, electronic equipment and storage medium

Info

Publication number: CN110782430A
Application number: CN201910933275.3A
Authority: CN
Inventors: 徐明亮; 吕培; 崔丽莎; 姜晓恒; 张晨民; 闫杰; 李丙涛; 王彦辉
Original assignee: Zhengzhou Jinhui Computer System Engineering Co Ltd
Current assignee: Zhengzhou Jinhui Computer System Engineering Co Ltd
Priority date: 2019-09-29
Filing date: 2019-09-29
Publication date: 2020-02-11

Abstract

The invention relates to the technical field of target detection, in particular to a small target detection method and device, electronic equipment and a storage medium. The detection method comprises the following steps: extracting the characteristics of the original image to obtain an image to be detected; processing the image to be detected through a context sensing module, keeping the spatial resolution of the image to be detected, expanding the receptive field and outputting a plurality of characteristic images of different receptive fields; predicting the size of a bounding box according to each characteristic image; and outputting the enclosure frame. According to the embodiment of the invention, the context sensing module expands the receptive field and keeps the resolution unchanged, and different context information is sensed, so that target information with different scales is captured, and the technical problem that the underlying characteristic diagram lacks semantic information is solved.

Description

Small target detection method and device, electronic equipment and storage medium

Technical Field

The invention relates to the technical field of target detection, in particular to a small target detection method and device, electronic equipment and a storage medium.

Background

Inspection of the surface of a workpiece using computer vision techniques is an important task in industrial manufacturing. Because the surface defects of part of the workpiece are extremely tiny, great challenges are brought to detection. In recent years, although the performance of general target detection has been improved significantly, the detection of small targets still faces great challenges.

Currently, small target detection networks can be roughly divided into two types: bottom-up approach and Top-down approach. Among them, the bottom-up structure (SSD, MS CNN, etc.) uses the traditional forward propagation network, and continues to down-sample until the feature map becomes very small (e.g., 1 × 1), and the small target information has been substantially lost. Therefore, the method directly predicts the small target on the underlying larger feature map, but the effect is not ideal because the underlying feature map lacks semantic information.

Disclosure of Invention

In order to solve the above technical problems, an object of the present invention is to provide a method, an apparatus, an electronic device and a storage medium for detecting a small target, wherein the adopted technical solution is as follows:

in a first aspect, an embodiment of the present invention provides a method for detecting a small target, where the method includes the following steps:

extracting the characteristics of the original image to obtain an image to be detected;

processing the image to be detected through a context sensing module, keeping the spatial resolution of the image to be detected, expanding the receptive field and outputting a plurality of characteristic images of different receptive fields;

predicting the size of a bounding box according to each characteristic image;

and outputting the enclosure frame.

Further, the context sensing module comprises a plurality of sensing branches, each sensing branch comprises a 1 × 1 convolution dimensionality reduction layer, a 3 × 3 convolution expansion convolution layer and a 1 × 1 convolution dimensionality recovery layer, and the processing procedure of each branch comprises the following steps:

taking the image to be detected as an input feature map, and reducing the data dimension of the input feature map through a dimension reduction layer to obtain a dimension reduction image;

obtaining a perception image of an enlarged receptive field by using an expansion convolutional layer formed by a plurality of stacked convolutions with different expansion rates on the dimension-reduced image, wherein the expansion convolutional layer comprises at least one convolution layer with 3 x 3 convolution;

and restoring the data dimension of the perception image through a dimension restoring layer to obtain a characteristic image with the same data dimension as the input characteristic image.

Further, the context sensing module comprises a shared dimension reduction layer, a shared expansion convolutional layer and a plurality of dimension recovery layers, wherein the shared expansion convolutional comprises a plurality of shared branches, each shared branch is composed of at least one convolutional layer, and adjacent shared branches share the same convolutional layer; the multiple sharing branches share one sharing dimensionality reduction layer, each sharing branch corresponds to one dimensionality recovery layer, and the processing process comprises the following steps:

taking the image to be detected as an input feature map, and reducing the data dimension of the input feature map through the shared dimension reduction layer to obtain a dimension reduction image;

the dimension reduction images are shared, expanded and coiled to obtain a plurality of perception images with enlarged receptive fields;

and restoring the data dimension of each perception image through a corresponding dimension restoring layer to obtain a plurality of characteristic images with the same data dimension as the input characteristic image.

Further, after obtaining the characteristic image with the same data dimension as the image to be detected, the method further comprises the following steps:

adding residual error connection into each characteristic image to obtain an optimized characteristic image;

and processing each optimized characteristic image through a corresponding enhancement layer to enhance the discrimination of the characteristics and obtain an optimized characteristic image.

In a second aspect, an embodiment of the present invention provides a small target detection apparatus, including:

the characteristic extraction module is used for extracting the characteristics of the original image to obtain an image to be detected;

the context sensing module is used for processing the image to be detected through the context sensing module, maintaining the spatial resolution of the image to be detected, expanding the receptive field and outputting a plurality of characteristic images with different receptive fields;

a predicted bounding box module, configured to predict a size of a bounding box according to each of the feature images; and

and the output module is used for outputting the enclosure frame.

Further, the context awareness module comprises a plurality of awareness branches, each awareness branch comprising:

the dimensionality reduction layer is used for taking the image to be detected as an input feature map, reducing the data dimensionality of the input feature map and obtaining a dimensionality reduction image;

the expansion convolutional layer is used for enabling the dimension reduction image to pass through an expansion convolutional layer formed by a plurality of stacked convolutions with different expansion rates to obtain a perception image with an expanded receptive field, and the expansion convolutional layer comprises at least one convolution layer with 3 x 3 convolution; and

and the dimension recovery layer is used for recovering the dimension of the perception image through the dimension recovery layer to obtain the characteristic image with the same data dimension as the input characteristic image.

Further, the context awareness module comprises:

the shared dimensionality reduction layer is used for taking the image to be detected as an input feature map, reducing the data dimensionality of the input feature map and obtaining a dimensionality reduction image;

a shared expansion convolutional layer comprising a plurality of shared branches, each shared branch being composed of at least one convolutional layer, adjacent shared branches sharing the same convolutional layer; the shared branches share one shared dimensionality reduction layer, and each shared branch corresponds to one dimensionality recovery layer; the perception image acquisition module is used for acquiring a plurality of perception images of an expanded receptive field by sharing an expansion convolution layer with the dimension reduction image;

and the dimension recovery layer is used for recovering the data dimension of the perception image through the dimension recovery layer corresponding to the sharing branch to obtain a plurality of characteristic images with the same data dimension as the input characteristic image.

Further, the detection device further comprises:

the residual error connecting module is used for adding residual error connection into each characteristic image to obtain an optimized characteristic image; and

and the discrimination enhancement module is used for processing each optimized feature image through a corresponding enhancement layer, enhancing the discrimination of the features and obtaining an optimized feature image.

In a third aspect, an embodiment of the present invention provides an electronic device, including:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to: any of the above-described small target detection methods is performed.

In a fourth aspect, an embodiment of the present invention provides a storage medium, in which computer-readable program instructions are stored, and when the program instructions are executed by a processor, the method for detecting any one of the above small objects is implemented.

The invention has the following beneficial effects:

the embodiment of the invention discloses a small target detection method, which comprises the steps of firstly, extracting the characteristics of an original image to obtain an image to be detected; processing the image to be detected through a context sensing module, keeping the spatial resolution of the image to be detected, expanding the receptive field and outputting a plurality of characteristic images of different receptive fields; predicting the size of a bounding box according to each characteristic image; and outputting the enclosure frame. According to the embodiment of the invention, the context sensing module expands the receptive field and keeps the resolution unchanged, and different context information is sensed, so that target information with different scales is captured, and the technical problem that the underlying characteristic diagram lacks semantic information is solved.

Drawings

Fig. 1 is a flowchart of a method for detecting a small target according to an embodiment of the present invention;

FIG. 2 is a flowchart of a process of a context awareness module according to an embodiment of the present invention;

FIG. 3 is a diagram illustrating a basic structure of a context awareness module according to an embodiment of the present invention;

FIG. 4 is a block diagram of an instantiation of a context awareness module according to an embodiment of the present invention;

FIG. 5 is a block diagram of an instantiation of an optimal context awareness module according to an embodiment of the present invention;

FIG. 6 is a schematic diagram of a network architecture for small target detection according to an embodiment of the present invention;

FIG. 7 is a schematic structural diagram of a small target detection apparatus according to another embodiment of the present invention;

fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

To further illustrate the technical means and effects of the present invention adopted to achieve the predetermined objects, the following detailed description of the method, apparatus, electronic device and storage medium for detecting small objects according to the present invention with reference to the accompanying drawings and preferred embodiments, its specific implementation, structure, features and effects are described below. In the following description, different "one embodiment" or "another embodiment" refers to not necessarily the same embodiment. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used herein in the description of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.

The following describes specific schemes of a small target detection method, a small target detection device, an electronic device, and a storage medium according to the present invention in detail with reference to the accompanying drawings.

Referring to fig. 1, a flowchart of a method for detecting a small target according to an embodiment of the present invention is shown, where the method obtains a feature image by downsampling an image, expanding a receptive field of a neuron through a context awareness module, and finally predicting a bounding box on the feature image through convolution. The method comprises the following specific steps:

and S001, extracting the characteristics of the original image to obtain an image to be detected.

The original image is adjusted to the size of the required resolution, and is input into the network constructed by the embodiment of the invention for forward propagation and feature extraction, the embodiment of the invention firstly carries out the processes of three times of convolution and spatial pooling on the original image, 1/8 of the original resolution of the original image is sampled, namely, three times of posing operation is carried out, so as to store the spatial information of the small target, the obtained image to be detected has larger resolution, and the complexity of the model is reduced due to the reduction of the number of parameters.

Specifically, for example, after the original image is adjusted to a 512 × 512 resolution image, the three convolution and spatial pooling operations are performed to obtain an image to be detected with a resolution of 64 × 64.

And S002, processing the image to be detected through a context sensing module, keeping the spatial resolution of the image to be detected, expanding the receptive field and outputting a plurality of characteristic images with different receptive fields.

After the context sensing module is arranged in the spatial pooling in the step S001, semantic information of the high-resolution feature map can be enhanced, and feature expression capability can be improved. The context sensing module comprises a plurality of sensing branches, each sensing branch comprises a dimensionality reduction layer, an expansion convolution layer and a dimensionality recovery layer, and convolutions with different expansion rates are added to adjacent expansion convolution layers. And after the dimension of the image to be detected is reduced through the dimension reduction layer, the image to be detected is respectively processed through the expansion convolution layer, and the processed image passes through the corresponding dimension recovery layer to obtain a characteristic image. The multiple sensing branches obtain multiple characteristic images with the same resolution and different receptive fields.

And step S003, predicting the size of the surrounding frame according to each characteristic image.

And on a plurality of sensing branches output by the context sensing module, predicting the class confidence and the coordinate offset of the bounding box with different scales and proportions by using the expanded convolution layer at the same time. For multiple outputs of the context-aware module, the expanded convolution layer is applied to predict bounding boxes of different scales. Predicting a smaller target frame for the characteristic image with a smaller receptive field; for feature images with large receptive fields, medium or large target frames are predicted.

As shown in FIG. 4, according to the increasing of the receptive field, the 4 branches output by the context-aware module are responsible for predicting bounding boxes with different sizes, the scales are {15 × 15,35 × 35,76 × 76,153 × 153} respectively, and the ratio of the bounding boxes of each given scale is { 1: 1, 1: 2, 2: 1, 1: 3, 3: 1 }. For example, for a 1: 1 bounding box of 15X 15, a corresponding bounding box would additionally be generated

1: 2 of the surrounding frame,

2: 1 of the surrounding frame,

1: 3 surrounding frame

3: 1 enclosing the frame. Therefore, 5 bounding boxes with different proportions are predicted by taking each pixel point on the feature map as a center. The multi-scale prediction is beneficial to improving the target detection precision, and the small target detection performance is remarkably improved.

Step S004, outputting the bounding box.

In summary, the embodiment of the present invention discloses a method for detecting a small target, which includes firstly extracting features of an original image to obtain an image to be detected; processing the image to be detected through a context sensing module, keeping the spatial resolution of the image to be detected, expanding the receptive field and outputting a plurality of characteristic images of different receptive fields; predicting the size of a bounding box according to each characteristic image; and outputting the enclosure frame. According to the embodiment of the invention, the context sensing module expands the receptive field and keeps the resolution unchanged, and different context information is sensed, so that target information with different scales is captured, and the technical problem that the underlying characteristic diagram lacks semantic information is solved.

As a preferred embodiment, in order to make the model more robust, operations such as data augmentation and hard-to-load sample mining are applied in the training process to further improve the precision of small target detection. The objective function of the whole model training is:

wherein x is a jaccard overlap coefficient of the prediction frame matched with the real frame, c is the confidence of classification, L is the parameter of the prediction frame (the central coordinate position, width and height of the frame), g is the parameter of the real value frame, α is weight, N is the number of the surrounding frames matched with the real frame with the threshold value more than 0.5, and L is the number of the surrounding frames matched with the real frame with the threshold value more than 0.5 _confFor confidence loss, L _locTo locate the loss. The confidence loss is the softmax loss and the localization loss is the Smooth L1 loss function.

Referring to fig. 2 as a preferred embodiment of the present invention, a processing flow diagram of a context awareness module is shown, where the context awareness module includes multiple sensing branches, each of which includes a dimension reduction layer, an expansion convolution layer, and a dimension restoration layer, the multiple sensing branches can increase a receptive field of a neuron while maintaining a spatial resolution of an output feature image, that is, enhance a feature expression capability of the output feature image, and the feature images output by the multiple sensing branches can sense different context information, so as to capture target information of different scales. The processing procedure of each branch of the context awareness module comprises the following steps:

step 201, reducing the data dimension of the image to be detected through a dimension reduction layer to obtain a dimension reduction image.

And taking the image to be detected as input, and reducing the dimension through convolution operation designed in the dimension reduction layer so as to reduce the complexity of the model. For example, after an image to be detected with width, height and channel number of w × h × c is input into a dimension reduction layer, the dimension reduction layer performs dimension reduction through 1 × 1 convolution operation, and the image to be detected is reduced from c to c/2, so as to obtain a dimension reduction image.

Step 202, passing the dimension reduction image through an expansion convolution layer to obtain a perception image with an expanded receptive field, wherein the expansion convolution layer comprises at least one convolution layer.

The dimension-reduced image is processed by n stacked dilation convolutions (

dilation rate r

1, 3, 5.. 2n-1) in the dilation convolution layer, and context information is captured in a larger area.

Specifically, let n be the number of stacked k × k (k > 1 is the convolution kernel size) expansion convolutions, and the expansion rate of the nth convolution layer be 2n-1, then the expanded convolution kernels are:

k′＝(k-1)(2n-1)+1

in the invention, k is set to be 3. Let R _nThe effective receptive field for the nth dilated convolution is defined as:

wherein R is _n-1Is the receptive field of layer n-1, s _iIs the step size of the ith convolutional layer. When k is 3, S _iWhen 1, stackedThe receptive fields of the dilated convolutions are shown in table 1:

TABLE 1 stacked expanded convolution receptive fields

Layer	1	2	3	4	5	n
							Kernel(k)	3×3	3×3	3×3	3×3	3×3	3×3
Stride	1	1	1	1	1	1
							Dilation rate(r)	1	3	5	7	9	2n-1
Kernel(k)	3×3	7×7	11×11	15×15	19×19	4n-1
							Receptive field	3	9	19	33	51	2n ²+1

As a preferred embodiment of the present invention, given n in the context-aware module, there is a valid field corresponding to it. The receptive field increases with increasing n in order to obtain multi-scale information. Note that the number of branches in the context awareness module is N, different N are set for comparison, and the experimental results are shown in table 2:

TABLE 2 comparison of the effects of different number of branches N

N	Precision (mAP (%))	Speed (FPS)
			2	72.5	32.7
3	75.6	30.2
			4	78.0	26.5
5	78.2	20.6

Considering the accuracy and speed of target detection together, the embodiment of the present invention will set N to 4.

And 203, restoring the data dimension of the perception image through a dimension restoring layer to obtain a characteristic image with the same data dimension as the image to be detected.

And (5) recovering the channel number of the characteristic image through convolution operation in the dimension recovery layer, and outputting the characteristic image. The characteristic images output by each perception branch have the same resolution but contain different semantic information, and the characteristic images are used as input of a later stage to predict a target object.

The image to be detected input by the context sensing module mainly comprises bottom layer information such as the boundary, lines and the like of an object in the image; the characteristic image output by the context sensing module learns more high-level semantic information through a network, and meanwhile, the spatial position information of the object is not damaged, so that the small target is favorably positioned and classified.

Referring to FIG. 3, an embodiment of the present invention is shownThe basic structure diagram of the sensing context sensing module is provided, and the basic structure can be seen that after the image to be detected with the width, the height and the channel number of w multiplied by h multiplied by c is input, the image to be detected is subjected to

Reducing the vitamin content and passing through

The expanded convolution layer(s) of (1) is processed, after recovering the characteristic image with the data dimension of w × h × c through Conv1 × 1 × c convolution operation, residual error connection is added, finally, the discrimination of the characteristic is increased through Conv1 × 1 × c convolution, and finally, the characteristic image is output.

For ease of understanding, please refer to fig. 4, which illustrates a schematic diagram of an instantiation of a context-aware module according to an embodiment of the present invention, wherein the context-aware module includes a plurality of aware branches, and each aware branch includes a dimension-reduction layer, an expansion convolution layer, and a dimension recovery layer. The expansion convolutional layer in each sensing branch at least comprises one convolutional layer, corresponding convolutional layers are superposed in the expansion convolutional layers according to the value of the expansion rate, and the sensing image of the expanded receptive field is obtained after the dimension reduction image is processed by the superposed corresponding convolutional layers.

Specifically, after an image to be detected with width, height and channel number w × h × c is input into a dimensionality reduction layer, the dimensionality reduction layer is subjected to convolution operation of 1 × 1:

and reducing the dimension, namely reducing the c of the image to be detected to c/2 to obtain a dimension-reduced image. The dilated convolutional layer in the first sensing branch comprises a 3 × 3 convolutional layer with a dilation rate of 1: obtaining a first perception image of an expanded receptive field after the dimension-reduced image is subjected to 3 x 3 convolutional layer processing with the expansion rate of 1; the second expansion convolution layer in the second parallel sensing branch comprises a 3 x 3 volume with an expansion ratio of 1The lamination layer and a 3 x 3 convolution layer with the expansion rate of 3 are processed in the dimension reduction image by the 3 x 3 convolution layer with the expansion rate of 1 and the 3 x 3 convolution layer with the expansion rate of 3 in sequence to obtain a second perception image of the enlarged receptive field; the third expansion convolutional layer in the third sensing branch arranged in parallel comprises a 3 × 3 convolutional layer with expansion rate of 1, a 3 × 3 convolutional layer with expansion rate of 3 and a 3 × 3 convolutional layer with expansion rate of 5, and a third sensing image of the expanded sensing field is obtained after the dimension reduction image is sequentially processed by the 3 × 3 convolutional layer with expansion rate of 1, the 3 × 3 convolutional layer with expansion rate of 3 and the 3 × 3 convolutional layer with expansion rate of 5; the fourth extended convolutional layer in the fourth sensing branch arranged in parallel comprises a 3 × 3 convolutional layer with an expansion rate of 1, a 3 × 3 convolutional layer with an expansion rate of 3, a 3 × 3 convolutional layer with an expansion rate of 5 and a 3 × 3 convolutional layer with an expansion rate of 7, and the fourth sensing image with an extended sensing field is obtained after the dimension-reduced image is sequentially processed by the 3 × 3 convolutional layer with an expansion rate of 1, the 3 × 3 convolutional layer with an expansion rate of 3, the 3 × 3 convolutional layer with an expansion rate of 5 and the 3 × 3 convolutional layer with an expansion rate of 7. And (3) the perceptual image output by each perceptual branch is subjected to 1 × 1 convolution operation in the dimension recovery layer, Conv1 × 1 × c recovers the channel number of the characteristic image, and the perceptual image is recovered to c from c/2 to obtain the characteristic image. The four sensing branches output 4 feature images of the same resolution but containing different semantic information.

Referring to fig. 5 as a preferred embodiment of the present invention, a schematic structural diagram of a preferred embodiment of instantiation of a perceptual context awareness module according to an embodiment of the present invention is shown, where a network structure of the perceptual context awareness module includes a shared dimension reduction layer, a shared expansion convolution layer, and multiple dimension recovery layers. The shared expansion convolution comprises a plurality of shared branches, each shared branch is composed of at least one convolution layer, and adjacent shared branches share the same convolution layer; the shared branches share one shared dimensionality reduction layer, and each shared branch corresponds to one dimensionality recovery layer.

Because the convolutions with the same structure in the shared expansion convolution layer are shared, the parameter quantity is reduced, the complexity of the model is further reduced, and the detection speed is further improved. The method comprises the steps that a shared dimensionality reduction layer is used for reducing the data dimensionality of an image to be detected, and a plurality of different shared branches of a context sensing module share the shared dimensionality reduction layer with 1 multiplied by 1 convolution to obtain a dimensionality reduction image; by sharing the expanded convolution layer, for inputting the dimension-reduced image into the shared expanded convolution layer,

the shared expansion convolutional layer comprises four shared branches, wherein the first shared branch comprises a 3 multiplied by 3 convolutional layer with the expansion rate of 1; the second sharing branch is a convolution layer of 3 x 3 with expansion rate of 3 added on the basis of the convolution of 3 x 3 expansion of the first sharing branch; the third sharing branch is added with a convolution layer with expansion rate of 5 multiplied by 3 on the basis of the second sharing branch; the fourth sharing branch is added with a convolution layer with the expansion rate of 7 multiplied by 3 on the basis of the third sharing branch, namely, the sharing of the expansion convolution layer is realized. And the dimension recovery layer is used for processing the perception image output by the expansion convolutional layer through the corresponding dimension recovery layer to obtain a characteristic image with the same data dimension as that of the image to be detected.

Specifically, referring to fig. 5 again, after the image to be detected with width, height and channel number w × h × c is input into the shared dimensionality reduction layer as the input feature map, the shared dimensionality reduction layer performs 1 × 1 convolution operation:

and d, reducing the dimension, namely reducing the input feature map from c to c/2 to obtain a dimension-reduced image. Inputting the dimension-reduced image into a shared expansion convolutional layer, wherein the dimension-reduced image passes through a 3 × 3 convolutional layer with an expansion rate of 1:

the first perception image is processed by a 3 × 3 convolutional layer with an expansion rate of 3 to obtain a second perception image of the enlarged receptive field, the second perception image is processed by a 3 × 3 convolutional layer with an expansion rate of 5 to obtain a third perception image of the enlarged receptive field, and the third perception image is processed by a 3 × 3 convolutional layer with an expansion rate of 7 to obtain a fourth perception image of the enlarged receptive field, so that a plurality of perception images of the enlarged receptive field are obtained. Each timeConv1 multiplied by 1 multiplied by c restores the channel number of the characteristic image, namely the data dimension, and the perception image is restored to c from c/2 to obtain a plurality of characteristic images with the same data dimension as the input image. The perceptual context awareness module outputs 4 feature images of the same resolution but containing different semantic information.

As a preferred embodiment, referring to fig. 4 and 5 again, after obtaining the feature image with the same data dimension as the image to be detected, the method further includes the following steps:

and step 204, adding residual errors into each characteristic image for connection to obtain an optimized characteristic image.

Residual concatenation performs pixel-level fusion of the input feature map with the perceptual image, as shown in fig. 3. Residual connection can enable feature images to fuse richer information, and meanwhile, the method is beneficial to back propagation of a training network.

In step 205, each optimized feature image is processed by a corresponding enhancement layer to increase the discrimination of the features, so as to obtain an optimized feature image.

Each optimized feature image is further subjected to 1 × 1 convolution with Conv1 × 1 × c to increase the discrimination of the feature.

For convenience of understanding, taking fig. 6 as an example, fig. 6 is a schematic diagram of a network structure related to small target detection according to an embodiment of the present invention, and first, an image to be detected with a resolution of 512 × 512 is downsampled, and is processed by Conv1, Conv2, Conv3, and Conv4 to obtain an image to be detected with a resolution of 64 × 64, and then is processed by a context sensing module, which adopts 4 sensing branches, and outputs a feature image after the steps of dimensionality reduction, convolution layer expansion, dimensionality restoration, residual error connection, and resolution enhancement, and at the same time, predicts the size of a bounding box, and outputs the bounding box.

Referring to fig. 7, a schematic structural diagram of a small target detection apparatus according to another embodiment of the present invention is provided, where the detection apparatus includes a feature extraction module 701, a context awareness module 702, a predicted bounding box module 703 and an output module 704, specifically: the feature extraction module 701 is configured to perform feature extraction on the original image to obtain an image to be detected. The context sensing module 702 is configured to process the image to be detected through the context sensing module, maintain the spatial resolution of the image to be detected, enlarge the receptive field, and output a plurality of feature images with different receptive fields. The predicted bounding box module 703 is configured to predict the size of the bounding box according to each feature image. And an output module 704 for outputting the bounding box. According to the embodiment of the invention, the context sensing module expands the receptive field and keeps the resolution unchanged, and different context information is sensed, so that target information with different scales is captured, and the technical problem that the underlying characteristic diagram lacks semantic information is solved.

As a preferred embodiment of the present invention, the context awareness module comprises a plurality of sensing branches, each of which comprises a dimension reduction layer, an expansion convolution layer, and a dimension restoration layer. Specifically, the dimension reduction layer is used for taking the image to be detected as an input characteristic diagram, reducing the data dimension of the image to be detected and obtaining the dimension reduction image. And the expansion convolutional layer is used for enabling the dimension reduction image to pass through an expansion convolutional layer formed by a plurality of convolutions with different expansion rates in a stacking mode to obtain a perception image with an expanded receptive field, and the expansion convolutional layer comprises at least one convolution layer with 3 x 3 convolution. The dimension restoring layer is used for restoring the dimension of the perception image through the dimension restoring layer to obtain the feature image with the same data dimension as the input feature image.

As another preferred embodiment of the present invention, the context awareness module includes a shared dimension reduction layer, a shared expansion convolution layer, and a plurality of dimension restoration layers. Specifically, the shared dimensionality reduction layer is used for taking the image to be detected as an input feature map to reduce the data dimensionality of the image to be detected, and a plurality of different shared branches of the context sensing module share the dimensionality reduction layer with 1 × 1 convolution to obtain a dimensionality reduction image; the shared expansion convolutional layer is used for comprising a plurality of shared branches, each shared branch is composed of at least one convolutional layer, and adjacent shared branches share the same convolutional layer; the shared branches share one shared dimensionality reduction layer, and each shared branch corresponds to one dimensionality recovery layer; the perception image acquisition module is used for acquiring a plurality of perception images of an expanded receptive field by sharing an expansion convolution layer with the dimension reduction image; and the dimension recovery layer is used for recovering the data dimension of the perception image through the dimension recovery layer corresponding to the sharing branch to obtain a plurality of characteristic images with the same data dimension as the input characteristic image.

Specifically, in the present embodiment, the shared dilation convolution includes four shared branches, where the first shared branch includes a 3 × 3 convolution layer with a dilation rate of 1; the second sharing branch adds a convolution layer of 3 multiplied by 3 with the expansion rate of 3 on the basis of the 3 multiplied by 3 expansion convolution of the first branch; the third sharing branch is added with a convolution layer with expansion rate of 5 multiplied by 3 on the basis of the second branch; the fourth sharing branch is added with a convolution layer with the expansion rate of 7 multiplied by 3 on the basis of the third sharing branch, namely, the sharing of the expansion convolution layer is realized.

As another preferred embodiment of the present invention, the context sensing module further comprises a residual concatenation module and an enhanced discrimination module. Specifically, the residual connecting module is configured to add residual connection to each feature image to obtain an optimized feature image. And the discrimination enhancement module is used for processing each optimized characteristic image through a corresponding enhancement layer, enhancing the discrimination of the characteristics and obtaining an optimized characteristic image.

Referring to fig. 8, a schematic diagram of an electronic device is shown based on the same inventive concept, the electronic device includes a memory 801 and a processor 802, wherein:

the memory 801 is used to store instructions required by the processor 802 to perform tasks.

The processor 802 is configured to execute the instruction stored in the memory 801, and perform feature extraction on the original image to obtain an image to be detected; processing an image to be detected through a context sensing module, keeping the spatial resolution of the image to be detected, expanding a receptive field and outputting a plurality of characteristic images with different receptive fields; predicting the size of the bounding box according to each characteristic image; and outputting the enclosure frame.

In other embodiments, the electronic device further comprises a communication interface 803 for enabling the subject to communicate with other devices or communication networks.

Preferably, the processor 802 is configured to execute the instructions stored in the memory 801 to perform the method for detecting the small target provided in any of the above embodiments.

The embodiment of the invention also provides a storage medium, wherein the storage medium can store a program readable by a computer, and the program executes the method for detecting the small target provided by any one of the above embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A method for detecting a small target, the method comprising the steps of:

predicting the size of a bounding box according to each characteristic image;

and outputting the enclosure frame.

2. The method for detecting the small target according to claim 1, wherein the context sensing module comprises a plurality of sensing branches, each sensing branch comprises a dimensionality reduction layer of 1 x 1 convolution, an expansion convolution layer of 3 x 3 convolution and a dimensionality recovery layer of 1 x 1 convolution, and the processing procedure of each branch comprises the following steps:

3. The method of claim 1, wherein the context-aware module comprises a shared dimension-reduction layer, a shared expanded convolutional layer, and a plurality of dimension-recovery layers, and wherein the shared expanded convolutional layer comprises a plurality of shared branches, each shared branch is composed of at least one convolutional layer, and adjacent shared branches share the same convolutional layer; the multiple sharing branches share one sharing dimensionality reduction layer, each sharing branch corresponds to one dimensionality recovery layer, and the processing process comprises the following steps:

4. The method for detecting the small target according to claim 2 or 3, characterized by further comprising the following steps after obtaining the feature image with the same data dimension as the image to be detected:

5. A device for detecting small objects, the device comprising:

and the output module is used for outputting the enclosure frame.

6. The apparatus for detecting small objects according to claim 5, wherein the context awareness module comprises a plurality of awareness branches, each awareness branch comprising:

7. The apparatus for detecting small objects according to claim 6, wherein the context awareness module comprises:

8. The apparatus for detecting small objects according to claim 6 or 7, further comprising:

9. An electronic device, comprising:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to: performing the method of any one of claims 1 to 4.

10. A storage medium having computer-readable program instructions stored therein, which when executed by a processor implement the method of any one of claims 1 to 4.