CN111985503A

CN111985503A - Target detection method and device based on improved characteristic pyramid network structure

Info

Publication number: CN111985503A
Application number: CN202010825554.0A
Authority: CN
Inventors: 李百成; 张翊; 黎嘉朗
Original assignee: Whale Cloud Technology Co Ltd
Current assignee: Whale Cloud Technology Co Ltd
Priority date: 2020-08-17
Filing date: 2020-08-17
Publication date: 2020-11-24
Anticipated expiration: 2040-08-17
Also published as: CN111985503B

Abstract

The scheme of the invention obtains a final characteristic layer by carrying out fusion, selection and residual operation on multi-scale characteristics extracted from a backbone network, and carries out target classification and position regression on the final characteristic layer to obtain a final result. The method can correctly detect the target even under the condition that the target is partially shielded in the target detection, and has high robustness and high performance.

Description

Target detection method and device based on improved characteristic pyramid network structure

Technical Field

The invention belongs to the technical field of target detection, and particularly relates to a target detection method and device based on an improved characteristic pyramid network structure.

Background

In the target detection of pictures, the feature pyramid network is a structure capable of greatly improving the network performance at a low cost, and has been used in various mainstream target detection network structures due to its excellent performance.

In the feature pyramid network, features from the backbone network are upsampled through a top-down path to a higher-level pyramid to generate a coarser feature map with stronger semantic information. These features are then connected horizontally with a bottom-up feature map of the same spatial size, enhancing the semantic information at a lower level. However, such sequentially integrated features may focus more on features at adjacent resolutions and dilute features at non-adjacent resolutions, limiting the performance of the network. Therefore, how to better integrate the features of the backbone network becomes a hot issue in academic and industrial fields.

Disclosure of Invention

In view of the above-mentioned deficiencies of the prior art, it is an object of the present invention to provide a method for controlling a motor.

The embodiment of the invention discloses a target detection method based on an improved feature pyramid network structure, which comprises the steps of obtaining a plurality of first feature layers with different sizes of a detected picture through a feature extraction network, and zooming the first feature layers to a preset resolution; fusing the plurality of zoomed first feature layers to obtain a second feature layer, and performing global average pooling and dimension reduction operations on the second feature layer to obtain a third feature layer; obtaining a selected fourth characteristic layer according to the weights of the third characteristic layer and different characteristic layers; processing the fourth characteristic layer in a residual error mode to obtain a pyramid characteristic diagram; and performing target classification and position regression on the pyramid feature map to output a detection frame.

In one possible embodiment, a bilinear interpolation method is adopted to perform amplification operation on a first feature layer with a resolution smaller than a preset resolution to reach the preset resolution; performing reduction operation on the first characteristic layer with the resolution greater than the preset resolution by adopting a maximum pooling method to achieve the preset resolution; wherein the preset resolution is centered in resolution among the plurality of first feature layers.

In one possible embodiment, the statistical information of the whole channel domain of the second feature layer is calculated by using global average pooling, and the dependency relationship of the channel domain is calculated by using the full connection layer by using the statistical information, so as to obtain the third feature layer.

In one possible embodiment, the third feature layer is expanded to n × d dimensions by 1 × 1 convolution, where n is the number of feature pyramid layers used and d is the dimension of each feature layer; and the weight of each channel is obtained through softmax operation, the weight is multiplied by the third feature layer to obtain the selected feature layer, and the feature layers are added pixel by pixel to obtain a finally selected fourth feature map.

In one possible embodiment, the pyramid feature map is obtained by scaling the fourth feature layer to a resolution corresponding to the plurality of first feature layers by an inverse operation of rescaling and adding the fourth feature layer to the corresponding first feature layer.

An object detection apparatus based on an improved feature pyramid network structure, comprising: the zooming module is used for acquiring a plurality of first feature layers with different sizes of the detected picture through a feature extraction network and zooming the first feature layers to a preset resolution; the fusion module is used for fusing the plurality of zoomed first feature layers to obtain a second feature layer, and performing global average pooling and dimension reduction operation on the second feature layer to obtain a third feature layer; the selection module is used for obtaining a selected fourth characteristic layer according to the weights of the third characteristic layer and different characteristic layers; the residual error module is used for processing the fourth characteristic layer in a residual error mode to obtain a pyramid characteristic diagram; and the position regression module is used for carrying out target classification and position regression on the pyramid feature map so as to output a detection frame.

In one possible embodiment, the scaling module is further configured to: performing amplification operation on the first characteristic layer with the resolution smaller than the preset resolution by adopting a bilinear interpolation method to achieve the preset resolution; performing reduction operation on the first characteristic layer with the resolution greater than the preset resolution by adopting a maximum pooling method to achieve the preset resolution; wherein the preset resolution is centered in resolution among the plurality of first feature layers.

In one possible embodiment, the fusion module is further configured to: and calculating the statistical information of the whole channel domain of the second characteristic layer by adopting global average pooling, and calculating the dependency relationship of the channel domain by adopting the full connection layer by utilizing the statistical information to obtain a third characteristic layer.

In one possible embodiment, the selection module is further configured to: amplifying the third characteristic layer to n x d dimension by 1 x 1 convolution, wherein n is the number of the adopted characteristic pyramid layers, and d is the dimension of each characteristic layer; and the weight of each channel is obtained through softmax operation, the weight is multiplied by the third feature layer to obtain the selected feature layer, and the feature layers are added pixel by pixel to obtain a finally selected fourth feature map.

In a possible embodiment, the residual module is further configured to scale the fourth feature layer to a resolution corresponding to the plurality of first feature layers by an inverse operation of the rescaling and add the fourth feature layer to the corresponding first feature layers to obtain the pyramid feature map.

A computer storage medium storing a computer program which, when executed, implements the method as hereinbefore described.

Compared with the prior art, the invention has the following beneficial effects:

the scheme of the invention has high robustness, and the network has stronger extraction capability on the target characteristics by dynamically selecting the characteristics of the backbone network. Under extreme conditions, such as the condition that the target characteristics are damaged by weak light, shielding and the like, the network can realize the enhancement of the target characteristics by aggregating the characteristics of characteristic layers with different resolutions, the detection accuracy is improved, and the target can still be correctly detected under the condition that the target is partially shielded. The method has the advantages of stronger network performance, smaller introduced parameter quantity, no obvious reduction of reasoning speed and improvement of accuracy.

Drawings

FIG. 1 is a flow chart of a method according to an embodiment of the present invention;

FIG. 2 is a network structure diagram of a convergence phase according to an embodiment of the present invention;

FIG. 3 is a network architecture diagram of a selection phase according to an embodiment of the present invention;

FIG. 4 is a diagram of a network structure in a residual error phase according to an embodiment of the present invention;

fig. 5(a) and 5(b) are comparison graphs of the testing results on the ODF port testing data set of the optical distribution frame according to the embodiment of the present invention.

Detailed Description

In order to facilitate understanding of those skilled in the art, the present invention will be further described with reference to the following examples and drawings, which are not intended to limit the present invention.

The method can dynamically select the feature layer, and specifically comprises the steps of performing fusion, selection and residual operation on multi-scale features extracted from a backbone network to obtain a final feature layer, and performing target classification and position regression on the final feature layer to obtain a final result. The method has high robustness and high performance in target detection, and can correctly detect the target even under the condition that the target is partially shielded.

Specifically, with reference to fig. 1, an embodiment of the present invention discloses a target detection method based on an improved feature pyramid network structure, including:

s101, a plurality of first feature layers with different sizes of the detected picture are obtained through a feature extraction network, and the first feature layers are scaled to a preset resolution.

And inputting the picture to be detected into a backbone network for feature extraction, wherein the backbone network, namely the feature extraction network, can comprise ResNet, VggNet and the like.

Next, referring to FIG. 2, for the multi-scale features { C ] extracted from the backbone network₃,C₄,C₅,C₆,C₇Scaling a plurality of first feature layers with different sizes to the size of the middle layer, namely, for the feature layer with the resolution less than the C5 layer, adopting a bilinear interpolation method to enlarge the size, and for the feature layer with the resolution greater than the C5 layer, reducing the size by using maximum pooling to obtain { R }₃,R₄,R₅,R₆,R₇}。

And S102, fusing the scaled first feature layers to obtain a second feature layer, and performing global average pooling and dimension reduction operations on the second feature layer to obtain a third feature layer.

Referring to FIG. 2, the scaled features are then simply information fused using a pixel-by-pixel addition operation to generate a fused feature layer R_sI.e. the second feature layer, and then use the global average pooling to obtain the statistical information of the whole channel domain, as shown in formula (1):

wherein H and W represent the height and width of the feature map respectively, (i, j) represents the coordinates of the pixel, and z_cRepresenting the statistical information of the c-th channel,

and (4) fusing the characteristic layers. In order to fully utilize the information in z, a fully connected layer is used to calculate the dependency relationship of the channel domain, and meanwhile, the dimensionality is reduced to improve the efficiency of the network, so that a third characteristic layer p is obtained, as shown in formula (2)

p＝Fc(z_c) (2)

And S103, obtaining the selected fourth characteristic layer according to the weights of the third characteristic layer and different characteristic layers.

In order to dynamically select the appropriate feature layer information, as shown in fig. 3, the network needs to adaptively assign weights of different feature layers, and to achieve this goal, the third feature layer p is first expanded to n × d dimension by 1 × 1 convolution, where n is the number of feature pyramid layers used, and is set to 5 here, and d is the dimension of each feature layer. Then, the weight of each channel is obtained by softmax operation, and is represented by a, that is:

A＝softmax(conv(p)) (3)

after the weights of all channels are obtained, multiplying the weights by the original rescaled features to obtain selected feature layers, and then adding the selected feature layers pixel by pixel to obtain a finally selected fourth feature layer q:

q＝sum(A_iR_i) (4)

wherein R is_i∈{R₃,R₄,R₅,R₆,R₇}。

And S104, processing the fourth characteristic layer in a residual error mode to obtain a pyramid characteristic diagram.

Referring to fig. 4, the selected features enhance the original features by means of residual errors, which makes the training speed of the network faster and the learned features more robust. The specific operation is to scale q by the inverse of rescalingTo the corresponding first feature layer C_iAnd adding the resolution ratios to obtain the final pyramid feature map. As shown in formula (5):

P_i＝Rescale(q)+C_i (5)

and S105, performing target classification and position regression on the pyramid feature map to output a detection frame.

Taking the single target detection network FCOS as an example, the characteristic pyramid network structure is replaced by the method, and the port detection data set is tested, and the results are shown in fig. 5(a) and fig. 5 (b). Because the ODF ports are densely arranged, the ports are often shielded from each other, which makes the detection of the shielded ports difficult, but the FCOS using the method shown in fig. 5(b) can still correctly detect the target under the condition that the target is partially shielded, as compared with the original FCOS shown in fig. 5(a), and the detection effect is better.

The embodiment of the invention also discloses a target detection device based on the improved characteristic pyramid network structure, which comprises a zooming module, a feature extraction module and a target detection module, wherein the zooming module is used for acquiring a plurality of first characteristic layers with different sizes of a detected picture through a characteristic extraction network and zooming the plurality of first characteristic layers to a preset resolution; the fusion module is used for fusing the plurality of zoomed first feature layers to obtain a second feature layer, and performing global average pooling and dimension reduction operation on the second feature layer to obtain a third feature layer; the selection module is used for obtaining a selected fourth characteristic layer according to the weights of the third characteristic layer and different characteristic layers; the residual error module is used for processing the fourth characteristic layer in a residual error mode to obtain a pyramid characteristic diagram; and the position regression module is used for carrying out target classification and position regression on the pyramid feature map so as to output a detection frame. The embodiments of the present invention correspond to the embodiments of the method described above, and specific contents thereof may refer to the embodiments of the method.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described embodiments are merely illustrative, and for example, a division of a unit is merely a division of a logic function, and an actual implementation may have another division, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed.

Units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment of the present invention.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

While the invention has been described in terms of its preferred embodiments, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention.

Claims

1. A target detection method based on an improved characteristic pyramid network structure is characterized in that,

acquiring a plurality of first feature layers with different sizes of a detected picture through a feature extraction network, and scaling the plurality of first feature layers to a preset resolution;

fusing the plurality of zoomed first feature layers to obtain a second feature layer, and performing global average pooling and dimension reduction operations on the second feature layer to obtain a third feature layer;

obtaining a selected fourth characteristic layer according to the weights of the third characteristic layer and different characteristic layers;

processing the fourth characteristic layer in a residual error mode to obtain a pyramid characteristic diagram;

and performing target classification and position regression on the pyramid feature map to output a detection frame.

2. The method of claim 1, wherein scaling the plurality of first feature layers to a preset resolution size comprises: performing amplification operation on the first characteristic layer with the resolution smaller than the preset resolution by adopting a bilinear interpolation method to achieve the preset resolution; performing reduction operation on the first characteristic layer with the resolution greater than the preset resolution by adopting a maximum pooling method to achieve the preset resolution; wherein the preset resolution is centered in resolution among the plurality of first feature layers.

3. The method of claim 1, wherein performing global average pooling and dimensionality reduction operations on the second feature layer to obtain a third feature layer comprises: and calculating the statistical information of the whole channel domain of the second characteristic layer by adopting global average pooling, and calculating the dependency relationship of the channel domain by adopting the full connection layer by utilizing the statistical information to obtain a third characteristic layer.

4. The method of claim 1, wherein obtaining the selected fourth feature layer based on the weights of the third feature layer and the different feature layers comprises: amplifying the third characteristic layer to n x d dimension by 1 x 1 convolution, wherein n is the number of the adopted characteristic pyramid layers, and d is the dimension of each characteristic layer; and the weight of each channel is obtained through softmax operation, the weight is multiplied by the third feature layer to obtain the selected feature layer, and the feature layers are added pixel by pixel to obtain a finally selected fourth feature map.

5. The method of claim 1, wherein the pyramid feature map is obtained by scaling the fourth feature layer to a resolution corresponding to the plurality of first feature layers by an inverse operation of rescaling and adding to the corresponding first feature layers.

6. An object detection device based on an improved characteristic pyramid network structure is characterized in that,

the zooming module is used for acquiring a plurality of first feature layers with different sizes of the detected picture through a feature extraction network and zooming the first feature layers to a preset resolution;

the fusion module is used for fusing the plurality of zoomed first feature layers to obtain a second feature layer, and performing global average pooling and dimension reduction operation on the second feature layer to obtain a third feature layer;

the selection module is used for obtaining a selected fourth characteristic layer according to the weights of the third characteristic layer and different characteristic layers;

the residual error module is used for processing the fourth characteristic layer in a residual error mode to obtain a pyramid characteristic diagram;

and the position regression module is used for carrying out target classification and position regression on the pyramid feature map so as to output a detection frame.

7. The apparatus of claim 6, wherein the scaling module is further to: performing amplification operation on the first characteristic layer with the resolution smaller than the preset resolution by adopting a bilinear interpolation method to achieve the preset resolution; performing reduction operation on the first characteristic layer with the resolution greater than the preset resolution by adopting a maximum pooling method to achieve the preset resolution; wherein the preset resolution is centered in resolution among the plurality of first feature layers.

8. The apparatus of claim 6, wherein the fusion module is further to: and calculating the statistical information of the whole channel domain of the second characteristic layer by adopting global average pooling, and calculating the dependency relationship of the channel domain by adopting the full connection layer by utilizing the statistical information to obtain a third characteristic layer.

9. The apparatus of claim 6, wherein the selection module is further to: amplifying the third characteristic layer to n x d dimension by 1 x 1 convolution, wherein n is the number of the adopted characteristic pyramid layers, and d is the dimension of each characteristic layer; and the weight of each channel is obtained through softmax operation, the weight is multiplied by the third feature layer to obtain the selected feature layer, and the feature layers are added pixel by pixel to obtain a finally selected fourth feature map.

10. The apparatus of claim 6, wherein the residual module is further configured to scale a fourth feature layer to a resolution corresponding to the plurality of first feature layers by an inverse of the rescaling operation and to add the fourth feature layer to the corresponding first feature layer to obtain a pyramid feature map.