CN113255837A

CN113255837A - Improved CenterNet network-based target detection method in industrial environment

Info

Publication number: CN113255837A
Application number: CN202110723531.3A
Authority: CN
Inventors: 孙小惟; 邓承志; 唐聪; 徐晨光; 汪胜前; 吴朝明
Original assignee: Nanchang Institute of Technology
Current assignee: Nanchang Institute of Technology
Priority date: 2021-06-29
Filing date: 2021-06-29
Publication date: 2021-08-13

Abstract

The invention discloses an improved CenterNet network-based target detection method in an industrial environment, which comprises the following steps: the method comprises the steps of collecting a plurality of images containing a target to be detected, and constructing a training data set of a CenterNet network; improving the CenterNet network, wherein the depth residual error neural network of ResNet-18 is replaced by ResNet-50, and the CenterNet network is trained by utilizing a training data set to obtain a target detection model; and detecting the target to be detected in the industrial environment by using the target detection model to obtain the position information and the classification information of the target. The invention not only ensures the target detection speed, but also improves the target detection precision by using the deep residual error network with deeper layers, particularly enhances the detection and identification capability of multiple targets and small target objects in the industrial environment, and has better universality compared with the traditional algorithm.

Description

Improved CenterNet network-based target detection method in industrial environment

Technical Field

The invention relates to the field of artificial intelligence, in particular to the field of target detection, and particularly relates to an improved CenterNet network-based target detection method in an industrial environment.

Background

Object detection is one of the important research directions in computer vision technology. In modern industrial production, the target detection technology has very wide application prospect. In actual production life, most of industrial target detection currently depends on human eye recognition, and many methods for detecting the industrial target through template matching are available, but the manual detection cost and the false detection rate are high, and the template matching is difficult to adapt to task scenes of complex illumination and multi-classification small targets. The working principle of the early target detection algorithm is to achieve the distinguishing effect by splicing strong classifiers, and the early target detection algorithm has the defect that series of non-rigid targets such as people cannot be detected. Based on the problem, some scholars propose an HOG + SVM structure, and the structural algorithm achieves great success in the early stage aiming at the directions of road detection and passing pedestrian detection, but cannot achieve the required real-time performance and accuracy under the industrial use environment.

With the rapid development of the deep learning technology, the target detection based on the deep neural network can be divided into an anchor-based detection algorithm and an anchor-free detection algorithm, the anchor-based detection algorithm clusters anchors in a data set before training, then performs feature extraction, and then performs classification and regression, so that a large amount of time is consumed.

Compared with RCNN and YOLO series in deep learning, the method needs to generate the anchor and then classify and regress, and through the self-contained structure, the method can not generate the anchor, so that the training and reasoning prediction time can be reduced, but the characteristic detection effect on small targets and objects in the vicinity is poor.

Disclosure of Invention

The invention aims to provide a target detection method in an industrial environment, which has low detection cost, high precision and good universality aiming at the defects and the defects of the prior art.

In order to achieve the purpose, the technical scheme of the invention based on the improved CenterNet network target detection method comprises the following steps:

step S1: collecting a plurality of images containing a target to be detected;

step S2: constructing a training data set of the CenterNet network;

step S3: modifying the CenterNet network to include a deep residual neural network that replaces ResNet-18 with ResNet-50;

step S4: training the CenterNet network by using the training data set to obtain a target detection model;

step S5: and detecting the target to be detected in the industrial environment by using the target detection model to obtain the position information and the classification information of the target.

The step S2 of constructing the training data set of the centeret network includes adding a plurality of images of the target to be detected and a plurality of images added by data enhancement, and the data enhancement includes random cropping, flipping, and translation. Step S2, the constructing the training data set further includes adding labeling information corresponding to each image, where the labeling information includes: whether the image contains the target to be detected, the position information and the classification information of the target to be detected.

Wherein the ResNet-50 deep residual neural network in step S3 includes 1 input convolutional layer, a maximum pooling layer, 4 ConvBlock residual blocks and 12 identyblock residual blocks, ConvBlock being a residual block for changing input-output dimensions including 3 convolutional layers, the first part of which includes 1 × 1 convolutional layer for changing input dimensions; the second part comprises a 3 × 3 convolutional layer for feature extraction; the third section includes 1 × 1 convolutional layers for changing output dimensions; the fourth part includes 1 × 1 convolutional layers for changing the picture input dimension and then jumping to connect to the output; IdentityBlock is a residual block for a continuous series network containing 3 convolutional layers, the first part of which comprises 1 x 1 convolutional layers for changing the input dimensions; the second part comprises a 3 × 3 convolutional layer for feature extraction; the third section includes 1 × 1 convolutional layers for changing output dimensions; the fourth part is the input picture jumping connection to the output. Firstly, carrying out convolution on an input industrial target sample picture for once to obtain a downsampling characteristic diagram of 2 times; then, performing maximum pooling once, wherein downsampling is not performed; then convolving the depth residual block consisting of 1 ConvBlock layer and 2 IdentityBlock layers, and sampling the depth residual block by 2 times; performing convolution on the feature map by a depth residual block consisting of 1 ConvBlock layer and 3 IdentityBlock layers, wherein the feature map is sampled by 2 times; performing convolution on the feature map by a depth residual block consisting of 1 ConvBlock layer and 5 IdentityBlock layers, wherein the feature map is sampled by 2 times; finally, performing convolution by a depth residual block consisting of 1 ConvBlock layer and 2 IdentityBlock layers, wherein the feature map is sampled by 2 times; finally, ResNet-50 can obtain a feature map 32 times of down-sampling, compared with the original ResNet-18 network, the generalization capability, feature extraction capability and detection accuracy of the model are improved, and due to the fact that ConvBlock and IdentityBlock convolution structures are used, the ConvBlock and IdentityBlock convolution structures firstly reduce dimensionality and then extract features through a layer of 3 x 3 convolution layer, compared with the ResNet-18 deep residual error neural network in the prior art that dimensionality is not reduced and features are extracted through two layers of 3 x 3 convolution layers, the number of parameters of the neural network is reduced under the same depth, and therefore the training speed of the network is improved.

The step S3 is to improve the centret network, and further includes adding a transposed convolutional layer to the neck feature extraction part of the neural network. The feature map sampled by 32 times is input into the transposed convolutional layer, and the feature map with high resolution sampled by 16 times can be obtained by the added transposed convolutional layer.

Step S5, detecting the target to be detected in the industrial environment by using the target detection model, and sending the high-resolution feature map into the server after obtaining the high-resolution feature mapIn the prediction heads of the network prediction part, the thermodynamic diagram loss, the central point offset loss and the width and height loss of the central point are respectively three parts in the prediction heads. The total loss function is

，

Wherein

Is a loss of the thermodynamic diagram,

is the loss of the center point bias,

is the loss of width and height of the center point,

，

respectively, the weight of each different loss function.

The step S3 is to improve the centret network, and further includes a method of increasing NMS non-maximum suppression in the prediction part of the neural network, so that when the neural network predicts the target to be detected in step S5, redundant local frames can be removed, and a high-quality detection frame can be obtained. Since it is very important that the centret network establishes the central point, and for a large target, there are many local information, so that it is difficult to determine the central point of the same large target, and the local frame cannot be removed by using the maximum pooling method, so that an extra local frame is left in the final detection result, which not only makes the detection result inaccurate, but also increases the overall detection time, so that a method of adding NMS non-maximum suppression is considered here to remove the extra local frame, and obtain a good-quality detection frame.

And the labeling information is labeled by software labelImg.

Compared with the prior art, the invention has the beneficial effects that: 1) the deep residual error network with a deeper layer number is used as an algorithm main body network, the capability of the algorithm for extracting the characteristics of a detected object is enhanced, the detection precision is improved while the detection speed is ensured, and the ConvBlock and IdentityBlock convolution structures are used, so that the dimensionality of the ConvBlock and IdentityBlock convolution structures is reduced firstly, and then a layer of 3 x 3 convolution layer is used for extracting the characteristics, compared with the ResNet-18 deep residual error neural network in the prior art, the dimensionality is not reduced, and the characteristics are extracted by using two layers of 3 x 3 convolution layers, the parameter number of the neural network is reduced under the same depth, and the training speed of the network is improved; 2) the feature extraction part of the CenterNet network is improved, a layer of transposition convolution layer is added, a high-resolution feature map with resolution twice as high as that of the original feature map can be obtained, the target detection capability is improved, and the detection and identification capability of multiple targets and small target objects in an industrial environment is particularly enhanced; 3) by adding an NMS non-maximum value inhibition method in the neural network prediction part, a high-quality detection frame is extracted, and the label classification capability and precision of the CenterNet network on the detection object are further improved; 4) the invention has wide application range, high precision and good universality.

Drawings

FIG. 1 is a flow diagram of target detection based on the modified CenterNet algorithm in an example industrial environment;

fig. 2 is a schematic structural diagram of a backbone network ConvBlock in the embodiment;

fig. 3 is a schematic structural diagram of an identity block of a backbone network in an embodiment;

fig. 4 is a schematic structural diagram of the improved centrnet network model in the embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

In this embodiment, in conjunction with fig. 1, there is provided a target detection method based on an improved centrnet network in an industrial environment, the method includes the following steps:

step S1: collecting a plurality of images containing a target to be detected;

step S2: constructing a training data set of the CenterNet network;

step S3: the improvement of the CenterNet network comprises a deep residual error neural network which utilizes ResNet-50 to replace ResNet-18, a transposed convolution layer is added to a neck feature extraction part of the deep residual error neural network, and a method for suppressing NMS non-maximum value is added to a prediction part of the deep residual error neural network;

Further, in this embodiment, the step S2 of constructing the centret network training data set includes adding a plurality of images of the target to be detected and a plurality of images added by data enhancement, where the data enhancement includes random cropping, flipping, and translating. Step S2, the constructing the training data set further includes adding labeling information corresponding to each image, where the labeling information includes: whether the image contains the target to be detected, the position information and the classification information of the target to be detected. The labeling information can be labeled by software labelImg, and can form a labeling file with a format suitable for the CenterNet network training.

Further, in this embodiment, with reference to fig. 2 to 4, in the step S3, a cenet network is modified, where the ResNet-50 deep residual error neural network includes 1 input convolutional layer, 1 maximum pooling layer, 4 ConvBlock residual blocks and 12 identity block residual blocks, a ConvBlock is a residual block for changing input and output dimensions that includes 3 convolutional layers, and a first part of the ConvBlock includes a 1 × 1 convolutional layer for changing input dimensions; the second part comprises a 3 × 3 convolutional layer for feature extraction; the third section includes 1 × 1 convolutional layers for changing output dimensions; the fourth part comprises a 1 × 1 convolutional layer for changing the input dimension of the picture and then jumping and connecting to the output, and the output dimension of the 1 × 1 convolutional layer of the third part is smaller than the input dimension of a ConvBlock residual block through artificial setting, so that the purpose of reducing the dimension is achieved; IdentityBlock is a residual block for a continuous series network containing 3 convolutional layers, the first part of which comprises 1 x 1 convolutional layers for changing the input dimensions; the second part comprises a 3 × 3 convolutional layer for feature extraction; the third part comprises a 1 x 1 convolutional layer for changing the output dimension, wherein the output dimension of the 1 x 1 convolutional layer of the third part is unchanged compared to the identity block residual block input dimension, and the fourth part is input picture skip connected to the output. Firstly, carrying out convolution on an input industrial target sample picture for once to obtain a downsampling characteristic diagram of 2 times; then, performing maximum pooling once without down-sampling; then convolving the depth residual block consisting of 1 ConvBlock layer and 2 IdentityBlock layers, and sampling the depth residual block by 2 times; performing convolution on the feature map by a depth residual block consisting of 1 ConvBlock layer and 3 IdentityBlock layers, wherein the feature map is sampled by 2 times; performing convolution on the feature map by a depth residual block consisting of 1 ConvBlock layer and 5 IdentityBlock layers, wherein the feature map is sampled by 2 times; finally, performing convolution by a depth residual block consisting of 1 ConvBlock layer and 2 IdentityBlock layers, wherein the feature map is sampled by 2 times; finally, ResNet-50 can obtain a feature map 32 times of down-sampling, compared with the original ResNet-18 network, the generalization capability, feature extraction capability and detection accuracy of the model are improved, and due to the fact that ConvBlock and IdentityBlock convolution structures are used, the ConvBlock and IdentityBlock convolution structures firstly reduce dimensionality and then extract features through a layer of 3 x 3 convolution layer, compared with the ResNet-18 deep residual error neural network in the prior art that dimensionality is not reduced and features are extracted through two layers of 3 x 3 convolution layers, the number of parameters of the neural network is reduced under the same depth, and therefore the training speed of the network is improved.

Further, in this embodiment, with reference to fig. 4, in step S3, the centeret network is improved, and a transposed convolutional layer is added to a neck feature extraction part of the neural network, that is, a transposed convolutional layer is added after an original three-layer transposed convolutional layer is added, and a feature map after down-sampling by 32 times is input into the transposed convolutional layer, so that a high-resolution feature map after up-sampling by 16 times can be obtained by the added transposed convolutional layer.

Further, in this embodiment, after obtaining the high resolution feature map, the high resolution feature map is fed into a prediction head of the depth residual error neural network prediction part, where the prediction head includes three parts, namely, thermodynamic diagram loss, center point bias loss, and center point width and height loss. The total loss function is

，

。

Among these are thermodynamic predicted losses:

wherein

Is the number of positive samples and is,

and

is a hyper-parameter, here taken as 2 and 4, respectively.

Is referred to as

Is detected

The true value of the class or classes,

the class is the number of kinds of the detection target,

is referred to as

Is detected

The predicted value of the class.

Is the center point bias loss:

wherein

Is the number of positive samples and is,

is the deviation predicted by the model for the center point of each target.

Is the true center point of the image,

is a multiple of the scaling factor and is,

is the value of down-sampling and rounding the true target center point.

Is the center point width height loss:

is the number of positive samples and is,

is the number of each of the objects,

is the width and height of each true center point,

is the predicted center point width and height.

，

The weights of different loss functions are respectively 1 and 0.1.

Further, in this embodiment, when the neural network predicts the target to be detected in step S5, the method of suppressing the non-maximum value of the NMS is added to the predicted portion of the neural network, so that the redundant local frame can be removed, and a high-quality detection frame can be obtained. Since it is very important that the centret network establishes the central point, and for a large target, there are many local information, so that it is difficult to determine the central point of the same large target, and the local frame cannot be removed by using the maximum pooling method, so that an extra local frame is left in the final detection result, which not only makes the detection result inaccurate, but also increases the overall detection time, so that a method of adding NMS non-maximum suppression is considered here to remove the extra local frame, and obtain a good-quality detection frame.

According to the invention, the deep residual error network ResNet-50 with a deeper layer number is used as an algorithm main body network, so that the characteristic extraction capability of the algorithm on a detected object is enhanced, the detection speed is ensured, and the detection precision is improved; the feature extraction part of the CenterNet network is improved, a layer of transposition convolution layer is added, a high-resolution feature map with resolution twice as high as that of the original feature map can be obtained, the target detection capability is improved, and the detection and identification capability of multiple targets and small target objects in an industrial environment is particularly enhanced; by adding an NMS non-maximum value inhibition method in the neural network prediction part, a high-quality detection frame is extracted, and the label classification capability and precision of a detection algorithm on a detected object are further improved. In conclusion, the algorithm has the advantages of higher detection precision, wide application range and good universality.

The foregoing shows and describes the general principles and features of the present invention, together with the advantages thereof. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, which are described in the specification and illustrated only to illustrate the principle of the present invention, but that various changes and modifications may be made therein without departing from the spirit and scope of the present invention, which fall within the scope of the invention as claimed. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims

1. Method for detecting targets in industrial environment based on improved CenterNet network, characterized in that it comprises the following steps:

step S1: collecting a plurality of images containing a target to be detected;

step S2: constructing a training data set of the CenterNet network;

step S3: the CenterNet network is improved, and comprises a depth residual error neural network which replaces ResNet-18 with ResNet-50 and adds a transposition convolutional layer to a neck feature extraction part of the neural network, wherein the ResNet-50 depth residual error neural network comprises 1 input convolutional layer, a maximum pooling layer, 4 ConvBlock residual blocks and 12 IdentityBlock residual blocks, the ConvBlock is a residual block which comprises 3 convolutional layers and is used for changing input and output dimensions, the IdentityBlock is a residual block which comprises 3 convolutional layers and is used for a continuous series network, and the transposition convolutional layer is used for converting a feature map which is 32 times of downsampling into a high-resolution feature map which is 16 times of upsampling;

2. The improved cenet network based target detection method in industrial environment as claimed in claim 1, wherein said step S2 of constructing the training data set of the cenet network includes adding a plurality of images of the target to be detected and a plurality of images added by data enhancement, the data enhancement includes random cropping, flipping, and shifting; and adding corresponding labeling information of each image, wherein the labeling information comprises whether the image contains a target to be detected, position information and classification information of the target to be detected.

3. The method for detecting the target based on the improved CenterNet network under the industrial environment according to claim 1, wherein the step S5 of detecting the target to be detected in the industrial environment by using the target detection model comprises the step of inputting the high-resolution feature map sampled by 16 times into a prediction head of a neural network prediction part, the prediction head comprises three parts, namely thermodynamic diagram loss, center point bias loss and center point width and height loss, and the total loss function is

，

Wherein

Is a thermodynamic diagram to predict the loss,

is the loss of the center point bias,

is the loss of width and height of the center point,

，

respectively, the weight of each different loss function.

4. The improved centret network based object detection method in industrial environment of claim 1, wherein said step S3 improves the centret network, further comprising a method of increasing NMS non-maxima suppression in the predicted part of the neural network.

5. The improved centret network based network object detection method in industrial environment of claim 2, characterized in that said labeling information is labeled by software labelImg.

6. The improved CenterNet-based network object detection method in an industrial environment of claim 1, wherein the ConvBlock residual block comprises four parts, a first part comprising a 1 x 1 convolutional layer for changing input dimensions; the second part comprises a 3 × 3 convolutional layer for feature extraction; the third section includes 1 × 1 convolutional layers for changing output dimensions; the fourth part includes 1 × 1 convolutional layers for changing the picture input dimension and then jumping-connecting to the output.

7. The improved CenterNet-based network object detection method in an industrial environment of claim 1, wherein the identyblock residual block comprises four parts, a first part comprising a 1 x 1 convolutional layer for changing input dimensions; the second part comprises a 3 × 3 convolutional layer for feature extraction; the third section includes 1 × 1 convolutional layers for changing output dimensions; the fourth part is the input picture jumping connection to the output.