CN113420819B

CN113420819B - Lightweight underwater target detection method based on CenterNet

Info

Publication number: CN113420819B
Application number: CN202110723096.4A
Authority: CN
Inventors: 沈钧戈; 毛昭勇; 丁文俊; 姜旭阳
Original assignee: Northwestern Polytechnical University
Current assignee: Northwestern Polytechnical University
Priority date: 2021-06-25
Filing date: 2021-06-25
Publication date: 2022-12-06
Anticipated expiration: 2041-06-25
Also published as: CN113420819A

Abstract

The invention provides a lightweight underwater target detection method based on CenterNet, which comprises the steps of shooting a target image underwater, making the target image into a data set, dividing the data set into a training set and a testing set, labeling the training set, selecting ResNet18 as a feature extraction network, building a feature pyramid for multi-scale feature fusion, outputting a feature graph with the maximum size of the fused image to a detection head, performing deep learning training on the image and labeled information in the training set by using a CenterNet algorithm to obtain a trained model, performing target detection, and obtaining classification information and position information of a target to be detected in the image. The underwater multi-scale target detection method is lighter, is suitable for embedded equipment, has higher target detection precision, further improves the detection precision of the multi-scale target in the underwater optical image, reduces part of required calculation amount, increases reasoning speed, and makes the algorithm lighter and more real-time.

Description

Lightweight underwater target detection method based on CenterNet

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to an underwater target detection method.

Background

With the development of human civilization, the utilization of marine resources by human beings is more and more deep and frequent, and the types and the quantity of marine facilities are more and more abundant. The construction of marine facilities is difficult and often plays an important role at the military strategic level, and once damaged, the damage is enormous and the repair is difficult. These characteristics make it extremely vulnerable to vandalism by other countries and terrorists, and it is therefore extremely important to protect the safety of marine facilities. The particularities of the geographical location of marine facilities make the protection of these facilities particularly difficult.

The underwater target detection is the 'eye' of human observation ocean and has extremely important significance in ocean resource development and ocean facility protection. The traditional underwater target detection mainly artificially extracts the characteristics of target radiation noise, then constructs a classifier, and classifies and identifies the target based on the extracted characteristics. In recent years, with great progress of artificial intelligence in the field of image recognition, application of deep learning in underwater target detection is also subjected to a great deal of and more research. So far, algorithms based on deep learning can be divided into two categories, one is a two-stage detector, such as R-CNN; one is a one-stage detector such as SSD, YOLO. The one-stage detector gives the category and position information of the target directly through the backbone network without using the RPN network, so they are faster. At present, one-stage or two-stage detectors have little practice in underwater target detection, but no anchor-free algorithm is used. The CenterNet is an one-stage algorithm without an anchor frame, avoids the complex operation of designing an anchor-box, does not Need Maximum Suppression (NMS), and has a structure simpler than that of a plurality of anchors-free algorithms, so that the CenterNet has higher speed and has higher real-time property compared with other algorithms; meanwhile, the requirement on the GPU is lower, and the method is more suitable for embedded equipment with limited computing capability. The algorithm consists of two parts: the system comprises a feature extraction network Hourglass for extracting features and a detection head for positioning and classifying targets and detecting based on central points. However, underwater targets tend to be small in size and densely distributed. The Hourglass of the feature extraction network used by the CenterNet has an overlarge acceptance domain due to a special nested structure, and the number of network layers is deep, so that a large amount of small target information is lost, and the detection effect on small targets and dense targets is poor; and the structure is complex, the calculated amount is large, the reasoning speed is slow, and the method is not suitable for being used in a lightweight algorithm. The current underwater optical image target detection has the following defects:

1. underwater targets are often small and dense, and existing underwater optical image target detection algorithms cannot well detect such targets.

2. The existing underwater target detection algorithm has no light weight and high precision. The present invention solves the two problems described above.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides a lightweight underwater target detection method based on CenterNet.

The technical scheme adopted by the invention for solving the technical problem comprises the following steps:

step 1, shooting a target image underwater to make a data set;

placing a target type to be detected under water, installing a camera on a Remote Operated Vehicle (ROV), carrying out multi-scale and multi-azimuth image shooting on the target to be detected to obtain a target image, and making the target image into a data set;

step 2, dividing the data set into a training set and a test set, and labeling the training set;

step 3, selecting ResNet18 which has better detection effects on small targets and dense targets and is more suitable for a lightweight algorithm as a feature extraction network; constructing a characteristic pyramid, respectively carrying out multi-scale characteristic fusion on the last layer with the convolution channel numbers of 128, 256 and 512 in the ResNet18, and outputting a characteristic diagram with the largest image size after fusion to a detection head;

step 4, performing deep learning training on the images and the labeled information in the training set by using a CenterNet algorithm to obtain a trained CenterNet algorithm model;

and 5, performing target detection on the images in the test set or the actually shot images by using a CenterNet algorithm, and acquiring the classification information and the position information of the target to be detected in the images.

The step 2 comprises the following steps:

and dividing the image into a training set and a testing set according to the proportion of 7:3 to 9:1, the number of images in the training set is more than 500, so that the generalization performance of the detection algorithm is ensured. Carrying out data annotation on the acquired images of the training set by using software labelimage, wherein annotation information is position information and category information of targets to be detected in the images to obtain an underwater optical image data set;

the step 3 comprises the following steps:

a network with an excessive number of layers is not suitable for underwater targets and lightweight algorithms, since an excessive number of network layers would result in a large loss of small target information and an increase in the amount of required computation. Therefore, the ResNet18 is selected as the feature extraction network according to the characteristics of small size and dense distribution of underwater targets and the condition that the computing capability of the embedded equipment is limited. Meanwhile, in order to increase the detection effect of the detection algorithm on small targets, multi-scale feature fusion is used for ResNet18, a feature pyramid is built, multi-scale feature fusion is respectively carried out on the last layer with the number of convolution channels of 128, 256 and 512 in ResNet18, a feature map with the image size of 64 multiplied by 64 is output to a detection head after fusion, and the detection accuracy of the algorithm is remarkably increased. Compared with the common characteristic pyramid, the invention deletes two output channels with smaller output picture size of the characteristic pyramid, reduces the calculation amount of the detection algorithm, and keeps the superiority that the CenterNet does not need the non-maximum inhibition process, thereby leading the algorithm to be lighter and more real-time. Meanwhile, the centret algorithm originally increases the output 16 × 16 size picture to 128 × 128 through three-layer deconvolution, and outputs the picture to a detection head to obtain the position, width, height and deviation information of the target in the image; and because the invention uses the characteristic pyramid to carry out multi-scale characteristic fusion and only reserves the channel with the largest output image size (64 multiplied by 64), and does not need to pass through three layers of deconvolution, the invention only reserves one layer of deconvolution, deletes the other two layers, and can increase the final output image size to 128 multiplied by 128, thereby reducing the required calculated amount and leading the algorithm to be lighter.

The step 4 comprises the following steps:

respectively storing the annotation files and the images in the data set in two folders, jointly moving the annotation files and the images into a data folder of an algorithm, and operating a main file of the algorithm through a command at a terminal to train a network; in the training process, firstly calling a feature extraction network comprising a multi-scale fusion mode to perform feature extraction on an underwater optical image in a training set, then calling a detection head related file, outputting a feature diagram output by the feature extraction network into a Loss Function (Loss Function) in a detection head to calculate a numerical value, and completing one-time forward propagation; and then, the convolutional neural network adjusts parameters in the model according to the change situation of the LOSS function value until the LOSS function LOSS reaches the minimum value, and the whole process is automatic, namely the process of training the model by the deep learning algorithm, so that the LOSS of the training model changes towards the minimum value. The algorithm can generate a continuously updated training model in the training process, loss gradually converges to the minimum value in the repeated forward propagation and backward propagation processes, namely the relationship curve of the Loss and time tends to be smooth and does not decline any more, and the training model at the moment is the optimal underwater optical image target detection training model, namely the training part of the CenterNet algorithm is completed.

The step 5 comprises the following steps:

reading the images in the test set or the actually shot images into a trained underwater optical image target detection training model, and then detecting the images through an underwater target detection algorithm;

in the detection stage, firstly, an input image is zoomed to 512 x 512, then feature extraction is carried out on the zoomed image through a feature extraction network, parameters in the feature extraction network are parameters of a training model at the moment, the optimal feature information in the underwater optical image is extracted, LOSS of the model is converged at the moment, and the condition that the parameters in the model are the most appropriate parameters for extracting the features at the moment is that the parameters are optimal, if the parameters are not optimal, the LOSS is not converged, the parameters can also continuously rise or fall, the feature information is input to a detection head part, and the specific detection mode of the detection head part is as follows:

suppose the input image is I e R ^W×H×3 Where W and H are the width and height of the image, respectively, at the time of detection, a hotspot map (keypoint heat map) of the key point is generated by the gaussian kernel:

representing the value of each point in the hot spot diagram, wherein R is the step length of outputting the corresponding original drawing, and is set to be 4, C represents the category of target detection objects, and if 4 underwater targets to be detected exist, C =4; in this way it is possible to obtain,

is a predicted value of the detected object for

Indicating that for class C, an object of this class is detected in the current (x.y) coordinates, and

then it means that there is no object with category C at this coordinate point currently;

the hot spots of each class in the output graph are extracted separately in the following way:

finally predicted from the model trained in step 4

The value of (a), that is, the probability value of the object existing at the center point of the current prediction target, selects the center point; detecting the value of the current hot spot by adopting maximum pooling (MaxPool) of 3x3

The points which are larger than or equal to the surrounding eight adjacent points (eight directions) are taken, and then the numerical value in all the points is taken

The largest first m points, m being less than or equal to 100, produce an effect similar to non-maximum suppression in anchor-based detection; predicting the position of a target object through the position of the central point by using m central points selected from the image to obtain m prediction frames, and judging whether the prediction frames are accurate or not so as to obtain an estimated value as a confidence coefficient;

in order to be able to detect the point,

representing the detected points in the class C, and expressing the position of each key point (namely the central point of the target to be detected) as an integer coordinate

Then use

Representing the probability that the current point is the center point, then a prediction box is generated using the coordinates:

wherein

Is the offset of the current point to the original image,

representing the length and width of the predicted corresponding target of the current point;

deleting the prediction target with the confidence coefficient smaller than the threshold 0.3, reserving the position of the prediction frame with the confidence coefficient larger than or equal to the threshold 0.3 as final position information, and using the classification of the heat point map as final classification information.

The invention has the advantages that the lightweight underwater target detection method based on the CenterNet is an underwater optical image target detection algorithm which can be carried on embedded equipment with limited calculation conditions and can detect the target type and position information in an underwater optical image in real time with higher precision:

1. the detection head selected by the invention is CenterNet, the detection method belongs to a one-stage detector, and the detection method is a target detection algorithm without an anchor-frame (anchor-free) and without maximum suppression (NMS), the speed is higher than that of a two-stage detector or a target detection algorithm based on the anchor-based and maximum suppression process, and the detection method has real-time performance; and the required calculated amount is less than that of other detectors, so that the detector is lighter in weight and is suitable for embedded equipment.

2. The invention selects ResNet18 as a feature extraction network, the network has simple structure, few layers and required calculation amount, higher detection precision on underwater optical image targets with the characteristics of small target size and dense distribution, lighter algorithm and capability of detecting underwater targets with higher precision and real time even if the underwater optical image targets are carried in embedded equipment with limited calculation conditions.

3. The invention uses the improved characteristic pyramid as a multi-scale characteristic fusion mode, and further improves the detection precision of the algorithm on the multi-scale target in the underwater optical image on the premise of keeping the superiority of the CenterNet in the process of not needing non-maximum inhibition.

4. On the original basis of the CenterNet, the invention deletes two layers of deconvolution, reduces the calculation amount needed by part, increases the reasoning speed and makes the algorithm lighter and more real-time.

Drawings

Fig. 1 is an exemplary view of a photographed underwater optical image.

Fig. 2 is a schematic structural diagram of ResNet18 according to the present invention.

Fig. 3 is a schematic diagram of the general structure of the algorithm of the present invention.

Figure 4 is a schematic diagram of the CenterNet detection head structure of the present invention.

Fig. 5 is two exemplary graphs of output results according to the present invention, fig. 5 (a) is an exemplary graph of output results of the graph one, and fig. 5 (b) is an exemplary graph of output results of the graph two.

Detailed Description

The invention is further illustrated with reference to the following figures and examples.

In order to solve the limitations and defects of the prior art, the invention provides a lightweight underwater target detection method based on the CenterNet.

Therefore, resNet18 is used as a feature extraction network, a unique multi-scale feature fusion mode is adopted, multi-scale information in ResNet18 is fused, and a feature diagram with the maximum resolution is output to a detection head. The two structures obviously improve the detection effect of the algorithm on small targets and dense targets, reduce the calculated amount and the size of the model, and enable the algorithm to detect the underwater small targets and the dense targets with higher precision and higher speed in embedded equipment with limited calculation capability.

The target detection method is easier to carry in a small underwater vehicle or a passive underwater monitor or a mobile monitoring station, if the server can be carried in the underwater monitor, even a sonar image is not required to be transmitted back to the monitoring station, the underwater monitor can automatically send a counter-braking command, and the target can be more quickly countered; the higher operation speed is precious in military use competing for minutes and seconds, and even in a civil underwater robot, the higher reaction speed can be generated, so that the higher working efficiency is achieved.

A lightweight underwater target detection method based on CenterNet comprises the following steps:

step 1, shooting a target image underwater, and making a data set;

step 3, selecting ResNet18 which has better detection effects on small targets and dense targets and is more suitable for a lightweight algorithm as a feature extraction network; constructing a characteristic pyramid, respectively carrying out multi-scale characteristic fusion on the last layer with the convolution channel numbers of 128, 256 and 512 in the ResNet18, and outputting a characteristic diagram with the fused image size of 64 multiplied by 64 to a detection head;

step 4, performing deep learning training on the images and the labeled information in the training set by using a CenterNet algorithm;

and 5, performing target detection on the images in the test set or the actually shot images by using a CenterNet algorithm to acquire classification information and position information of the target to be detected in the images.

The step 1 comprises the following steps:

placing a target to be detected under water, installing a camera on a remote-control unmanned underwater vehicle (ROV), and shooting multi-scale and multi-azimuth images of the target to be detected, wherein a shooting sample is shown in figure 1;

the step 2 comprises the following steps:

dividing the images into a training set and a test set in a proportion of 7: 1, the number of images in the training set is not too small, and is preferably more than 500, so as to ensure the generalization performance of the detection algorithm. Performing data annotation on the acquired training set image by using a software labelimage, wherein annotation information is position information and category information of a target to be detected in the image, and obtaining an underwater optical image data set;

the step 3 comprises the following steps:

a network with an excessive number of layers is not suitable for underwater targets and lightweight algorithms, since an excessive number of network layers would result in a large loss of small target information and an increase in the amount of required computation. Therefore, aiming at the characteristics of small size and dense distribution of underwater targets and the condition of limited computing power of embedded equipment, the method selects ResNet18 as a feature extraction network, the conventional ResNet18 structure is shown in FIG. 2, and in the scheme, the sizes of input pictures are 512 x 512, so the sizes of output feature pictures of con3_ x, con4_ x and con5_ x are 128 x 128, 256 x 256 and 512 x 512 respectively. In order to increase the detection effect of the detection algorithm on small targets, the method uses multi-scale feature fusion on ResNet18 to build a feature pyramid: and respectively carrying out multi-scale feature fusion on the last layer with the number of convolution channels of 128, 256 and 512 in the ResNet18, and outputting a feature map with the image size of 64 multiplied by 64 after fusion to a detection head, so that the detection precision of the algorithm is obviously improved, and the overall structure of the algorithm is shown in figure 3. The specific fusion mode is as follows: firstly, carrying out 1 multiplied by 1 convolution on three input feature maps with different sizes, reducing the number of channels to 128, fusing different feature maps and reducing the calculated amount; then, performing maximum pooling on the three input feature maps with the changed channel number respectively, and increasing the image sizes to 512 × 512; finally, the three feature maps of 512 × 512 × 128 are added, and the obtained feature maps are output to the detection head. Different from the common characteristic pyramid, the method deletes two output channels with smaller sizes of the output pictures of the characteristic pyramid, reduces the calculation amount of the detection algorithm, and keeps the superiority that the CenterNet does not need a non-maximum inhibition process, so that the algorithm is lighter and more real-time. Meanwhile, the centret algorithm originally increases the output 16 × 16 size picture to 128 × 128 through three-layer deconvolution, and outputs the picture to a detection head to obtain the position, width, height and deviation information of the target in the image; and because the method uses the characteristic pyramid to perform multi-scale characteristic fusion and only reserves the channel with the largest output image size (64 multiplied by 64) without three-layer deconvolution, the method only reserves one layer of deconvolution and deletes the other two layers, namely the size of the final output image can be increased to 128 multiplied by 128, the required calculated amount is reduced, and the algorithm is lighter.

The step 4 comprises the following steps:

and respectively storing the annotation files and the images in the data set in two folders, and jointly moving the annotation files and the images into the data folders of the algorithm. Py file to run algorithm by command at terminal to train the network. In the training process, the algorithm firstly calls a feature extraction network comprising a multi-scale fusion mode to perform feature extraction on an underwater optical image in a training set, then calls a detection head related file, outputs a feature diagram output by the feature extraction network into a Loss Function (Loss Function) in a detection head to calculate a numerical value, and completes one-time forward propagation; and then, the convolutional neural network adjusts parameters in the model according to the change condition of the Loss function value, so that the Loss of the training model changes to the minimum value. The algorithm can generate a continuously updated training model in the training process, loss gradually converges to the minimum value in the repeated forward propagation and backward propagation processes, and the training model at the moment is the optimal underwater optical image target detection training model, namely the training part of the CenterNet algorithm is completed.

The step 5 comprises the following steps:

and inputting the underwater optical image into an underwater target detection algorithm, and reading the trained underwater optical image target detection training model. In the detection stage, firstly, the input image is zoomed to 512 x 512, and then the zoomed image is subjected to feature extraction through a feature extraction network, at the moment, parameters in the feature extraction network are parameters of a training model, so that feature information in the underwater optical image can be optimally extracted. The characteristic information is input into the detection head, the schematic diagram of the detection head structure is shown in fig. 4, and the specific detection mode of the detection head is as follows.

Suppose the input image is I e R ^W×H×3 Where W and H are the width and height of the image, respectively, upon detection, a hot spot map (keypoint heat map) of the key point is generated:

wherein R is a step length for outputting the corresponding original image, which is set to 4, and C represents a category of the target detection object, and if 4 types of targets to be detected underwater exist, C =4; in this way it is possible to obtain,

is a predicted value of the detected object for

Indicating that for category c, an object of this category is detected in the current (x.y) coordinates, and

it means that there is no object of the category c at this coordinate point at present. The hot spots of each class in the output graph are extracted individually. The extraction method comprises the following steps: detecting the value of the current hot spot by adopting maximum pooling (MaxPool) of 3x3

Points (or equal) larger than the surrounding eight neighboring points (eight orientations), and then taking the value of all points

The first 100 points of maximum, produce effects similar to non-maximum suppression in the anchor-based assay.

Suppose that

In order for the point to be detected,

representing one point detected in class c. The position of each key point is expressed by integer coordinates

Then use

Representing the confidence of the current point, and then using such coordinates to generate a calibration box:

wherein

Is the bias point of the current point corresponding to the original image,

representing the length and width of the predicted current point corresponding to the target. Finally predicted from the model

The value of (A), i.e. the probability value of the object existing at the current center point, is selected

The point with the value of the top 100 is taken as a possible center point; in the present invention, a threshold value of 0.3 is set, that is, a central point greater than the threshold value is called out from 100 results selected from the images as a final result, and the detection results finally output from the two images are shown in fig. 5.

Claims

1. A lightweight underwater target detection method based on CenterNet is characterized by comprising the following steps:

step 1, shooting a target image underwater, and making a data set;

placing a target to be detected under water, mounting a camera on the remote-control unmanned submersible, carrying out multi-scale and multi-azimuth image shooting on the target to be detected to obtain a target image, and making the target image into a data set;

step 3, selecting ResNet18 which has better detection effects on small targets and dense targets and is more suitable for a lightweight algorithm as a feature extraction network; building a characteristic pyramid, respectively carrying out multi-scale characteristic fusion on the last layer with the convolution channel number of 128, 256 and 512 in ResNet18, outputting a characteristic image with the size of 64 multiplied by 64 to a detection head CenterNet, deleting two output channels with smaller size of the output image of the characteristic pyramid, carrying out multi-scale characteristic fusion by using the characteristic pyramid, only reserving a channel with the largest size of the output image, not needing to carry out three-layer deconvolution, only reserving one layer of deconvolution, deleting the other two layers, and increasing the size of the final output image to 128 multiplied by 128;

step 5, performing target detection on the images in the test set or the actually shot images by using a CenterNet algorithm to obtain classification information and position information of a target to be detected in the images, reading the images in the test set or the actually shot images into a trained underwater optical image target detection training model, and then performing detection by using an underwater target detection algorithm;

in the detection stage, firstly, an input image is zoomed to 512 × 512, then feature extraction is carried out on the zoomed image through a feature extraction network, parameters in the feature extraction network are parameters of a training model at the moment, the optimal feature information in the underwater optical image is extracted, LOSS convergence of the model is optimal at the moment, the feature information is input to a detection head part, and the specific detection mode of the detection head part is as follows:

the input image is I E R ^W×H×3 And W and H are the width and the height of the image respectively, and when in detection, a hotspot graph of a key point is generated through a Gaussian kernel:

representing the value of each point in the hot spot diagram, wherein R is the step size of outputting the corresponding original image, set to 4, C represents the category of the target detection object,

is a predicted value of the detected object for

the hot spots of each class in the output graph are extracted individually in the following way:

predicted from the model trained in step 4

The value of (a), that is, the probability value of the object existing at the center point of the current prediction target, selects the center point; detecting the value of the current hot spot by adopting the maximum pooling of 3x3

The number of the points is taken from all the points which are larger than or equal to the eight surrounding adjacent points

The maximum front m points, m is less than or equal to 100, the positions of the target objects are predicted from m central points selected from the images according to the positions of the central points to obtain m prediction frames, and whether the prediction frames are accurate or not is judged, so that an estimated value is obtained as a confidence coefficient;

in order for the point to be detected,

representing the detected points in class C, the position of each key point is expressed by integer coordinates

Then use

wherein

Is the offset of the current point to the original image,

representing the length and width of the target corresponding to the predicted current point; deleting the prediction target with the confidence coefficient smaller than the threshold 0.3, reserving the position of the prediction frame with the confidence coefficient larger than or equal to the threshold 0.3 as final position information, and using the classification of the hot spot diagram as final classification information.

2. The CenterNet-based lightweight underwater target detection method of claim 1, wherein:

the step 2 comprises the following steps:

dividing the image into a training set and a testing set according to the proportion, wherein the proportion is 7:3 to 9:1, performing data annotation on the acquired images of the training set by using software labelimage, wherein annotation information is position information and category information of a target to be detected in the images, and acquiring an underwater optical image data set, wherein the number of the images of the training set is more than 500.

3. The CenterNet-based lightweight underwater target detection method according to claim 1, characterized in that:

the step 4 comprises the following steps:

respectively storing the marked files and the images in the data set in two folders, jointly moving the marked files and the images into a data folder of an algorithm, and operating a main.py file of the algorithm through a command at a terminal to train a network; in the training process, firstly calling a feature extraction network comprising a multi-scale fusion mode to perform feature extraction on underwater optical images in a training set, then calling a relevant file of a detection head, outputting a feature diagram output by the feature extraction network to a loss function in the detection head to calculate a numerical value, and completing one-time forward propagation; and then, the convolutional neural network adjusts parameters in the model according to the change condition of the LOSS function value until the LOSS function LOSS reaches the minimum value and does not decrease, and the training model at the moment is the optimal underwater optical image target detection training model, namely the training part of the CenterNet algorithm is completed.