CN113076871A

CN113076871A - Fish shoal automatic detection method based on target shielding compensation

Info

Publication number: CN113076871A
Application number: CN202110354428.6A
Authority: CN
Inventors: 丁泉龙; 杨伟健; 曹燕; 王一歌; 韦岗
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2021-04-01
Filing date: 2021-04-01
Publication date: 2021-07-06
Anticipated expiration: 2041-04-01
Also published as: CN113076871B

Abstract

The invention discloses a fish school automatic detection method based on target shielding compensation, which comprises the following steps: the method comprises the steps that a camera is carried on a multi-rotor unmanned aerial vehicle to collect fish shoal images, and marking and data expansion are carried out; performing feature extraction, namely performing multistage feature extraction from shallow to deep on an input fish school image by using a double-branch feature extraction network to obtain five feature maps; carrying out feature fusion, fusing semantic information of a deep feature map into a shallow feature map on the upper layer by using an improved semantic embedding branch, and fusing detail information of a four-time down-sampling feature map into an eight-time down-sampling feature map; and predicting the fish target through the three characteristic graphs to obtain a candidate frame, processing the repeated candidate frame by adopting an improved DIoU _ NMS non-maximum value inhibition algorithm, and outputting a fish school detection result. The invention can improve the recall rate of the fish school detection when mutual shielding is caused by fish school aggregation, thereby improving the average accuracy of the fish school detection.

Description

Fish shoal automatic detection method based on target shielding compensation

Technical Field

The invention relates to the technical field of image target detection, in particular to a fish school automatic detection method based on target shielding compensation.

Background

Modern fish culture is not independent of systematic management, and fish school detection has very important practical significance for culture industrialization, wherein the fish school detection can detect whether fish exists and the size of the fish, and further evaluate whether culture and fish feeding are proper.

The fish school detection can adopt a sonar image method and an optical image method. The sonar image method utilizes an ultrasonic principle, acquires underwater fish-swarm sonar images through a sonar system, and then detects fish targets from the sonar images, but for an actual underwater scene, the sonar image method is easy to be interfered by other objects. With the development and improvement of underwater photography technology, optical imaging methods are now available. By adopting an optical image method, an optical image of a fish shoal needs to be acquired firstly, and then the fish is detected and marked by a target detection method. And the target detection is a branch in image processing, which is to find out all objects in the specified category in the picture and mark their specific positions in the image with rectangular frames. The manual marking of the fish shoal is expensive and inefficient, and in order to promote the development of automatic informatization of the fish farming industry, it is very important to research the automatic fish shoal detection method aiming at the actual underwater environment of the farm.

With the continuous development of computer technology, the automatic detection of the underwater fish school optical image by using deep learning can reduce the time for searching and marking fish, thereby saving the time for relevant workers to execute the task and improving the working efficiency.

The Yolov4 target detection algorithm belongs to a deep learning algorithm, gives consideration to detection speed and detection precision, and is widely applied to the field of image target detection. The Yolov4 algorithm firstly sends a data set into a Yolov4 network for training, stores a trained network model weight file, then inputs a test image by using the stored network model weight file, namely a prediction frame which possibly has a target in the test image can be generated, and simultaneously, a confidence score of the target in the prediction frame is given. The algorithm has good effects on detection speed and detection precision, is suitable for being applied to automatic fish school detection, and can quickly obtain a detection result after a fish school image is shot.

However, when the fish image data is shot underwater actually, the underwater scene is complex, the collected fish images have the condition that mutual occlusion is caused by fish aggregation, if the YOLOv4 algorithm is directly used for detecting the fish targets, the detection effect on the occluded targets is poor, missed detection occurs, and the recall rate of the fish targets is relatively low. Therefore, it is desirable to provide an underwater fish detection method with high recall rate.

Disclosure of Invention

The invention aims to solve the defects in the prior art and provides an automatic fish school detection method based on target occlusion compensation.

The purpose of the invention can be achieved by adopting the following technical scheme:

a fish school automatic detection method based on target shielding compensation comprises the following steps:

s1, acquiring a fish school image in a pond environment through a multi-rotor unmanned airship carrying a camera, and marking and data expansion the acquired fish school image;

the underwater fish shoal image can be acquired by flying the multi-rotor unmanned airship to the sky of an interested water area and landing the unmanned airship to the water surface, and then acquiring optical image data of the cultured fish shoal by using a camera carried on the unmanned airship.

S2, inputting the fish image into a double branch feature extraction network to perform multilevel feature extraction from shallow to deep, wherein the double branch feature extraction network is called as a double branch feature extraction network because a lightweight original information feature extraction network parallel to CSPDarknet53 is added on the basis of a trunk feature extraction network CSPDarknet53 of a YOLOv4 algorithm; after multi-stage feature extraction is carried out by a double-branch feature extraction network, five feature maps are obtained, and the five feature maps are respectively a two-time down-sampling feature map F_A1Fourfold down-sampling feature map F_A2Eight-fold down-sampling feature map F_A3Sixteen-fold down-sampling feature map F_A4Thirty-two times downsampling feature map F_A5Resolutions of 1/2, 1/4, 1/8, 1/16, 1/32 of the input fish school image, respectively;

s3, using Modified Semantic Embedding Branch (MSEB) to obtain the feature map F obtained in the step S2_A5Fusing semantic information of to the feature map F_A4In (1), obtaining a characteristic diagram F_AM4

Feature map F

_AM41/16 for the input fish image; the characteristic diagram F obtained in the step S2_A4Fusing semantic information of to the feature map F_A3In (1), obtaining a characteristic diagram F_AM3

Feature map F

_AM31/8 for the input fish image;

s4, convolution down-sampling is carried out to obtain the four times down-sampled feature map F obtained in the step S2_A2The detail information of (4) is fused with the eight-fold down-sampled feature map F obtained in step S3_AM3In (1), obtaining a characteristic diagram F_AMC3

Feature map F

_AMC31/8 for the input fish image;

s5, obtaining the characteristic diagram F obtained in the step S2_A5The characteristic diagram F obtained in step S3_AM4And the characteristic diagram F obtained in step S4_AMC3After feature fusion is carried out on the feature pyramid structure of the YOLOv4 algorithm, three feature graphs are obtained, wherein the three feature graphs are F_B3、F_B4And F_B5Then using the feature map F_B3、F_B4And F_B5Predicting the fish target after convolution processing to obtain repeated candidate frames and corresponding prediction confidence scores;

and S6, processing the repeated candidate boxes by adopting a non-maximum suppression algorithm of the improved DIoU _ NMS to obtain a prediction box result containing the prediction confidence score, and drawing the prediction box result on a corresponding picture as a fish shoal detection result.

Further, in step S1, labeling the fish targets in each collected fish image one by using labelImg image labeling software, generating an xml tag file containing labeling information for each image after labeling, and constructing an original data set from the collected fish images and the tag files corresponding to the collected fish images; and then expanding the original data set in a data enhancement mode comprising vertical turnover, horizontal turnover, brightness change, random Gaussian white noise addition, filtering and affine transformation to form a final data set and improve the robustness of the network model to environmental changes.

Further, in step S2, the fish school image is input into the dual-branch feature extraction network for multi-stage feature extraction from shallow to deep, so as to extract and retain the original features of the input image more fully, and compensate the problem of insufficient fish features when the fish school is blocked, and the specific process of feature extraction by the dual-branch feature extraction network is as follows:

the trunk feature extraction network CSPDarknet53 comprises a CBM unit and five cross-phase local network CSPx units; the CBM unit consists of a Convolution layer Convolution with the step length of 1 and the Convolution kernel of 3 x 3, a Batch Normalization layer Batch Normalization and a Mish activation function layer; the CSPx unit is formed by fusing a plurality of CBM units and x Res unit residual error units, wherein each Res unit residual error unit consists of a CBM unit with a convolution kernel of 1 x 1, a CBM unit with a convolution kernel of 3 x 3 and a residual error structure, the two characteristic graphs are spliced on a channel by using the Concatenate fusion operation, and the dimensionality of the spliced characteristic graphs is expanded; the channel number of the Convolution layer convergence of the five CSPx units is 64, 128, 256, 512 and 1024 in sequence, and each CSPx unit is subjected to twice down-sampling; the characteristic graphs obtained by five CSPx units are respectively F_C1、F_C2、F_C3、F_C4、F_C5Resolutions of 1/2, 1/4, 1/8, 1/16, 1/32 of the input fish school image, respectively;

the lightweight original information feature extraction network comprises five CM units, wherein each CM unit consists of a convolutional layer constraint with the step length of 2 and the convolutional kernel of 3 x 3 and a maximum pooling layer MaxPool with the pooling step length of 1 and the pooling kernel of 3 x 3, the convolutional layer with the step length of 2 is subjected to one-time double down sampling, and the number of convolutional layer channels of each CM unit is the same as that of the corresponding cross-stage local network CSPx unit in a main feature extraction network CSPDarknet 53; the characteristic diagram obtained by five CM units is F_L1、F_L2、F_L3、F_L4、F_L5Resolutions of 1/2, 1/4, 1/8, 1/16, 1/32 of the input fish school image, respectively;

then, in the multi-stage feature extraction process from shallow to deep, a feature graph F extracted by a light-weight original information feature extraction network is obtained_LiFeature map F extracted from CSPDarknet53 network_CiPerforming Add fusion operation, i is 1,2,3,4,5, adding corresponding pixel values of the two feature maps to obtain a final extracted feature map F_A1、F_A2、F_A3、F_A4、F_A5The resolutions are 1/2, 1/4, 1/8, 1/16, 1/32 of the input fish school image, respectively.

Furthermore, in the multi-stage feature extraction process from shallow to deep, the feature map extracted from the shallow layer has rich detail information, but lacks semantic information, and cannot better identify and detect the target; on the contrary, semantic information can be well extracted from a deep layer, but a large amount of detail information is lost, and position information cannot be effectively predicted. Therefore, in step S3, the improved semantic embedding branch is used to fuse the semantic information of the deep feature map into the shallow feature map on the upper layer, so as to compensate the problem of insufficient semantic information in the shallow feature map, thereby improving the recall rate of the fish target during detection, and the process of fusing by using the improved semantic embedding branch is as follows:

firstly, the deep layer feature map F obtained in step S2_A5Performing coordinate fusion operation of different scale features by using a convolution layer with convolution kernel 1 × 1 and a convolution layer with convolution kernel 3 × 3, performing double upsampling by using a nearest neighbor interpolation mode after passing through a Sigmoid function, and performing double upsampling with the shallow feature map F obtained in the step S2_A4Multiplying the pixel values to obtain a feature map F _AM41/16 of the resolution of the input fish image, so that the deep feature map F_A5Fusing semantic information of (2) to the shallow feature map F_A4Middle, shallow layer characteristic map F_A4The semantic information of (2) is insufficient;

then, the deep layer feature map F obtained in step S2 is subjected to_A4It is also embedded with improved semantic embedding branchesFusion of semantic information to shallow feature map F_A3In (1), obtaining a characteristic diagram F _AM31/8 with resolution of input fish image, making up for shallow feature map F_A3The semantic information of (2) is insufficient;

the Sigmoid functional form used in the improved semantic embedding branch is as follows:

where i is the input and e is the natural constant.

Further, in the step S4, the detail information of the quadruple down-sampled feature map is fused into the eight-fold down-sampled feature map by means of convolution down-sampling, and the detail information of the quadruple down-sampled feature map is fully utilized to compensate the positioning of the edge contour of the fish when the fish school is blocked, where the fusion process is as follows:

firstly, the feature map F obtained in the step S2 is four times sampled_A2Processing by a CBL unit, wherein the CBL unit consists of a Convolution layer convention with a step size of 1 and a Convolution kernel of 3 × 3, a Batch Normalization layer Batch Normalization and a LeakyReLU activation function layer, performing double down-sampling by using the Convolution layer convention with the step size of 2 and the Convolution kernel of 3 × 3, and performing double down-sampling with the feature map F obtained in the step S3_AM3Carrying out Concatenate fusion operation after CBL unit treatment to obtain a characteristic diagram F_AMC3At a resolution of 1/8 of the input fish image, thereby using a quadruple down-sampled feature map F_A2The details of (a).

Further, the step S5 process is as follows:

first, the feature map F obtained in step S2_A5The characteristic diagram F obtained in step S3_AM4And the characteristic diagram F obtained in step S4_AMC3After feature fusion is carried out on the feature pyramid structure of the YOLOv4 algorithm, three feature graphs F are obtained_B3、F_B4And F_B5Wherein the feature Pyramid structure of the YOLOv4 algorithm includes a Spatial Pyramid Pooling layer (SPP) and a Path Aggregation Network (PANet), and the Spatial Pyramid structure includes a Spatial Pyramid Pooling layer (SPP)The structure of the pooling layer is in a feature map F_A5After three times of CBL unit processing, performing Concatenate fusion operation by adopting four maximum pooling layers with pooling cores of 1 × 1, 5 × 5, 9 × 9 and 13 × 13, and repeatedly fusing the features by the structure of the path aggregation network through paths from bottom to top and from top to bottom; then three feature maps F are processed_B3、F_B4And F_B5Respectively carrying out convolution layer processing with a CBL unit and a convolution kernel of 1 x 1 to obtain three Prediction feature maps of different sizes, namely Prediction1, Prediction2 and Prediction3, wherein the resolutions of the three Prediction feature maps are 1/8, 1/16 and 1/32 of an input fish school image; and then, predicting the fish target by using the three prediction characteristic graphs to obtain repeated candidate frames and corresponding prediction confidence scores.

Further, in the step S6, the repeated candidate box is processed by using the non-maximum suppression algorithm of the improved DIoU _ NMS, so that the missed detection problem of the blocked target is compensated, and the recall rate of the blocked fish is further improved, which includes the following specific processing procedures:

s601, traversing all candidate frames in an image, sequentially judging the prediction confidence score of each candidate frame, reserving the candidate frames with the scores larger than the confidence threshold value and the corresponding scores thereof, and deleting the candidate frames with the scores lower than the confidence threshold value;

s602, selecting the candidate frame M with the highest prediction confidence score in the residual candidate frames, and traversing the other candidate frames B in sequence_iCalculating Distance cross ratio Distance-IoU with the candidate frame M, wherein Distance cross ratio Distance-IoU is called DIoU for short if a certain candidate frame B_iIf the value of DIoU with the candidate box M is not lower than the given threshold value epsilon, the overlap of the two boxes is considered to be high, and if the candidate box B is directly deleted according to the DIoU _ NMS algorithm_iEasily cause the problem of missed detection when the fish swarm aggregation causes the occlusion, therefore, the improved DIoU _ NMS algorithm divides the candidate box B_iThe prediction confidence score of (2) is reduced, and then the candidate frame M is removed to the final prediction frame set G; wherein the prediction confidence score reduction criterion is as follows:

where M is the candidate box with the highest current prediction confidence score, B_iIs the other candidate box to be traversed, ρ (M, B)_i) Is M and B_iC is a distance containing M and B_iIs the diagonal length of the minimum bounding rectangle, DIoU (M, B)_i) Is M and B_iIs a given threshold value of the distance cross-over ratio DIoU, S_iIs candidate frame B_iIs predicted confidence score, S'_iIs candidate frame B_iA reduced score prediction confidence score;

and S603, repeatedly executing the step S602 until all the candidate frames are processed, and drawing the final prediction frame set G on the corresponding picture as an output result to obtain a fish school detection result.

Further, in the step S602, DIoU adds a penalty factor based on the intersection ratio IoU, the penalty factor considering the distance between the center points of the two candidate boxes, DIoU (M, B)_i) The calculation method of (c) is as follows:

wherein M is the candidate box with the highest current prediction confidence score, B_iIs the other candidate box to be traversed, ρ (M, B)_i) Is M and B_iC is a distance containing M and B_iIoU (M, B)_i) Is M and B_iThe ratio of the intersection and union of (a).

Compared with the prior art, the invention has the following advantages and effects:

(1) in the image feature extraction process, the double-branch feature extraction network is used for extracting the features of the input fish school image, the problem of insufficient fish features when the fish school is shielded is compensated, and the original features of the fish can be extracted more fully.

(2) The invention adopts the improved semantic embedded branch MSEB to fuse the semantic information of the deep layer feature map into the feature map of the upper layer, thereby making up the problem of insufficient semantic information in the shallow layer feature map of the upper layer and further improving the recall rate of the fish target.

(3) According to the method, the detail information of the four-time down-sampling feature map is fused into the eight-time down-sampling feature map, so that the edge contour information of the fish is fully acquired by utilizing the detail information of the four-time down-sampling feature map, and the edge contour of the fish can be more accurately positioned when a fish shoal is shielded.

(4) The invention adopts the non-maximum suppression algorithm of the improved DIoU _ NMS to process the repeated candidate frame, and the missed detection problem of the blocked target is compensated, so that the missed detection of the deleted repeated candidate frame and the true frame is balanced, and the recall rate of the blocked fish is further improved.

Drawings

FIG. 1 is a flow chart of a fish shoal automatic detection method based on target occlusion compensation disclosed by the invention;

fig. 2 is a network structure diagram of a fish school automatic detection method based on target occlusion compensation in an embodiment of the present invention, where Concat represents Concatenate fusion operation;

fig. 3 is a block diagram of an improved semantic embedding branch MSEB in an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Examples

In this embodiment, a flow chart shown in fig. 1 and a network structure chart shown in fig. 2 are used to provide an automatic fish school detection method based on target occlusion compensation, so as to realize automatic detection of an underwater fish school target, where the specific flow is as follows:

s1, flying the unmanned airship to the sky of a water area of interest by using a multi-rotor wing and landing the unmanned airship to the water surface, then shooting image data of cultured fish schools by using a camera carried on the unmanned airship, enabling a camera to face the front, setting the interval time of shooting the images to be 5 seconds, enabling the original resolution of the shot images to be 1920 x 1080, and then marking and expanding the collected fish school images to obtain a data set for training;

s2, inputting the fish image into a double-branch feature extraction network to perform multistage feature extraction from shallow to deep, wherein the double-branch feature extraction network is specifically a double-branch feature extraction network which is called as a double-branch feature extraction network because a lightweight original information feature extraction network parallel to a trunk feature extraction network CSPDarknet53 is added on the basis of the trunk feature extraction network CSPDarknet53 of a YOLov4 algorithm; after multi-stage feature extraction is carried out by a double-branch feature extraction network, five feature maps are obtained, and the five feature maps are respectively a two-time down-sampling feature map F_A1Fourfold down-sampling feature map F_A2Eight-fold down-sampling feature map F_A3Sixteen-fold down-sampling feature map F_A4Thirty-two times downsampling feature map F_A5Resolutions of 1/2, 1/4, 1/8, 1/16, 1/32 of the input fish school image, respectively;

s3, using Modified Semantic Embedding Branch (MSEB) to obtain the feature map F obtained in the step S2_A5Fusing semantic information of to the feature map F_A4In (1), obtaining a characteristic diagram F _AM41/16 for the resolution of the input fish image; the characteristic diagram F obtained in the step S2_A4Fusing semantic information of to the feature map F_A3In (1), obtaining a characteristic diagram F _AM31/8 for the resolution of the input fish image;

s4, convolution down-sampling is carried out to obtain the four times down-sampled feature map F obtained in the step S2_A2The detail information of (4) is fused with the eight-fold down-sampled feature map F obtained in step S3_AM3In (1), obtaining a characteristic diagram F _AMC31/8 for the resolution of the input fish image;

s5, obtaining the characteristic diagram F obtained in the step S2_A5The characteristic diagram F obtained in step S3_AM4And the characteristic diagram F obtained in step S4_AMC3Through YAfter feature fusion is carried out on the feature pyramid structure of the OLOv4 algorithm, three feature graphs are obtained, wherein the three feature graphs are F_B3、F_B4And F_B5Then using the feature map F_B3、F_B4And F_B5Predicting the fish target after convolution processing to obtain repeated candidate frames and corresponding prediction confidence scores;

In this embodiment, step S1 uses labelImg labeling software to label the fish bodies in the collected fish images one by one with rectangular frames in a manual labeling manner, so as to obtain corresponding xml tag files, and record the coordinates and categories of each target in the images; and then, expanding the acquired fish image and the corresponding tag file thereof in a data enhancement mode comprising vertical turnover, horizontal turnover, brightness change, random Gaussian white noise addition, filtering and affine transformation to form a final data set and improve the robustness of the network model to environmental changes.

In this embodiment, in step S2, a fish image is input into a dual branch feature extraction network for multi-level feature extraction from shallow to deep, where 208 in fig. 2 shows a specific structure of the dual branch feature extraction network, and a lightweight original information feature extraction network parallel to CSPDarknet53 is added on the basis of CSPDarknet53 as a main feature extraction network of the YOLOv4 algorithm, and the structure of the dual branch feature extraction network is described as follows:

the trunk feature extraction network CSPDarknet53 comprises a CBM unit and five cross-phase local network CSPx units; the CBM unit is composed of a Convolution layer Convolution with a step size of 1 and a Convolution kernel of 3 × 3, a Batch Normalization layer Batch Normalization and a Mish activation function layer, and 201 in fig. 2 shows the structure of one CBM unit; the CSPx unit is formed by fusing a plurality of CBM units and x Res unit residual error units, wherein 204 in FIG. 2 shows the structure of a CSPx unit; res unit residual unit in CSPx unit is formed by CBM with convolution kernel of 1 x 1The element, the CBM unit with convolution kernel 3 × 3 and the residual structure, and 203 in fig. 2 gives the structure of a Res unit residual unit; splicing the two feature graphs on a channel by using a coordinate fusion operation, wherein the dimension can be expanded; the number of convolution layer channels of the five CSPx units is 64, 128, 256, 512 and 1024 in sequence, and each CSPx unit is subjected to twice down sampling; the characteristic diagram obtained by five CSPx units is F_C1、F_C2、F_C3、F_C4、F_C5Resolutions of 1/2, 1/4, 1/8, 1/16, 1/32 of the input fish school image, respectively;

the lightweight original information feature extraction network comprises five CM units, wherein each CM unit consists of a Convolution layer constraint with the step length of 2 and the Convolution kernel of 3 x 3 and a maximum pooling layer MaxPool with the pooling step length of 1 and the pooling kernel of 3 x 3, and the structure of one CM unit is given by 205 in FIG. 2; the convolution layer with the step length of 2 can be subjected to one-time double down sampling, and the number of convolution layer channels of each CM unit is the same as that of corresponding cross-stage local network CSPx units in the main feature extraction network CSPDarknet 53; the characteristic diagram obtained by five CM units is F_L1、F_L2、F_L3、F_L4、F_L5Resolutions of 1/2, 1/4, 1/8, 1/16, 1/32 of the input fish school image, respectively;

then, in the multi-stage feature extraction process from shallow to deep, a feature graph F extracted by a light-weight original information feature extraction network is obtained_Li(i-1, 2,3,4,5) and feature map F extracted by corresponding CSPDarknet53 network in sequence_Ci(i is 1,2,3,4,5), Add fusion operation is performed, i.e. corresponding pixel values of the two feature maps are added to obtain the final extracted feature map F_A1、F_A2、F_A3、F_A4、F_A5The resolutions are 1/2, 1/4, 1/8, 1/16, 1/32 of the input fish school image, respectively.

In this embodiment, in step S3, the improved semantic embedding branch MSEB is used to fuse the semantic information of the deep feature map into the shallow feature map on the upper layer, and fig. 3 shows a specific structure diagram of the improved semantic embedding branch MSEB; the specific steps for fusion using the MSEB are,firstly, the deep layer feature map F obtained in step S2_A5Performing coordinate fusion of different scale features by using a convolution layer with convolution kernel 1 × 1 and a convolution layer with convolution kernel 3 × 3, performing double upsampling by using a nearest neighbor interpolation mode after passing through a Sigmoid function, and performing double upsampling with the shallow feature map F obtained in the step S2_A4Multiplying the pixel values to obtain a feature map F _AM41/16 of the resolution of the input fish image, so that the deep feature map F_A5Fusing semantic information of (2) to the shallow feature map F_A4Middle, shallow layer characteristic map F_A4The semantic information of (2) is insufficient;

then, the deep layer feature map F obtained in step S2 is subjected to_A4The MSEB is also used for fusing the semantic information thereof into a shallow feature map F_A3In (1), obtaining a characteristic diagram F _AM31/8 with resolution of input fish image, making up for shallow feature map F_A3The semantic information of (2) is insufficient;

the form of Sigmoid function used in the improved semantic embedding branch MSEB is as follows:

where i is the input and e is the natural constant.

In this embodiment, the implementation process of step S4 is as follows:

firstly, the feature map F obtained in the step S2 is four times sampled_A2After processing by a CBL unit, wherein the CBL unit is composed of a Convolution layer convention, a Batch Normalization layer Batch Normalization and a leakage relu activation function layer, of which Convolution kernel is 3 × 3 and step size is 1, a structure of the CBL unit is shown as 202 in fig. 2; then, the Convolution layer Convolution with the Convolution kernel size of 3 × 3 is double down-sampled by the step size of 2, and the feature map F obtained in step S3 is obtained_AM3Carrying out Concatenate fusion operation after CBL unit treatment to obtain a characteristic diagram F_AMC3At a resolution of 1/8 of the input fish image, thereby using a quadruple down-sampled feature map F_A2The detailed information of the fish can fully obtain the edge contour information of the fish and compensate the shielding of fish shoalThe edge contour of the fish is positioned.

In this embodiment, the implementation process of step S5 is as follows:

first, the feature map F obtained in step S2_A5The characteristic diagram F obtained in step S3_AM4And the characteristic diagram F obtained in step S4_AMC3After feature fusion is carried out on the feature pyramid structure of the YOLOv4 algorithm, three feature graphs F are obtained_B3、F_B4And F_B5Wherein, the feature Pyramid structure of the YOLOv4 algorithm includes a Spatial Pyramid Pooling layer (SPP) and a Path Aggregation Network (PANet), and the SPP structure is in a feature graph F_A5After three times of CBL unit processing, Concatenate fusion was performed using four maximal pooling layers with pooling kernels of 1 × 1, 5 × 5, 9 × 9, and 13 × 13, the structure of SPP is given at 206 in fig. 2, the features are repeatedly fused by bottom-up and top-down paths for the PANet structure, and the structure of PANet is given at 207 in fig. 2; then three feature maps F are processed_B3、F_B4And F_B5Performing convolutional layer processing with a CBL unit and a convolutional kernel of 1 x 1 respectively to obtain three Prediction feature maps of different sizes, namely Prediction1, Prediction2 and Prediction3, wherein the resolutions of the three Prediction feature maps are 1/8, 1/16 and 1/32 of an input fish swarm image respectively; and then, predicting the fish target by using the three prediction characteristic graphs to obtain repeated candidate frames and corresponding prediction confidence scores.

In this embodiment, the implementation process of step S6 is as follows:

s602, selecting the candidate frame M with the highest prediction confidence score in the residual candidate frames, and traversing the other candidate frames B in sequence_iCalculating Distance cross ratio Distance-IoU with the candidate frame M, wherein Distance cross ratio Distance-IoU is called DIoU for short if a certain candidate frame B_iThe value of DIoU with the candidate box M is not lower than a given threshold ε, the overlap of the two boxes is consideredHigher degree, not directly deleting the candidate frame B_iInstead, the candidate frame B is reduced_iThen removing the candidate frame M to the final prediction frame set G; wherein the prediction confidence score reduction criterion is as follows:

The DIoU in step S602 adds a penalty factor based on the intersection ratio IoU, where the penalty factor takes into account the distance between the center points of the two candidate frames, and the specific calculation method is as follows:

In this embodiment, the Prediction frame needs to be continuously adjusted during training to be close to the real frame of the target to be detected, so that 9 kinds of prior frames with different sizes are obtained by using a K-means clustering algorithm on the fish image data set before training, the prior frames are suitable for the acquired fish image data set, and the three Prediction feature maps, namely, Prediction1, Prediction2 and Prediction3, are respectively set to be 3 kinds of prior frames with different sizes. The K-means clustering algorithm measures the approaching degree of the two frames by using the intersection ratio IoU as an index, and a formula for specifically calculating the distance between the two frames is as follows:

distance (box, center) 1-IoU (box, center) formula (4)

Wherein box represents a candidate box to be calculated, center represents a candidate box of a cluster center, and IoU (box, center) represents an intersection ratio of the candidate box to be calculated and the candidate box of the cluster center.

In this embodiment, during training, the initial learning rate is set to 0.0002, the initial training iteration round number epoch is set to 45, 8 images are randomly selected for training each time, an Adam optimizer is used to accelerate network model convergence, and meanwhile, in order to reduce GPU memory overhead, the resolution of each image for training is adjusted to 416 × 416.

In this embodiment, the loss function loss predicts the error L from the regression box_locConfidence error L_confClassification error L_clsThe method comprises the following three parts of:

the specific calculation method of v in the above formula (5) is formula (6), IoU (P, T) is the intersection ratio of the prediction box and the real box, and ρ (P)_ctr,T_ctr) Is the distance between the center points of the predicted frame and the real frame, d is the diagonal length of the minimum bounding rectangle containing the predicted frame and the real frame, w^gtAnd h^gtRespectively the width and height of the real frame, w and h respectively the width and height of the prediction frame, the image is divided into S x S grids,m is the number of a priori boxes anchor that will be generated per mesh,

indicating that the prediction box contains the object to be detected,

indicating that the prediction box does not contain the object to be detected,

is the prediction confidence for the corresponding prior box,

is the actual confidence, λ_noobjIs a set weight coefficient, c is a category to which the object to be detected belongs,

is the actual probability that the object in the corresponding mesh belongs to the class c,

is the predicted probability that the target in the corresponding mesh belongs to the class c.

In this embodiment, after the relevant parameters are set, the fish school data set is trained, a curve change of loss can be obtained after the training is completed, the loss function loss starts to decrease faster and tends to converge finally, a trained fish school target detection model weight file is stored, then a test fish school image file is input by using the stored model weight file, the fish school image can be subjected to fish target detection, prediction frames in which targets may exist in the image are generated, a prediction confidence score of each prediction frame is given, and an image with the prediction frames and the prediction confidence scores thereof is output.

The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.

Claims

1. A fish school automatic detection method based on target shielding compensation is characterized by comprising the following steps:

s2, inputting the fish image into a double-branch feature extraction network for multilevel feature extraction from shallow to deep, wherein the double-branch feature extraction network is formed by adding a lightweight original information feature extraction network parallel to CSPDarknet53 on the basis of a main feature extraction network CSPDarknet53 of a YOLOv4 algorithm; after multi-stage feature extraction is carried out by a double-branch feature extraction network, five feature maps are obtained, and the five feature maps are respectively a two-time down-sampling feature map F_A1Fourfold down-sampling feature map F_A2Eight-fold down-sampling feature map F_A3Sixteen-fold down-sampling feature map F_A4Thirty-two times downsampling feature map F_A5Resolutions of 1/2, 1/4, 1/8, 1/16, 1/32 of the input fish school image, respectively;

s3, using improved semantic embedding branch to embed the feature graph F obtained in the step S2_A5Fusing semantic information of to the feature map F_A4In (1), obtaining a characteristic diagram F_AM4Feature map F_AM41/16 for the input fish image; the characteristic diagram F obtained in the step S2_A4Fusing semantic information of to the feature map F_A3In (1), obtaining a characteristic diagram F_AM3Feature map F_AM31/8 for the input fish image;

s4, convolution down-sampling is carried out to obtain the four times down-sampled feature map F obtained in the step S2_A2The detail information of (4) is fused with the eight-fold down-sampled feature map F obtained in step S3_AM3In (1), obtaining a characteristic diagram F_AMC3Feature map F_AMC31/8 for the input fish image;

2. The method according to claim 1, wherein in step S1, labelImg image labeling software is used to label the fish targets in each collected fish image one by one, after labeling, each image generates an xml tag file containing labeling information, and the collected fish image and its corresponding tag file construct an original data set; and then expanding the original data set in a data enhancement mode comprising vertical inversion, horizontal inversion, brightness change, random Gaussian white noise addition, filtering and affine transformation to form a final data set.

3. The method of claim 1, wherein the trunk feature extraction network CSPDarknet53 comprises a CBM unit and five cross-phase local network CSPx units; the CBM unit consists of a Convolution layer Convolution with the step length of 1 and the Convolution kernel of 3 x 3, a Batch Normalization layer Batch Normalization and a Mish activation function layer; the CSPx unit is formed by fusing a plurality of CBM units and x Res unit residual error units, wherein each Res unit residual error unit consists of a CBM unit with a convolution kernel of 1 x 1, a CBM unit with a convolution kernel of 3 x 3 and a residual error structure, the two characteristic graphs are spliced on a channel by using the Concatenate fusion operation, and the dimensionality of the spliced characteristic graphs is expanded; of Convolution layers of five CSPx unitsThe number of channels is 64, 128, 256, 512 and 1024 in sequence, and each CSPx unit is subjected to twice down-sampling; the characteristic graphs obtained by five CSPx units are respectively F_C1、F_C2、F_C3、F_C4、F_C5Resolutions of 1/2, 1/4, 1/8, 1/16, 1/32 of the input fish school image, respectively;

4. The method for automatically detecting fish school based on target occlusion compensation as claimed in claim 3, wherein the fusion process using improved semantic embedding branch in step S3 is as follows:

firstly, the deep layer feature map F obtained in step S2_A5Performing coordinate fusion operation of different scale features by using convolution layer with convolution kernel of 1 × 1 and convolution layer with convolution kernel of 3 × 3, performing double upsampling by using nearest neighbor interpolation after passing through Sigmoid function, and performing step S2The obtained shallow characteristic diagram F_A4Multiplying the pixel values to obtain a feature map F_AM41/16 of the resolution of the input fish image, so that the deep feature map F_A5Fusing semantic information of (2) to the shallow feature map F_A4Performing the following steps;

then, the deep layer feature map F obtained in step S2 is subjected to_A4Fusing its semantic information to the shallow feature map F also using improved semantic embedding branches_A3In (1), obtaining a characteristic diagram F_AM31/8 for the resolution of the input fish image;

where i is the input and e is the natural constant.

5. The method for automatically detecting fish school based on target occlusion compensation as claimed in claim 3, wherein said step S4 is as follows:

firstly, the feature map F obtained in the step S2 is four times sampled_A2Processing by a CBL unit, wherein the CBL unit consists of a Convolution layer convention with a step size of 1 and a Convolution kernel of 3 × 3, a Batch Normalization layer Batch Normalization and a LeakyReLU activation function layer, performing double down-sampling by using the Convolution layer convention with the step size of 2 and the Convolution kernel of 3 × 3, and performing double down-sampling with the feature map F obtained in the step S3_AM3Carrying out Concatenate fusion operation after CBL unit treatment to obtain a characteristic diagram F_AMC3The resolution is 1/8 of the input fish image.

6. The method for automatically detecting fish school based on target occlusion compensation as claimed in claim 5, wherein said step S5 is as follows:

first, the feature map F obtained in step S2_A5The characteristic diagram F obtained in step S3_AM4And the characteristic diagram F obtained in step S4_AMC3Calculated by YOLOv4After the characteristic pyramid structure of the method is subjected to characteristic fusion, three characteristic graphs F are obtained_B3、F_B4And F_B5Wherein, the feature pyramid structure of the YOLOv4 algorithm comprises a spatial pyramid pooling layer and a path aggregation network, and the structure of the spatial pyramid pooling layer is in a feature map F_A5After three times of CBL unit processing, performing Concatenate fusion operation by adopting four maximum pooling layers with pooling cores of 1 × 1, 5 × 5, 9 × 9 and 13 × 13, and repeatedly fusing the features by the structure of the path aggregation network through paths from bottom to top and from top to bottom; then three feature maps F are processed_B3、F_B4And F_B5Respectively carrying out convolution layer processing with a CBL unit and a convolution kernel of 1 x 1 to obtain three Prediction feature maps of different sizes, namely Prediction1, Prediction2 and Prediction3, wherein the resolutions of the three Prediction feature maps are 1/8, 1/16 and 1/32 of an input fish school image; and then, predicting the fish target by using the three prediction characteristic graphs to obtain repeated candidate frames and corresponding prediction confidence scores.

7. The method for automatically detecting fish school based on target occlusion compensation as claimed in claim 1, wherein the process of step S6 is as follows:

s602, selecting the candidate frame M with the highest prediction confidence score in the residual candidate frames, and traversing the other candidate frames B in sequence_iCalculating Distance cross ratio Distance-IoU with the candidate frame M, wherein Distance cross ratio Distance-IoU is called DIoU for short if a certain candidate frame B_iIf the value of DIoU with the candidate frame M is not less than the given threshold value epsilon, the degree of overlap between the two frames is considered to be high, and the candidate frame B is not directly deleted_iInstead, the candidate frame B is reduced_iThen removing the candidate frame M to the final prediction frame set G; wherein the prediction confidence score reduction criterion is as follows:

where M is the candidate box with the highest current prediction confidence score, B_iIs the other candidate box to be traversed, ρ (M, B)_i) Is M and B_iC is a distance containing M and B_iIs the diagonal length of the minimum bounding rectangle, DIoU (M, B)_i) Is M and B_iIs a given threshold value of the distance cross-over ratio DIoU, S_iIs candidate frame B_iIs predicted confidence score of, S_i ^′Is candidate frame B_iA reduced score prediction confidence score;

8. The method as claimed in claim 7, wherein the DIoU in step S602 is obtained by adding a penalty factor based on the intersection ratio IoU, wherein the penalty factor takes into account the distance between the center points of the two candidate frames, DIoU (M, B)_i) The calculation method of (c) is as follows: