CN113076871B

CN113076871B - Fish shoal automatic detection method based on target shielding compensation

Info

Publication number: CN113076871B
Application number: CN202110354428.6A
Authority: CN
Inventors: 丁泉龙; 杨伟健; 曹燕; 王一歌; 韦岗
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2021-04-01
Filing date: 2021-04-01
Publication date: 2022-10-21
Anticipated expiration: 2041-04-01
Also published as: CN113076871A

Abstract

The invention discloses an automatic fish school detection method based on target occlusion compensation. The method includes: collecting fish school images by carrying a camera on a multi-rotor unmanned spacecraft, and carrying out marking and data expansion; The road feature extraction network performs multi-level feature extraction from shallow to deep on the input fish school image, and obtains five feature maps; performs feature fusion, and uses the improved semantic embedding branch to fuse the semantic information of the deep feature map into the shallow feature map of the previous layer. layer feature map, and fuse the detail information of the four-fold down-sampling feature map into the eight-fold down-sampling feature map; predict the fish target through the three feature maps, get the candidate frame, and use the improved DIoU_NMS non-maximum suppression The algorithm processes duplicate candidate boxes and outputs the result of fish school detection. The invention can improve the recall rate of fish school detection when the fish school gathers and cause mutual occlusion, thereby improving the average precision of fish school detection.

Description

An automatic fish detection method based on target occlusion compensation

技术领域technical field

本发明涉及图像目标检测技术领域，具体涉及一种基于目标遮挡补偿的鱼群自动检测方法。The invention relates to the technical field of image target detection, in particular to an automatic fish school detection method based on target occlusion compensation.

背景技术Background technique

现代化的鱼类养殖离不开系统化的管理，鱼群检测对养殖工业化具有非常重要的实际意义，其中鱼群检测可以检测是否存在鱼，以及鱼类的大小，进而评估养殖、鱼类喂养是否得当。Modern fish farming is inseparable from systematic management. Fish stock detection is of great practical significance to the industrialization of farming. Fish stock detection can detect the presence of fish and the size of fish, and then evaluate whether farming and fish feeding are appropriate.

鱼群检测可以采用声呐图像法和光学图像法。声呐图像法利用超声原理，通过声呐系统采集水下鱼群声呐图像，然后从声呐图像中检测出鱼目标，但是对于实际水下场景，采用声呐图像法容易受到其他物体的干扰。随着水下摄影技术的发展和完善，现在可以采用光学图像法。采用光学图像法，首先需要采集鱼群光学图像，然后通过目标检测方法把鱼检测并标记出来。而目标检测是图像处理中的一个分支，它是将图片里全部指定类别的物体找出，并用矩形框标记出它们在图像中的具体位置。人工标记鱼群代价昂贵且低效，为了促进鱼类养殖产业的自动化信息化的发展，针对实际养殖场水下环境研究鱼群自动检测方法就显得极为重要。Fish school detection can use sonar image method and optical image method. The sonar image method uses the principle of ultrasound, collects sonar images of underwater fish through a sonar system, and then detects fish targets from the sonar images. However, for actual underwater scenes, the sonar image method is easily interfered by other objects. With the development and perfection of underwater photography technology, optical image method can now be used. Using the optical image method, the optical image of the fish school needs to be collected first, and then the fish is detected and marked by the target detection method. Object detection is a branch of image processing, which finds all objects of the specified category in the picture and marks their specific positions in the image with a rectangular frame. Manually labeling fish schools is expensive and inefficient. In order to promote the development of automated informatization in the fish farming industry, it is extremely important to study the automatic detection method of fish schools for the underwater environment of actual farms.

随着计算机技术的不断发展，使用深度学习对水下鱼群光学图像进行鱼群自动检测，可以减少寻找和标记鱼的时间，因此可以节省相关工作人员执行该任务的时间从而提升工作效率。With the continuous development of computer technology, the use of deep learning to automatically detect fish in optical images of underwater fish can reduce the time to find and mark fish, so it can save the time for relevant staff to perform this task and improve work efficiency.

YOLOv4目标检测算法属于一种深度学习算法，兼顾检测速度和检测精度，已经被广泛应用于图像目标检测领域。YOLOv4算法首先将数据集送入YOLOv4网络中进行训练，保存训练好的网络模型权重文件，然后利用保存的网络模型权重文件，输入测试图像，即可生成该测试图像中可能存在目标的预测框，同时给出预测框存在目标的置信度得分。由于该算法在检测速度和检测精度上有较好的效果，适合应用于鱼群自动检测中，可以在拍摄完一张鱼群图像后快速得到检测结果。The YOLOv4 target detection algorithm belongs to a deep learning algorithm, taking into account the detection speed and detection accuracy, and has been widely used in the field of image target detection. The YOLOv4 algorithm first sends the data set into the YOLOv4 network for training, saves the trained network model weight file, and then uses the saved network model weight file to input the test image to generate a prediction frame that may exist in the test image. At the same time, the confidence score of the existence of the target in the prediction box is given. Because the algorithm has good effects in detection speed and detection accuracy, it is suitable for automatic detection of fish schools, and the detection results can be obtained quickly after taking a fish school image.

但是在实际水下拍摄鱼群图像数据时，水下场景比较复杂，采集到的鱼群图像存在鱼群聚集导致相互遮挡的情况，如果直接使用YOLOv4算法进行鱼群目标检测，对于遮挡目标的检测效果较差，会出现漏检，鱼目标的召回率相对较低。因此，目前亟待提供一种较高召回率的水下鱼群检测方法。However, in the actual underwater shooting of fish school image data, the underwater scene is more complicated, and the collected fish school images may cause mutual occlusion caused by fish school aggregation. If the YOLOv4 algorithm is directly used for fish school target detection, the detection of occluded targets The effect is poor, there will be missed detection, and the recall rate of the fish target is relatively low. Therefore, it is urgent to provide an underwater fish detection method with a high recall rate.

发明内容SUMMARY OF THE INVENTION

本发明的目的是为了解决现有技术中的上述缺陷，提供一种基于目标遮挡补偿的鱼群自动检测方法。The purpose of the present invention is to provide an automatic fish school detection method based on target occlusion compensation in order to solve the above-mentioned defects in the prior art.

本发明的目的可以通过采取如下技术方案达到：The purpose of the present invention can be achieved by adopting the following technical solutions:

一种基于目标遮挡补偿的鱼群自动检测方法，所述鱼群自动检测方法包括如下步骤：An automatic fish school detection method based on target occlusion compensation, the fish school automatic detection method comprises the following steps:

S1、通过搭载相机的多旋翼无人飞船在池塘环境中采集鱼群图像，并对采集到的鱼群图像进行标记和数据扩充；S1. Collect fish school images in a pond environment through a multi-rotor unmanned spacecraft equipped with a camera, and mark and data expand the collected fish school images;

水下鱼群图像的获取，可以通过将多旋翼无人飞船飞到感兴趣水域上空并降落至水面，然后使用无人飞船上搭载的相机采集养殖鱼群光学图像数据。The underwater fish images can be obtained by flying the multi-rotor unmanned spacecraft over the waters of interest and landing on the water surface, and then using the camera on the unmanned spacecraft to collect the optical image data of the cultured fish.

S2、将鱼群图像输入到双支路特征提取网络中进行由浅至深的多级特征提取，所述双支路特征提取网络是在YOLOv4算法的主干特征提取网络CSPDarknet53的基础上，加入与CSPDarknet53并行的轻量级原始信息特征提取网络，故称为双支路特征提取网络；经过双支路特征提取网络进行多级特征提取后，得到五个特征图，分别是两倍下采样特征图F_A1、四倍下采样特征图F_A2、八倍下采样特征图F_A3、十六倍下采样特征图F_A4、三十二倍下采样特征图F_A5，分辨率分别是输入鱼群图像的1/2、1/4、1/8、1/16、1/32；S2. Input the fish school image into the dual-branch feature extraction network for multi-level feature extraction from shallow to deep. The dual-branch feature extraction network is based on the backbone feature extraction network CSPDarknet53 of the YOLOv4 algorithm, and is added with CSPDarknet53 Parallel lightweight original information feature extraction network, so it is called dual-branch feature extraction network; after multi-level feature extraction by dual-branch feature extraction network, five feature maps are obtained, which are twice the downsampling feature map F. _A1 , four times downsampling feature map F _A2 , eight times downsampling feature map F _A3 , sixteen times downsampling feature map F _A4 , thirty-two times downsampling feature map F _A5 , the resolution is respectively the input fish school image 1/2, 1/4, 1/8, 1/16, 1/32;

S3、使用改进的语义嵌入分支(Modified Semantic Embedding Branch,MSEB)，将步骤S2得到的特征图F_A5的语义信息融合到特征图F_A4中，得到特征图F_AM4，特征图F_AM4的分辨率是输入鱼群图像的1/16；将步骤S2得到的特征图F_A4的语义信息融合到特征图F_A3中，得到特征图F_AM3，特征图F_AM3的分辨率是输入鱼群图像的1/8；S3. Use the Modified Semantic Embedding Branch (MSEB) to fuse the semantic information of the feature map F _A5 obtained in step S2 into the feature map F _A4 to obtain the feature map F _AM4 and the resolution of the feature map F _AM4 is 1/16 of the input fish school image; the semantic information of the feature map F _A4 obtained in step S2 is fused into the feature map F _A3 to obtain the feature map F _AM3 , and the resolution of the feature map F _AM3 is 1 of the input fish school image. /8;

S4、通过卷积下采样，将步骤S2得到的四倍下采样特征图F_A2的细节信息融合到步骤S3得到的八倍下采样特征图F_AM3中，得到特征图F_AMC3，特征图F_AMC3的分辨率是输入鱼群图像的1/8；S4. Through convolution downsampling, the detail information of the quadruple downsampling feature map F _A2 obtained in step S2 is fused into the eightfold downsampling feature map F _AM3 obtained in step S3 to obtain a feature map F _AMC3 and a feature map F _AMC3 The resolution is 1/8 of the input fish school image;

S5、将步骤S2得到的特征图F_A5、步骤S3得到的特征图F_AM4和步骤S4得到的特征图F_AMC3经过YOLOv4算法的特征金字塔结构进行特征融合后，得到三个特征图，分别是F_B3、F_B4和F_B5，然后使用特征图F_B3、F_B4和F_B5经过卷积处理后进行鱼目标的预测，得到重复的候选框以及对应的预测置信度得分；S5. After the feature map F _A5 obtained in step S2, the feature map F _AM4 obtained in step S3 and the feature map F _AMC3 obtained in step S4 are subjected to feature fusion through the feature pyramid structure of the YOLOv4 algorithm, three feature maps are obtained, respectively F _B3 , F _B4 and F _B5 , and then use the feature maps F _B3 , F _B4 and F _B5 to predict the fish target after convolution processing, and obtain the repeated candidate frame and the corresponding prediction confidence score;

S6、采用改进DIoU_NMS的非极大值抑制算法处理重复候选框，得到包含预测置信度得分的预测框结果并绘制在对应图片上，作为鱼群检测结果。S6, using the improved DIoU_NMS non-maximum value suppression algorithm to process the repeated candidate frame, obtain the prediction frame result including the prediction confidence score and draw it on the corresponding picture, as the fish school detection result.

进一步地，所述步骤S1中通过labelImg图像标记软件对采集到的每张鱼群图像中的鱼目标进行逐一标记，标记后每张图像就会生成包含标记信息的xml标签文件，由采集到的鱼群图像与其对应的标签文件构建成原始数据集；然后进行包括垂直翻转、水平翻转、改变亮度、随机添加高斯白噪声、滤波、仿射变换在内的数据增强方式对原始数据集扩充，以形成最终的数据集，提升网络模型对环境变化的鲁棒性。Further, in the described step S1, the fish targets in each fish school image collected are marked one by one by labelImg image labeling software, after the mark, each image will generate the xml label file that includes the marking information, and the collected fish school The image and its corresponding label file are constructed into the original data set; then data enhancement methods including vertical flip, horizontal flip, change of brightness, random addition of Gaussian white noise, filtering, and affine transformation are performed to expand the original data set to form the final data set. The dataset improves the robustness of the network model to environmental changes.

进一步地，步骤S2中将鱼群图像输入到双支路特征提取网络中进行由浅至深的多级特征提取，从而对输入图像的原始特征进行更充分地提取和保留，补偿鱼群遮挡时鱼特征不足的问题，上述双支路特征提取网络进行特征提取的具体过程如下：Further, in step S2, the fish school image is input into the dual-branch feature extraction network for multi-level feature extraction from shallow to deep, so as to more fully extract and retain the original features of the input image, and compensate for the fish when the fish school is occluded. For the problem of insufficient features, the specific process of feature extraction by the above-mentioned dual-branch feature extraction network is as follows:

所述主干特征提取网络CSPDarknet53包括一个CBM单元和五个跨阶段局部网络CSPx单元；其中，CBM单元由步长为1、卷积核为3*3的卷积层Convolution、批归一化层BatchNormalization和Mish激活函数层组成；CSPx单元由若干个CBM单元和x个Res unit残差单元Concatenate融合操作而成，Res unit残差单元由卷积核为1*1的CBM单元、卷积核为3*3的CBM单元以及残差结构组成，Concatenate融合操作对两个特征图在通道上拼接，拼接后得到的特征图的维度得到扩充；五个CSPx单元的卷积层Convolution的通道数依次为64、128、256、512、1024，每个CSPx单元会进行一次两倍下采样；经过五个CSPx单元得到的特征图分别为F_C1、F_C2、F_C3、F_C4、F_C5，分辨率分别是输入鱼群图像的1/2、1/4、1/8、1/16、1/32；The backbone feature extraction network CSPDarknet53 includes a CBM unit and five cross-stage local network CSPx units; wherein, the CBM unit consists of a convolution layer Convolution with a stride of 1 and a convolution kernel of 3*3, and a batch normalization layer BatchNormalization and Mish activation function layer; CSPx unit is formed by concatenate fusion operation of several CBM units and x Res unit residual units. Res unit residual unit consists of a CBM unit with a convolution kernel of 1*1 and a convolution kernel of 3. *3 is composed of CBM unit and residual structure. The Concatenate fusion operation splices two feature maps on the channel, and the dimension of the feature map obtained after splicing is expanded; the number of channels of the convolution layer Convolution of the five CSPx units is 64 in turn. , 128, 256, 512, 1024, each CSPx unit will perform a double downsampling; the feature maps obtained after five CSPx units are F _C1 , F _C2 , F _C3 , F _C4 , F _C5 , respectively is 1/2, 1/4, 1/8, 1/16, 1/32 of the input fish image;

所述轻量级原始信息特征提取网络包括五个CM单元，其中，CM单元由步长为2、卷积核为3*3的卷积层Convolution和池化步长为1、池化核为3*3的最大池化层MaxPool组成，步长为2的卷积层进行一次两倍下采样，每个CM单元的卷积层通道数和主干特征提取网络CSPDarknet53中对应的跨阶段局部网络CSPx单元的卷积层通道数相同；经过五个CM单元得到的特征图为F_L1、F_L2、F_L3、F_L4、F_L5，分辨率分别是输入鱼群图像的1/2、1/4、1/8、1/16、1/32；The lightweight original information feature extraction network includes five CM units, wherein the CM unit consists of a convolution layer Convolution with a stride of 2 and a convolution kernel of 3*3, a pooling step of 1, and a pooling kernel of The maximum pooling layer MaxPool of 3*3 is composed, the convolutional layer with a stride of 2 performs a double downsampling, the number of convolutional layer channels of each CM unit and the corresponding cross-stage local network CSPx in the backbone feature extraction network CSPDarknet53 The number of channels of the convolutional layer of the unit is the same; the feature maps obtained after five CM units are F _L1 , F _L2 , F _L3 , F _L4 , F _L5 , and the resolutions are 1/2 and 1/4 of the input fish image respectively. , 1/8, 1/16, 1/32;

然后在由浅至深的多级特征提取过程中，将轻量级原始信息特征提取网络提取到的特征图F_Li依次与对应CSPDarknet53网络提取到的特征图F_Ci进行Add融合操作，i＝1,2,3,4,5，Add融合操作对两个特征图的对应像素值进行相加，得到最终提取到的特征图F_A1、F_A2、F_A3、F_A4、F_A5，分辨率分别是输入鱼群图像的1/2、1/4、1/8、1/16、1/32。Then in the process of multi-level feature extraction from shallow to deep, the feature map F _Li extracted by the lightweight original information feature extraction network is sequentially combined with the feature map F _Ci extracted by the corresponding CSPDarknet53 network. Add fusion operation, i=1, 2, 3, 4, 5, the Add fusion operation adds the corresponding pixel values of the two feature maps to obtain the final extracted feature maps F _A1 , F _A2 , F _A3 , F _A4 , F _A5 , and the resolutions are respectively Enter 1/2, 1/4, 1/8, 1/16, 1/32 of the fish school image.

进一步地，在由浅至深的多级特征提取过程中，浅层提取到的特征图具有较丰富的细节信息，但是缺乏语义信息，无法较好地对目标进行识别与检测；相反地，深层可以很好地提取到语义信息，但是缺失了大量的细节信息，无法有效预测位置信息。因此在步骤S3中使用改进的语义嵌入分支将深层特征图的语义信息融合到其上一层浅层特征图中，补偿浅层特征图中语义信息不足的问题，进而提高检测时对鱼目标的召回率，使用改进的语义嵌入分支进行融合的过程如下：Further, in the process of multi-level feature extraction from shallow to deep, the feature map extracted from the shallow layer has rich detailed information, but lacks semantic information, so it cannot identify and detect the target well; on the contrary, the deep layer can Semantic information is well extracted, but a lot of detailed information is missing, and location information cannot be effectively predicted. Therefore, in step S3, the improved semantic embedding branch is used to fuse the semantic information of the deep feature map into the upper shallow feature map to compensate for the lack of semantic information in the shallow feature map, thereby improving the detection accuracy of fish targets. The recall rate, the fusion process using the improved semantic embedding branch is as follows:

首先对步骤S2得到的深层特征图F_A5使用卷积核为1*1的卷积层和卷积核为3*3的卷积层进行不同尺度特征的Concatenate融合操作，接着经过Sigmoid函数后使用最近邻插值的方式进行两倍上采样，然后与步骤S2得到的浅层特征图F_A4进行像素值相乘，得到特征图F_AM4，其分辨率是输入鱼群图像的1/16，使深层特征图F_A5的语义信息融合到浅层特征图F_A4中，弥补浅层特征图F_A4的语义信息不足的问题；First, the deep feature map F _A5 obtained in step S2 uses a convolutional layer with a convolution kernel of 1*1 and a convolutional layer with a convolution kernel of 3*3 to perform Concatenate fusion operations of different scale features, and then use the Sigmoid function. Double upsampling is performed by the method of nearest neighbor interpolation, and then the pixel value is multiplied with the shallow layer feature map F _A4 obtained in step S2 to obtain the feature map F _AM4 , whose resolution is 1/16 of the input fish school image, so that the deep layer The semantic information of the feature map F _A5 is fused into the shallow feature map F _A4 to make up for the problem of insufficient semantic information of the shallow feature map F _A4 ;

然后，对步骤S2得到的深层特征图F_A4同样使用改进的语义嵌入分支将其语义信息融合到浅层特征图F_A3中，得到特征图F_AM3，其分辨率是输入鱼群图像的1/8，弥补浅层特征图F_A3的语义信息不足的问题；Then, the deep feature map F _A4 obtained in step S2 also uses the improved semantic embedding branch to fuse its semantic information into the shallow feature map F _A3 to obtain a feature map F _AM3 , whose resolution is 1/1 of the input fish school image. 8. Make up for the lack of semantic information of the shallow feature map F _A3 ;

改进的语义嵌入分支中使用的Sigmoid函数形式如下：The form of the sigmoid function used in the improved semantic embedding branch is as follows:

其中，i为输入，e为自然常数。where i is the input and e is a natural constant.

进一步地，所述步骤S4通过卷积下采样的方式，将四倍下采样特征图的细节信息融合到八倍下采样特征图中，充分利用四倍下采样特征图的细节信息，补偿鱼群遮挡时对鱼的边缘轮廓的定位，融合过程如下：Further, in the step S4, the detail information of the quadruple downsampling feature map is fused into the eightfold downsampling feature map by means of convolution downsampling, and the detail information of the quadruple downsampling feature map is fully utilized to compensate the fish school. The positioning of the edge contour of the fish during occlusion, the fusion process is as follows:

首先对步骤S2得到的四倍下采样特征图F_A2经过CBL单元处理，其中，CBL单元由步长为1、卷积核为3*3的卷积层Convolution、批归一化层Batch Normalization和LeakyReLU激活函数层组成，接着使用步长为2、卷积核为3*3的卷积层Convolution进行两倍下采样，然后与步骤S3得到的特征图F_AM3经过CBL单元处理后进行Concatenate融合操作，得到特征图F_AMC3，其分辨率是输入鱼群图像的1/8，以此利用四倍下采样特征图F_A2的细节信息。First, the quadruple down-sampling feature map F _A2 obtained in step S2 is processed by the CBL unit, wherein the CBL unit consists of the convolution layer Convolution with stride 1 and convolution kernel 3*3, batch normalization layer Batch Normalization and The LeakyReLU activation function layer is composed of layers, and then the convolution layer Convolution with a stride of 2 and a convolution kernel of 3*3 is used for double downsampling, and then the feature map F _AM3 obtained in step S3 is processed by the CBL unit and then Concatenate fusion operation is performed , obtain the feature map F _AMC3 , whose resolution is 1/8 of the input fish school image, so as to utilize the detail information of the four-fold down-sampling feature map F _A2 .

进一步地，所述步骤S5过程如下：Further, the step S5 process is as follows:

首先对步骤S2得到的特征图F_A5、步骤S3得到的特征图F_AM4和步骤S4得到的特征图F_AMC3经过YOLOv4算法的特征金字塔结构进行特征融合后，得到三个特征图F_B3、F_B4和F_B5，其中，YOLOv4算法的特征金字塔结构包括空间金字塔池化层(Spatial Pyramid Pooling,SPP)和路径聚合网络(Path Aggregation Network,PANet)，空间金字塔池化层的结构是在特征图F_A5经过三次CBL单元处理后，采用池化核为1*1、5*5、9*9和13*13的四个最大池化层进行Concatenate融合操作，路径聚合网络的结构通过自底向上和自顶向下的路径对特征反复融合操作；接着对三个特征图F_B3、F_B4和F_B5分别经过CBL单元和卷积核为1*1的卷积层处理，得到三个不同大小的预测特征图Prediction1、Prediction2、Prediction3，上述三个预测特征图的分辨率分别是输入鱼群图像的1/8、1/16、1/32；然后使用三个预测特征图进行鱼目标的预测，得到重复的候选框以及对应的预测置信度得分。Firstly, after feature fusion is performed on the feature map F _A5 obtained in step S2, the feature map F _AM4 obtained in step S3 and the feature map F _AMC3 obtained in step S4 through the feature pyramid structure of the YOLOv4 algorithm, three feature maps F _B3 and F _B4 are obtained. and F _B5 , where the feature pyramid structure of the YOLOv4 algorithm includes a spatial pyramid pooling layer (Spatial Pyramid Pooling, SPP) and a path aggregation network (Path Aggregation Network, PANet). The structure of the spatial pyramid pooling layer is in the feature map F _A5 After three CBL unit processing, four maximum pooling layers with pooling kernels of 1*1, 5*5, 9*9 and 13*13 are used to perform the Concatenate fusion operation. The structure of the path aggregation network is based on bottom-up and self- The top-down path repeats the feature fusion operation; then the three feature maps F _B3 , F _B4 and F _B5 are processed by the CBL unit and the convolutional layer with a convolution kernel of 1*1, respectively, to obtain three predictions of different sizes Feature maps Prediction1, Prediction2, Prediction3, the resolutions of the above three prediction feature maps are 1/8, 1/16, and 1/32 of the input fish image respectively; then use the three prediction feature maps to predict the fish target, and get Repeated candidate boxes and corresponding prediction confidence scores.

进一步地，所述步骤S6使用改进DIoU_NMS的非极大值抑制算法处理重复候选框，使被遮挡目标的漏检问题得到补偿，进而提高鱼被遮挡时的召回率，具体处理过程如下：Further, the step S6 uses the non-maximum suppression algorithm of the improved DIoU_NMS to process the repeated candidate frames, so that the missed detection problem of the occluded target is compensated, and then the recall rate when the fish is occluded is improved. The specific processing process is as follows:

S601、遍历一张图像中的所有候选框，依次对每个候选框的预测置信度得分进行判断，保留得分大于置信度阈值的候选框及其对应得分，删除得分低于置信度阈值的候选框；S601. Traverse all candidate frames in an image, judge the prediction confidence score of each candidate frame in turn, retain candidate frames whose scores are greater than a confidence threshold and their corresponding scores, and delete candidate frames whose scores are lower than the confidence threshold ;

S602、选出剩余候选框中预测置信度得分最高的候选框M，依次遍历其他的候选框B_i，与候选框M计算距离交并比Distance-IoU，其中，距离交并比Distance-IoU简称DIoU，如果某一候选框B_i与候选框M的DIoU值不低于给定的阈值ε，则认为两个框的重叠度较高，如果按照DIoU_NMS算法将直接删除候选框B_i，容易导致鱼群聚集导致遮挡时出现漏检的问题，因此改进的DIoU_NMS算法将候选框B_i的预测置信度得分减小，然后将候选框M移除到最后的预测框集合G中；其中，预测置信度得分减小准则如下：S602: Select the candidate frame M with the highest prediction confidence score in the remaining candidate frames, traverse the other candidate frames B _i in turn, and calculate the distance intersection ratio Distance-IoU with the candidate frame M, where the distance intersection ratio Distance-IoU is abbreviated as DIoU, if the DIoU value of a candidate frame B _i and a candidate frame M is not lower than the given threshold ε, it is considered that the overlap of the two frames is high. If the candidate frame B _i will be deleted directly according to the DIoU_NMS algorithm, it is easy to cause The fish gathering causes the problem of missed detection when occlusion occurs, so the improved DIoU_NMS algorithm reduces the prediction confidence score of the candidate frame B _i , and then removes the candidate frame M into the final prediction frame set G; among them, the prediction confidence The degree score reduction criteria are as follows:

其中，M是当前预测置信度得分最高的候选框，B_i是待遍历的其他候选框，ρ(M,B_i)是M和B_i的中心点的距离，c是包含M和B_i的最小外接矩形的对角线长度，DIoU(M,B_i)是M和B_i的距离交并比DIoU的计算值，ε是给定的距离交并比DIoU的阈值，S_i是候选框B_i的预测置信度得分，S′_i是候选框B_i得分减小后的预测置信度得分；Among them, M is the candidate frame with the highest prediction confidence score, B _i is the other candidate frame to be traversed, ρ(M, B _i ) is the distance between the center points of M and B _i , and c is the distance between M and B _i The diagonal length of the minimum circumscribed rectangle, DIoU(M,B _i ) is the calculated value of the distance intersection ratio DIoU between M and B _i , ε is the given distance intersection ratio DIoU threshold, S _i is the candidate frame B The prediction confidence score of _i , S′ _i is the prediction confidence score after the score of candidate frame B _i is reduced;

S603、重复执行步骤S602，直到处理完所有的候选框，并将最后的预测框集合G作为输出结果绘制在对应图片上，得到鱼群检测结果。S603: Repeat step S602 until all candidate frames are processed, and draw the final prediction frame set G as the output result on the corresponding picture to obtain the fish school detection result.

进一步地，所述步骤S602中DIoU是在交并比IoU的基础上添加一个惩罚因子，该惩罚因子考虑两个候选框的中心点的距离，DIoU(M,B_i)的计算方式如下：Further, in the step S602, DIoU is to add a penalty factor on the basis of the intersection and union ratio IoU. The penalty factor considers the distance between the center points of the two candidate frames. The calculation method of DIoU(M,B _i ) is as follows:

其中，M为当前预测置信度得分最高的候选框，B_i是待遍历的其他候选框，ρ(M,B_i)是M和B_i的中心点的距离，c是包含M和B_i的最小外接矩形的对角线长度，IoU(M,B_i)是M和B_i的交集和并集的比值。Among them, M is the candidate frame with the highest prediction confidence score, B _i is other candidate frames to be traversed, ρ(M, B _i ) is the distance between the center points of M and B _i , and c is the distance between M and B _i The diagonal length of the smallest circumscribed rectangle, IoU(M,B _i ) is the ratio of the intersection and union of M and B _i .

本发明相对于现有技术具有如下的优点及效果：Compared with the prior art, the present invention has the following advantages and effects:

(1)本发明在图像特征提取过程，使用双支路特征提取网络提取输入鱼群图像的特征，补偿鱼群遮挡时鱼特征不足的问题，能更充分的提取鱼的原始特征。(1) In the process of image feature extraction, the present invention uses a dual-branch feature extraction network to extract the features of the input fish school image, compensates for the lack of fish features when the fish school is occluded, and can more fully extract the original fish features.

(2)本发明采用改进的语义嵌入分支MSEB，将深层特征图的语义信息融合到其上一层特征图中，弥补上一层的浅层特征图中语义信息不足的问题，进而提高鱼目标的召回率。(2) The present invention adopts the improved semantic embedding branch MSEB, and fuses the semantic information of the deep feature map into the feature map of the upper layer to make up for the problem of insufficient semantic information in the shallow feature map of the upper layer, thereby improving the fish target. the recall rate.

(3)本发明将四倍下采样特征图的细节信息融合到八倍下采样特征图中，以此利用四倍下采样特征图的细节信息，充分获取鱼的边缘轮廓信息，能更准确的定位鱼群遮挡时鱼的边缘轮廓。(3) The present invention fuses the detail information of the quadruple downsampling feature map into the eightfold downsampling feature map, so as to utilize the detail information of the quadruple downsampling feature map to fully obtain the edge contour information of the fish, which can more accurately Locating the edge profile of the fish when the fish is occluded.

(4)本发明采用改进DIoU_NMS的非极大值抑制算法处理重复候选框，被遮挡目标的漏检问题得到补偿，使得删除重复候选框和真实框漏检之间平衡，进而提高鱼被遮挡时的召回率。(4) The present invention adopts the non-maximum suppression algorithm of improved DIoU_NMS to process repeated candidate frames, and the missed detection problem of the occluded target is compensated, so as to balance the deletion of the repeated candidate frame and the missed detection of the real frame, thereby improving the time when the fish is occluded. the recall rate.

附图说明Description of drawings

图1是本发明公开的一种基于目标遮挡补偿的鱼群自动检测方法的流程图；Fig. 1 is the flow chart of a kind of fish school automatic detection method based on target occlusion compensation disclosed by the present invention;

图2是本发明实施例中的一种基于目标遮挡补偿的鱼群自动检测方法的网络结构图，图中的Concat表示Concatenate融合操作；2 is a network structure diagram of a method for automatic fish school detection based on target occlusion compensation in an embodiment of the present invention, and Concat in the figure represents a Concatenate fusion operation;

图3是本发明实施例中改进的语义嵌入分支MSEB的结构图。FIG. 3 is a structural diagram of an improved semantic embedding branch MSEB in an embodiment of the present invention.

具体实施方式Detailed ways

为使本发明实施例的目的、技术方案和优点更加清楚，下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。In order to make the purposes, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments These are some embodiments of the present invention, but not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention.

实施例Example

本实施例将采用图1所示的流程图和图2所示的网络结构图，提供一种基于目标遮挡补偿的鱼群自动检测方法，实现对水下鱼群目标的自动检测，具体流程如下：This embodiment will use the flowchart shown in FIG. 1 and the network structure diagram shown in FIG. 2 to provide an automatic detection method for fish schools based on target occlusion compensation to realize automatic detection of underwater fish school targets. The specific process is as follows :

S1、使用多旋翼无人飞船将其飞到感兴趣水域上空并降落至水面，然后使用无人飞船上搭载的相机拍摄养殖鱼群图像数据，摄像头朝向前方，拍摄图像的间隔时间设置为5秒，拍摄图像的原始分辨率为1920*1080，接着对采集到的鱼群图像进行标记和数据扩充，从而得到用于训练的数据集；S1. Use the multi-rotor unmanned spacecraft to fly it over the waters of interest and land on the surface, and then use the camera on the unmanned spacecraft to capture the image data of the farmed fish, the camera is facing forward, and the interval for capturing images is set to 5 seconds , the original resolution of the captured image is 1920*1080, and then the collected fish school images are marked and data expanded to obtain a data set for training;

S2、将鱼群图像输入到双支路特征提取网络中进行由浅至深的多级特征提取，上述双支路特征提取网络具体是，在YOLOv4算法的主干特征提取网络CSPDarknet53的基础上，加入与主干特征提取网络CSPDarknet53并行的轻量级原始信息特征提取网络，故称为双支路特征提取网络；经过双支路特征提取网络进行多级特征提取后，得到五个特征图，分别是两倍下采样特征图F_A1、四倍下采样特征图F_A2、八倍下采样特征图F_A3、十六倍下采样特征图F_A4、三十二倍下采样特征图F_A5，分辨率分别是输入鱼群图像的1/2、1/4、1/8、1/16、1/32；S2. Input the fish school image into the dual-branch feature extraction network for multi-level feature extraction from shallow to deep. Specifically, the above-mentioned dual-branch feature extraction network is based on the backbone feature extraction network CSPDarknet53 of the YOLOv4 algorithm, adding and The backbone feature extraction network CSPDarknet53 is a parallel lightweight original information feature extraction network, so it is called a dual-branch feature extraction network; after multi-level feature extraction through the dual-branch feature extraction network, five feature maps are obtained, which are twice as large as The downsampling feature map F _A1 , the four-fold down-sampling feature map F _A2 , the eight-fold down-sampling feature map F _A3 , the sixteen-fold down-sampling feature map F _A4 , and the thirty-two-fold down-sampling feature map F _A5 , the resolutions are respectively Input 1/2, 1/4, 1/8, 1/16, 1/32 of the fish school image;

S3、使用改进的语义嵌入分支(Modified Semantic Embedding Branch,MSEB)，将步骤S2得到的特征图F_A5的语义信息融合到特征图F_A4中，得到特征图F_AM4，其分辨率是输入鱼群图像的1/16；将步骤S2得到的特征图F_A4的语义信息融合到特征图F_A3中，得到特征图F_AM3，其分辨率是输入鱼群图像的1/8；S3. Using the Modified Semantic Embedding Branch (MSEB), fuse the semantic information of the feature map F _A5 obtained in step S2 into the feature map F _A4 to obtain the feature map F _AM4 , the resolution of which is the input fish school 1/16 of the image; fuse the semantic information of the feature map F _A4 obtained in step S2 into the feature map F _A3 to obtain the feature map F _AM3 , whose resolution is 1/8 of the input fish school image;

S4、通过卷积下采样，将步骤S2得到的四倍下采样特征图F_A2的细节信息融合到步骤S3得到的八倍下采样特征图F_AM3中，得到特征图F_AMC3，其分辨率是输入鱼群图像的1/8；S4. Through convolution downsampling, the detail information of the quadruple downsampling feature map F _A2 obtained in step S2 is fused into the eightfold downsampling feature map F _AM3 obtained in step S3 to obtain a feature map F _AMC3 , whose resolution is Enter 1/8 of the fish school image;

本实施例中，步骤S1使用labelImg标记软件，通过手工标记方式对采集到的鱼群图像中的鱼体用矩形框进行逐一标记，得到对应的xml标签文件，记录每个目标在图像中的坐标及其类别；然后进行包括垂直翻转、水平翻转、改变亮度、随机添加高斯白噪声、滤波、仿射变换在内的数据增强方式对采集到的鱼群图像及其对应的标签文件的扩充，以形成最终的数据集，提升网络模型对环境变化的鲁棒性。In this embodiment, step S1 uses labelImg labeling software to mark the fish bodies in the collected fish school images with rectangular boxes one by one by manual labeling, so as to obtain the corresponding xml label file, and record the coordinates of each target in the image and its categories; and then carry out data enhancement methods including vertical flip, horizontal flip, change brightness, random addition of Gaussian white noise, filtering, affine transformation to the collected fish school images and their corresponding label files. The final dataset is formed to improve the robustness of the network model to environmental changes.

本实施例中，步骤S2中将鱼群图像输入到双支路特征提取网络中进行由浅至深的多级特征提取，图2中的208给出了双支路特征提取网络的具体结构，是在YOLOv4算法的主干特征提取网络CSPDarknet53的基础上，加入了与CSPDarknet53并行的轻量级原始信息特征提取网络，双支路特征提取网络的结构说明如下：In this embodiment, in step S2, the fish school image is input into the dual-branch feature extraction network for multi-level feature extraction from shallow to deep. 208 in FIG. 2 shows the specific structure of the dual-branch feature extraction network, which is On the basis of the backbone feature extraction network CSPDarknet53 of the YOLOv4 algorithm, a lightweight original information feature extraction network parallel to CSPDarknet53 is added. The structure of the dual-branch feature extraction network is described as follows:

主干特征提取网络CSPDarknet53包括一个CBM单元和五个跨阶段局部网络CSPx单元；其中，CBM单元由步长为1、卷积核为3*3的卷积层Convolution、批归一化层BatchNormalization和Mish激活函数层组成，图2中的201给出了一个CBM单元的结构；CSPx单元由若干个CBM单元和x个Res unit残差单元Concatenate融合而成，图2中的204给出了一个CSPx单元的结构；CSPx单元中的Res unit残差单元由卷积核为1*1的CBM单元、卷积核为3*3的CBM单元以及残差结构组成，图2中的203给出了一个Res unit残差单元的结构；Concatenate融合操作对两个特征图在通道上拼接，维度会扩充；五个CSPx单元的卷积层通道数依次为64、128、256、512、1024，每个CSPx单元会进行一次两倍下采样；经过五个CSPx单元得到的特征图为F_C1、F_C2、F_C3、F_C4、F_C5，分辨率分别是输入鱼群图像的1/2、1/4、1/8、1/16、1/32；The backbone feature extraction network CSPDarknet53 includes a CBM unit and five cross-stage local network CSPx units; among them, the CBM unit consists of a convolutional layer Convolution with a stride of 1 and a convolution kernel of 3*3, a batch normalization layer BatchNormalization and Mish The activation function layer is composed. 201 in Figure 2 shows the structure of a CBM unit; the CSPx unit is composed of several CBM units and x Res unit residual units Concatenate. 204 in Figure 2 shows a CSPx unit The structure of the Res unit in the CSPx unit is composed of a CBM unit with a convolution kernel of 1*1, a CBM unit with a convolution kernel of 3*3, and a residual structure. 203 in Figure 2 gives a Res The structure of the unit residual unit; the Concatenate fusion operation splices the two feature maps on the channel, and the dimension will be expanded; the number of convolutional layer channels of the five CSPx units is 64, 128, 256, 512, 1024, each CSPx unit. A double downsampling will be performed once; the feature maps obtained after five CSPx units are F _C1 , F _C2 , F _C3 , F _C4 , F _C5 , and the resolutions are 1/2, 1/4, 1/8, 1/16, 1/32;

轻量级原始信息特征提取网络包括五个CM单元，其中，CM单元由步长为2、卷积核为3*3的卷积层Convolution和池化步长为1池化核为3*3的最大池化层MaxPool组成，图2中205给出了一个CM单元的结构；步长为2的卷积层可进行一次两倍下采样，每个CM单元的卷积层通道数和主干特征提取网络CSPDarknet53中对应的跨阶段局部网络CSPx单元的卷积层通道数相同；经过五个CM单元得到的特征图为F_L1、F_L2、F_L3、F_L4、F_L5，分辨率分别是输入鱼群图像的1/2、1/4、1/8、1/16、1/32；The lightweight original information feature extraction network includes five CM units. Among them, the CM unit consists of a convolution layer Convolution with a stride of 2 and a convolution kernel of 3*3, and a pooling step size of 1. The pooling kernel is 3*3. The maximum pooling layer of MaxPool is composed of MaxPool, 205 in Figure 2 shows the structure of a CM unit; the convolutional layer with a stride of 2 can be downsampled twice once, and the number of convolutional layer channels and backbone features of each CM unit The corresponding cross-stage local network CSPx units in the extraction network _CSPDarknet53 have the same number of convolutional layer channels; the feature maps obtained after five CM units are FL1, _FL2 , _FL3 , _FL4 , _FL5 , and the resolution is the input 1/2, 1/4, 1/8, 1/16, 1/32 of fish image;

然后在由浅至深的多级特征提取过程中，将轻量级原始信息特征提取网络提取到的特征图F_Li(i＝1,2,3,4,5)依次与对应CSPDarknet53网络提取到的特征图F_Ci(i＝1,2,3,4,5)进行Add融合操作，Add融合操作就是对两个特征图的对应像素值进行相加，得到最终提取到的特征图F_A1、F_A2、F_A3、F_A4、F_A5，分辨率分别是输入鱼群图像的1/2、1/4、1/8、1/16、1/32。Then in the process of multi-level feature extraction from shallow to deep, the feature map F _Li (i=1, 2, 3, 4, 5) extracted by the lightweight original information feature extraction network is sequentially combined with the corresponding CSPDarknet53 network. The feature map F _Ci (i=1, 2, 3, 4, 5) performs the Add fusion operation. The Add fusion operation is to add the corresponding pixel values of the two feature maps to obtain the final extracted feature maps F _A1 , F _A2 , F _A3 , F _A4 , F _A5 , the resolutions are 1/2, 1/4, 1/8, 1/16, 1/32 of the input fish school image respectively.

本实施例中，步骤S3中使用改进的语义嵌入分支MSEB将深层特征图的语义信息融合到其上一层的浅层特征图中，图3给出了所述改进的语义嵌入分支MSEB的具体结构图；使用MSEB进行融合的具体步骤是，首先对步骤S2得到的深层特征图F_A5使用卷积核为1*1的卷积层和卷积核为3*3的卷积层进行不同尺度特征的Concatenate融合，接着经过Sigmoid函数后使用最近邻插值的方式进行两倍上采样，然后与步骤S2得到的浅层特征图F_A4进行像素值相乘，得到特征图F_AM4，其分辨率是输入鱼群图像的1/16，使深层特征图F_A5的语义信息融合到浅层特征图F_A4中，弥补浅层特征图F_A4的语义信息不足的问题；In this embodiment, the improved semantic embedding branch MSEB is used in step S3 to fuse the semantic information of the deep feature map into the shallow feature map of the upper layer. Fig. 3 shows the details of the improved semantic embedding branch MSEB Structural diagram; the specific steps of using MSEB for fusion are: first, use the convolutional layer with the convolution kernel of 1*1 and the convolutional layer with the convolutional kernel of 3*3 to perform different scales on the deep feature map F _A5 obtained in step S2. Concatenate fusion of the features, and then use the nearest neighbor interpolation method to double upsampling after the Sigmoid function, and then multiply the pixel values of the shallow feature map F _A4 obtained in step S2 to obtain the feature map F _AM4 , whose resolution is Input 1/16 of the fish school image, so that the semantic information of the deep feature map F _A5 is fused into the shallow feature map F _A4 to make up for the lack of semantic information of the shallow feature map F _A4 ;

然后，对步骤S2得到的深层特征图F_A4同样使用MSEB将其语义信息融合到浅层特征图F_A3中，得到特征图F_AM3，其分辨率是输入鱼群图像的1/8，弥补浅层特征图F_A3的语义信息不足的问题；Then, the deep feature map F _A4 obtained in step S2 also uses MSEB to fuse its semantic information into the shallow feature map F _A3 to obtain a feature map F _AM3 , whose resolution is 1/8 of the input fish school image, making up for the shallow The problem of insufficient semantic information of layer feature map F _A3 ;

改进的语义嵌入分支MSEB中所使用的Sigmoid函数形式如下：The form of the sigmoid function used in the improved semantic embedding branch MSEB is as follows:

本实施例中，步骤S4的实现过程如下：In this embodiment, the implementation process of step S4 is as follows:

首先对步骤S2得到的四倍下采样特征图F_A2经过CBL单元处理，其中CBL单元由步长为1卷积核为3*3的卷积层Convolution、批归一化层Batch Normalization和LeakyReLU激活函数层组成，图2中的202给出了一个所述CBL单元的结构；接着使用步长为2卷积核为3*3的卷积层Convolution进行两倍下采样，然后与步骤S3得到的特征图F_AM3经过CBL单元处理后进行Concatenate融合操作，得到特征图F_AMC3，其分辨率是输入鱼群图像的1/8，以此利用四倍下采样特征图F_A2的细节信息，充分获取鱼的边缘轮廓信息，补偿鱼群遮挡时鱼的边缘轮廓定位。First, the quadruple downsampling feature map F _A2 obtained in step S2 is processed by the CBL unit, where the CBL unit is activated by the convolution layer Convolution, the batch normalization layer Batch Normalization and the LeakyReLU with a stride of 1 and a convolution kernel of 3*3. It is composed of function layers. 202 in Figure 2 shows the structure of a CBL unit; then use the convolution layer Convolution with a stride of 2 and a convolution kernel of 3*3 to perform double downsampling, and then compare it with the result obtained in step S3. After the feature map F _AM3 is processed by the CBL unit, the Concatenate fusion operation is performed, and the feature map _F _AMC3 is obtained, and its resolution is 1/8 of the input fish image. The edge contour information of the fish compensates for the edge contour positioning of the fish when the fish school is occluded.

本实施例中，步骤S5的实现过程如下：In this embodiment, the implementation process of step S5 is as follows:

首先对步骤S2得到的特征图F_A5、步骤S3得到的特征图F_AM4和步骤S4得到的特征图F_AMC3经过YOLOv4算法的特征金字塔结构进行特征融合后，得到三个特征图F_B3、F_B4和F_B5，其中，YOLOv4算法的特征金字塔结构包括空间金字塔池化层(Spatial Pyramid Pooling,SPP)和路径聚合网络(Path Aggregation Network,PANet)，SPP结构是在特征图F_A5经过三次CBL单元处理后，采用了池化核为1*1、5*5、9*9和13*13的四个最大池化层进行Concatenate融合，图2的206给出了所述SPP的结构，PANet结构通过自底向上和自顶向下的路径对特征反复融合，图2的207给出了所述PANet的结构；接着对三个特征图F_B3、F_B4和F_B5分别经过CBL单元和卷积核为1*1的卷积层处理，得到三个不同大小的预测特征图Prediction1、Prediction2、Prediction3，三个预测特征图的分辨率分别是输入鱼群图像的1/8、1/16、1/32；然后使用三个预测特征图进行鱼目标的预测，得到重复的候选框以及对应的预测置信度得分。Firstly, after feature fusion is performed on the feature map F _A5 obtained in step S2, the feature map F _AM4 obtained in step S3 and the feature map F _AMC3 obtained in step S4 through the feature pyramid structure of the YOLOv4 algorithm, three feature maps F _B3 and F _B4 are obtained. and F _B5 , where the feature pyramid structure of the YOLOv4 algorithm includes a Spatial Pyramid Pooling (SPP) and a Path Aggregation Network (PANet), and the SPP structure is processed by three CBL units in the feature map F _A5 Then, four maximum pooling layers with pooling kernels of 1*1, 5*5, 9*9 and 13*13 are used for Concatenate fusion. 206 in Figure 2 shows the structure of the SPP, and the PANet structure passes The bottom-up and top-down paths fuse the features repeatedly. 207 in Figure 2 shows the structure of the _PANet ; then the three feature maps FB3, _FB4 , and _FB5 go through the CBL unit and the convolution kernel, respectively. For 1*1 convolutional layer processing, three prediction feature maps of different sizes, Prediction1, Prediction2, and Prediction3, are obtained. The resolutions of the three prediction feature maps are 1/8, 1/16, 1/8 of the input fish image. 32; then use the three prediction feature maps to predict the fish target, and obtain repeated candidate boxes and corresponding prediction confidence scores.

本实施例中，步骤S6的实现过程如下：In this embodiment, the implementation process of step S6 is as follows:

S602、选出剩余候选框中预测置信度得分最高的候选框M，依次遍历其他的候选框B_i，与候选框M计算距离交并比Distance-IoU，其中，距离交并比Distance-IoU简称DIoU，如果某一候选框B_i与候选框M的DIoU值不低于给定的阈值ε，则认为两个框的重叠度较高，不直接删除候选框B_i，而是减小候选框B_i的预测置信度得分，然后将候选框M移除到最后的预测框集合G中；其中，预测置信度得分减小准则如下：S602: Select the candidate frame M with the highest prediction confidence score in the remaining candidate frames, traverse the other candidate frames B _i in turn, and calculate the distance intersection ratio Distance-IoU with the candidate frame M, where the distance intersection ratio Distance-IoU is abbreviated as DIoU, if the DIoU value of a candidate frame B _i and candidate frame M is not lower than the given threshold ε, it is considered that the overlap of the two frames is high, and the candidate frame B _i is not directly deleted, but the candidate frame is reduced. The prediction confidence score of B _i , and then the candidate frame M is removed to the final prediction frame set G; wherein, the prediction confidence score reduction criterion is as follows:

上述步骤S602中的DIoU是在交并比IoU的基础上添加一个惩罚因子，该惩罚因子考虑了两个候选框的中心点的距离，具体计算方式如下：The DIoU in the above step S602 is to add a penalty factor on the basis of the intersection and union ratio IoU. The penalty factor considers the distance between the center points of the two candidate frames, and the specific calculation method is as follows:

本实施例中，在训练时需要对预测框不断进行调整，使其接近于待检测目标的真实框，因此在训练之前，对鱼群图像数据集使用K-means聚类算法得到9种不同大小的先验框，使得先验框适合采集到的鱼群图像数据集，三个预测特征图Prediction1、Prediction2、Prediction3分别设定3种大小的先验框。其中，K-means聚类算法使用交并比IoU作为指标衡量两个框接近的程度，具体计算两个框距离的公式如下：In this embodiment, the prediction frame needs to be continuously adjusted during training to make it close to the real frame of the target to be detected. Therefore, before training, the K-means clustering algorithm is used for the fish school image data set to obtain 9 different sizes. The a priori frame of , makes the a priori frame suitable for the collected fish school image data set, and the three prediction feature maps Prediction1, Prediction2, and Prediction3 respectively set a priori frame of three sizes. Among them, the K-means clustering algorithm uses the intersection and union ratio IoU as an indicator to measure the closeness of the two boxes. The formula for calculating the distance between the two boxes is as follows:

Distance(box,center)＝1-IoU(box,center) 公式(4)Distance(box,center)=1-IoU(box,center) Formula (4)

其中，box表示待计算的候选框，center表示聚类中心的候选框，IoU(box,center)表示待计算候选框与聚类中心候选框的交并比值。Among them, box represents the candidate frame to be calculated, center represents the candidate frame of the cluster center, and IoU(box, center) represents the intersection ratio between the candidate frame to be calculated and the cluster center candidate frame.

本实施例中，在训练时，初始学习率设为0.0002，初始的训练迭代轮数epoch设为45，每次随机选择8张图像进行训练，使用了Adam优化器加快网络模型收敛，同时，为了减小GPU内存开销，将训练的每张图像分辨率调整为416*416。In this embodiment, during training, the initial learning rate is set to 0.0002, the initial number of training iterations epoch is set to 45, 8 images are randomly selected for training each time, and the Adam optimizer is used to speed up the convergence of the network model. Reduce GPU memory overhead and adjust the resolution of each image for training to 416*416.

本实施例中，损失函数loss由回归框预测误差L_loc、置信度误差L_conf、分类误差L_cls三部分组成，具体计算公式如下：In this embodiment, the loss function loss is composed of three parts: the regression frame prediction error L _loc , the confidence error L _conf , and the classification error L _cls , and the specific calculation formula is as follows:

上述公式(5)中v的具体计算方式是公式(6)，IoU(P,T)是预测框与真实框的交并比值，ρ(P_ctr,T_ctr)是预测框与真实框的中心点的距离，d是包含预测框和真实框的最小外接矩形的对角线长度，w^gt和h^gt分别是真实框的宽和高，w和h分别是预测框的宽和高，图像分为S×S个网格，M是每个网格会产生先验框anchor的数量，

表示预测框包含待检测目标，

表示预测框不包含待检测目标，

是对应先验框的预测置信度，

是实际的置信度，λ_noobj是设定的权重系数，c是待检测目标所属于的种类，

是对应网格中目标属于c类别的实际概率，

是对应网格中目标属于c类别的预测概率。The specific calculation method of v in the above formula (5) is formula (6), IoU(P,T) is the intersection ratio between the predicted frame and the real frame, ρ(P _ctr ,T _ctr ) is the center of the predicted frame and the real frame The distance between the points, d is the diagonal length of the smallest circumscribed rectangle containing the predicted box and the ground truth box, w ^gt and h ^gt are the width and height of the ground truth box, respectively, w and h are the width and height of the predicted box, and the image points is S×S grids, M is the number of anchors that each grid generates a priori box,

Indicates that the prediction box contains the target to be detected,

Indicates that the prediction box does not contain the target to be detected,

is the prediction confidence of the corresponding prior box,

is the actual confidence, λ _noobj is the set weight coefficient, c is the category of the target to be detected,

is the actual probability that the target in the corresponding grid belongs to category c,

is the predicted probability that the target in the corresponding grid belongs to category c.

本实施例中，在设置完相关参数后，对鱼群数据集进行训练，训练完之后可以得到loss的曲线变化，刚开始损失函数loss下降较快，最后趋于收敛，保存训练好的鱼群目标检测模型权重文件，然后利用保存好的模型权重文件，输入测试鱼群图像文件，即可对该鱼群图像进行鱼目标的检测，生成该图像中可能存在目标的预测框，并给出每个预测框的预测置信度得分，输出带有预测框及其预测置信度得分的图像。In this embodiment, after the relevant parameters are set, the fish swarm data set is trained. After the training, the curve change of the loss can be obtained. At the beginning, the loss function decreases rapidly, and finally tends to converge, and the trained fish swarm is saved. The target detection model weight file, and then use the saved model weight file and input the test fish image file to detect the fish target on the fish image, generate the prediction frame of the possible target in the image, and give each image. The prediction confidence score of each prediction box, output an image with the prediction box and its prediction confidence score.

上述实施例为本发明较佳的实施方式，但本发明的实施方式并不受上述实施例的限制，其他的任何未背离本发明的精神实质与原理下所作的改变、修饰、替代、组合、简化，均应为等效的置换方式，都包含在本发明的保护范围之内。The above-mentioned embodiments are preferred embodiments of the present invention, but the embodiments of the present invention are not limited by the above-mentioned embodiments, and any other changes, modifications, substitutions, combinations, The simplification should be equivalent replacement manners, which are all included in the protection scope of the present invention.

Claims

1. A fish school automatic detection method based on target shielding compensation is characterized by comprising the following steps:

s1, collecting a fish school image in a pond environment through a multi-rotor unmanned spacecraft carrying a camera, and marking and data expansion are carried out on the collected fish school image;

s2, inputting the fish image into a double-branch feature extraction network for multistage feature extraction from shallow to deep, wherein the double-branch feature extraction network is a lightweight original information feature extraction network parallel to CSPDarknet53 on the basis of a trunk feature extraction network CSPDarknet53 of a YOLOv4 algorithm; after multi-stage feature extraction is carried out by a double-branch feature extraction network, five feature maps are obtained, and the five feature maps are respectively a two-time down-sampling feature map F _A1 Fourfold down-sampling feature map F _A2 Eight-fold down-sampling feature map F _A3 Sixteen-fold downsampling feature map F _A4 Thirty-two times downsampling feature map F _A5 The resolution is 1/2, 1/4, 1/8, 1/16 and 1/32 of the input fish image respectively;

the trunk feature extraction network CSPDarknet53 comprises a CBM unit and five cross-phase local network CSPx units; the CBM unit consists of a Convolution layer Convolution with the step length of 1 and a Convolution kernel of 3*3, a Batch Normalization layer Batch Normalization and a Mish activation function layer; the CSPx unit is formed by fusing a plurality of CBM units and x Res unit residual error units (Consatenate), each Res unit residual error unit is composed of a CBM unit with a convolution kernel of 1*1, a CBM unit with a convolution kernel of 3*3 and a residual error structure, the two feature maps are spliced on a channel through the Consatenate fusion operation, and the dimension of the feature map obtained after splicing is expanded; the channel number of the Convolution layer convergence of the five CSPx units is 64, 128, 256, 512 and 1024 in sequence, and each CSPx unit is subjected to twice down-sampling; the characteristic graphs obtained by five CSPx units are respectively F _C1 、F _C2 、F _C3 、F _C4 、F _C5 The resolution is 1/2, 1/4, 1/8, 1/16 and 1/32 of the input fish image respectively;

the lightweight original information feature extraction network comprises five CM units,the CM unit consists of a convolutional layer Convolume with the step length of 2 and a convolutional layer with the convolutional kernel of 3*3 and a maximum pooling layer MaxPool with the pooling step length of 1 and the pooling kernel of 3*3, the convolutional layer with the step length of 2 is subjected to twice down-sampling, and the number of convolutional layer channels of each CM unit is the same as that of corresponding cross-stage local network CSPx units in a main feature extraction network CSPDarknet 53; the characteristic diagram obtained by five CM units is F _L1 、F _L2 、F _L3 、F _L4 、F _L5 The resolution is 1/2, 1/4, 1/8, 1/16 and 1/32 of the input fish image respectively;

then, in the multi-stage feature extraction process from shallow to deep, a feature graph F extracted by a light-weight original information feature extraction network is obtained _Li Feature map F extracted from corresponding CSPDarknet53 network _Ci Performing Add fusion operation, i =1,2,3,4,5, add fusion operation adding corresponding pixel values of the two feature maps to obtain a final extracted feature map F _A1 、F _A2 、F _A3 、F _A4 、F _A5 The resolution is 1/2, 1/4, 1/8, 1/16 and 1/32 of the input fish image respectively;

s3, using improved semantic embedding branch to embed the feature graph F obtained in the step S2 _A5 Fusing semantic information of to the feature map F _A4 In (1), a characteristic diagram F is obtained _AM4 Feature map F _AM4 The resolution of (1/16) of the input fish shoal image; the characteristic diagram F obtained in the step S2 _A4 Fusing semantic information of to the feature map F _A3 In (1), obtaining a characteristic diagram F _AM3 Characteristic diagram F _AM3 The resolution of (1/8) of the input fish shoal image;

s4, carrying out convolution downsampling on the feature map F obtained in the step S2 in a quadruple downsampling mode _A2 The detail information of (2) is fused with the eight-fold down-sampling feature map F obtained in the step S3 _AM3 In (1), a characteristic diagram F is obtained _AMC3 Characteristic diagram F _AMC3 The resolution of (1/8) of the input fish shoal image;

s5, obtaining the characteristic diagram F obtained in the step S2 _A5 And step S3, obtaining a characteristic diagram F _AM4 And the characteristic diagram F obtained in step S4 _AMC3 Characteristic pyramid knot through YOLOv4 algorithmAfter feature fusion, three feature maps are obtained, namely F _B3 、F _B4 And F _B5 Then using the feature map F _B3 、F _B4 And F _B5 Predicting the fish target after convolution processing to obtain repeated candidate frames and corresponding prediction confidence scores;

and S6, processing the repeated candidate frames by adopting a non-maximum value suppression algorithm of the improved DIoU _ NMS to obtain a prediction frame result containing the prediction confidence score, and drawing the prediction frame result on a corresponding picture to be used as a fish school detection result.

2. The method according to claim 1, wherein in step S1, labelImg image labeling software is used to label the fish targets in each collected fish image one by one, and after labeling, each image generates an xml tag file containing labeling information, and the collected fish images and the tag files corresponding to the collected fish images construct an original data set; and then expanding the original data set in a data enhancement mode comprising vertical inversion, horizontal inversion, brightness change, random Gaussian white noise addition, filtering and affine transformation to form a final data set.

3. The method for automatically detecting fish school based on target occlusion compensation as claimed in claim 1, wherein the fusion process using improved semantic embedding branch in step S3 is as follows:

firstly, the deep layer characteristic diagram F obtained in the step S2 _A5 Performing coordinate fusion operation on different scale features by using a convolutional layer with a convolutional kernel of 1*1 and a convolutional layer with a convolutional kernel of 3*3, performing two-time upsampling by using a nearest neighbor interpolation mode after passing through a Sigmoid function, and performing two-time upsampling on the upsampled data, and then obtaining a shallow feature map F with the shallow feature map F obtained in the step S2 _A4 Multiplying the pixel values to obtain a feature map F _AM4 The resolution is 1/16 of the input fish image, so that the deep characteristic diagram F _A5 Fusing semantic information of (2) to the shallow feature map F _A4 Performing the following steps;

then, the deep layer obtained in step S2 is processedFeature map F _A4 Fusing its semantic information to the shallow feature map F also using improved semantic embedding branches _A3 In (1), a characteristic diagram F is obtained _AM3 The resolution is 1/8 of the input fish school image;

the Sigmoid functional form used in the improved semantic embedding branch is as follows:

where i is the input and e is the natural constant.

4. The method for automatically detecting fish school based on target occlusion compensation as claimed in claim 1, wherein the step S4 comprises the following steps:

firstly, the quadruple down-sampling feature map F obtained in the step S2 is subjected to _A2 Processing by a CBL unit, wherein the CBL unit consists of a Convolution layer Convolation with a step size of 1 and a Convolution kernel of 3*3, a Batch Normalization layer Batch Normalization and a LeakyReLU activation function layer, performing double downsampling by using the Convolution layer Convolation with a step size of 2 and a Convolution kernel of 3*3, and performing downsampling by using a characteristic diagram F obtained in the step S3 _AM3 Carrying out Concatenate fusion operation after CBL unit treatment to obtain a characteristic diagram F _AMC3 The resolution is 1/8 of the input fish image.

5. The method for automatically detecting fish school based on target occlusion compensation as claimed in claim 4, wherein the step S5 comprises the following steps:

firstly, the characteristic diagram F obtained in the step S2 _A5 And S3, obtaining a characteristic diagram F _AM4 And the characteristic diagram F obtained in step S4 _AMC3 After feature fusion is carried out on the feature pyramid structure of the YOLOv4 algorithm, three feature graphs F are obtained _B3 、F _B4 And F _B5 Wherein, the feature pyramid structure of the YOLOv4 algorithm comprises a spatial pyramid pooling layer and a path aggregation network, and the structure of the spatial pyramid pooling layer is in a feature map F _A5 Three times of CAfter BL unit processing, using four maximum pooling layers with pooling cores of 1*1, 5*5, 9*9 and 13 × 13 to perform a Concatenate fusion operation, and repeatedly fusing features by a path from bottom to top and a path from top to bottom in a path aggregation network structure; then three feature maps F are processed _B3 、F _B4 And F _B5 Respectively carrying out convolution layer processing with a CBL unit and a convolution kernel of 1*1 to obtain three Prediction feature maps of different sizes, namely Prediction1, prediction2 and Prediction3, wherein the resolutions of the three Prediction feature maps are respectively 1/8, 1/16 and 1/32 of the input fish school image; and then, predicting the fish target by using the three prediction characteristic graphs to obtain repeated candidate frames and corresponding prediction confidence scores.

6. The method for automatically detecting fish school based on target occlusion compensation as claimed in claim 1, wherein the process of step S6 is as follows:

s601, traversing all candidate frames in an image, sequentially judging the prediction confidence score of each candidate frame, reserving the candidate frames with the scores larger than the confidence threshold value and the corresponding scores thereof, and deleting the candidate frames with the scores lower than the confidence threshold value;

s602, selecting the candidate frame M with the highest prediction confidence score in the residual candidate frames, and traversing the other candidate frames B in sequence _i Calculating a Distance cross ratio Distance-IoU with the candidate frame M, wherein the Distance cross ratio Distance-IoU is abbreviated as DIoU if a certain candidate frame B _i If the value of DIoU with the candidate frame M is not less than the given threshold value epsilon, the degree of overlap between the two frames is considered to be high, and the candidate frame B is not directly deleted _i Instead, the candidate box B is reduced _i Then removing the candidate frame M to the final prediction frame set G; wherein the prediction confidence score reduction criterion is as follows:

where M is the candidate box with the highest current prediction confidence score, B _i Is the other to be traversedCandidate box, ρ (M, B) _i ) Is M and B _i C is a distance containing M and B _i The diagonal length of the minimum bounding rectangle, DIoU (M, B) _i ) Is M and B _i Is a given threshold value of the distance cross-over ratio DIoU, S _i Is candidate frame B _i A predicted confidence score of, S _i ^′ Is candidate frame B _i A reduced score prediction confidence score;

and S603, repeatedly executing the step S602 until all the candidate frames are processed, and drawing the final prediction frame set G on the corresponding picture as an output result to obtain a fish school detection result.

7. The method as claimed in claim 6, wherein the DIoU in S602 is obtained by adding a penalty factor based on the cross-over ratio IoU, the penalty factor considering the distance between the center points of the two candidate frames, DIoU (M, B) _i ) The calculation method of (c) is as follows:

wherein M is the candidate box with the highest current prediction confidence score, B _i Is the other candidate box to traverse, ρ (M, B) _i ) Is M and B _i C is a distance containing M and B _i IoU (M, B) is the minimum diagonal length of the circumscribed rectangle _i ) Is M and B _i The ratio of the intersection and union of (a).