CN114842215A

CN114842215A - A fish visual recognition method based on multi-task fusion

Info

Publication number: CN114842215A
Application number: CN202210415517.1A
Authority: CN
Inventors: 曹立杰; 陈子文; 王其华
Original assignee: Dalian Ocean University
Current assignee: Dalian Ocean University
Priority date: 2022-04-20
Filing date: 2022-04-20
Publication date: 2022-08-02

Abstract

The invention provides a fish visual identification method based on multi-task fusion, which can perform target detection, instance segmentation and attitude estimation in parallel by utilizing an efficient multi-task network, and the inference speed is improved to a real-time level. Compared with a baseline, the network only increases negligible parameter quantity, and the processing capacity of the network for parallel multitask is excavated, so that the network can be easily deployed in an actual aquaculture application scene. The invention provides a prediction idea of a multi-task network by using a single decoder branch, and provides a heuristic for carrying out fusion coding among multiple tasks.

Description

A fish visual recognition method based on multi-task fusion

技术领域technical field

本发明属于目标检测领域，尤其涉及一种基于多任务融合的鱼类视觉识别方法。The invention belongs to the field of target detection, in particular to a fish visual recognition method based on multi-task fusion.

背景技术Background technique

在水产养殖过程中，对鱼类的生理状态以及运动行为的检测可以更好的实现精准养殖过程。生理状态和运动行为的获取可以通过计算鱼类的体长体重以及运动姿态来进行。计算机视觉领域中的目标检测，关键点检测，实例分割可以提供精准的预测目标框，掩膜和关键点骨架信息。目标检测得到的预测目标框可以精准的定位鱼的位置，实例分割得到的掩膜可以得到鱼类的轮廓，关键点信息则可以对鱼类的运动姿态进行判断。In the process of aquaculture, the detection of the physiological state and movement behavior of fish can better realize the precise breeding process. The acquisition of physiological state and movement behavior can be carried out by calculating the body length, weight and movement posture of fish. Object detection, keypoint detection, and instance segmentation in the field of computer vision can provide accurate prediction of target frame, mask and keypoint skeleton information. The predicted target frame obtained by target detection can accurately locate the position of the fish, the mask obtained by instance segmentation can obtain the outline of the fish, and the key point information can judge the movement and posture of the fish.

随着深度学习的发展，有很多优秀的算法可以实现这些任务，针对目标检测，YOLO系列【Redmon,Joseph,and Ali Farhadi."Yolov3:An incremental improvement."arXivpreprint arXiv:1804.02767(2018).】和SSD系列【Liu,Wei,et al."Ssd:Single shotmultibox detector."European conference on computer vision.Springer,Cham,2016.】得到预测目标框；针对实例分割，Mask R-CNN【He,Kaiming,et al."Mask r-cnn."Proceedings of the IEEE international conference on computer vision.2017.】和或者SOLO【Wang,Xinlong,et al."Solo:Segmenting objects by locations."EuropeanConference on Computer Vision.Springer,Cham,2020.】这种基于像素分类的分割方法理论上可以分割全部实例像素，所以可以实现较好的效果，但是推理过程会产生大量的参数，因此损失了实时性。对于姿态估计，DeepPose【Toshev,Alexander,and ChristianSzegedy."Deeppose:Human pose estimation via deep neural networks."Proceedingsof the IEEE conference on computer vision and pattern recognition.2014.】使用深度神经网络来进行人体姿态估计，通过预测关键点的坐标位置来进行定位，并利用多尺度信息对预测结果进行微调。由于关键点分布较为稀疏，因此目前的主流框架HRNet【Sun,Ke,et al."Deep high-resolution representation learning for human poseestimation."Proceedings of the IEEE/CVF Conference on Computer Vision andPattern Recognition.2019.】是利用热力图的方式对关键点定位，但是，利用热力图需要将图像上采样到原始图像大小，这个过程中产生了大量的参数。虽然这些框架已经十分优秀，但是一次性串行实现三个功能是十分低效的且浪费计算资源。水产养殖过程中存在密集的实例，因此对系统的实时性以及资源占用有着较高的要求，多任务网络则有着如下的特性：(a)它可以一次性完成多种预测任务，节省推理时间和算力(b)多个任务分支共享一个编码器有助于提升特征提取的深度，从而提升各个任务的性能。因此，一个多任务框架十分适合水产养殖过程。With the development of deep learning, there are many excellent algorithms to achieve these tasks. For object detection, the YOLO series [Redmon, Joseph, and Ali Farhadi."Yolov3:An incremental improvement."arXivpreprint arXiv:1804.02767(2018).] and SSD series [Liu, Wei, et al."Ssd:Single shotmultibox detector."European conference on computer vision.Springer,Cham,2016.] get the predicted target frame; for instance segmentation, Mask R-CNN [He, Kaiming, et al. al."Mask r-cnn."Proceedings of the IEEE international conference on computer vision.2017.]and or SOLO[Wang,Xinlong,et al."Solo:Segmenting objects by locations."EuropeanConference on Computer Vision.Springer,Cham , 2020.] This segmentation method based on pixel classification can theoretically segment all instance pixels, so it can achieve better results, but the inference process will generate a large number of parameters, so the real-time performance is lost. For pose estimation, DeepPose [Toshev,Alexander,and ChristianSzegedy."Deeppose:Human pose estimation via deep neural networks."Proceedingsof the IEEE conference on computer vision and pattern recognition.2014.] uses deep neural networks for human pose estimation, through Predict the coordinate positions of key points for localization, and use multi-scale information to fine-tune the prediction results. Due to the sparse distribution of key points, the current mainstream framework HRNet [Sun,Ke,et al."Deep high-resolution representation learning for human poseestimation."Proceedings of the IEEE/CVF Conference on Computer Vision andPattern Recognition.2019.] is The key points are located by means of heat map, however, the use of heat map requires upsampling the image to the original image size, and a large number of parameters are generated in this process. Although these frameworks are very good, implementing three functions in series at one time is very inefficient and wastes computing resources. There are dense instances in the aquaculture process, so there are high requirements for the real-time performance and resource occupation of the system. The multi-task network has the following characteristics: (a) It can complete a variety of prediction tasks at one time, saving reasoning time and Computational power (b) Sharing an encoder for multiple task branches helps to improve the depth of feature extraction, thereby improving the performance of each task. Therefore, a multitasking framework is well suited for aquaculture processes.

目前几乎没有学者在渔业领域针对多任务网络进行研究，但是在自动驾驶领域，已经有大量利用多任务网络在全景感知中的应用【Teichmann,Marvin,et al."Multinet:Real-time joint semantic reasoning for autonomous driving."2018IEEEIntelligent Vehicles Symposium(IV).IEEE,2018.】。MultiNet利用一个编码器和三个解码器分支实现了场景分类，目标检测和行驶区域的语义分割，并且达到了实时检测的效果，但是此架构仅适用在自动驾驶领域。LSNet【Duan,Kaiwen,et al."Location-sensitivevisual recognition with cross-iou loss."arXiv preprint arXiv:2104.04899(2021).】通过将位置敏感的任务进行统一编码，用一个统一的框架实现了目标检测，实例分割，姿态估计，但是无法并行进行三个任务。At present, almost no scholars have studied multi-task networks in the field of fishery, but in the field of autonomous driving, there have been a large number of applications of multi-task networks in panoramic perception [Teichmann, Marvin, et al."Multinet:Real-time joint semantic reasoning for autonomous driving."2018IEEEIntelligent Vehicles Symposium(IV).IEEE,2018.]. MultiNet uses one encoder and three decoder branches to achieve scene classification, object detection and semantic segmentation of driving areas, and achieves the effect of real-time detection, but this architecture is only applicable to the field of autonomous driving. LSNet [Duan, Kaiwen, et al. "Location-sensitive visual recognition with cross-iou loss." arXiv preprint arXiv: 2104.04899 (2021).] By uniformly encoding location-sensitive tasks, target detection is achieved with a unified framework , instance segmentation, pose estimation, but the three tasks cannot be performed in parallel.

发明内容SUMMARY OF THE INVENTION

为了解决现有技术中存在的问题，本发明提供了一种基于多任务融合的鱼类视觉识别方法，利用高效简单的多任务网络，可以同时实现目标检测、实例分割、姿态估计，网络仅使用了一个编码器和一个解码器分支在每个尺度进行检测，并且将检测时间和计算占用优化到了一个合理的范围内。In order to solve the problems existing in the prior art, the present invention provides a fish visual recognition method based on multi-task fusion. Using an efficient and simple multi-task network, target detection, instance segmentation, and attitude estimation can be realized at the same time. The network only uses An encoder and a decoder branch are used to detect at each scale, and the detection time and computational footprint are optimized to a reasonable range.

为了实现较高的准确率和速度，本发明的技术方案为：In order to achieve higher accuracy and speed, the technical scheme of the present invention is:

一种基于多任务融合的鱼类视觉识别方法，构建了一个全卷积网络，采用Darknet53作为编码器对图像进行编码，在网络的颈部使用特征金字塔，进行不同尺度目标的上下文特征融合，并且针对不同尺度的目标使用了指定尺度的检测头作为解码器；使得输出层可以融合更多层次的特征信息。对于每个输出张量，按照三个任务对通道进行划分，分别用于目标检测、姿态估计、实例分割；A fish visual recognition method based on multi-task fusion, constructs a fully convolutional network, uses Darknet53 as the encoder to encode the image, uses a feature pyramid at the neck of the network, and performs contextual feature fusion of targets of different scales, and For targets of different scales, the detection head of the specified scale is used as the decoder; so that the output layer can fuse more layers of feature information. For each output tensor, the channels are divided according to three tasks, which are respectively used for target detection, pose estimation, and instance segmentation;

所述目标检测，采取了基于锚框的一阶段目标检测方法，按照不同尺度的目标输出不同尺寸含预测目标框信息的特征图，所述预测目标框为矩形；The target detection adopts a one-stage target detection method based on anchor frames, and outputs feature maps of different sizes containing predicted target frame information according to targets of different scales, and the predicted target frame is a rectangle;

所述姿态估计，利用多个关键点表达一个姿态状态，采用目标检测得到的单个预测目标框中心点，通过预测中心点指向各个关键点的向量进行姿态估计；The attitude estimation uses a plurality of key points to express an attitude state, adopts a single predicted target frame center point obtained by target detection, and performs attitude estimation through the vector of the predicted center point pointing to each key point;

所述实例分割，以预测目标框中心点作为极坐标的原点，通过预测目标轮廓多边形顶点所处的角度区间确定轮廓点的初步位置，并预测相较与基准角的偏移修正得到具体位置，从而确定单个实例的掩膜。The described instance is divided, with the predicted target frame center point as the origin of the polar coordinates, the preliminary position of the contour point is determined by predicting the angular interval where the vertices of the target contour polygon are located, and the prediction is compared with the offset correction of the reference angle to obtain the specific position, Thereby determining the mask for a single instance.

进一步地，所述基于锚框的一阶段目标检测方法具体为：按照不同尺度目标输出不同尺寸的特征图

其中C为输出特征图占据的通道数，分别对应{类别，置信度，x，y，w，h}，其中，类别表示为预测物体的种类，置信度为网络针对当前预测物体给出的预测概率，x、y为预测的物体中心相对于网格中心点的偏移值，w，h为物体的预测目标框相对于锚框的尺寸偏移。解码器将输出的特征图划分为S×S个网格，每个网格负责预测一个中心点落入网格的物体，通过对锚框的形状和位置进行偏移，使得矩形框更适合目标；随后，所有检测到的预测目标框将利用非极大抑制算法进行过滤，留下针对每个物体的置信度最高的预测目标框。Further, the anchor frame-based one-stage target detection method is specifically: outputting feature maps of different sizes according to different scale targets

where C is the number of channels occupied by the output feature map, corresponding to {category, confidence, x, y, w, h}, where the category is the type of the predicted object, and the confidence is the prediction given by the network for the current predicted object Probability, x, y are the offset values of the predicted object center relative to the grid center point, w, h are the size offset of the object's predicted target frame relative to the anchor frame. The decoder divides the output feature map into S×S grids. Each grid is responsible for predicting an object whose center point falls into the grid. By offsetting the shape and position of the anchor box, the rectangular box is more suitable for the target. ; then, all detected predicted target boxes will be filtered using a non-maximum suppression algorithm, leaving the predicted target box with the highest confidence for each object.

进一步地，所述姿态估计，编码了k个关键点坐标到

其中i∈{1,...,k}；y_i代表第i个关键点的绝对位置坐标(x，y)；所述关键点检测占据着N_pose＝2×k个通道；利用目标检测得到的预测目标框，首先计算出预测目标框的中心点

随后计算出预测目标框的对角线长度diag，对于每个姿态向量N(y_i；b)有：Further, the pose estimation encodes k key point coordinates to

where i∈{1,...,k}; y _i represents the absolute position coordinates (x, y) of the i-th key point; the key point detection occupies N _pose = 2×k channels; using target detection The obtained prediction target frame, first calculate the center point of the prediction target frame

The diagonal length diag of the predicted target frame is then calculated, and for each pose vector N(y _i ; b) there are:

归一化后，关键点检测将输出的结果范围限制到[0,1]之间，通过预测目标框中心点指向各个关键点的向量进行姿态估计。且关键点检测与目标检测任务直接产生了耦合，从而使两个任务得以互相提升性能。After normalization, the key point detection limits the output range to [0, 1], and performs pose estimation by predicting the vector of the center point of the target frame pointing to each key point. And the keypoint detection and target detection tasks are directly coupled, so that the two tasks can improve the performance of each other.

进一步地，所述实例分割，将预测目标框的中心点作为极坐标的原点，将掩膜多边形的顶点表达为角度与截距的方式；角度与截距的计算方式如下：以固定的步长Stride∈[0,360]将坐标轴分割成

个角度区块，定义区间起始角a＝N×Stride。如果多边形上的某一个顶点落入某个角度区块内，则由该区块的通道负责预测顶点到原点的距离、顶点的角度相对于区间起始角的偏移量和顶点的置信度；对于每个区块，在输出的通道中，每个顶点使用三个参数表示，分别为距离、角度偏移、置信度；对于每个实例，当步长为S时，网络最多能预测

个顶点。Further, in the instance segmentation, the center point of the prediction target frame is used as the origin of polar coordinates, and the vertex of the mask polygon is expressed as the mode of angle and intercept; the calculation mode of angle and intercept is as follows: with a fixed step size Stride∈[0,360] splits the axis into

an angle block, and define the starting angle of the interval a=N×Stride. If a certain vertex on the polygon falls into a certain angle block, the channel of the block is responsible for predicting the distance from the vertex to the origin, the offset of the angle of the vertex relative to the starting angle of the interval, and the confidence of the vertex; For each block, in the output channel, each vertex is represented by three parameters, namely distance, angle offset, and confidence; for each instance, when the step size is S, the network can predict at most

vertex.

进一步地，还包括用于计算预测值与真实值差距的损失函数，如下：Further, it also includes a loss function used to calculate the difference between the predicted value and the real value, as follows:

其中，l_obj(i,j)是用来计算目标检测任务的损失，l_pose(i,j)是用来计算姿态估计任务的损失，l_seg(i,j)是用来计算实例分割的损失，q_i，j是一个常数，用来指示当前锚框是否含有目标，G^wG^h表示当前的格子位置，n^a表示当前目标所处的锚框；其中，具体的任务的损失定义如下：Among them, l _obj (i, j) is used to calculate the loss of target detection task, l _pose (i, j) is used to calculate the loss of pose estimation task, l _seg (i, j) is used to calculate instance segmentation Loss, q _{i, j} is a constant, used to indicate whether the current anchor box contains ^a target, G ^w G ^h represents the current grid position, na represents the anchor box where the current target is located; among them, the specific task loss is defined as follows :

l_obj(i,j)＝l₁(i,j)+l₂(i,j)+l₃(i,j)+l₄(i,j)l _obj (i,j)=l ₁ (i,j)+l ₂ (i,j)+l ₃ (i,j)+l ₄ (i,j)

其中，l₁(i,j)为预测目标框中心的损失，l₂(i,j)为预测目标框尺寸的损失，l₃(i,j)是预测置信度的损失，l₄(i,j)是预测类别的损失：Among them, l ₁ (i, j) is the loss of predicting the center of the target frame, l ₂ (i, j) is the loss of predicting the size of the target frame, l ₃ (i, j) is the loss of prediction confidence, l ₄ (i ,j) is the loss for the predicted class:

其中，

是目标框的中心位置，

是二分类交叉熵；in,

is the center position of the target box,

is the binary cross-entropy;

其中，w_i,j,h_i,j是目标框的宽和高，

是当前第j个锚框的宽和高；Among them, w _i,j ,hi _,j are the width and height of the target frame,

is the width and height of the current j-th anchor box;

其中，

是网络预测的目标框置信度；in,

is the confidence of the target box predicted by the network;

其中，c是总类别数，C_i,j,k为预测的目标类别，ψ(...,...)为交叉熵损失函数；Among them, c is the total number of categories, C _i,j,k is the predicted target category, ψ(...,...) is the cross entropy loss function;

其中，n^p为关键点的个数，P_i,j,k为关键点的坐标位置，φ(...,...)为均方误差损失函数；Among them, n ^p is the number of key points, P _{i, j, k} are the coordinate positions of key points, and φ(...,...) is the mean square error loss function;

其中，v是划分的角度区块个数，diag为预测目标框的对角线长度，α_i,j,k为顶点到原点的距离，β_i,j,k为顶点所处的角度相对于角度区间起点的偏移，γ_i,j,k为当前顶点的置信度。Among them, v is the number of divided angle blocks, diag is the diagonal length of the predicted target frame, α _{i, j, k} is the distance from the vertex to the origin, β _{i, j, k} is the angle at which the vertex is located relative to The offset of the starting point of the angle interval, γ _i,j,k is the confidence of the current vertex.

本发明的有益效果为，提出了一个高效的多任务网络可以并行进行目标检测、实例分割和姿态估计，并且推理速度提升到了实时水平。本发明的网络相较于基线仅仅增加了可以忽略不计的参数量，挖掘了网络对于并行多任务的处理能力，使得本发明的网络可以很容易的部署到实际水产养殖应用场景中。本发明提出了使用单个解码器分支来进行多任务网络的预测思路，为在多种任务之间进行融合编码提出了启发。The beneficial effects of the present invention are that an efficient multi-task network is proposed, which can perform target detection, instance segmentation and attitude estimation in parallel, and the inference speed is improved to a real-time level. Compared with the baseline, the network of the present invention only increases a negligible amount of parameters, and the processing capability of the network for parallel multitasking is exploited, so that the network of the present invention can be easily deployed in actual aquaculture application scenarios. The present invention proposes a prediction idea of using a single decoder branch to perform multi-task network, and provides inspiration for fusion coding among various tasks.

附图说明Description of drawings

图1是本发明进行鱼类监测的视觉任务，(a)目标图像，(b)目标检测，(c)姿态估计，(d)实例分割。Figure 1 shows the visual task of fish monitoring in the present invention, (a) target image, (b) target detection, (c) pose estimation, (d) instance segmentation.

图2是本发明的网络结构框架图。Fig. 2 is a network structure frame diagram of the present invention.

图3是本发明实施例实例分割示意图。FIG. 3 is a schematic diagram of an example segmentation according to an embodiment of the present invention.

图4是本发明实施例掩膜多边形的边数统计图。FIG. 4 is a statistical diagram of the number of sides of a mask polygon according to an embodiment of the present invention.

图5是本发明实施例多种鱼类监测结果，(a)目标检测结果，(b)姿态估计结果，(c)实例分割结果，(d)多任务结果。Fig. 5 shows the monitoring results of various fish according to the embodiment of the present invention, (a) target detection results, (b) pose estimation results, (c) instance segmentation results, and (d) multi-task results.

具体实施方式Detailed ways

下面结合附图对本发明的技术方案作进一步的详细说明。The technical solutions of the present invention will be further described in detail below with reference to the accompanying drawings.

为了验证本文的网络的性能，本实施例制作了一份包含高质量标签的鱼类多任务数据集，包含人工标注的真实目标框、关键点和掩膜标签，后续操作均在此数据集上进行。本实施例不但尝试了端到端的训练策略，而且尝试了交替的训练范式对多个任务检测精度的影响。实验表明，本实施例的思路是有效且高效的。经过验证，本实施例最终得到了95.3％的目标检测平均精度，53.9％的实例分割平均精度，对于姿态估计，本实施例则达到了95.1％目标关键点相似度平均精度，推理速度则达到了66.3fps(使用英伟达teslav100)。所有的这些任务相较于基线模型仅增加了0.69％的参数量。In order to verify the performance of the network in this paper, a multi-task dataset of fish containing high-quality labels is produced in this example, including manually annotated real target boxes, key points and mask labels, and subsequent operations are performed on this dataset. conduct. This embodiment not only tries an end-to-end training strategy, but also tries the influence of alternate training paradigms on the detection accuracy of multiple tasks. Experiments show that the idea of this embodiment is effective and efficient. After verification, this embodiment finally obtains 95.3% average accuracy of target detection and 53.9% average accuracy of instance segmentation. For pose estimation, this embodiment achieves 95.1% average accuracy of target key point similarity, and the inference speed reaches 95.1%. 66.3fps (using nvidia teslav100). All these tasks only increase the number of parameters by 0.69% compared to the baseline model.

实施例：Example:

本实施例收集了2.6k张鱼类图像，并进行人工标注，标注结果遵循MS-COCO的格式，标注内容包括真实目标框，实例分割掩膜多边形，姿态估计关键点坐标。In this example, 2.6k fish images are collected and manually labeled. The labeling results follow the MS-COCO format. The labeling content includes the real target frame, instance segmentation mask polygons, and pose estimation key point coordinates.

本实施例在NVIDIA TESLA v100上训练本文的框架，本实施例尝试了不同的超参数和优化器的组合来找到最适合的训练方法。本实施例将输入图像统一调整尺寸到416×416来进行推理，本实施例使用了Darknet53在ImageNet【imagenet】上的预训练模型作为初始化权重。在本实施例中，使用平均精度(AP)来衡量本实施例的模型性能，相关计算标准与MS-COCO一致。对于目标检测来说，本实施例使用预测框与真实值的交并比(IOU)阈值(从0.5到0.95)来计算，对于实例分割来说，本实施例使用掩膜IOU来计算AP，对于姿态估计来说，本文使用OKS来作为计算AP。This example trains the framework of this paper on NVIDIA TESLA v100. This example tries different combinations of hyperparameters and optimizers to find the most suitable training method. In this example, the input image is uniformly resized to 416×416 for inference. This example uses the pre-trained model of Darknet53 on ImageNet [imagenet] as the initialization weight. In this embodiment, the average precision (AP) is used to measure the model performance of this embodiment, and the relevant calculation standard is consistent with MS-COCO. For object detection, this embodiment uses the intersection of prediction frame and ground truth (IOU) threshold (from 0.5 to 0.95) to calculate, for instance segmentation, this embodiment uses mask IOU to calculate AP, for instance segmentation For pose estimation, this paper uses OKS as the calculation AP.

a)目标检测a) Object detection

与yolov3类似，本实施例的目标检测方法是基于锚框的，因此选择最合适的锚框尺寸对网络收敛效果有很大的帮助。按照不同尺度目标输出不同尺寸的特征图

其中S∈{13,26,52}，将输出的特征图划分为S×S个网格，每个网格负责预测一个物体中心落入网格的类别、置信度，并且对锚框的形状和位置进行偏移，使得矩形框更适合目标；检测的目标框利用非极大抑制算法进行过滤，留下置信度最高的目标框，对于每个锚框来说，目标检测占用了输出张量的前6个维度，分别对应{类别，置信度，x，y，w，h}。本实施例使用了K-means算法对数据集中所有的真实目标框进行了聚类，最终按照不同尺度得到了9个锚框尺寸，分别是{[333,151],[363,173],[353,216],[261,127],[148,245],[340,119],[129,56],[175,82],[231,99]}。在本实施例的数据集上运行了不同的目标检测算法，具体的结果见表1所示。Similar to yolov3, the target detection method in this embodiment is based on the anchor frame, so choosing the most suitable anchor frame size is of great help to the network convergence effect. Output feature maps of different sizes according to different scale targets

Where S ∈ {13, 26, 52}, the output feature map is divided into S × S grids, each grid is responsible for predicting the category and confidence of an object center falling into the grid, and for the shape of the anchor box Offset with the position to make the rectangular frame more suitable for the target; the detected target frame is filtered by the non-maximum suppression algorithm, leaving the target frame with the highest confidence. For each anchor frame, the target detection occupies the output tensor The first 6 dimensions of , correspond to {category, confidence, x, y, w, h} respectively. This example uses the K-means algorithm to cluster all the real target boxes in the data set, and finally obtains 9 anchor box sizes according to different scales, which are {[333,151],[363,173],[353,216],[ 261,127],[148,245],[340,119],[129,56],[175,82],[231,99]}. Different target detection algorithms are run on the dataset of this embodiment, and the specific results are shown in Table 1.

表1Table 1

其中MultiNet本实施例只使用了其目标检测的分支而忽略了其他任务，可以看到，本实施例FishNet的网络推理速度高于MultiNet，CenterNet和faster r-cnn，但是低于YOLOv5s与yolov3，原因是YOLOv5s使用了轻量化的网络设计，而且本实施例在yolov3的基础上增加了部分参数导致的。可以看到，本实施例的框架的AP仅次于CenterNet，可以达到一个相当好的结果。虽然本实施例的架构是以YOLOv3作为基线进行重建的，但是本实施例的得分比YOLOv3要高，本实施例认为这是由于多个任务之间互相影响导致的。Among them, MultiNet only uses its target detection branch in this embodiment and ignores other tasks. It can be seen that the network inference speed of FishNet in this embodiment is higher than that of MultiNet, CenterNet and faster r-cnn, but lower than that of YOLOv5s and yolov3. It is because YOLOv5s uses a lightweight network design, and this example adds some parameters on the basis of yolov3. It can be seen that the AP of the framework of this embodiment is second only to CenterNet, and a fairly good result can be achieved. Although the architecture of this embodiment is reconstructed using YOLOv3 as the baseline, the score of this embodiment is higher than that of YOLOv3, which is considered to be caused by the mutual influence of multiple tasks in this embodiment.

b)姿态估计b) Attitude estimation

利用鱼类关键点来进行鱼类运动状态的研究较少，因此本实施例自行确定了6个关键点在鱼体。本实施例定义了鱼嘴，鱼鳃上部，鱼鳃下部，鱼尾中心，鱼尾上部，鱼尾下部来定义一条鱼的姿态。如果有特殊要求，关键点的位置及数量可以随即改变。There are few studies on fish movement state using key points of fish, so in this example, 6 key points are determined on the fish body by themselves. This embodiment defines the fish mouth, the upper part of the fish gills, the lower part of the fish gills, the center of the fish tail, the upper part of the fish tail, and the lower part of the fish tail to define the posture of a fish. If there are special requirements, the position and number of key points can be changed at any time.

表2Table 2

针对动物进行关键点识别的方案较少，因此本实施例使用几个框架进行了一些改动来适合本实施例的数据集，表2展示了部分实验结果。与基于热图进行关键点检测的方法不同，本实施例利用向量回归的方式来进行关键点定位。本实施例最终在数据集上达到了81.3％的AP，但是本实施例的方法依然无法媲美基于热图的方法。本实施例分析的主要原因如下：本实施例在计算关键点向量归一化过程中使用了预测目标框的对角线长度作为基准，因此目标检测的预测框质量直接影响了关键点向量的回归结果。There are few schemes for identifying key points for animals, so this embodiment uses several frameworks to make some changes to fit the data set of this embodiment. Table 2 shows some experimental results. Different from the method for detecting key points based on a heat map, this embodiment uses a vector regression method to locate key points. This embodiment finally achieves an AP of 81.3% on the dataset, but the method of this embodiment is still not comparable to the heatmap-based method. The main reasons for the analysis in this embodiment are as follows: In this embodiment, the diagonal length of the predicted target frame is used as a reference in the process of calculating the normalization of the key point vector, so the quality of the predicted frame for target detection directly affects the regression of the key point vector result.

c)实例分割c) instance segmentation

本实施例通过将一个实例放置到极坐标系中的形式，通过预测实例的轮廓多边形顶点来完成实例分割。由于极坐标系按照固定步长分割成若干个角度区块，因此多边形的边数上限即为区块的个数,因此选择适当的步长有利于提升分割边界的精细程度。针对这个问题，本实施例对所有数据集中掩膜多边形的边数进行了统计，统计结果如图4所示。可以看到几乎大多数多边形的边数都分布在20左右，因此本实施例认为，将步长设置为15°，即产生24个角度区块即可以满足本实施例工作的要求。本实施例中实例分割占据着

个通道。This embodiment completes instance segmentation by predicting the vertices of the silhouette polygon of the instance by placing an instance in the polar coordinate system. Since the polar coordinate system is divided into several angular blocks according to a fixed step size, the upper limit of the number of sides of a polygon is the number of blocks, so choosing an appropriate step size is beneficial to improve the fineness of the segmentation boundary. To solve this problem, this embodiment counts the number of sides of the mask polygon in all data sets, and the statistical result is shown in FIG. 4 . It can be seen that almost most polygons have about 20 sides. Therefore, in this embodiment, it is considered that setting the step size to 15°, that is, generating 24 angle blocks, can meet the working requirements of this embodiment. In this embodiment, instance segmentation occupies

channel.

表3table 3

本实施例选择了基于像素分类的最具代表性的方法，Mask RCNN，和三种典型的基于轮廓的方法进行对比。由表3可以看到，在本实施例的数据集上，本实施例的方法取得了46.7％的AP，相较于Mask R-CNN，本实施例在针对大目标的检测中落后，原因可能是大目标的轮廓更为复杂，边数过多导致多边形轮廓无法更加精细。在与当前最好的基于轮廓的方法PolarMask来说，本实施例几乎与他达到了相同的水平，这是意料之外的。This example selects the most representative method based on pixel classification, Mask RCNN, and compares it with three typical contour-based methods. It can be seen from Table 3 that on the data set of this embodiment, the method of this embodiment achieves an AP of 46.7%. Compared with Mask R-CNN, this embodiment lags behind in the detection of large targets. The reason may be The outline of the large target is more complex, and the polygon outline cannot be finer due to the excessive number of sides. In terms of PolarMask, the state-of-the-art contour-based method, this embodiment is almost on the same level as him, which is unexpected.

表4是本实施例的模型在执行不同任务的参数量以及GFLPOs与YOLOv3的对比，可以看到，本实施例相较于YOLOv3仅仅增加了5.1％的参数量，相较于单目标检测版本的FishNet，本实施例仅通过增加0.69％的参数量便实现了多任务的学习，这些参数量几乎是可以忽略不计的。因此，本实施例的模型可以以很高的速度进行实时的推理。图5是本实施例的模型最终检测的可视化结果。Table 4 shows the parameter quantities of the model in this embodiment performing different tasks and the comparison between GFLPOs and YOLOv3. It can be seen that this embodiment only increases the parameter quantity by 5.1% compared with YOLOv3, compared with the single target detection version. FishNet, this embodiment realizes multi-task learning by only increasing the amount of parameters by 0.69%, and the amount of these parameters is almost negligible. Therefore, the model of this embodiment can perform real-time reasoning at a high speed. FIG. 5 is a visualization result of the final detection of the model in this embodiment.

表4Table 4

为了对比多任务网络相对于串行结构的优势，本实施例选用了各个任务中的效果较好的算法进行组合，通过表5可以看出，通过将YOLOv5目标检测算法、HRNet姿态估计算法与PolarMask实例分割算法进行串行组合后，模型的参数量大大增加，同时推理时间也大幅度上升，导致推理帧率下降，本实施例提出的网络依托于多任务网络的优势，保持了较高的推理帧率，同时达到了较好的检测准确率。In order to compare the advantages of the multi-task network compared to the serial structure, this embodiment selects the algorithms with better effects in each task to combine. It can be seen from Table 5 that by combining the YOLOv5 target detection algorithm, HRNet attitude estimation algorithm and PolarMask After the instance segmentation algorithm is serially combined, the number of parameters of the model is greatly increased, and the inference time is also greatly increased, resulting in a decrease in the inference frame rate. The network proposed in this embodiment relies on the advantages of the multi-task network and maintains a high inference rate. frame rate, while achieving better detection accuracy.

表5table 5

在本实施例中，本实施例提出了一种可以端到端训练的多任务网络架构，可以高效高速的进行目标检测，姿态估计和实例分割。本实施例制作了一个鱼类的多任务数据集，并且将本实施例的框架在数据集上进行测试。经过与其他算法的对比，本实施例的方法可以达到优秀的检测效果，并且保持了较高的实时性，在英伟达TESLA v100上取得了63.3FPS的速度。本实施例的工作表明了多任务网络仅使用一个预测分支也可以取得很好的检测效果，希望后续的研究可以扩展本实施例的方法来实现更强的性能在更多领域。In this embodiment, this embodiment proposes a multi-task network architecture that can be trained end-to-end, which can perform target detection, pose estimation and instance segmentation efficiently and at high speed. In this example, a multi-task dataset of fish is produced, and the framework of this example is tested on the dataset. After comparison with other algorithms, the method of this embodiment can achieve excellent detection effect and maintain high real-time performance, and achieve a speed of 63.3FPS on NVIDIA TESLA v100. The work of this embodiment shows that a multi-task network can achieve a good detection effect with only one prediction branch, and it is hoped that subsequent research can expand the method of this embodiment to achieve stronger performance in more fields.

Claims

1. A fish visual identification method based on multi-task fusion is characterized in that,

a full convolution network is constructed, the Darknet53 is used as an encoder to encode images, a characteristic pyramid is used at the neck of the network to perform context characteristic fusion of targets with different scales, and a detection head with a specified scale is used as a decoder for the targets with different scales; for each output tensor, dividing channels according to three tasks, and respectively using the channels for target detection, attitude estimation and instance segmentation;

the target detection adopts a one-stage target detection method based on an anchor frame, characteristic graphs containing information of a predicted target frame in different sizes are output according to targets in different scales, and the predicted target frame is rectangular;

the attitude estimation is carried out by expressing an attitude state by utilizing a plurality of key points, adopting the central point of a single rectangular target frame obtained by target detection and predicting vectors of the central point pointing to each key point;

and the example segmentation is to use the center point of the predicted target frame as the origin of polar coordinates, and obtain a specific position by predicting the angle of the polygon vertex of the target outline and the distance between the polygon vertex and the origin, so as to determine the mask of a single example.

2. The fish visual identification method based on multitask fusion according to claim 1, wherein the anchor frame-based one-stage target detection method specifically comprises the following steps: outputting feature maps of different sizes according to targets of different scales

Wherein C is the number of channels occupied by the output feature map and respectively corresponds to { category, confidence, x, y, w, h }, wherein the category represents the type of the predicted object, the confidence is the prediction probability given by the network for the current predicted object, x and y are offset values of the center of the predicted object relative to the center point of the grid, and w and h are size offsets of the predicted target frame of the object relative to the anchor frame; the decoder divides the output characteristic diagram into S multiplied by S grids, each grid is responsible for predicting an object with a central point falling into the grid, and the rectangular frame is more suitable for a target by offsetting the shape and the position of the anchor frame; all detected predicted target boxes will then be filtered using a non-maximum suppression algorithm, leaving the predicted target box with the highest confidence for each object.

3. The fish visual identification method based on multi-task fusion as claimed in claim 1, wherein the pose estimation encodes k key point coordinates to

Wherein i ∈ { 1., k }; y is _i Absolute position coordinates (x, y) representing the ith keypoint; the keypoint detection occupies N _pose 2 × k channels; firstly, the central point of the predicted target frame is calculated by using the predicted target frame obtained by target detection

The diagonal length diag of the predicted target box is then calculated, for each pose vector N (y) _i (ii) a b) Comprises the following steps:

after normalization, the key point detection limits the output result range to [0,1], and attitude estimation is carried out by predicting the vector of the center point of the target frame pointing to each key point.

4. The fish visual identification method based on multitask fusion according to claim 1, characterized in that said example division uses the central point of the predicted target frame as the origin of polar coordinates, and expresses the vertex of the mask polygon as angle and intercept; the angle and intercept are calculated as follows: with a fixed step size Stride ∈ [0,360,]dividing coordinate axes into

An angle block defining an interval start angle a of N × Stride; if a certain vertex on the polygon falls into a certain angle block, the channel of the block is responsible for predicting the distance from the vertex to the origin, the offset of the angle of the vertex relative to the interval starting angle and the confidence coefficient of the vertex; for each block, in the output channel, each vertex is represented by three parameters, namely distance, angle offset and confidence; for each instance, the network is most predictable when the step size is S

A vertex.

5. The fish visual recognition method based on multitask fusion according to claim 1, further comprising a loss function for calculating the difference between a predicted value and a true value, as follows:

wherein l _obj (i, j) is used to calculate the loss of the target detection task,/ _pose (i, j) is the penalty for computing the pose estimation task, l _seg (i, j) is the loss used to compute the instance partitioning, q _i，j Is a constant that indicates whether the current anchor frame contains a target, G ^w G ^h Indicates the current grid position, n ^a An anchor frame representing the current target; wherein, the loss of the specific task is defined as follows:

l _obj (i,j)＝l ₁ (i,j)+l ₂ (i,j)+l ₃ (i,j)+l ₄ (i,j)

wherein l ₁ (i, j) predicting loss of center of target frame,/ ₂ (i, j) loss of predicted target frame size,/ ₃ (i, j) is the loss of confidence in the prediction,/ ₄ (i, j) is the loss of the prediction class:

wherein,

is the center position of the target frame,

is a binary cross entropy;

wherein, w _i,j ,h _i,j Is the width and height of the target frame,

is the width and height of the current jth anchor frame;

wherein,

is the target box confidence of the network prediction;

wherein C is the total number of classes, C _i,j,k For the predicted target class, ψ (·.,..) is the cross-entropy loss function;

wherein n is ^p Is the number of key points, P _i,j,k Phi (.,..) is the mean square error loss function;

where v is the number of divided angle blocks, diag is the diagonal length of the prediction target frame, α _i,j,k Distance of vertex to origin, β _i,j,k The offset of the angle at which the apex is located with respect to the start of the angular interval, γ _i,j,k Is the confidence of the current vertex.