CN114842215A - Fish visual identification method based on multi-task fusion - Google Patents

Fish visual identification method based on multi-task fusion Download PDF

Info

Publication number
CN114842215A
CN114842215A CN202210415517.1A CN202210415517A CN114842215A CN 114842215 A CN114842215 A CN 114842215A CN 202210415517 A CN202210415517 A CN 202210415517A CN 114842215 A CN114842215 A CN 114842215A
Authority
CN
China
Prior art keywords
target
predicted
frame
vertex
angle
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210415517.1A
Other languages
Chinese (zh)
Inventor
曹立杰
陈子文
王其华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Dalian Ocean University
Original Assignee
Dalian Ocean University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Dalian Ocean University filed Critical Dalian Ocean University
Priority to CN202210415517.1A priority Critical patent/CN114842215A/en
Publication of CN114842215A publication Critical patent/CN114842215A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A40/00Adaptation technologies in agriculture, forestry, livestock or agroalimentary production
    • Y02A40/80Adaptation technologies in agriculture, forestry, livestock or agroalimentary production in fisheries management
    • Y02A40/81Aquaculture, e.g. of fish

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a fish visual identification method based on multi-task fusion, which can perform target detection, instance segmentation and attitude estimation in parallel by utilizing an efficient multi-task network, and the inference speed is improved to a real-time level. Compared with a baseline, the network only increases negligible parameter quantity, and the processing capacity of the network for parallel multitask is excavated, so that the network can be easily deployed in an actual aquaculture application scene. The invention provides a prediction idea of a multi-task network by using a single decoder branch, and provides a heuristic for carrying out fusion coding among multiple tasks.

Description

Fish visual identification method based on multi-task fusion
Technical Field
The invention belongs to the field of target detection, and particularly relates to a fish visual identification method based on multi-task fusion.
Background
In the aquaculture process, the detection of the physiological state and the movement behavior of the fish can better realize the accurate aquaculture process. The acquisition of the physiological state and the exercise behavior can be performed by calculating the body length and the body weight of the fish and the exercise posture. Target detection, key point detection and instance segmentation in the field of computer vision can provide accurate predicted target frame, mask and key point skeleton information. The predicted target frame obtained by target detection can accurately position the fish, the mask obtained by example segmentation can obtain the fish outline, and the movement posture of the fish can be judged according to the key point information.
With the development of deep learning, there are many excellent algorithms that can achieve these tasks, for target detection, the YOLO series [ Redmon, Joseph, and Ali faradai. "yoloov 3: An innovative approach." arXiv predictive arXiv:1804.02767(2018) ] and the SSD series [ Liu, Wei, et al. "SSD: Single shot multiple detector." European conference on computer vision. For example segmentation, Mask R-CNN [ He, Kaiming, et al. "Mask R-CNN." Proceedings of the IEEE international Conference on Computer vision.2017. ] and or SOLO [ Wang, Xinlong, et al. "Segmenting objects by locations" European Conference on Computer vision.Springer, Cham,2020 ] this pixel classification-based segmentation method theoretically can segment all example pixels, so good effect can be achieved, but the inference process generates a large number of parameters, thereby losing real-time. For pose estimation, deep neural networks are used for Human pose estimation, positioning is performed by predicting the coordinate positions of key points, and the prediction result is finely adjusted by using multi-scale information. Since the key points are sparsely distributed, the current mainstream framework HRNet [ Sun, Ke, et al, "Deep high-resolution representation for human position estimation." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern recognition.2019 ] locates the key points by using thermodynamic diagrams, but the image needs to be up-sampled to the original image size by using the thermodynamic diagrams, and a large number of parameters are generated in the process. While these frameworks have been excellent, implementing three functions serially at once is very inefficient and wasteful of computing resources. Intensive examples exist in the aquaculture process, so that high requirements are placed on the real-time performance and resource occupation of the system, and the multitask network has the following characteristics: (a) the method can complete various prediction tasks at one time, saves inference time and computational power, and (b) a plurality of task branches share one encoder, which is beneficial to improving the depth of feature extraction, thereby improving the performance of each task. Thus, a multi-tasking framework is well suited for aquaculture processes.
Few researchers have been working on multitask networks in fishery, but in the field of automatic driving, there have been a lot of applications of multitask networks in panoramic perception [ Teichmann, Marvin, et al, "Real-time joint semiconductor for autonomous driving."2018IEEE intellectual Vehicles Symposium (IV). MultiNet realizes scene classification, target detection and semantic segmentation of a driving region by using one encoder and three decoder branches, and achieves the effect of real-time detection, but the framework is only suitable for the field of automatic driving. LSNet [ Duan, Kaiwen, et al. "Location-sensitive visual registration with cross-iou loss." arXiv preprinting arXiv:2104.04899(2021) ] by uniformly encoding the position-sensitive tasks, object detection, instance segmentation, and pose estimation are achieved with a uniform framework, but the three tasks cannot be performed in parallel.
Disclosure of Invention
In order to solve the problems in the prior art, the invention provides a fish visual identification method based on multi-task fusion, which can simultaneously realize target detection, instance segmentation and attitude estimation by utilizing an efficient and simple multi-task network, wherein the network only uses one encoder and one decoder branch to carry out detection in each scale, and optimizes the detection time and calculation occupation to a reasonable range.
In order to realize higher accuracy and speed, the technical scheme of the invention is as follows:
a fish visual identification method based on multi-task fusion constructs a full convolution network, adopts Darknet53 as an encoder to encode images, uses a characteristic pyramid at the neck of the network to perform context characteristic fusion of targets with different scales, and uses a detection head with a specified scale as a decoder aiming at the targets with different scales; so that the output layer can fuse more levels of feature information. For each output tensor, dividing channels according to three tasks, and respectively using the channels for target detection, attitude estimation and instance segmentation;
the target detection adopts a one-stage target detection method based on an anchor frame, characteristic graphs containing information of a predicted target frame in different sizes are output according to targets in different scales, and the predicted target frame is rectangular;
the attitude estimation is to use a plurality of key points to express an attitude state, adopt a single predicted target frame center point obtained by target detection and carry out attitude estimation through vectors of the predicted center point pointing to each key point;
and the example segmentation is to determine the mask of a single example by taking the central point of the predicted target frame as the origin of polar coordinates, determining the initial position of the contour point through predicting the angle interval where the vertex of the target contour polygon is located, and predicting the offset between the predicted position and the reference angle to correct to obtain a specific position.
Further, the anchor frame-based one-stage target detection method specifically comprises the following steps: outputting feature maps of different sizes according to targets of different scales
Figure BDA0003605746300000031
Wherein C is the number of channels occupied by the output characteristic diagram and respectively corresponds to { category, confidence coefficient, x, y, w, h }, wherein the category is represented as the kind of the predicted objectAnd the confidence coefficient is the prediction probability given by the network aiming at the current predicted object, x and y are the deviation values of the center of the predicted object relative to the center point of the grid, and w and h are the size deviation of the predicted target frame of the object relative to the anchor frame. The decoder divides the output characteristic diagram into S multiplied by S grids, each grid is responsible for predicting an object with a central point falling into the grid, and the rectangular frame is more suitable for a target by offsetting the shape and the position of the anchor frame; all detected predicted target boxes will then be filtered using a non-maximum suppression algorithm, leaving the predicted target box with the highest confidence for each object.
Further, the attitude estimation encodes k key point coordinates into
Figure BDA0003605746300000041
Wherein i ∈ { 1., k }; y is i Absolute position coordinates (x, y) representing the ith keypoint; the keypoint detection occupies N pose 2 × k channels; firstly, the central point of the predicted target frame is calculated by using the predicted target frame obtained by target detection
Figure BDA0003605746300000042
The diagonal length diag of the predicted target box is then calculated, for each pose vector N (y) i (ii) a b) Comprises the following steps:
Figure BDA0003605746300000043
after normalization, the key point detection limits the output result range to [0,1], and attitude estimation is carried out by predicting the vector of the center point of the target frame pointing to each key point. And the key point detection and the target detection task are directly coupled, so that the performance of the two tasks is mutually improved.
Further, the example is divided, the central point of the prediction target frame is used as the origin of polar coordinates, and the vertex of the mask polygon is expressed in a mode of angle and intercept; the angle and intercept are calculated as follows: with a fixed step size Stride ∈ [0,360,]dividing coordinate axes into
Figure BDA0003605746300000044
Each angle block defines an interval start angle a equal to N × Stride. If a certain vertex on the polygon falls into a certain angle block, the channel of the block is responsible for predicting the distance from the vertex to the origin, the offset of the angle of the vertex relative to the interval starting angle and the confidence coefficient of the vertex; for each block, in the output channel, each vertex is represented by three parameters, namely distance, angle offset and confidence; for each instance, the network is most predictable when the step size is S
Figure BDA0003605746300000045
A vertex.
Further, a loss function for calculating the difference between the predicted value and the actual value is also included, as follows:
Figure BDA0003605746300000051
wherein l obj (i, j) is used to calculate the loss of the target detection task,/ pose (i, j) is the penalty for computing the pose estimation task, l seg (i, j) is the loss used to compute the instance partitioning, q i,j Is a constant that indicates whether the current anchor frame contains a target, G w G h Indicates the current grid position, n a An anchor frame representing the current target; wherein, the loss of the specific task is defined as follows:
l obj (i,j)=l 1 (i,j)+l 2 (i,j)+l 3 (i,j)+l 4 (i,j)
wherein l 1 (i, j) predicting loss of center of target frame,/ 2 (i, j) loss of predicted target frame size,/ 3 (i, j) is the loss of confidence in the prediction,/ 4 (i, j) is the loss of the prediction class:
Figure BDA0003605746300000052
wherein,
Figure BDA0003605746300000053
is the center position of the target frame,
Figure BDA0003605746300000054
is a binary cross entropy;
Figure BDA0003605746300000055
wherein, w i,j ,h i,j Is the width and height of the target frame,
Figure BDA0003605746300000056
is the width and height of the current jth anchor frame;
Figure BDA0003605746300000057
wherein,
Figure BDA0003605746300000058
is the target box confidence of the network prediction;
Figure BDA0003605746300000059
wherein C is the total number of classes, C i,j,k For the predicted target class, ψ (·.,..) is a cross entropy loss function;
Figure BDA00036057463000000510
wherein n is p Is the number of key points, P i,j,k Phi is a coordinate position of a key point, and phi is a mean square error loss function;
Figure BDA0003605746300000061
where v is the number of divided angle blocks, diag is the diagonal length of the prediction target frame, α i,j,k Distance of vertex to origin, β i,j,k The offset of the angle at which the apex is located with respect to the start of the angular interval, γ i,j,k Is the confidence of the current vertex.
The invention has the advantages that an efficient multi-task network is provided, the target detection, the instance segmentation and the attitude estimation can be carried out in parallel, and the reasoning speed is improved to a real-time level. Compared with a baseline, the network only increases negligible parameter quantity, and the processing capacity of the network for parallel multitask is excavated, so that the network can be easily deployed in an actual aquaculture application scene. The invention provides a prediction idea of a multi-task network by using a single decoder branch, and provides a heuristic for carrying out fusion coding among multiple tasks.
Drawings
FIG. 1 is a diagram of the visual tasks of the present invention for fish monitoring, (a) target image, (b) target detection, (c) pose estimation, and (d) example segmentation.
Fig. 2 is a network architecture framework diagram of the present invention.
FIG. 3 is a schematic diagram of an example partition according to an embodiment of the present invention.
FIG. 4 is a statistical graph of edge counts of mask polygons in accordance with an embodiment of the present invention.
Fig. 5 shows the results of monitoring various fishes according to the embodiment of the present invention, (a) the result of target detection, (b) the result of attitude estimation, (c) the result of example segmentation, and (d) the result of multitasking.
Detailed Description
The technical solution of the present invention will be further described in detail with reference to the accompanying drawings.
In order to verify the performance of the network, the embodiment manufactures a fish multitask data set containing high-quality labels, wherein the fish multitask data set contains artificially labeled real target boxes, key points and mask labels, and subsequent operations are performed on the data set. The present embodiment not only tries the end-to-end training strategy, but also tries the influence of the alternative training paradigm on the detection accuracy of multiple tasks. Experiments show that the idea of the embodiment is effective and efficient. Through verification, the target detection average precision of 95.3% and the example segmentation average precision of 53.9% are finally obtained in the embodiment, for attitude estimation, the target key point similarity average precision of 95.1% is achieved in the embodiment, and the inference speed is 66.3fps (using great to tesla v 100). All of these tasks had only 0.69% increased parameters compared to the baseline model.
Example (b):
in the embodiment, 2.6k fish images are collected and are manually labeled, the labeling result follows the format of MS-COCO, the labeling content comprises a real target frame, an instance segmentation mask polygon and the coordinates of the attitude estimation key points.
The present embodiment trains the framework herein on NVIDIA TESLA v100, which tries a combination of different hyper-parameters and optimizers to find the most suitable training method. This embodiment uniformly resizes the input images to 416 × 416 for inference, and uses a pre-trained model of Darknet53 on ImageNet (ImageNet) as an initialization weight. In the present embodiment, the average Accuracy (AP) is used to measure the model performance of the present embodiment, and the correlation calculation standard is consistent with MS-COCO. For object detection, the present embodiment uses the intersection ratio of the prediction box to the true value (IOU) threshold (from 0.5 to 0.95) for computation, the mask IOU for instance segmentation for computation of the AP, and OKS for pose estimation for computation of the AP.
a) Target detection
Similar to yolov3, the target detection method of the embodiment is based on the anchor frame, so that the selection of the most appropriate anchor frame size is greatly helpful to the network convergence effect. Outputting feature maps of different sizes according to targets of different scales
Figure BDA0003605746300000071
Wherein S is equal to {13,26,52}, and the output characteristic diagram is divided into S multiplied by S grids, wherein each gridThe grid is responsible for predicting the category and the confidence coefficient of an object center falling into the grid, and the shape and the position of the anchor frame are shifted, so that the rectangular frame is more suitable for a target; and filtering the detected target frame by using a non-maximum suppression algorithm, leaving the target frame with the highest confidence coefficient, and for each anchor frame, occupying the first 6 dimensions of the output tensor by the target detection, and respectively corresponding to the { category, the confidence coefficient, x, y, w, h }. In the embodiment, a K-means algorithm is used for clustering all real target frames in a data set, and finally 9 anchor frame sizes are obtained according to different scales, wherein the sizes are respectively { [333,151 ]],[363,173],[353,216],[261,127],[148,245],[340,119],[129,56],[175,82],[231,99]}. Different target detection algorithms are run on the data set of the embodiment, and specific results are shown in table 1.
TABLE 1
Figure BDA0003605746300000081
The MultiNet embodiment only uses the branch of target detection and ignores other tasks, and it can be seen that the network inference speed of fishernet in this embodiment is higher than MultiNet, centrnet and faster-cnn, but lower than YOLOv5s and YOLOv3 because YOLOv5s uses a lightweight network design, and this embodiment is caused by adding partial parameters on the basis of YOLOv 3. It can be seen that the AP of the framework of this embodiment is second only to centrnet, with a reasonably good result. Although the architecture of the present embodiment is reconstructed with YOLOv3 as a baseline, the score of the present embodiment is higher than that of YOLOv3, which is considered to be caused by the mutual influence among a plurality of tasks.
b) Attitude estimation
The research on the motion state of the fish by using the fish key points is less, so that the embodiment determines that the 6 key points are in the fish body by itself. This example defines the mouth, gill upper, gill lower, tail center, tail upper, tail lower to define the pose of a fish. The location and number of keypoints may be varied accordingly, if desired.
TABLE 2
Figure BDA0003605746300000091
There are fewer protocols for identifying key points for animals, so this example uses several frames with some modifications to fit the data set of this example, and table 2 shows some experimental results. Unlike the method of performing keypoint detection based on heat maps, the present embodiment performs keypoint localization by means of vector regression. This example ultimately achieved 81.3% AP on the data set, but the method of this example is still comparable to the heatmap-based method. The main reasons for the analysis of this example are as follows: in the embodiment, the diagonal length of the prediction target frame is used as a reference in the normalization process of calculating the key point vector, so that the quality of the prediction frame for target detection directly affects the regression result of the key point vector.
c) Instance partitioning
The embodiment accomplishes example segmentation by predicting the outline polygon vertices of an example in a form of placing the example in a polar coordinate system. Because the polar coordinate system is divided into a plurality of angle blocks according to the fixed step length, the upper limit of the number of the sides of the polygon is the number of the blocks, and therefore, the selection of the proper step length is beneficial to improving the fineness degree of the divided boundary. To address this problem, the present embodiment performs statistics on the number of edges of the mask polygons in all data sets, and the statistical result is shown in fig. 4. It can be seen that almost most of the polygon sides are distributed around 20, therefore, the present embodiment considers that setting the step size to 15 °, i.e. generating 24 angle blocks, can satisfy the requirement of the operation of the present embodiment. Example partitioning in this embodiment occupies
Figure BDA0003605746300000092
A channel.
TABLE 3
Figure BDA0003605746300000093
The most representative method based on pixel classification, Mask RCNN, and three typical contour-based methods were selected for comparison in this example. As can be seen from table 3, in the data set of the embodiment, the method of the embodiment achieves 46.7% of AP, and compared to Mask R-CNN, the embodiment lags behind the detection of the large target, which may be caused by the complicated contour of the large target and the too many edges, which may result in the failure to refine the polygon contour. This example achieved almost the same level as he did with the currently best contour-based approach, polarmack, which was unexpected.
Table 4 shows the parameter quantities of the model of the present embodiment in executing different tasks and the comparison between GFLPOs and YOLOv3, it can be seen that, compared to YOLOv3, the present embodiment only increases the parameter quantity by 5.1%, and compared to the single target detection version of fishernet, the present embodiment only increases the parameter quantity by 0.69% to realize multi-task learning, and these parameter quantities are almost negligible. Therefore, the model of the embodiment can perform real-time reasoning at a high speed. Fig. 5 is a visualization result of the model final inspection of the present embodiment.
TABLE 4
Figure BDA0003605746300000101
In order to compare the advantages of the multi-task network relative to the serial structure, the embodiment selects and combines algorithms with better effects in each task, and it can be seen from table 5 that after the YOLOv5 target detection algorithm, the HRNet posture estimation algorithm and the polarimask instance segmentation algorithm are combined in series, the parameter quantity of the model is greatly increased, the inference time is greatly increased, and the inference frame rate is reduced.
TABLE 5
Figure BDA0003605746300000102
In this embodiment, a multi-task network architecture capable of end-to-end training is provided in this embodiment, and target detection, pose estimation, and instance segmentation can be performed efficiently and at high speed. This example produced a multitask data set of fish and tested the framework of this example on the data set. Compared with other algorithms, the method of the embodiment can achieve excellent detection effect, keeps high real-time performance, and achieves the speed of 63.3FPS on the great TeSLA v 100. The work of the embodiment shows that the multitask network can achieve a good detection effect by using only one prediction branch, and subsequent research is expected to expand the method of the embodiment to achieve stronger performance in more fields.

Claims (5)

1. A fish visual identification method based on multi-task fusion is characterized in that,
a full convolution network is constructed, the Darknet53 is used as an encoder to encode images, a characteristic pyramid is used at the neck of the network to perform context characteristic fusion of targets with different scales, and a detection head with a specified scale is used as a decoder for the targets with different scales; for each output tensor, dividing channels according to three tasks, and respectively using the channels for target detection, attitude estimation and instance segmentation;
the target detection adopts a one-stage target detection method based on an anchor frame, characteristic graphs containing information of a predicted target frame in different sizes are output according to targets in different scales, and the predicted target frame is rectangular;
the attitude estimation is carried out by expressing an attitude state by utilizing a plurality of key points, adopting the central point of a single rectangular target frame obtained by target detection and predicting vectors of the central point pointing to each key point;
and the example segmentation is to use the center point of the predicted target frame as the origin of polar coordinates, and obtain a specific position by predicting the angle of the polygon vertex of the target outline and the distance between the polygon vertex and the origin, so as to determine the mask of a single example.
2. The fish visual identification method based on multitask fusion according to claim 1, wherein the anchor frame-based one-stage target detection method specifically comprises the following steps: outputting feature maps of different sizes according to targets of different scales
Figure FDA0003605746290000011
Wherein C is the number of channels occupied by the output feature map and respectively corresponds to { category, confidence, x, y, w, h }, wherein the category represents the type of the predicted object, the confidence is the prediction probability given by the network for the current predicted object, x and y are offset values of the center of the predicted object relative to the center point of the grid, and w and h are size offsets of the predicted target frame of the object relative to the anchor frame; the decoder divides the output characteristic diagram into S multiplied by S grids, each grid is responsible for predicting an object with a central point falling into the grid, and the rectangular frame is more suitable for a target by offsetting the shape and the position of the anchor frame; all detected predicted target boxes will then be filtered using a non-maximum suppression algorithm, leaving the predicted target box with the highest confidence for each object.
3. The fish visual identification method based on multi-task fusion as claimed in claim 1, wherein the pose estimation encodes k key point coordinates to
Figure FDA0003605746290000012
Wherein i ∈ { 1., k }; y is i Absolute position coordinates (x, y) representing the ith keypoint; the keypoint detection occupies N pose 2 × k channels; firstly, the central point of the predicted target frame is calculated by using the predicted target frame obtained by target detection
Figure FDA0003605746290000021
The diagonal length diag of the predicted target box is then calculated, for each pose vector N (y) i (ii) a b) Comprises the following steps:
Figure FDA0003605746290000022
after normalization, the key point detection limits the output result range to [0,1], and attitude estimation is carried out by predicting the vector of the center point of the target frame pointing to each key point.
4. The fish visual identification method based on multitask fusion according to claim 1, characterized in that said example division uses the central point of the predicted target frame as the origin of polar coordinates, and expresses the vertex of the mask polygon as angle and intercept; the angle and intercept are calculated as follows: with a fixed step size Stride ∈ [0,360,]dividing coordinate axes into
Figure FDA0003605746290000023
An angle block defining an interval start angle a of N × Stride; if a certain vertex on the polygon falls into a certain angle block, the channel of the block is responsible for predicting the distance from the vertex to the origin, the offset of the angle of the vertex relative to the interval starting angle and the confidence coefficient of the vertex; for each block, in the output channel, each vertex is represented by three parameters, namely distance, angle offset and confidence; for each instance, the network is most predictable when the step size is S
Figure FDA0003605746290000024
A vertex.
5. The fish visual recognition method based on multitask fusion according to claim 1, further comprising a loss function for calculating the difference between a predicted value and a true value, as follows:
Figure FDA0003605746290000025
wherein l obj (i, j) is used to calculate the loss of the target detection task,/ pose (i, j) is the penalty for computing the pose estimation task, l seg (i, j) is the loss used to compute the instance partitioning, q i,j Is a constant that indicates whether the current anchor frame contains a target, G w G h Indicates the current grid position, n a An anchor frame representing the current target; wherein, the loss of the specific task is defined as follows:
l obj (i,j)=l 1 (i,j)+l 2 (i,j)+l 3 (i,j)+l 4 (i,j)
wherein l 1 (i, j) predicting loss of center of target frame,/ 2 (i, j) loss of predicted target frame size,/ 3 (i, j) is the loss of confidence in the prediction,/ 4 (i, j) is the loss of the prediction class:
Figure FDA0003605746290000031
wherein,
Figure FDA0003605746290000032
is the center position of the target frame,
Figure FDA0003605746290000033
is a binary cross entropy;
Figure FDA0003605746290000034
wherein, w i,j ,h i,j Is the width and height of the target frame,
Figure FDA0003605746290000035
is the width and height of the current jth anchor frame;
Figure FDA0003605746290000036
wherein,
Figure FDA0003605746290000037
is the target box confidence of the network prediction;
Figure FDA0003605746290000038
wherein C is the total number of classes, C i,j,k For the predicted target class, ψ (·.,..) is the cross-entropy loss function;
Figure FDA0003605746290000039
wherein n is p Is the number of key points, P i,j,k Phi (.,..) is the mean square error loss function;
Figure FDA00036057462900000310
where v is the number of divided angle blocks, diag is the diagonal length of the prediction target frame, α i,j,k Distance of vertex to origin, β i,j,k The offset of the angle at which the apex is located with respect to the start of the angular interval, γ i,j,k Is the confidence of the current vertex.
CN202210415517.1A 2022-04-20 2022-04-20 Fish visual identification method based on multi-task fusion Pending CN114842215A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210415517.1A CN114842215A (en) 2022-04-20 2022-04-20 Fish visual identification method based on multi-task fusion

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210415517.1A CN114842215A (en) 2022-04-20 2022-04-20 Fish visual identification method based on multi-task fusion

Publications (1)

Publication Number Publication Date
CN114842215A true CN114842215A (en) 2022-08-02

Family

ID=82565544

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210415517.1A Pending CN114842215A (en) 2022-04-20 2022-04-20 Fish visual identification method based on multi-task fusion

Country Status (1)

Country Link
CN (1) CN114842215A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117132870A (en) * 2023-10-25 2023-11-28 西南石油大学 Wing icing detection method combining CenterNet and mixed attention
WO2024087574A1 (en) * 2022-10-27 2024-05-02 中国科学院空天信息创新研究院 Panoptic segmentation-based optical remote-sensing image raft mariculture area classification method

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2024087574A1 (en) * 2022-10-27 2024-05-02 中国科学院空天信息创新研究院 Panoptic segmentation-based optical remote-sensing image raft mariculture area classification method
CN117132870A (en) * 2023-10-25 2023-11-28 西南石油大学 Wing icing detection method combining CenterNet and mixed attention
CN117132870B (en) * 2023-10-25 2024-01-26 西南石油大学 Wing icing detection method combining CenterNet and mixed attention

Similar Documents

Publication Publication Date Title
Adarsh et al. YOLO v3-Tiny: Object Detection and Recognition using one stage improved model
Liu et al. Affinity derivation and graph merge for instance segmentation
US10262218B2 (en) Simultaneous object detection and rigid transform estimation using neural network
CN114842215A (en) Fish visual identification method based on multi-task fusion
CN108960059A (en) A kind of video actions recognition methods and device
CN113011568B (en) Model training method, data processing method and equipment
CN107146219B (en) Image significance detection method based on manifold regularization support vector machine
Cheng et al. Cascaded non-local neural network for point cloud semantic segmentation
CN111899278B (en) Unmanned aerial vehicle image rapid target tracking method based on mobile terminal
CN115861619A (en) Airborne LiDAR (light detection and ranging) urban point cloud semantic segmentation method and system of recursive residual double-attention kernel point convolution network
Milioto et al. Fast instance and semantic segmentation exploiting local connectivity, metric learning, and one-shot detection for robotics
Li et al. MVF-CNN: Fusion of multilevel features for large-scale point cloud classification
CN114049621A (en) Cotton center identification and detection method based on Mask R-CNN
Károly et al. Optical flow-based segmentation of moving objects for mobile robot navigation using pre-trained deep learning models
Alsanad et al. Real-time fuel truck detection algorithm based on deep convolutional neural network
Ahmed et al. Robust Object Recognition with Genetic Algorithm and Composite Saliency Map
CN114529832A (en) Method and device for training preset remote sensing image overlapping shadow segmentation model
KR20110037184A (en) Pipelining computer system combining neuro-fuzzy system and parallel processor, method and apparatus for recognizing objects using the computer system in images
CN118212572A (en) Road damage detection method based on improvement YOLOv7
Sinha et al. Human activity recognition from UAV videos using a novel DMLC-CNN model
CN104778683A (en) Multi-modal image segmenting method based on functional mapping
CN116452599A (en) Contour-based image instance segmentation method and system
Ito et al. Point proposal based instance segmentation with rectangular masks for robot picking task
Cui et al. PVF-NET: Point & voxel fusion 3D object detection framework for point cloud
Dao et al. Attention-based proposals refinement for 3D object detection

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination