CN114842215A - Fish visual identification method based on multi-task fusion - Google Patents
Fish visual identification method based on multi-task fusion Download PDFInfo
- Publication number
- CN114842215A CN114842215A CN202210415517.1A CN202210415517A CN114842215A CN 114842215 A CN114842215 A CN 114842215A CN 202210415517 A CN202210415517 A CN 202210415517A CN 114842215 A CN114842215 A CN 114842215A
- Authority
- CN
- China
- Prior art keywords
- target
- predicted
- frame
- vertex
- angle
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 29
- 230000004927 fusion Effects 0.000 title claims abstract description 15
- 230000000007 visual effect Effects 0.000 title claims abstract description 13
- 238000001514 detection method Methods 0.000 claims abstract description 53
- 241000251468 Actinopterygii Species 0.000 claims abstract description 24
- 230000011218 segmentation Effects 0.000 claims abstract description 21
- 238000004422 calculation algorithm Methods 0.000 claims description 11
- 238000010586 diagram Methods 0.000 claims description 9
- 239000013598 vector Substances 0.000 claims description 9
- 238000000638 solvent extraction Methods 0.000 claims description 4
- 238000010606 normalization Methods 0.000 claims description 3
- 230000001629 suppression Effects 0.000 claims description 3
- 206010037180 Psychiatric symptoms Diseases 0.000 claims description 2
- 238000009360 aquaculture Methods 0.000 abstract description 6
- 244000144974 aquaculture Species 0.000 abstract description 6
- 238000012545 processing Methods 0.000 abstract description 2
- 230000008569 process Effects 0.000 description 7
- 230000000694 effects Effects 0.000 description 6
- 230000006870 function Effects 0.000 description 4
- 238000012549 training Methods 0.000 description 4
- 238000013527 convolutional neural network Methods 0.000 description 3
- 230000033001 locomotion Effects 0.000 description 3
- 238000013459 approach Methods 0.000 description 2
- 230000006399 behavior Effects 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 2
- 238000002372 labelling Methods 0.000 description 2
- 238000012544 monitoring process Methods 0.000 description 2
- 230000035790 physiological processes and functions Effects 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 241001465754 Metazoa Species 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000037396 body weight Effects 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 238000007689 inspection Methods 0.000 description 1
- 230000004807 localization Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000005192 partition Methods 0.000 description 1
- 238000003909 pattern recognition Methods 0.000 description 1
- 230000008447 perception Effects 0.000 description 1
- 238000011897 real-time detection Methods 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
- 238000012800 visualization Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02A—TECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
- Y02A40/00—Adaptation technologies in agriculture, forestry, livestock or agroalimentary production
- Y02A40/80—Adaptation technologies in agriculture, forestry, livestock or agroalimentary production in fisheries management
- Y02A40/81—Aquaculture, e.g. of fish
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- Software Systems (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- Mathematical Physics (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Image Analysis (AREA)
Abstract
The invention provides a fish visual identification method based on multi-task fusion, which can perform target detection, instance segmentation and attitude estimation in parallel by utilizing an efficient multi-task network, and the inference speed is improved to a real-time level. Compared with a baseline, the network only increases negligible parameter quantity, and the processing capacity of the network for parallel multitask is excavated, so that the network can be easily deployed in an actual aquaculture application scene. The invention provides a prediction idea of a multi-task network by using a single decoder branch, and provides a heuristic for carrying out fusion coding among multiple tasks.
Description
Technical Field
The invention belongs to the field of target detection, and particularly relates to a fish visual identification method based on multi-task fusion.
Background
In the aquaculture process, the detection of the physiological state and the movement behavior of the fish can better realize the accurate aquaculture process. The acquisition of the physiological state and the exercise behavior can be performed by calculating the body length and the body weight of the fish and the exercise posture. Target detection, key point detection and instance segmentation in the field of computer vision can provide accurate predicted target frame, mask and key point skeleton information. The predicted target frame obtained by target detection can accurately position the fish, the mask obtained by example segmentation can obtain the fish outline, and the movement posture of the fish can be judged according to the key point information.
With the development of deep learning, there are many excellent algorithms that can achieve these tasks, for target detection, the YOLO series [ Redmon, Joseph, and Ali faradai. "yoloov 3: An innovative approach." arXiv predictive arXiv:1804.02767(2018) ] and the SSD series [ Liu, Wei, et al. "SSD: Single shot multiple detector." European conference on computer vision. For example segmentation, Mask R-CNN [ He, Kaiming, et al. "Mask R-CNN." Proceedings of the IEEE international Conference on Computer vision.2017. ] and or SOLO [ Wang, Xinlong, et al. "Segmenting objects by locations" European Conference on Computer vision.Springer, Cham,2020 ] this pixel classification-based segmentation method theoretically can segment all example pixels, so good effect can be achieved, but the inference process generates a large number of parameters, thereby losing real-time. For pose estimation, deep neural networks are used for Human pose estimation, positioning is performed by predicting the coordinate positions of key points, and the prediction result is finely adjusted by using multi-scale information. Since the key points are sparsely distributed, the current mainstream framework HRNet [ Sun, Ke, et al, "Deep high-resolution representation for human position estimation." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern recognition.2019 ] locates the key points by using thermodynamic diagrams, but the image needs to be up-sampled to the original image size by using the thermodynamic diagrams, and a large number of parameters are generated in the process. While these frameworks have been excellent, implementing three functions serially at once is very inefficient and wasteful of computing resources. Intensive examples exist in the aquaculture process, so that high requirements are placed on the real-time performance and resource occupation of the system, and the multitask network has the following characteristics: (a) the method can complete various prediction tasks at one time, saves inference time and computational power, and (b) a plurality of task branches share one encoder, which is beneficial to improving the depth of feature extraction, thereby improving the performance of each task. Thus, a multi-tasking framework is well suited for aquaculture processes.
Few researchers have been working on multitask networks in fishery, but in the field of automatic driving, there have been a lot of applications of multitask networks in panoramic perception [ Teichmann, Marvin, et al, "Real-time joint semiconductor for autonomous driving."2018IEEE intellectual Vehicles Symposium (IV). MultiNet realizes scene classification, target detection and semantic segmentation of a driving region by using one encoder and three decoder branches, and achieves the effect of real-time detection, but the framework is only suitable for the field of automatic driving. LSNet [ Duan, Kaiwen, et al. "Location-sensitive visual registration with cross-iou loss." arXiv preprinting arXiv:2104.04899(2021) ] by uniformly encoding the position-sensitive tasks, object detection, instance segmentation, and pose estimation are achieved with a uniform framework, but the three tasks cannot be performed in parallel.
Disclosure of Invention
In order to solve the problems in the prior art, the invention provides a fish visual identification method based on multi-task fusion, which can simultaneously realize target detection, instance segmentation and attitude estimation by utilizing an efficient and simple multi-task network, wherein the network only uses one encoder and one decoder branch to carry out detection in each scale, and optimizes the detection time and calculation occupation to a reasonable range.
In order to realize higher accuracy and speed, the technical scheme of the invention is as follows:
a fish visual identification method based on multi-task fusion constructs a full convolution network, adopts Darknet53 as an encoder to encode images, uses a characteristic pyramid at the neck of the network to perform context characteristic fusion of targets with different scales, and uses a detection head with a specified scale as a decoder aiming at the targets with different scales; so that the output layer can fuse more levels of feature information. For each output tensor, dividing channels according to three tasks, and respectively using the channels for target detection, attitude estimation and instance segmentation;
the target detection adopts a one-stage target detection method based on an anchor frame, characteristic graphs containing information of a predicted target frame in different sizes are output according to targets in different scales, and the predicted target frame is rectangular;
the attitude estimation is to use a plurality of key points to express an attitude state, adopt a single predicted target frame center point obtained by target detection and carry out attitude estimation through vectors of the predicted center point pointing to each key point;
and the example segmentation is to determine the mask of a single example by taking the central point of the predicted target frame as the origin of polar coordinates, determining the initial position of the contour point through predicting the angle interval where the vertex of the target contour polygon is located, and predicting the offset between the predicted position and the reference angle to correct to obtain a specific position.
Further, the anchor frame-based one-stage target detection method specifically comprises the following steps: outputting feature maps of different sizes according to targets of different scalesWherein C is the number of channels occupied by the output characteristic diagram and respectively corresponds to { category, confidence coefficient, x, y, w, h }, wherein the category is represented as the kind of the predicted objectAnd the confidence coefficient is the prediction probability given by the network aiming at the current predicted object, x and y are the deviation values of the center of the predicted object relative to the center point of the grid, and w and h are the size deviation of the predicted target frame of the object relative to the anchor frame. The decoder divides the output characteristic diagram into S multiplied by S grids, each grid is responsible for predicting an object with a central point falling into the grid, and the rectangular frame is more suitable for a target by offsetting the shape and the position of the anchor frame; all detected predicted target boxes will then be filtered using a non-maximum suppression algorithm, leaving the predicted target box with the highest confidence for each object.
Further, the attitude estimation encodes k key point coordinates intoWherein i ∈ { 1., k }; y is i Absolute position coordinates (x, y) representing the ith keypoint; the keypoint detection occupies N pose 2 × k channels; firstly, the central point of the predicted target frame is calculated by using the predicted target frame obtained by target detectionThe diagonal length diag of the predicted target box is then calculated, for each pose vector N (y) i (ii) a b) Comprises the following steps:
after normalization, the key point detection limits the output result range to [0,1], and attitude estimation is carried out by predicting the vector of the center point of the target frame pointing to each key point. And the key point detection and the target detection task are directly coupled, so that the performance of the two tasks is mutually improved.
Further, the example is divided, the central point of the prediction target frame is used as the origin of polar coordinates, and the vertex of the mask polygon is expressed in a mode of angle and intercept; the angle and intercept are calculated as follows: with a fixed step size Stride ∈ [0,360,]dividing coordinate axes intoEach angle block defines an interval start angle a equal to N × Stride. If a certain vertex on the polygon falls into a certain angle block, the channel of the block is responsible for predicting the distance from the vertex to the origin, the offset of the angle of the vertex relative to the interval starting angle and the confidence coefficient of the vertex; for each block, in the output channel, each vertex is represented by three parameters, namely distance, angle offset and confidence; for each instance, the network is most predictable when the step size is SA vertex.
Further, a loss function for calculating the difference between the predicted value and the actual value is also included, as follows:
wherein l obj (i, j) is used to calculate the loss of the target detection task,/ pose (i, j) is the penalty for computing the pose estimation task, l seg (i, j) is the loss used to compute the instance partitioning, q i,j Is a constant that indicates whether the current anchor frame contains a target, G w G h Indicates the current grid position, n a An anchor frame representing the current target; wherein, the loss of the specific task is defined as follows:
l obj (i,j)=l 1 (i,j)+l 2 (i,j)+l 3 (i,j)+l 4 (i,j)
wherein l 1 (i, j) predicting loss of center of target frame,/ 2 (i, j) loss of predicted target frame size,/ 3 (i, j) is the loss of confidence in the prediction,/ 4 (i, j) is the loss of the prediction class:
wherein, w i,j ,h i,j Is the width and height of the target frame,is the width and height of the current jth anchor frame;
wherein C is the total number of classes, C i,j,k For the predicted target class, ψ (·.,..) is a cross entropy loss function;
wherein n is p Is the number of key points, P i,j,k Phi is a coordinate position of a key point, and phi is a mean square error loss function;
where v is the number of divided angle blocks, diag is the diagonal length of the prediction target frame, α i,j,k Distance of vertex to origin, β i,j,k The offset of the angle at which the apex is located with respect to the start of the angular interval, γ i,j,k Is the confidence of the current vertex.
The invention has the advantages that an efficient multi-task network is provided, the target detection, the instance segmentation and the attitude estimation can be carried out in parallel, and the reasoning speed is improved to a real-time level. Compared with a baseline, the network only increases negligible parameter quantity, and the processing capacity of the network for parallel multitask is excavated, so that the network can be easily deployed in an actual aquaculture application scene. The invention provides a prediction idea of a multi-task network by using a single decoder branch, and provides a heuristic for carrying out fusion coding among multiple tasks.
Drawings
FIG. 1 is a diagram of the visual tasks of the present invention for fish monitoring, (a) target image, (b) target detection, (c) pose estimation, and (d) example segmentation.
Fig. 2 is a network architecture framework diagram of the present invention.
FIG. 3 is a schematic diagram of an example partition according to an embodiment of the present invention.
FIG. 4 is a statistical graph of edge counts of mask polygons in accordance with an embodiment of the present invention.
Fig. 5 shows the results of monitoring various fishes according to the embodiment of the present invention, (a) the result of target detection, (b) the result of attitude estimation, (c) the result of example segmentation, and (d) the result of multitasking.
Detailed Description
The technical solution of the present invention will be further described in detail with reference to the accompanying drawings.
In order to verify the performance of the network, the embodiment manufactures a fish multitask data set containing high-quality labels, wherein the fish multitask data set contains artificially labeled real target boxes, key points and mask labels, and subsequent operations are performed on the data set. The present embodiment not only tries the end-to-end training strategy, but also tries the influence of the alternative training paradigm on the detection accuracy of multiple tasks. Experiments show that the idea of the embodiment is effective and efficient. Through verification, the target detection average precision of 95.3% and the example segmentation average precision of 53.9% are finally obtained in the embodiment, for attitude estimation, the target key point similarity average precision of 95.1% is achieved in the embodiment, and the inference speed is 66.3fps (using great to tesla v 100). All of these tasks had only 0.69% increased parameters compared to the baseline model.
Example (b):
in the embodiment, 2.6k fish images are collected and are manually labeled, the labeling result follows the format of MS-COCO, the labeling content comprises a real target frame, an instance segmentation mask polygon and the coordinates of the attitude estimation key points.
The present embodiment trains the framework herein on NVIDIA TESLA v100, which tries a combination of different hyper-parameters and optimizers to find the most suitable training method. This embodiment uniformly resizes the input images to 416 × 416 for inference, and uses a pre-trained model of Darknet53 on ImageNet (ImageNet) as an initialization weight. In the present embodiment, the average Accuracy (AP) is used to measure the model performance of the present embodiment, and the correlation calculation standard is consistent with MS-COCO. For object detection, the present embodiment uses the intersection ratio of the prediction box to the true value (IOU) threshold (from 0.5 to 0.95) for computation, the mask IOU for instance segmentation for computation of the AP, and OKS for pose estimation for computation of the AP.
a) Target detection
Similar to yolov3, the target detection method of the embodiment is based on the anchor frame, so that the selection of the most appropriate anchor frame size is greatly helpful to the network convergence effect. Outputting feature maps of different sizes according to targets of different scalesWherein S is equal to {13,26,52}, and the output characteristic diagram is divided into S multiplied by S grids, wherein each gridThe grid is responsible for predicting the category and the confidence coefficient of an object center falling into the grid, and the shape and the position of the anchor frame are shifted, so that the rectangular frame is more suitable for a target; and filtering the detected target frame by using a non-maximum suppression algorithm, leaving the target frame with the highest confidence coefficient, and for each anchor frame, occupying the first 6 dimensions of the output tensor by the target detection, and respectively corresponding to the { category, the confidence coefficient, x, y, w, h }. In the embodiment, a K-means algorithm is used for clustering all real target frames in a data set, and finally 9 anchor frame sizes are obtained according to different scales, wherein the sizes are respectively { [333,151 ]],[363,173],[353,216],[261,127],[148,245],[340,119],[129,56],[175,82],[231,99]}. Different target detection algorithms are run on the data set of the embodiment, and specific results are shown in table 1.
TABLE 1
The MultiNet embodiment only uses the branch of target detection and ignores other tasks, and it can be seen that the network inference speed of fishernet in this embodiment is higher than MultiNet, centrnet and faster-cnn, but lower than YOLOv5s and YOLOv3 because YOLOv5s uses a lightweight network design, and this embodiment is caused by adding partial parameters on the basis of YOLOv 3. It can be seen that the AP of the framework of this embodiment is second only to centrnet, with a reasonably good result. Although the architecture of the present embodiment is reconstructed with YOLOv3 as a baseline, the score of the present embodiment is higher than that of YOLOv3, which is considered to be caused by the mutual influence among a plurality of tasks.
b) Attitude estimation
The research on the motion state of the fish by using the fish key points is less, so that the embodiment determines that the 6 key points are in the fish body by itself. This example defines the mouth, gill upper, gill lower, tail center, tail upper, tail lower to define the pose of a fish. The location and number of keypoints may be varied accordingly, if desired.
TABLE 2
There are fewer protocols for identifying key points for animals, so this example uses several frames with some modifications to fit the data set of this example, and table 2 shows some experimental results. Unlike the method of performing keypoint detection based on heat maps, the present embodiment performs keypoint localization by means of vector regression. This example ultimately achieved 81.3% AP on the data set, but the method of this example is still comparable to the heatmap-based method. The main reasons for the analysis of this example are as follows: in the embodiment, the diagonal length of the prediction target frame is used as a reference in the normalization process of calculating the key point vector, so that the quality of the prediction frame for target detection directly affects the regression result of the key point vector.
c) Instance partitioning
The embodiment accomplishes example segmentation by predicting the outline polygon vertices of an example in a form of placing the example in a polar coordinate system. Because the polar coordinate system is divided into a plurality of angle blocks according to the fixed step length, the upper limit of the number of the sides of the polygon is the number of the blocks, and therefore, the selection of the proper step length is beneficial to improving the fineness degree of the divided boundary. To address this problem, the present embodiment performs statistics on the number of edges of the mask polygons in all data sets, and the statistical result is shown in fig. 4. It can be seen that almost most of the polygon sides are distributed around 20, therefore, the present embodiment considers that setting the step size to 15 °, i.e. generating 24 angle blocks, can satisfy the requirement of the operation of the present embodiment. Example partitioning in this embodiment occupiesA channel.
TABLE 3
The most representative method based on pixel classification, Mask RCNN, and three typical contour-based methods were selected for comparison in this example. As can be seen from table 3, in the data set of the embodiment, the method of the embodiment achieves 46.7% of AP, and compared to Mask R-CNN, the embodiment lags behind the detection of the large target, which may be caused by the complicated contour of the large target and the too many edges, which may result in the failure to refine the polygon contour. This example achieved almost the same level as he did with the currently best contour-based approach, polarmack, which was unexpected.
Table 4 shows the parameter quantities of the model of the present embodiment in executing different tasks and the comparison between GFLPOs and YOLOv3, it can be seen that, compared to YOLOv3, the present embodiment only increases the parameter quantity by 5.1%, and compared to the single target detection version of fishernet, the present embodiment only increases the parameter quantity by 0.69% to realize multi-task learning, and these parameter quantities are almost negligible. Therefore, the model of the embodiment can perform real-time reasoning at a high speed. Fig. 5 is a visualization result of the model final inspection of the present embodiment.
TABLE 4
In order to compare the advantages of the multi-task network relative to the serial structure, the embodiment selects and combines algorithms with better effects in each task, and it can be seen from table 5 that after the YOLOv5 target detection algorithm, the HRNet posture estimation algorithm and the polarimask instance segmentation algorithm are combined in series, the parameter quantity of the model is greatly increased, the inference time is greatly increased, and the inference frame rate is reduced.
TABLE 5
In this embodiment, a multi-task network architecture capable of end-to-end training is provided in this embodiment, and target detection, pose estimation, and instance segmentation can be performed efficiently and at high speed. This example produced a multitask data set of fish and tested the framework of this example on the data set. Compared with other algorithms, the method of the embodiment can achieve excellent detection effect, keeps high real-time performance, and achieves the speed of 63.3FPS on the great TeSLA v 100. The work of the embodiment shows that the multitask network can achieve a good detection effect by using only one prediction branch, and subsequent research is expected to expand the method of the embodiment to achieve stronger performance in more fields.
Claims (5)
1. A fish visual identification method based on multi-task fusion is characterized in that,
a full convolution network is constructed, the Darknet53 is used as an encoder to encode images, a characteristic pyramid is used at the neck of the network to perform context characteristic fusion of targets with different scales, and a detection head with a specified scale is used as a decoder for the targets with different scales; for each output tensor, dividing channels according to three tasks, and respectively using the channels for target detection, attitude estimation and instance segmentation;
the target detection adopts a one-stage target detection method based on an anchor frame, characteristic graphs containing information of a predicted target frame in different sizes are output according to targets in different scales, and the predicted target frame is rectangular;
the attitude estimation is carried out by expressing an attitude state by utilizing a plurality of key points, adopting the central point of a single rectangular target frame obtained by target detection and predicting vectors of the central point pointing to each key point;
and the example segmentation is to use the center point of the predicted target frame as the origin of polar coordinates, and obtain a specific position by predicting the angle of the polygon vertex of the target outline and the distance between the polygon vertex and the origin, so as to determine the mask of a single example.
2. The fish visual identification method based on multitask fusion according to claim 1, wherein the anchor frame-based one-stage target detection method specifically comprises the following steps: outputting feature maps of different sizes according to targets of different scalesWherein C is the number of channels occupied by the output feature map and respectively corresponds to { category, confidence, x, y, w, h }, wherein the category represents the type of the predicted object, the confidence is the prediction probability given by the network for the current predicted object, x and y are offset values of the center of the predicted object relative to the center point of the grid, and w and h are size offsets of the predicted target frame of the object relative to the anchor frame; the decoder divides the output characteristic diagram into S multiplied by S grids, each grid is responsible for predicting an object with a central point falling into the grid, and the rectangular frame is more suitable for a target by offsetting the shape and the position of the anchor frame; all detected predicted target boxes will then be filtered using a non-maximum suppression algorithm, leaving the predicted target box with the highest confidence for each object.
3. The fish visual identification method based on multi-task fusion as claimed in claim 1, wherein the pose estimation encodes k key point coordinates toWherein i ∈ { 1., k }; y is i Absolute position coordinates (x, y) representing the ith keypoint; the keypoint detection occupies N pose 2 × k channels; firstly, the central point of the predicted target frame is calculated by using the predicted target frame obtained by target detectionThe diagonal length diag of the predicted target box is then calculated, for each pose vector N (y) i (ii) a b) Comprises the following steps:
after normalization, the key point detection limits the output result range to [0,1], and attitude estimation is carried out by predicting the vector of the center point of the target frame pointing to each key point.
4. The fish visual identification method based on multitask fusion according to claim 1, characterized in that said example division uses the central point of the predicted target frame as the origin of polar coordinates, and expresses the vertex of the mask polygon as angle and intercept; the angle and intercept are calculated as follows: with a fixed step size Stride ∈ [0,360,]dividing coordinate axes intoAn angle block defining an interval start angle a of N × Stride; if a certain vertex on the polygon falls into a certain angle block, the channel of the block is responsible for predicting the distance from the vertex to the origin, the offset of the angle of the vertex relative to the interval starting angle and the confidence coefficient of the vertex; for each block, in the output channel, each vertex is represented by three parameters, namely distance, angle offset and confidence; for each instance, the network is most predictable when the step size is SA vertex.
5. The fish visual recognition method based on multitask fusion according to claim 1, further comprising a loss function for calculating the difference between a predicted value and a true value, as follows:
wherein l obj (i, j) is used to calculate the loss of the target detection task,/ pose (i, j) is the penalty for computing the pose estimation task, l seg (i, j) is the loss used to compute the instance partitioning, q i,j Is a constant that indicates whether the current anchor frame contains a target, G w G h Indicates the current grid position, n a An anchor frame representing the current target; wherein, the loss of the specific task is defined as follows:
l obj (i,j)=l 1 (i,j)+l 2 (i,j)+l 3 (i,j)+l 4 (i,j)
wherein l 1 (i, j) predicting loss of center of target frame,/ 2 (i, j) loss of predicted target frame size,/ 3 (i, j) is the loss of confidence in the prediction,/ 4 (i, j) is the loss of the prediction class:
wherein, w i,j ,h i,j Is the width and height of the target frame,is the width and height of the current jth anchor frame;
wherein C is the total number of classes, C i,j,k For the predicted target class, ψ (·.,..) is the cross-entropy loss function;
wherein n is p Is the number of key points, P i,j,k Phi (.,..) is the mean square error loss function;
where v is the number of divided angle blocks, diag is the diagonal length of the prediction target frame, α i,j,k Distance of vertex to origin, β i,j,k The offset of the angle at which the apex is located with respect to the start of the angular interval, γ i,j,k Is the confidence of the current vertex.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210415517.1A CN114842215A (en) | 2022-04-20 | 2022-04-20 | Fish visual identification method based on multi-task fusion |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210415517.1A CN114842215A (en) | 2022-04-20 | 2022-04-20 | Fish visual identification method based on multi-task fusion |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114842215A true CN114842215A (en) | 2022-08-02 |
Family
ID=82565544
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210415517.1A Pending CN114842215A (en) | 2022-04-20 | 2022-04-20 | Fish visual identification method based on multi-task fusion |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114842215A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117132870A (en) * | 2023-10-25 | 2023-11-28 | 西南石油大学 | Wing icing detection method combining CenterNet and mixed attention |
WO2024087574A1 (en) * | 2022-10-27 | 2024-05-02 | 中国科学院空天信息创新研究院 | Panoptic segmentation-based optical remote-sensing image raft mariculture area classification method |
-
2022
- 2022-04-20 CN CN202210415517.1A patent/CN114842215A/en active Pending
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2024087574A1 (en) * | 2022-10-27 | 2024-05-02 | 中国科学院空天信息创新研究院 | Panoptic segmentation-based optical remote-sensing image raft mariculture area classification method |
CN117132870A (en) * | 2023-10-25 | 2023-11-28 | 西南石油大学 | Wing icing detection method combining CenterNet and mixed attention |
CN117132870B (en) * | 2023-10-25 | 2024-01-26 | 西南石油大学 | Wing icing detection method combining CenterNet and mixed attention |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Adarsh et al. | YOLO v3-Tiny: Object Detection and Recognition using one stage improved model | |
Liu et al. | Affinity derivation and graph merge for instance segmentation | |
US10262218B2 (en) | Simultaneous object detection and rigid transform estimation using neural network | |
CN114842215A (en) | Fish visual identification method based on multi-task fusion | |
CN108960059A (en) | A kind of video actions recognition methods and device | |
CN113011568B (en) | Model training method, data processing method and equipment | |
CN107146219B (en) | Image significance detection method based on manifold regularization support vector machine | |
Cheng et al. | Cascaded non-local neural network for point cloud semantic segmentation | |
CN111899278B (en) | Unmanned aerial vehicle image rapid target tracking method based on mobile terminal | |
CN115861619A (en) | Airborne LiDAR (light detection and ranging) urban point cloud semantic segmentation method and system of recursive residual double-attention kernel point convolution network | |
Milioto et al. | Fast instance and semantic segmentation exploiting local connectivity, metric learning, and one-shot detection for robotics | |
Li et al. | MVF-CNN: Fusion of multilevel features for large-scale point cloud classification | |
CN114049621A (en) | Cotton center identification and detection method based on Mask R-CNN | |
Károly et al. | Optical flow-based segmentation of moving objects for mobile robot navigation using pre-trained deep learning models | |
Alsanad et al. | Real-time fuel truck detection algorithm based on deep convolutional neural network | |
Ahmed et al. | Robust Object Recognition with Genetic Algorithm and Composite Saliency Map | |
CN114529832A (en) | Method and device for training preset remote sensing image overlapping shadow segmentation model | |
KR20110037184A (en) | Pipelining computer system combining neuro-fuzzy system and parallel processor, method and apparatus for recognizing objects using the computer system in images | |
CN118212572A (en) | Road damage detection method based on improvement YOLOv7 | |
Sinha et al. | Human activity recognition from UAV videos using a novel DMLC-CNN model | |
CN104778683A (en) | Multi-modal image segmenting method based on functional mapping | |
CN116452599A (en) | Contour-based image instance segmentation method and system | |
Ito et al. | Point proposal based instance segmentation with rectangular masks for robot picking task | |
Cui et al. | PVF-NET: Point & voxel fusion 3D object detection framework for point cloud | |
Dao et al. | Attention-based proposals refinement for 3D object detection |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |