CN117975572B - Fish behavior detection method based on machine vision - Google Patents

Fish behavior detection method based on machine vision Download PDF

Info

Publication number
CN117975572B
CN117975572B CN202410369755.2A CN202410369755A CN117975572B CN 117975572 B CN117975572 B CN 117975572B CN 202410369755 A CN202410369755 A CN 202410369755A CN 117975572 B CN117975572 B CN 117975572B
Authority
CN
China
Prior art keywords
module
global
video
fish
optimization module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202410369755.2A
Other languages
Chinese (zh)
Other versions
CN117975572A (en
Inventor
董俊
闫家仁
刘晓晨
曹振杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong Freshwater Fisheries Research Institute
Original Assignee
Shandong Freshwater Fisheries Research Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong Freshwater Fisheries Research Institute filed Critical Shandong Freshwater Fisheries Research Institute
Priority to CN202410369755.2A priority Critical patent/CN117975572B/en
Publication of CN117975572A publication Critical patent/CN117975572A/en
Application granted granted Critical
Publication of CN117975572B publication Critical patent/CN117975572B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A40/00Adaptation technologies in agriculture, forestry, livestock or agroalimentary production
    • Y02A40/80Adaptation technologies in agriculture, forestry, livestock or agroalimentary production in fisheries management
    • Y02A40/81Aquaculture, e.g. of fish

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Medical Informatics (AREA)
  • Evolutionary Computation (AREA)
  • Databases & Information Systems (AREA)
  • Artificial Intelligence (AREA)
  • Psychiatry (AREA)
  • Social Psychology (AREA)
  • Human Computer Interaction (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a fish behavior detection method based on machine vision, and relates to the field of computer vision. The invention provides a fish behavior detection model, which comprises fish behavior video acquisition, fish behavior video annotation, local optimization module construction, global optimization module construction, multistage fusion module construction, improvement UniFormer module construction, fish behavior machine vision detection model construction, fish behavior data set training fish behavior machine vision detection model training, and real-time detection by using the fish behavior machine vision detection model; the invention provides an improvement UniFormer module, firstly a local optimization module effectively reduces local time redundancy by inserting a local multi-head time module before ViT blocks, then a global optimization module can capture complete space-time dependency, and finally a multi-stage fusion module is used for fusing all global semantic marks of multiple stages and multiple levels to form a final video representation.

Description

Fish behavior detection method based on machine vision
Technical Field
The invention belongs to computer vision, and particularly relates to a fish behavior detection method based on machine vision.
Background
Visual transformer ViT performs well in image tasks, which motivates various studies to apply image ViT to video tasks, however the large gap between image and video prevents spatiotemporal learning of these image pre-training models. Under video object detection tasks, video-specific detection models like UniFormer can be transferred more seamlessly to the video domain, but their unique architecture requires long image pre-training, limiting scalability. With the advent of the powerful open source image ViT detection architecture, its potential for video understanding can be released by improving UniFormer architecture.
The detection of fish behavior is an important research area, which involves identifying and analyzing fish behavior from videos taken on or under water. The quality of the video directly affects the accuracy and reliability of the detection. High resolution, sharpness, and stable video help provide more accurate data to better capture various behaviors of fish. Meanwhile, the capability of the video detection algorithm is also of great importance, and the video detection algorithm needs to have the capability of identifying and classifying different types of fish behaviors and the skill of filtering and processing interference and noise in a complex environment, so that the video quality and the performance of the detection algorithm jointly determine the effect of fish behavior detection.
Disclosure of Invention
The invention provides a fish behavior detection method based on machine vision, which aims at improving on the basis of UniFormer, introduces a high-efficiency structural design, uses local optimization, global optimization and multi-stage fusion, releases potential of UniFormer in video understanding, and realizes advanced performance in fish behavior video detection.
The invention aims at improving UniFormer on the basis of providing a fish behavior detection method based on machine vision, which comprises the following steps:
S1, fish behavior video acquisition, namely, acquiring 6 fish behaviors including foraging, evading, social contact, tour, rest and propagation behaviors, wherein the fish behavior video acquisition comprises underwater shooting and water shooting;
s2, labeling the fish behavior videos, namely labeling all fish behavior videos containing 6 fish behaviors, and labeling the 6 fish behaviors in a continuous frame mode, so as to obtain a fish behavior data set;
S3, constructing a local optimization module which comprises a local multi-head time module, a global multi-head space module and an FFN;
S4, constructing a global optimization module which comprises a DPE, a global multi-head space-time module and an FFN;
S5, constructing a multi-stage fusion module, and fusing the input local optimization module and the global optimization module;
s6, constructing an improvement UniFormer module which comprises a 3D convolution module, a downsampling module, a local optimization module, a global optimization module and a multi-stage fusion module;
S7, constructing a fish behavior machine vision detection model, which sequentially comprises an input module, an improvement UniFormer module, a detection head and an output module;
S8, training the fish behavior machine vision detection model by using the fish behavior data set, performing multiple training, increasing the training round number in each training, using the accuracy and the recall rate as evaluation indexes, and selecting the model with the highest accuracy and recall rate as the fish behavior machine vision detection model;
S9, performing real-time detection by using a fish behavior machine vision detection model, shooting the region where the fish is located by using shooting equipment, and inputting fish behavior videos into the fish behavior machine vision detection model in real time to obtain a real-time detection result.
Preferably, in step S1, when shooting underwater, a professional underwater camera is used, a part of the video shooting sites select a water area with sufficient light, clear water and high transparency, a part of the video shooting sites select a water area with turbid water quality and clear fish profile and action track, when shooting on water, a part of the video shooting sites select a water area with sufficient light, clear water and high transparency, a part of the video shooting sites select a night water area with light supplementing equipment and clear water with clear water quality and high transparency, and a part of the video shooting sites select a water area with turbid water quality and clear fish profile and action track.
Preferably, in step S1, for foraging activities, including food finding, predation and eating activities; for evasive behavior, including rapid distance from other fish, abrupt steering, and hiding; for social behavior, including puppet, territory competition, territory defense, and collaborative predation; for tour activities, i.e. regularly tour a specific area; for rest activities, including resting floating, hiding beside the shelter and resting at the water bottom; for reproductive activities, including mating, spawning, and brooding.
Preferably, in step S3, the input fish behavior video is projected into 16L spatiotemporal markers using 3D convolution,/>L is the product of time, height and width of the input video, C is the video channel, then/>After 8 times time downsampling, 2 times spatial downsampling and position embedding,/>For a local optimization module, input/>Output/>,/>,/>,/>,/>,/>,/>Representing element-wise addition, LMT representing a local multi-head time module, GMS representing a global multi-head space module, FFN consisting of two linear projections separated by GeLU, norm in LMT representing Batch Norm, norm in GMS and FFN representing Layer Norm, GMS and FFN from ViT blocks of image pre-training, LMT having a learnable parameter matrix in the time dimension t1 x1 for LMT,/>Representing a mark/>, in the time dimension that the LMT can learnAnd other markers/>The relation between them, for GMS, is mainly focused on 1 x H x W in single frame video,,/>,/>,/>And/>Is a different linear projection in the nth header, exp represents an exponential function,/>Representing the matrix transpose.
Preferably, in step S4, for the global optimization module, inputs are made,/>,/>,/>,/>,/>,/>Representing element-by-element addition, DPE from UniFormer representing dynamic position embedding, FFN consisting of two linear projections separated by GeLU, GMST representing a global multi-headed space-time block, norm representing Layer Norm in GMST and FFN calculated in the manner of GMSTWherein the linear projection/>WhereinThe dependency between q and all spatiotemporal markers X can be modeled to model the query/>Conversion to video representation in a manner of/>Wherein/>Representing the conversion of X into a spatiotemporal context using linear projection,/>Representing the cross-affinity matrix between calculated q and X,/>The calculation mode is/>Exp represents an exponential function, T represents a matrix transpose,/>Representing a linear projection of a linear layer implementation,/>Representing a linear projection of a linear layer implementation.
Preferably, in step S5, for a multi-level fusion module, the ith global optimization module is first expressed asFor/>, from the local optimization moduleThe global optimization module can convert the query q into videomarks/>Videomark/>, for all global optimization modulesSequentially using the previous global optimization module/>, As the current global optimization moduleThen/>Finally, a global video mark F is obtained, and for N global optimization modules,/>Class labels/>, are then extracted from the final local optimization moduleThe class labels and the global video labels are added in the form of weighted sums to obtain and output a final video representation/>,/>Representing element-by-element additions,/>Is a learnable parameter processed by a Sigmoid function.
Preferably, in step S6, for improving UniFormer the module, inputting the fish behavior video, projecting the input fish behavior video into a plurality of space-time markers by applying 3D convolution, then performing 8-time spatial downsampling, performing 2-time downsampling, performing position embedding, constructing a local optimization module and a global optimization module, wherein the local optimization module sequentially consists of a local multi-head time module, a global multi-head spatial module and an FFN, the global optimization module sequentially consists of a DPE, a global multi-head space-time module and an FFN, using and sequentially connecting the plurality of local optimization modules, introducing the global optimization module above each local optimization module, constructing a multi-stage fusion module, extracting class markers from the last local optimization module, obtaining global video markers from all global optimization modules, adding the class markers and the global video markers in a weighted sum form, and obtaining and outputting the final fish behavior video representation.
Compared with the prior art, the invention has the following technical effects:
The technical scheme provided by the invention provides an improvement UniFormer module which is used for optimizing the performance of UniFormer in a video target detection task and is applied to a fish behavior detection scene, and comprises a local optimization module, a global optimization module and a multi-stage fusion module, wherein the local optimization module utilizes the spatial representation of ViT, the local time redundancy is effectively reduced by inserting a local multi-head time module before ViT blocks, meanwhile, the global optimization module is introduced above the local optimization module, the complete space-time dependency is captured, and finally, all global semantic marks of multi-stage and multi-stage are fused by using the multi-stage fusion module to form the final video representation.
Drawings
Fig. 1 is a flow chart of fish behavior detection provided by the invention.
Fig. 2 is a block diagram of an improvement UniFormer provided by the present invention.
FIG. 3 is a block diagram of a local optimization module provided by the present invention.
FIG. 4 is a block diagram of a global optimization module provided by the present invention.
Detailed Description
The invention aims to provide a fish behavior detection method based on machine vision, which provides an improvement UniFormer module for optimizing UniFormer performance in fish behavior detection tasks, and comprises a local optimization module, a global optimization module and a multi-stage fusion module, wherein the local optimization module utilizes ViT spatial representation, the local time redundancy is effectively reduced by inserting a local multi-head time module before ViT blocks, meanwhile, the global optimization module is introduced above the local optimization module, the complete space-time dependency is captured, and finally, all global semantic marks of multiple stages and multiple stages are fused by using the multi-stage fusion module to form a final video representation.
Referring to fig. 1, in an embodiment of the present application, a method for detecting fish behavior based on machine vision is provided:
S1, collecting fish behavior videos, wherein the collected fish behavior videos are taken from clear water areas in an aquarium, all videos are taken through professional underwater equipment, including 6 fish behaviors including foraging, evading, social contact, tour, rest and propagation, the video length is between one minute and five minutes, and the total video frequency is 150 video segments;
S2, labeling the fish behavior videos, namely labeling all fish behavior videos containing 6 fish behaviors, and labeling the 6 fish behaviors by using a continuous frame method so as to obtain a fish behavior data set containing behavior tags and time stamp information;
S3, constructing a local optimization module which comprises a local multi-head time module, a global multi-head space module and an FFN;
S4, constructing a global optimization module which comprises a DPE, a global multi-head space-time module and an FFN;
S5, constructing a multi-stage fusion module, and fusing the input local optimization module and the global optimization module;
s6, constructing an improvement UniFormer module which comprises a 3D convolution module, a downsampling module, a local optimization module, a global optimization module and a multi-stage fusion module;
S7, constructing a fish behavior machine vision detection model, which sequentially comprises an input module, an improvement UniFormer module, a UniFormer detection head and an output module;
S8, training the fish behavior machine vision detection model by using the fish behavior data set, performing multiple training, increasing the training round number in each training, using the accuracy and the recall rate as evaluation indexes, and selecting the model with the highest accuracy and recall rate as the fish behavior machine vision detection model;
S9, performing real-time detection by using a fish behavior machine vision detection model, shooting the region where the fish is located by using shooting equipment, and inputting fish behavior videos into the fish behavior machine vision detection model in real time to obtain a real-time detection result.
Further, in step S1, for foraging activities, including food seeking, predation, and eating activities; for evasive behavior, including rapid distance from other fish, abrupt steering, and hiding; for social behavior, including puppet, territory competition, territory defense, and collaborative predation; for tour activities, i.e. regularly tour a specific area; for rest activities, including resting floating, hiding beside the shelter and resting at the water bottom; for reproductive activities, including mating, spawning, and brooding.
Further, in step S3, the input fish behavior video is projected into 16L spatiotemporal markers using 3D convolutionL is the product of time, height and width of the input video, C is the video channel, then/>After 8 times time downsampling, 2 times spatial downsampling and position embedding,/>For the local optimization module, the structure is shown in figure 3, and the input/>Output/>,/>,/>,/>,/>,/>Representing element-by-element additions, and/>, in FIG. 3Correspondingly, LMT stands for local multi-head time module, GMS stands for global multi-head space module, FFN consists of two linear projections separated by GeLU, norm stands for Batch Norm, norm stands for Layer Norm, GMS and FFN come from ViT blocks of image pre-training, LMT has a learnable parameter matrix/>, in time dimension t×1×1, for LMTRepresenting a mark/>, in the time dimension that the LMT can learnAnd other markers/>The relation between them, for GMS, is mainly focused on 1 x H x W in single frame video,,/>,/>,/>And/>Is a different linear projection in the nth header, exp represents an exponential function,/>Representing the matrix transpose.
Further, in step S4, for the global optimization module, its structure is as shown in FIG. 4, input,/>,/>,/>,/>,/>,/>Representing element-by-element additions, and/>, in FIG. 4Correspondingly, DPE comes from UniFormer and represents dynamic position embedding, FFN consists of two linear projections separated by GeLU, GMST represents global multi-head space-time module, norm in GMST and FFN represents Layer Norm, and GMST is calculated in/>In which the projection is linearWherein/>The dependency between q and all spatiotemporal markers X can be modeled to model the query/>Conversion to video representation in a manner of/>Wherein/>Representing the conversion of X into a spatiotemporal context using linear projection, V representing/>, in FIG. 4,/>Representing the cross-affinity matrix between calculated q and X,/>The calculation mode is/>Q represents/>, in FIG. 4In FIG. 4K represents/>Exp represents an exponential function,/>Representing matrix transposition,/>Representing a linear projection of a linear layer implementation,/>Representing a linear projection of a linear layer implementation.
Further, in step S5, for the multi-level fusion module, the ith global optimization module is first expressed asFor/>, from the local optimization moduleThe global optimization module can convert the query q into videomarks/>Videomark/>, for all global optimization modulesSequentially using the previous global optimization module/>, As the current global optimization moduleThen/>Finally, a global video mark F is obtained, and for N global optimization modules,/>Class labels/>, are then extracted from the final local optimization moduleThe class labels and the global video labels are added in the form of weighted sums to obtain and output a final video representation/>,/>Representing element-by-element additions,/>Is a learnable parameter processed by a Sigmoid function.
Further, in step S6, for the improvement UniFormer module, the structure is shown in fig. 2, the resolution of a single frame image of the fish behavior video is 224×224×3, 3D convolution is applied to project the input fish behavior video into a plurality of space-time markers, then 8 times of space downsampling is performed, 2 times of time downsampling is performed, position embedding is performed, a local optimization module and a global optimization module are constructed, the local optimization module is sequentially composed of a local multi-head time module, a global multi-head space module and an FFN, the global optimization module is sequentially composed of a DPE, a global multi-head space-time module and an FFN, a plurality of local optimization modules are used and sequentially connected, a global optimization module is introduced above each local optimization module, a multi-stage fusion module is constructed, class markers are extracted from the last local optimization module, global video markers are obtained from all global optimization modules, the class markers and the global video markers are added in a weighted sum form, and the final fish behavior video representation is obtained and output, in fig. 2, N local optimization modules and N is 4.
Further, in step S4, for DPE, which is generated by using a combination of sine and cosine functions and is referred to as position coding for adding position information for each position of an input sequence, a calculation formula is,/>Wherein/>Is location information,/>Is an index of the embedded dimension,/>Is the size of the embedding dimension, for each location/>And each embedded dimension/>The corresponding position code/>, can be calculated
Further, in step S4, the FFN is used to perform nonlinear transformation and feature extraction on the feature vector of each position, and is composed of two fully connected layers, where the two layers are connected by an activation function, the first fully connected layer maps the input feature vector to a hidden layer with an intermediate dimension, a ReLU activation function is applied after the first fully connected layer, and then the second fully connected layer maps the output of the first fully connected layer back to the original feature dimension, where the FFN can be calculated asWherein/>Is an input feature vector,/>And/>Is the weight and bias of the first fully connected layer,/>And/>Is the weight and bias of the second fully connected layer,/>Representing an activation function.
The foregoing is merely a preferred embodiment of the present invention, and it should be noted that modifications and improvements could be made by those skilled in the art without departing from the inventive concept, which fall within the scope of the present invention.

Claims (2)

1. The fish behavior detection method based on machine vision is characterized by comprising the following steps of:
s1, fish behavior video acquisition, namely, acquiring 6 fish behaviors including foraging, evading, social contact, tour, rest and propagation behaviors, wherein the fish behavior video acquisition comprises underwater shooting and water shooting; for foraging behavior, including finding food, predation, and eating activities; for evasive behavior, including rapid distance from other fish, abrupt steering, and hiding; for social behavior, including puppet, territory competition, territory defense, and collaborative predation; for tour activities, i.e. regularly tour a specific area; for rest activities, including resting floating, hiding beside the shelter and resting at the water bottom; for reproductive activities, including mating, spawning, and brooding;
s2, labeling the fish behavior videos, namely labeling all fish behavior videos containing 6 fish behaviors, and labeling the 6 fish behaviors in a continuous frame mode, so as to obtain a fish behavior data set;
s3, constructing a local optimization module which comprises a local multi-head time module, a global multi-head space module and an FFN, wherein the specific method comprises the following steps: video projection of input fish behavior into 16L spatiotemporal markers using 3D convolution ,/>L is the product of time, height and width of the input video, C is the video channel, then/>After 8 times time downsampling, 2 times spatial downsampling and position embedding,/>For a local optimization module, input/>Output/>,/>,/>,/>,/>,/>Representing element-wise addition, LMT representing a local multi-head time module, GMS representing a global multi-head space module, FFN consisting of two linear projections separated by GeLU, norm in LMT representing Batch Norm, norm in GMS and FFN representing Layer Norm, GMS and FFN from ViT blocks of image pre-training, LMT having a learnable parameter matrix in the time dimension t1 x1 for LMT,/>Representing a mark/>, in the time dimension that the LMT can learnAnd other markers/>The relation between them, for GMS, is mainly focused on 1 x H x W in single frame video,,/>,/>,/>And/>Is a different linear projection in the nth header, exp represents an exponential function,/>Representing matrix transposition;
S4, constructing a global optimization module comprising DPE, a global multi-head space-time module and FFN, and inputting the global optimization module ,/>,/>,/>,/>,/>,/>,/>Representing element-by-element addition, DPE from UniFormer representing dynamic position embedding, FFN consisting of two linear projections separated by GeLU, GMST representing a global multi-headed space-time block, norm representing Layer Norm in GMST and FFN calculated in the manner of GMSTWherein the linear projection/>WhereinThe dependency between q and all spatiotemporal markers X can be modeled to model the query/>Conversion to video representation in a manner of/>Wherein/>Representing the conversion of X into a spatiotemporal context using linear projection,/>Representing the cross-affinity matrix between calculated q and X,/>The calculation mode is/>Exp represents an exponential function, T represents a matrix transpose,/>Representing a linear projection of a linear layer implementation,/>Representing a linear projection of the linear layer implementation;
For DPE, which refers to position coding for adding position information for each position of an input sequence, DPE is generated by using a combination of sine and cosine functions, and the calculation formula is Wherein/>Is location information,/>Is an index of the embedded dimension,/>Is the size of the embedding dimension, for each location/>And each embedded dimension/>The corresponding position code/>, can be calculated
The FFN is used for carrying out nonlinear transformation and feature extraction on the feature vector of each position, and consists of two fully connected layers, wherein the two fully connected layers are connected through an activation function, the first fully connected layer maps the input feature vector to a hidden layer with an intermediate dimension, a ReLU activation function is applied after the first fully connected layer, then the second fully connected layer maps the output of the first fully connected layer back to the original feature dimension, and the FFN can be calculated in a mode of descriptionWherein/>Is an input feature vector,/>And/>Is the weight and bias of the first fully connected layer,/>And/>Is the weight and bias of the second fully connected layer,/>Representing activation functions
S5, constructing a multi-stage fusion module, fusing the input local optimization module and the global optimization module, and for the multi-stage fusion module, firstly representing the ith global optimization module asFor the local optimization moduleThe global optimization module can convert the query q into videomarks/>Video tagging for all global optimization modulesSequentially using the previous global optimization module/>/>, As the current global optimization moduleThenFinally, a global video mark F is obtained, and for N global optimization modules,/>Class labels/>, are then extracted from the final local optimization moduleThe class labels and the global video labels are added in the form of weighted sums to obtain and output a final video representation/>,/>,/>Representing element-by-element additions,/>Is a learnable parameter processed by a Sigmoid function;
S6, constructing an improvement UniFormer module which comprises a 3D convolution module, a downsampling module, a local optimization module, a global optimization module and a multi-stage fusion module; for an improvement UniFormer module, inputting fish behavior videos, projecting the input fish behavior videos into a plurality of space-time marks by applying 3D convolution, then performing 8-time space downsampling, performing 2-time downsampling, performing position embedding, constructing a local optimization module and a global optimization module, wherein the local optimization module sequentially consists of a local multi-head time module, a global multi-head space module and an FFN, the global optimization module sequentially consists of DPE, the global multi-head space-time module and the FFN, using and sequentially connecting the plurality of local optimization modules, introducing the global optimization module above each local optimization module, constructing a multi-stage fusion module, extracting class marks from the last local optimization module, obtaining global video marks from all the global optimization modules, adding the class marks and the global video marks in a weighted sum mode, and obtaining and outputting a final fish behavior video representation;
S7, constructing a fish behavior machine vision detection model, which sequentially comprises an input module, an improvement UniFormer module, a detection head and an output module;
S8, training the fish behavior machine vision detection model by using the fish behavior data set, performing multiple training, increasing the training round number in each training, using the accuracy and the recall rate as evaluation indexes, and selecting the model with the highest accuracy and recall rate as the fish behavior machine vision detection model;
S9, performing real-time detection by using a fish behavior machine vision detection model, shooting the region where the fish is located by using shooting equipment, and inputting fish behavior videos into the fish behavior machine vision detection model in real time to obtain a real-time detection result.
2. The machine vision-based fish behavior detection method according to claim 1, wherein in the step S1, a professional underwater photographing device is used, a part of the video photographing sites select a water area with sufficient light, clear water and high transparency, a part of the video photographing sites select a water area with turbid water quality and clear fish profile and action track, a part of the video photographing sites select a water area with sufficient light, clear water and high transparency, a part of the video photographing sites select a water area with light supplementing device and night clear water with high water quality and high transparency, and a part of the video photographing sites select a water area with turbid water quality and clear fish profile and action track, when photographing on water.
CN202410369755.2A 2024-03-29 2024-03-29 Fish behavior detection method based on machine vision Active CN117975572B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410369755.2A CN117975572B (en) 2024-03-29 2024-03-29 Fish behavior detection method based on machine vision

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410369755.2A CN117975572B (en) 2024-03-29 2024-03-29 Fish behavior detection method based on machine vision

Publications (2)

Publication Number Publication Date
CN117975572A CN117975572A (en) 2024-05-03
CN117975572B true CN117975572B (en) 2024-06-04

Family

ID=90861573

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410369755.2A Active CN117975572B (en) 2024-03-29 2024-03-29 Fish behavior detection method based on machine vision

Country Status (1)

Country Link
CN (1) CN117975572B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104969898A (en) * 2015-07-24 2015-10-14 中国水产科学研究院黄海水产研究所 Experiment device used for researching effects on fish behaviors imposed by light and application thereof
WO2020207092A1 (en) * 2019-04-11 2020-10-15 浙江大学 Feedback-type pond recirculating water intelligent feeding system fusing machine vision and infrared detection technology
CN116824454A (en) * 2023-07-03 2023-09-29 山东大学 Fish behavior identification method and system based on spatial pyramid attention
US11810366B1 (en) * 2022-09-22 2023-11-07 Zhejiang Lab Joint modeling method and apparatus for enhancing local features of pedestrians
CN117197727A (en) * 2023-11-07 2023-12-08 浙江大学 Global space-time feature learning-based behavior detection method and system
CN117576616A (en) * 2024-01-08 2024-02-20 中国农业大学 Deep learning-based fish swimming behavior early warning method, system and device

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115049952B (en) * 2022-04-24 2023-04-07 南京农业大学 Juvenile fish limb identification method based on multi-scale cascade perception deep learning network

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104969898A (en) * 2015-07-24 2015-10-14 中国水产科学研究院黄海水产研究所 Experiment device used for researching effects on fish behaviors imposed by light and application thereof
WO2020207092A1 (en) * 2019-04-11 2020-10-15 浙江大学 Feedback-type pond recirculating water intelligent feeding system fusing machine vision and infrared detection technology
US11810366B1 (en) * 2022-09-22 2023-11-07 Zhejiang Lab Joint modeling method and apparatus for enhancing local features of pedestrians
CN116824454A (en) * 2023-07-03 2023-09-29 山东大学 Fish behavior identification method and system based on spatial pyramid attention
CN117197727A (en) * 2023-11-07 2023-12-08 浙江大学 Global space-time feature learning-based behavior detection method and system
CN117576616A (en) * 2024-01-08 2024-02-20 中国农业大学 Deep learning-based fish swimming behavior early warning method, system and device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
UniFormer: Unifying Convolution and Self-Attention for Visual Recognition;Kunchang Li等;《 IEEE Transactions on Pattern Analysis and Machine Intelligence》;20230705;全文 *

Also Published As

Publication number Publication date
CN117975572A (en) 2024-05-03

Similar Documents

Publication Publication Date Title
Islam et al. Semantic segmentation of underwater imagery: Dataset and benchmark
CN105354548B (en) A kind of monitor video pedestrian recognition methods again based on ImageNet retrievals
Komorowski et al. Minkloc++: lidar and monocular image fusion for place recognition
WO2020170014A1 (en) Object counting and instance segmentation using neural network architectures with image-level supervision
Chen et al. End-to-end learning of object motion estimation from retinal events for event-based object tracking
CN111178284A (en) Pedestrian re-identification method and system based on spatio-temporal union model of map data
Li et al. ConvTransNet: A CNN–transformer network for change detection with multiscale global–local representations
Hong et al. USOD10K: a new benchmark dataset for underwater salient object detection
CN115661505A (en) Semantic perception image shadow detection method
CN112906675A (en) Unsupervised human body key point detection method and system in fixed scene
CN117370498A (en) Unified modeling method for 3D open vocabulary detection and closed caption generation
Wang et al. An Asynchronous LLM Architecture for Event Stream Analysis with Cameras
Chen et al. MAFNet: a multi-attention fusion network for RGB-T crowd counting
Qiao et al. FireFormer: an efficient Transformer to identify forest fire from surveillance cameras
Li et al. Image de-occlusion via event-enhanced multi-modal fusion hybrid network
CN117975572B (en) Fish behavior detection method based on machine vision
CN117173631A (en) Method and system for monitoring biodiversity
Zhao et al. Research on human behavior recognition in video based on 3DCCA
Anilkumar et al. An improved beluga whale optimizer—Derived Adaptive multi-channel DeepLabv3+ for semantic segmentation of aerial images
CN115761802A (en) Dynamic bird identification method and system
Zhao et al. Recognizing High-Speed Moving Objects with Spike Camera
Verma et al. Intensifying security with smart video surveillance
CN114882072A (en) High-speed correlation filtering target tracking method based on multi-channel image feature fusion
Ciampi et al. Mc-gta: A synthetic benchmark for multi-camera vehicle tracking
Li et al. A holistic marine video dataset

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant