CN117975572B

CN117975572B - Fish behavior detection method based on machine vision

Info

Publication number: CN117975572B
Application number: CN202410369755.2A
Authority: CN
Inventors: 董俊; 闫家仁; 刘晓晨; 曹振杰
Original assignee: Shandong Freshwater Fisheries Research Institute
Current assignee: Shandong Freshwater Fisheries Research Institute
Priority date: 2024-03-29
Filing date: 2024-03-29
Publication date: 2024-06-04
Anticipated expiration: 2044-03-29
Also published as: CN117975572A

Abstract

The invention provides a fish behavior detection method based on machine vision, and relates to the field of computer vision. The invention provides a fish behavior detection model, which comprises fish behavior video acquisition, fish behavior video annotation, local optimization module construction, global optimization module construction, multistage fusion module construction, improvement UniFormer module construction, fish behavior machine vision detection model construction, fish behavior data set training fish behavior machine vision detection model training, and real-time detection by using the fish behavior machine vision detection model; the invention provides an improvement UniFormer module, firstly a local optimization module effectively reduces local time redundancy by inserting a local multi-head time module before ViT blocks, then a global optimization module can capture complete space-time dependency, and finally a multi-stage fusion module is used for fusing all global semantic marks of multiple stages and multiple levels to form a final video representation.

Description

Fish behavior detection method based on machine vision

Technical Field

The invention belongs to computer vision, and particularly relates to a fish behavior detection method based on machine vision.

Background

Visual transformer ViT performs well in image tasks, which motivates various studies to apply image ViT to video tasks, however the large gap between image and video prevents spatiotemporal learning of these image pre-training models. Under video object detection tasks, video-specific detection models like UniFormer can be transferred more seamlessly to the video domain, but their unique architecture requires long image pre-training, limiting scalability. With the advent of the powerful open source image ViT detection architecture, its potential for video understanding can be released by improving UniFormer architecture.

The detection of fish behavior is an important research area, which involves identifying and analyzing fish behavior from videos taken on or under water. The quality of the video directly affects the accuracy and reliability of the detection. High resolution, sharpness, and stable video help provide more accurate data to better capture various behaviors of fish. Meanwhile, the capability of the video detection algorithm is also of great importance, and the video detection algorithm needs to have the capability of identifying and classifying different types of fish behaviors and the skill of filtering and processing interference and noise in a complex environment, so that the video quality and the performance of the detection algorithm jointly determine the effect of fish behavior detection.

Disclosure of Invention

The invention provides a fish behavior detection method based on machine vision, which aims at improving on the basis of UniFormer, introduces a high-efficiency structural design, uses local optimization, global optimization and multi-stage fusion, releases potential of UniFormer in video understanding, and realizes advanced performance in fish behavior video detection.

The invention aims at improving UniFormer on the basis of providing a fish behavior detection method based on machine vision, which comprises the following steps:

S1, fish behavior video acquisition, namely, acquiring 6 fish behaviors including foraging, evading, social contact, tour, rest and propagation behaviors, wherein the fish behavior video acquisition comprises underwater shooting and water shooting;

s2, labeling the fish behavior videos, namely labeling all fish behavior videos containing 6 fish behaviors, and labeling the 6 fish behaviors in a continuous frame mode, so as to obtain a fish behavior data set;

S3, constructing a local optimization module which comprises a local multi-head time module, a global multi-head space module and an FFN;

S4, constructing a global optimization module which comprises a DPE, a global multi-head space-time module and an FFN;

S5, constructing a multi-stage fusion module, and fusing the input local optimization module and the global optimization module;

s6, constructing an improvement UniFormer module which comprises a 3D convolution module, a downsampling module, a local optimization module, a global optimization module and a multi-stage fusion module;

S7, constructing a fish behavior machine vision detection model, which sequentially comprises an input module, an improvement UniFormer module, a detection head and an output module;

S8, training the fish behavior machine vision detection model by using the fish behavior data set, performing multiple training, increasing the training round number in each training, using the accuracy and the recall rate as evaluation indexes, and selecting the model with the highest accuracy and recall rate as the fish behavior machine vision detection model;

S9, performing real-time detection by using a fish behavior machine vision detection model, shooting the region where the fish is located by using shooting equipment, and inputting fish behavior videos into the fish behavior machine vision detection model in real time to obtain a real-time detection result.

Preferably, in step S1, when shooting underwater, a professional underwater camera is used, a part of the video shooting sites select a water area with sufficient light, clear water and high transparency, a part of the video shooting sites select a water area with turbid water quality and clear fish profile and action track, when shooting on water, a part of the video shooting sites select a water area with sufficient light, clear water and high transparency, a part of the video shooting sites select a night water area with light supplementing equipment and clear water with clear water quality and high transparency, and a part of the video shooting sites select a water area with turbid water quality and clear fish profile and action track.

Preferably, in step S1, for foraging activities, including food finding, predation and eating activities; for evasive behavior, including rapid distance from other fish, abrupt steering, and hiding; for social behavior, including puppet, territory competition, territory defense, and collaborative predation; for tour activities, i.e. regularly tour a specific area; for rest activities, including resting floating, hiding beside the shelter and resting at the water bottom; for reproductive activities, including mating, spawning, and brooding.

Preferably, in step S3, the input fish behavior video is projected into 16L spatiotemporal markers using 3D convolution，/>L is the product of time, height and width of the input video, C is the video channel, then/>After 8 times time downsampling, 2 times spatial downsampling and position embedding,/>For a local optimization module, input/>Output/>，/>，/>，，/>，/>，/>，/>Representing element-wise addition, LMT representing a local multi-head time module, GMS representing a global multi-head space module, FFN consisting of two linear projections separated by GeLU, norm in LMT representing Batch Norm, norm in GMS and FFN representing Layer Norm, GMS and FFN from ViT blocks of image pre-training, LMT having a learnable parameter matrix in the time dimension t1 x1 for LMT，/>Representing a mark/>, in the time dimension that the LMT can learnAnd other markers/>The relation between them, for GMS, is mainly focused on 1 x H x W in single frame video,，/>，/>，/>And/>Is a different linear projection in the nth header, exp represents an exponential function,/>Representing the matrix transpose.

Preferably, in step S4, for the global optimization module, inputs are made，/>，，/>，，/>，/>，/>，/>Representing element-by-element addition, DPE from UniFormer representing dynamic position embedding, FFN consisting of two linear projections separated by GeLU, GMST representing a global multi-headed space-time block, norm representing Layer Norm in GMST and FFN calculated in the manner of GMSTWherein the linear projection/>WhereinThe dependency between q and all spatiotemporal markers X can be modeled to model the query/>Conversion to video representation in a manner of/>Wherein/>Representing the conversion of X into a spatiotemporal context using linear projection,/>Representing the cross-affinity matrix between calculated q and X,/>The calculation mode is/>Exp represents an exponential function, T represents a matrix transpose,/>Representing a linear projection of a linear layer implementation,/>Representing a linear projection of a linear layer implementation.

Preferably, in step S5, for a multi-level fusion module, the ith global optimization module is first expressed asFor/>, from the local optimization moduleThe global optimization module can convert the query q into videomarks/>Videomark/>, for all global optimization modulesSequentially using the previous global optimization module/>, As the current global optimization moduleThen/>Finally, a global video mark F is obtained, and for N global optimization modules,/>Class labels/>, are then extracted from the final local optimization moduleThe class labels and the global video labels are added in the form of weighted sums to obtain and output a final video representation/>，，/>Representing element-by-element additions,/>Is a learnable parameter processed by a Sigmoid function.

Preferably, in step S6, for improving UniFormer the module, inputting the fish behavior video, projecting the input fish behavior video into a plurality of space-time markers by applying 3D convolution, then performing 8-time spatial downsampling, performing 2-time downsampling, performing position embedding, constructing a local optimization module and a global optimization module, wherein the local optimization module sequentially consists of a local multi-head time module, a global multi-head spatial module and an FFN, the global optimization module sequentially consists of a DPE, a global multi-head space-time module and an FFN, using and sequentially connecting the plurality of local optimization modules, introducing the global optimization module above each local optimization module, constructing a multi-stage fusion module, extracting class markers from the last local optimization module, obtaining global video markers from all global optimization modules, adding the class markers and the global video markers in a weighted sum form, and obtaining and outputting the final fish behavior video representation.

Compared with the prior art, the invention has the following technical effects:

The technical scheme provided by the invention provides an improvement UniFormer module which is used for optimizing the performance of UniFormer in a video target detection task and is applied to a fish behavior detection scene, and comprises a local optimization module, a global optimization module and a multi-stage fusion module, wherein the local optimization module utilizes the spatial representation of ViT, the local time redundancy is effectively reduced by inserting a local multi-head time module before ViT blocks, meanwhile, the global optimization module is introduced above the local optimization module, the complete space-time dependency is captured, and finally, all global semantic marks of multi-stage and multi-stage are fused by using the multi-stage fusion module to form the final video representation.

Drawings

Fig. 1 is a flow chart of fish behavior detection provided by the invention.

Fig. 2 is a block diagram of an improvement UniFormer provided by the present invention.

FIG. 3 is a block diagram of a local optimization module provided by the present invention.

FIG. 4 is a block diagram of a global optimization module provided by the present invention.

Detailed Description

The invention aims to provide a fish behavior detection method based on machine vision, which provides an improvement UniFormer module for optimizing UniFormer performance in fish behavior detection tasks, and comprises a local optimization module, a global optimization module and a multi-stage fusion module, wherein the local optimization module utilizes ViT spatial representation, the local time redundancy is effectively reduced by inserting a local multi-head time module before ViT blocks, meanwhile, the global optimization module is introduced above the local optimization module, the complete space-time dependency is captured, and finally, all global semantic marks of multiple stages and multiple stages are fused by using the multi-stage fusion module to form a final video representation.

Referring to fig. 1, in an embodiment of the present application, a method for detecting fish behavior based on machine vision is provided:

S1, collecting fish behavior videos, wherein the collected fish behavior videos are taken from clear water areas in an aquarium, all videos are taken through professional underwater equipment, including 6 fish behaviors including foraging, evading, social contact, tour, rest and propagation, the video length is between one minute and five minutes, and the total video frequency is 150 video segments;

S2, labeling the fish behavior videos, namely labeling all fish behavior videos containing 6 fish behaviors, and labeling the 6 fish behaviors by using a continuous frame method so as to obtain a fish behavior data set containing behavior tags and time stamp information;

S7, constructing a fish behavior machine vision detection model, which sequentially comprises an input module, an improvement UniFormer module, a UniFormer detection head and an output module;

Further, in step S1, for foraging activities, including food seeking, predation, and eating activities; for evasive behavior, including rapid distance from other fish, abrupt steering, and hiding; for social behavior, including puppet, territory competition, territory defense, and collaborative predation; for tour activities, i.e. regularly tour a specific area; for rest activities, including resting floating, hiding beside the shelter and resting at the water bottom; for reproductive activities, including mating, spawning, and brooding.

Further, in step S3, the input fish behavior video is projected into 16L spatiotemporal markers using 3D convolution，L is the product of time, height and width of the input video, C is the video channel, then/>After 8 times time downsampling, 2 times spatial downsampling and position embedding,/>For the local optimization module, the structure is shown in figure 3, and the input/>Output/>，/>，，/>，/>，，/>，/>Representing element-by-element additions, and/>, in FIG. 3Correspondingly, LMT stands for local multi-head time module, GMS stands for global multi-head space module, FFN consists of two linear projections separated by GeLU, norm stands for Batch Norm, norm stands for Layer Norm, GMS and FFN come from ViT blocks of image pre-training, LMT has a learnable parameter matrix/>, in time dimension t×1×1, for LMT，Representing a mark/>, in the time dimension that the LMT can learnAnd other markers/>The relation between them, for GMS, is mainly focused on 1 x H x W in single frame video,，/>，/>，/>And/>Is a different linear projection in the nth header, exp represents an exponential function,/>Representing the matrix transpose.

Further, in step S4, for the global optimization module, its structure is as shown in FIG. 4, input，，/>，/>，，/>，/>，/>，/>Representing element-by-element additions, and/>, in FIG. 4Correspondingly, DPE comes from UniFormer and represents dynamic position embedding, FFN consists of two linear projections separated by GeLU, GMST represents global multi-head space-time module, norm in GMST and FFN represents Layer Norm, and GMST is calculated in/>In which the projection is linearWherein/>The dependency between q and all spatiotemporal markers X can be modeled to model the query/>Conversion to video representation in a manner of/>Wherein/>Representing the conversion of X into a spatiotemporal context using linear projection, V representing/>, in FIG. 4，/>Representing the cross-affinity matrix between calculated q and X,/>The calculation mode is/>Q represents/>, in FIG. 4In FIG. 4K represents/>Exp represents an exponential function,/>Representing matrix transposition,/>Representing a linear projection of a linear layer implementation,/>Representing a linear projection of a linear layer implementation.

Further, in step S5, for the multi-level fusion module, the ith global optimization module is first expressed asFor/>, from the local optimization moduleThe global optimization module can convert the query q into videomarks/>Videomark/>, for all global optimization modulesSequentially using the previous global optimization module/>, As the current global optimization moduleThen/>Finally, a global video mark F is obtained, and for N global optimization modules,/>Class labels/>, are then extracted from the final local optimization moduleThe class labels and the global video labels are added in the form of weighted sums to obtain and output a final video representation/>，，/>Representing element-by-element additions,/>Is a learnable parameter processed by a Sigmoid function.

Further, in step S6, for the improvement UniFormer module, the structure is shown in fig. 2, the resolution of a single frame image of the fish behavior video is 224×224×3, 3D convolution is applied to project the input fish behavior video into a plurality of space-time markers, then 8 times of space downsampling is performed, 2 times of time downsampling is performed, position embedding is performed, a local optimization module and a global optimization module are constructed, the local optimization module is sequentially composed of a local multi-head time module, a global multi-head space module and an FFN, the global optimization module is sequentially composed of a DPE, a global multi-head space-time module and an FFN, a plurality of local optimization modules are used and sequentially connected, a global optimization module is introduced above each local optimization module, a multi-stage fusion module is constructed, class markers are extracted from the last local optimization module, global video markers are obtained from all global optimization modules, the class markers and the global video markers are added in a weighted sum form, and the final fish behavior video representation is obtained and output, in fig. 2, N local optimization modules and N is 4.

Further, in step S4, for DPE, which is generated by using a combination of sine and cosine functions and is referred to as position coding for adding position information for each position of an input sequence, a calculation formula is，/>Wherein/>Is location information,/>Is an index of the embedded dimension,/>Is the size of the embedding dimension, for each location/>And each embedded dimension/>The corresponding position code/>, can be calculated。

Further, in step S4, the FFN is used to perform nonlinear transformation and feature extraction on the feature vector of each position, and is composed of two fully connected layers, where the two layers are connected by an activation function, the first fully connected layer maps the input feature vector to a hidden layer with an intermediate dimension, a ReLU activation function is applied after the first fully connected layer, and then the second fully connected layer maps the output of the first fully connected layer back to the original feature dimension, where the FFN can be calculated asWherein/>Is an input feature vector,/>And/>Is the weight and bias of the first fully connected layer,/>And/>Is the weight and bias of the second fully connected layer,/>Representing an activation function.

The foregoing is merely a preferred embodiment of the present invention, and it should be noted that modifications and improvements could be made by those skilled in the art without departing from the inventive concept, which fall within the scope of the present invention.

Claims

1. The fish behavior detection method based on machine vision is characterized by comprising the following steps of:

s1, fish behavior video acquisition, namely, acquiring 6 fish behaviors including foraging, evading, social contact, tour, rest and propagation behaviors, wherein the fish behavior video acquisition comprises underwater shooting and water shooting; for foraging behavior, including finding food, predation, and eating activities; for evasive behavior, including rapid distance from other fish, abrupt steering, and hiding; for social behavior, including puppet, territory competition, territory defense, and collaborative predation; for tour activities, i.e. regularly tour a specific area; for rest activities, including resting floating, hiding beside the shelter and resting at the water bottom; for reproductive activities, including mating, spawning, and brooding;

s3, constructing a local optimization module which comprises a local multi-head time module, a global multi-head space module and an FFN, wherein the specific method comprises the following steps: video projection of input fish behavior into 16L spatiotemporal markers using 3D convolution ，/>L is the product of time, height and width of the input video, C is the video channel, then/>After 8 times time downsampling, 2 times spatial downsampling and position embedding,/>For a local optimization module, input/>Output/>，，/>，，/>，/>，/>，/>Representing element-wise addition, LMT representing a local multi-head time module, GMS representing a global multi-head space module, FFN consisting of two linear projections separated by GeLU, norm in LMT representing Batch Norm, norm in GMS and FFN representing Layer Norm, GMS and FFN from ViT blocks of image pre-training, LMT having a learnable parameter matrix in the time dimension t1 x1 for LMT，/>Representing a mark/>, in the time dimension that the LMT can learnAnd other markers/>The relation between them, for GMS, is mainly focused on 1 x H x W in single frame video,，/>，/>，/>And/>Is a different linear projection in the nth header, exp represents an exponential function,/>Representing matrix transposition;

S4, constructing a global optimization module comprising DPE, a global multi-head space-time module and FFN, and inputting the global optimization module ，/>，/>，/>，，/>，/>，/>，/>Representing element-by-element addition, DPE from UniFormer representing dynamic position embedding, FFN consisting of two linear projections separated by GeLU, GMST representing a global multi-headed space-time block, norm representing Layer Norm in GMST and FFN calculated in the manner of GMSTWherein the linear projection/>WhereinThe dependency between q and all spatiotemporal markers X can be modeled to model the query/>Conversion to video representation in a manner of/>Wherein/>Representing the conversion of X into a spatiotemporal context using linear projection,/>Representing the cross-affinity matrix between calculated q and X,/>The calculation mode is/>Exp represents an exponential function, T represents a matrix transpose,/>Representing a linear projection of a linear layer implementation,/>Representing a linear projection of the linear layer implementation;

For DPE, which refers to position coding for adding position information for each position of an input sequence, DPE is generated by using a combination of sine and cosine functions, and the calculation formula is ，Wherein/>Is location information,/>Is an index of the embedded dimension,/>Is the size of the embedding dimension, for each location/>And each embedded dimension/>The corresponding position code/>, can be calculated；

The FFN is used for carrying out nonlinear transformation and feature extraction on the feature vector of each position, and consists of two fully connected layers, wherein the two fully connected layers are connected through an activation function, the first fully connected layer maps the input feature vector to a hidden layer with an intermediate dimension, a ReLU activation function is applied after the first fully connected layer, then the second fully connected layer maps the output of the first fully connected layer back to the original feature dimension, and the FFN can be calculated in a mode of descriptionWherein/>Is an input feature vector,/>And/>Is the weight and bias of the first fully connected layer,/>And/>Is the weight and bias of the second fully connected layer,/>Representing activation functions

S5, constructing a multi-stage fusion module, fusing the input local optimization module and the global optimization module, and for the multi-stage fusion module, firstly representing the ith global optimization module asFor the local optimization moduleThe global optimization module can convert the query q into videomarks/>Video tagging for all global optimization modulesSequentially using the previous global optimization module/>/>, As the current global optimization moduleThenFinally, a global video mark F is obtained, and for N global optimization modules,/>Class labels/>, are then extracted from the final local optimization moduleThe class labels and the global video labels are added in the form of weighted sums to obtain and output a final video representation/>，/>，/>Representing element-by-element additions,/>Is a learnable parameter processed by a Sigmoid function;

S6, constructing an improvement UniFormer module which comprises a 3D convolution module, a downsampling module, a local optimization module, a global optimization module and a multi-stage fusion module; for an improvement UniFormer module, inputting fish behavior videos, projecting the input fish behavior videos into a plurality of space-time marks by applying 3D convolution, then performing 8-time space downsampling, performing 2-time downsampling, performing position embedding, constructing a local optimization module and a global optimization module, wherein the local optimization module sequentially consists of a local multi-head time module, a global multi-head space module and an FFN, the global optimization module sequentially consists of DPE, the global multi-head space-time module and the FFN, using and sequentially connecting the plurality of local optimization modules, introducing the global optimization module above each local optimization module, constructing a multi-stage fusion module, extracting class marks from the last local optimization module, obtaining global video marks from all the global optimization modules, adding the class marks and the global video marks in a weighted sum mode, and obtaining and outputting a final fish behavior video representation;

2. The machine vision-based fish behavior detection method according to claim 1, wherein in the step S1, a professional underwater photographing device is used, a part of the video photographing sites select a water area with sufficient light, clear water and high transparency, a part of the video photographing sites select a water area with turbid water quality and clear fish profile and action track, a part of the video photographing sites select a water area with sufficient light, clear water and high transparency, a part of the video photographing sites select a water area with light supplementing device and night clear water with high water quality and high transparency, and a part of the video photographing sites select a water area with turbid water quality and clear fish profile and action track, when photographing on water.