CN116824454A

CN116824454A - Fish behavior identification method and system based on spatial pyramid attention

Info

Publication number: CN116824454A
Application number: CN202310808836.3A
Authority: CN
Inventors: 马昕; 张昊; 姜文鑫; 于弋甯; 王历昂
Original assignee: Shandong University
Current assignee: Shandong University
Priority date: 2023-07-03
Filing date: 2023-07-03
Publication date: 2023-09-29

Abstract

The invention belongs to the technical field of fish behavior recognition, and provides a fish behavior recognition method and system based on spatial pyramid attention. In particular, the spatial pyramid attention module aggregates features from different levels of the baseline to capture correlations between local and global information. And then, combining the extracted time features with the spatial features to obtain a new fusion feature. These three features are weighted differently during the final classification process by a learnable multiplier M for behavior recognition.

Description

Fish behavior identification method and system based on spatial pyramid attention

Technical Field

The invention belongs to the technical field of fish behavior identification, and particularly relates to a fish behavior identification method and system based on spatial pyramid attention.

Background

The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.

With the increasing global demand for aquatic products, aquaculture industry is vigorously developed. Therefore, the well-being of fish in culture is receiving increasing attention. Behavior is an important index for reflecting the state of fish in aquaculture, and fish behavior identification can provide real-time understanding and early warning of the state of fish, which is very important for intensive aquaculture management. For example, the feeding behavior and the starvation behavior of the fish can be timely identified, and a reasonable and effective feeding strategy can be adopted, so that the cultivation efficiency is improved. Through the recognition of abnormal behaviors such as fear of fishes, an aquaculture manager can know the influence of external conditions such as temperature, light and sound on the fishes, timely adjust the culture environment and protect the welfare of the fishes.

The traditional fish identification method is that a dedicated person observes the behavior of fish for a long time, but the method is labor-intensive and inefficient. Recent developments in computer vision technology have provided a low cost and efficient method for monitoring fish behavior.

Machine learning methods for fish behavior recognition rely on hand-made features. Beyan et al extract several behavioral characteristics of fish, including curvature scale space-based characteristics, speed and acceleration-based characteristics, turn-based characteristics, centroid distance functions, neighborhood characteristics, and loop characteristics. And (3) reducing the dimension of the features by adopting a principal component analysis method, and finally, identifying abnormal behaviors by adopting hierarchical classification and hidden Markov model clustering. Liu et al define a computer vision based eating activity index that is generated by subtracting two consecutive frames to obtain the sum of all pixel intensities in the difference frame. The index was compared with the index of the artificial observation feeding activity and the correlation coefficient reached 0.9195. Spontaneous collective behavior can also be used to evaluate the behavior state of fish effectively. Researchers analyze spontaneous collective behaviors of fish by calculating the variation amplitude, interaction force and dispersion of the water flow field. And spontaneous collective behaviors were quantified and integrated to evaluate appetite levels in fish.

However, the fabrication of manual features is often time consuming and labor intensive. In recent years, deep learning technology has begun to prevail because features can be automatically extracted through a deep neural network, and has achieved remarkable achievements in the fields of image classification, object detection, object tracking, behavior recognition, and the like. Thus, deep learning techniques have begun to be applied to aquaculture. Zhou et al utilized Convolutional Neural Networks (CNNs) to assess appetite levels in fish. Compared with other machine learning methods, the CNN has the best performance, and the accuracy reaches 90%. Yang et al designed a dual-attention network to achieve image-based analysis of fish behavior. The spatial relationship between the interested areas in the fish ingestion images is acquired by the network through a position attention module and a channel attention module, then a plurality of optimization strategies are adopted to train the network, and finally the accuracy rate on the test set reaches 89.56%.

Although former behavior recognition for fish shoals has conducted much work, most studies have defined and recognized behavior of fish from a global perspective only. However, some fish behavior occurs between a few individuals in a localized area. For example, hunger is typically a global behavior that occurs between most fish, while fear behavior is typically a local behavior that occurs between a few individuals within a local area. If only global behavior is concerned, it is difficult to identify local behavior, resulting in poor identification and vice versa. Meanwhile, in an actual culture environment, the phenomenon of cross shielding and small-area aggregation among fishes often occurs due to overlarge culture density, and difficulty is brought to the identification of the behaviors of the fishes.

Disclosure of Invention

In order to solve at least one technical problem in the background art, the invention provides a fish behavior identification method and system based on space pyramid attention, which provides a double-flow network (SPA-TSN) based on space pyramid attention, and simultaneously identifies fish behaviors from two angles, namely global and local, and comprises a space flow and a motion flow, wherein the space pyramid attention module is used for respectively acquiring the space and time characteristics of the fish behaviors. The extracted time features and the spatial features are fused to obtain fusion features, and three features are endowed with different weights in the classification process through a learnable multiplier and then used for behavior recognition.

In order to achieve the above purpose, the present invention adopts the following technical scheme:

the first aspect of the invention provides a fish behavior identification method based on spatial pyramid attention, comprising the following steps:

acquiring fish school behavior video data;

combining the fish swarm behavior video data and the fish behavior recognition model to obtain a behavior recognition result; the fish behavior recognition model is constructed by the following steps:

processing fish school behavior video data into RGB images and optical flow images;

the method comprises the steps of adopting a double-flow network, aiming at each flow, adopting a spatial pyramid attention module to acquire corresponding characteristics, taking RGB images as input to extract spatial characteristics of fish behaviors from the spatial flows, taking optical flow images as input to extract time characteristics of fish behaviors from the motion flows, wherein the spatial pyramid attention module aggregates characteristics from different levels of a base line, capturing behavior semantics of fish from global and local angles, fusing the extracted time characteristics with the spatial characteristics to obtain fused characteristics, and endowing different weights to the three characteristics in a classification process through a learnable multiplier for behavior recognition.

A second aspect of the present invention provides a fish behavior recognition system based on spatial pyramid attention, comprising:

the data acquisition module is used for acquiring fish school behavior videos;

the fish behavior recognition module is used for combining the fish swarm behavior video and the fish behavior recognition model to obtain a behavior recognition result; the fish behavior recognition model is constructed by the following steps:

processing the fish school behavior video into RGB images and optical flow images;

A third aspect of the present invention provides a computer-readable storage medium.

A computer readable storage medium having stored thereon a computer program which when executed by a processor implements the steps in the method for identifying fish behavior based on spatial pyramid attention as described in the first aspect.

A fourth aspect of the invention provides a computer device.

A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps in the fish behavior identification method based on spatial pyramid attention as described in the first aspect when the program is executed.

Compared with the prior art, the invention has the beneficial effects that:

1. the PA of the invention aggregates features from different levels of Resnet-50 through a spatial pyramid module, and captures fish behavior information from global and local angles. Furthermore, an attention module is used to enable the model to focus on the region of interest, ignoring cross-occlusion and small-area aggregation phenomena between fish.

2. The dual-stream fusion method provided by the invention combines spatial and temporal features, and also assigns different weights to the features according to their importance to the current class. This enables the model to focus on the most relevant information for each class and achieve a higher accuracy.

Additional aspects of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention.

FIG. 1 is a diagram of an experimental system provided by an embodiment of the present invention;

FIG. 2 is a fish behavior data set provided by an embodiment of the present invention; wherein (a) - (d) in fig. 2 are feeding behavior, fear behavior, hunger behavior and normal behavior, respectively;

FIG. 3 is a dual-flow network architecture based on spatial pyramid attention provided by an embodiment of the present invention;

FIG. 4 is a residual block provided by an embodiment of the present invention;

FIG. 5 is an RGB image and its corresponding optical flow image provided by an embodiment of the present invention; fig. 5 (a) - (d) are different RGB images, and fig. 5 (e) - (h) are optical flow images corresponding to the RGB images;

FIG. 6 is an attention module provided by an embodiment of the present invention;

FIG. 7 is a spatial pyramid module provided by an embodiment of the present invention;

FIG. 8 illustrates two general dual stream fusion methods provided by embodiments of the present invention; feature level fusion in fig. 8 (a), decision level fusion in fig. 8 (b);

FIG. 9 is a confusion matrix for different models provided by embodiments of the present invention; fig. 9 (a) is a spatial stream, fig. 9 (b) is a motion stream, fig. 9 (c) is a decision level fusion, fig. 9 (d) is a feature level fusion, and fig. 9 (e) is a method of the present invention;

FIG. 10 is a graph of accuracy of a comparative experiment provided by an embodiment of the present invention;

FIG. 11 is a view of the SPA-TSN region of interest provided by an embodiment of the present invention, where (a) - (d) in FIG. 11 show overall fish behavior and (e) - (f) in FIG. 11 show local fish behavior.

Detailed Description

The invention will be further described with reference to the drawings and examples.

It should be noted that the following detailed description is illustrative and is intended to provide further explanation of the invention. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the present invention. As used herein, the singular is also intended to include the plural unless the context clearly indicates otherwise, and furthermore, it is to be understood that the terms "comprises" and/or "comprising" when used in this specification are taken to specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof.

Accurately monitoring fish behaviors is a key to improving the efficiency of intensive farming. In the past, research on fish behavior recognition is mostly limited to recognizing fish population behaviors from a global perspective, and importance of recognizing local region behaviors is ignored. The invention provides a double-flow network based on spatial pyramid attention, in particular to an SPA-TSN which aggregates features from different levels of Resnet-50 through a spatial pyramid module and captures fish behavior information from the global and local angles. Furthermore, an attention module is used to enable the model to focus on the region of interest, ignoring cross-occlusion and small-area aggregation phenomena between fish.

In order to verify the method of the present invention, a fish behavior data set consisting of four typical fish behaviors was collected in a real farming environment and experiments were performed on the data set. Experimental results show that the accuracy of the SPA-TSN model on a test set reaches 95.904% and is superior to other advanced methods. In addition, the class activation diagram is visualized through the grad-cam, and the class activation diagram shows that the model provided by the invention can identify the behaviors of fish in the global and local directions, so that the expected target is achieved.

Example 1

The embodiment provides a fish behavior identification method based on spatial pyramid attention, which comprises the following steps:

step 1: acquiring a fish swarm behavior data set;

and continuously collecting 20 sections of fish school behavior videos in the cultivation time of fifteen days. When acquiring data, the diversity of the data set is improved by frequently changing the shooting angle.

Behaviors in the video are classified into four types of ingestion behavior, fear behavior, hunger behavior and normal behavior with the help of expert experience, as shown in fig. 2. A total of 645 video clips were noted, each clip containing an average of 4 seconds of fish school behavior, each behavior characterized as shown in table 1.

TABLE 1 characterization of four fish behaviors

Step 2: data preprocessing

To solve the problem of insufficient sample size, the present embodiment employs a data enhancement technique to expand the data set.

Specifically, the number of samples in the data set is enlarged by flipping and rotating, and finally the number of samples in the data set reaches 1990. The samples obtained were mixed according to a ratio of 7: the scale of 3 is divided into training and test sets. It is ensured that samples from the same video appear in only one of the subsets to avoid model overfitting due to specific conditions of the video (e.g. lighting conditions, camera angles, etc.).

The data set distribution is shown in table 2.

TABLE 2 data set distribution

Step 3: fish behavior identification

The invention designs a fish behavior recognition algorithm (SPA-TSN) based on spatial pyramid attention, and realizes accurate recognition of fish behaviors. Fig. 3 shows the overall architecture of the proposed recognition algorithm.

Specifically, the fish behavior recognition model includes two parts: spatial stream CNN and motion stream CNN. For each stream, a spatial pyramid attention structure is designed for capturing the behavioral semantics of the fish from a global and local perspective.

Firstly, the space flow and the motion flow are respectively input by RGB images and optical flow images, and the space-time characteristics of fish behaviors are extracted. And then fusing the space and time characteristics to obtain a final classification result.

A dual-flow network comprises two parts: the spatial stream CNN and the motion stream CNN are used for capturing the spatial and temporal characteristics of fish, respectively. Because of the good performance of the residual network in deep learning various fields, this embodiment selects Resnet-50 as the base network for each stream. Table 3 shows the overall architecture of Resnet-50.

Table 3. Overall architecture of resnet50

Specifically, conv1 is a single convolutional layer, and conv2_x-conv_5 is formed by stacking different numbers of residual blocks. Two different types of residual blocks are shown in fig. 4 (a) and (b), differing in whether a 1 x 1 convolutional layer is used in the residual connection, which together construct conv2_x-conv5_x.

Spatial flow: the single frame of the video is randomly cut, the single frame is input into a space flow network after normalization preprocessing operation, the input dimension is 3 multiplied by 224, and the single frame is output as a section of characteristic vector with fixed length and contains the space information of fish behaviors.

Optical flow: the optical flow method is a method for finding out the correspondence existing between the previous frame and the current frame by utilizing the change of pixels in an image sequence in a time domain and the correlation between adjacent frames, thereby calculating the motion information of an object between the adjacent frames. The conventional method generally uses the Lucas-Kanade optical flow method to calculate the optical flow, however, with the advent of the deep learning age, researchers began to try to solve the problem of optical flow estimation by using an End-to-End network model. Flowet is a deep learning model, and aims to solve the limitation of the traditional method in terms of calculation speed and detection precision, and is widely adopted in optical flow estimation at present.

Therefore, the invention adopts the flowet model to calculate the optical flow of the fish shoal motion so as to obtain the time information of the fish shoal behavior, and visualizes the calculated optical flow into the format of RGB picture. Fig. 5 (a) - (h) show the original RGB diagrams of four fish behaviors and their light flow diagrams generated with adjacent frames.

Motion flow: the motion flow network stacks 10 continuous optical flow pictures as network input, the preprocessing operation is consistent with the spatial flow, the input dimension is 30 multiplied by 224, and the output of the motion flow is also a section of characteristic vector with fixed length and comprises time information of fish behaviors.

Attention mechanisms were originally applied in natural language processing for capturing contextual relationships. In recent years, it has been widely used in the field of computer vision to capture global correlations between local features. In this study, an attention module was presented to focus the network on the region of interest to solve the cross-occlusion and small clustering problems between fish. It is noted that the attention module is a plug and play module that can accept input in any dimension. The attention module structure is shown in fig. 6.

Given an inputWhere C is the number of channels, H and W represent the height and width of the feature map, respectively, and the attention calculating operation can be represented by formulas (1) - (2):

y＝x+conv(z _i ) (2)

wherein x is an input feature, y is an output feature, and Q, K and V correspond to query, key and value respectively. Q, K, V are all generated by input characteristic x through convolution layer with convolution kernel size of 1×1, and characteristic dimension is d _k =c/n. n is an adjustable variable, the purpose of which is to reduce the computational complexity in the self-attention operation. At the position ofIn conv_3x and conv_4x, let n=2 in this embodiment; in conv_5x, n=4.

The characteristic z calculated from formula (1) _i The dimension is still d _k . To obtain the same dimension as the input feature x, let z _i A convolution layer with a convolution kernel size of 1 x 1 is passed and then residual concatenated with x. In this way, the attention module can be inserted anywhere in the network without destroying the original information.

In general, convolutional neural networks are deep with larger receptive fields and richer semantic information, but often ignore important local information while having advanced visual perception. In contrast, the bottom layer has high resolution but lacks global semantics. In order to capture the overall and local behaviors of fish simultaneously, the invention designs a space pyramid module. This module obtains global and local fish behavior semantics simultaneously by aggregating features of different levels of ResNet-50, as shown by the black dashed box in FIG. 7.

Considering that the bottom layer may introduce too much noise and higher computational complexity, the present embodiment only aggregates the features of Conv3_x, conv4_x, conv5_x three levels, denoted as F ₃ ,F ₄ ,F ₅ . Wherein F is _n ∈R ^C×H×W C is the number of channels, H, W represents the height and width of the feature map, respectively, and n ε {3,4,5} represents features from different levels of Resnet-50. The values of H, W, and C vary with n. In particular, F ₃ ∈R ^512×28×28 ,F ₄ ∈R ^1024×14×14 ,F ₅ ∈R ^2048×7×7 。

Will F ₃ ,F ₄ ,F ₅ Global pooling after processing by the attention module, notably for F ₃ ,F ₄ A 1 x 1 convolutional layer is added for dimension adjustment before global pooling. After global pooling, corresponding output feature vectors Fo of three layers Conv3_x, conv4_x and Conv5_x ₃ ,Fo ₄ ,Fo ₅ Can be expressed as: f (F) _on ＝[f _n,1 ,f _n,1 …,f _n,N ]. Where N e {3,4,5}, n=2048.

Will F _o3 ,F _o4 ,F _o5 After splicing, the final output F is obtained _o ，F _o ∈R ^3×2048 。

Fish behavior identification requires both spatial and temporal information, so fusion of spatiotemporal features is critical to final classification. The usual dual stream fusion approach is feature level fusion and decision level fusion, as shown in fig. 8. The feature level fusion is to splice or add the features extracted by the space flow and the motion flow to obtain a fusion feature, and further obtain a classification result, and the decision level fusion is to generate a classification score for the space flow and the motion flow respectively, and then fuse the two to obtain a final classification result.

The invention designs a double-flow fusion method for integrating space-time information. The fusion method adopts a learning weighting scheme to combine the spatial characteristics and the temporal characteristics. The weighting scheme is learned during training using a cross entropy loss function and a back propagation algorithm.

For a particular video clip input, the spatial and temporal information is extracted by the spatial and motion streams, respectively, denoted as F _o,rgb And F _o,flow . F is then added by means of elements _o,rgb ,F _o,flow Generating a new fusion feature F _o,fusion As shown in formula (3):

F _o,fusion ＝F _o,rgb +F _o,flow (3)

wherein the method comprises the steps of

For F _o,fusion ,F _o,rgb ,F _o,flow Respectively, to generate their respective classification scores C through a full connection layer _fusion ,C _rgb ,C _flow . Thereafter, a trainable multiplier M is designed _fusion ,M _rgb ,M _flow The method is used for automatically adjusting weights of three features in a final classification result, and the weights are shown in a formula (4):

wherein, the liquid crystal display device comprises a liquid crystal display device,for the predicted classification result, as would be indicated by the matrix multiplication of the corresponding element multiplication.

The loss function is selected as a cross entropy function as shown in equation (5).

Where y is the true class label, n is the batch size, and x is the class number.

The experimental object in the experiment is the plaque carving cultured in the first circulating water culture workshop D1 pool from certain aquatic products company in Shandong province. These fish have been cultured in the circulating water system for about 6 months prior to the experiment. The fish school feeding is completed by an automatic feeding machine, and the feeding is performed five times a day.

The fish behavior recognition experiment is performed in a real industrial circulating aquaculture environment, as shown in fig. 1. The culture pond is square, the side length is 6.85m, the height is 1m, and the water depth is about 0.8m. The water temperature for cultivation is about 25 ℃, and the PH value is about 8. The RGB camera is positioned above the culture pond and is about 2m away from the ground, the fish behavior video is continuously collected, the video resolution is 1920 x 1080, and the speed is 30 frames/second. The deep learning model is deployed on two Nvidia GTX 1080Ti servers and implemented in a Linux Ubuntu 20.04.3LTS environment using Pytorch framework coding.

In the invention, four evaluation indexes-accuracy, precision, recall and specificity-are used to evaluate the performance of the fish behavior recognition method proposed by the present invention. Accuracy refers to the ratio of correctly classified samples to the total number of samples. Precision refers to the ratio of the number of correctly predicted samples to the total number of predicted samples for a certain class. Recall refers to the ratio of the number of correctly predicted samples to the number of actual samples for a class, and specificity refers to the ratio of true negative samples among negative samples. The calculation method of accuracy, precision, recall and specificity is shown in formulas (6) - (9):

TP, FN.FP, TN represent true positive, true negative, false positive, false negative, respectively.

In this section, a number of experiments were performed on fish behavior datasets to evaluate the performance of the algorithms presented in this invention. First, an ablation experiment was performed to verify the effectiveness of each of the modules presented in section 2. Next, the model of the present invention is compared with other advanced methods. Finally, experimental results of the model are discussed.

In order to improve the performance of the model, the invention adopts a migration learning method. Models were pre-trained in the UCF101 dataset. UCF101 is a real human behavior recognition dataset, collected from YouTube, providing 13320 videos from 101 action categories. After the pre-training phase on UCF101, the training model of the fish behavior data set of the present invention is continued. The main super parameters in the training process are shown in table 4.

TABLE 4 main super parameters

In order to verify the effectiveness of the three operations proposed by the present invention (spatial pyramid module, attention module and dual-flow fusion method), the following ablation experiments were designed. The base network of each stream is first tested, then the spatial pyramid and the attention module are added in sequence. Finally, the process is repeated with dual stream inputs.

The experimental results are shown in table 5. First, the data sets of the present invention were performed on spatial and motion Flow baselines (named RBG only and Flow only), respectively, with 86.348% and 91.809% accuracy for the test set, respectively. After adding the Spatial Pyramid module (SP), the accuracy on the test set reached 90.444% and 94.198%, respectively. After further addition of the Attention module (Attention), the accuracy of the test set finally reached 91.468% and 95.222%. Then, with dual stream inputs, the accuracy of the test set was 92.491%, 94.888% and 95.904% in the three cases, respectively.

Table 5 ablation experimental results

Next, experiments were performed to evaluate the effectiveness of the fusion method proposed by the present invention. Three different fusion methods, namely feature level fusion, decision level fusion and double-flow fusion method provided by the invention, are tested through experiments. The results are shown in Table 6, and the accuracy rates of the three methods are 93.857%, 94.368% and 95.904%, respectively. Surprisingly, feature level fusion and decision level fusion methods are not even as accurate as the motion flow after the addition of the spatial pyramid module and the attention module. To explore the reasons for producing these results, a map of the confusion matrix under spatial flow, motion flow, and three different double flow fusion methods was plotted and analyzed, as shown in fig. 9. Depicted on the diagonal is the number of correctly predicted actions, with darker colors indicating higher confidence and lighter colors indicating lower confidence.

Analysis of the food intake behavior can find that the food intake behavior is more dependent on spatial distribution information, the accuracy of the food intake behavior on spatial flow is higher, the fear behavior is more dependent on motion information on time scale, and the food intake behavior has better effect on motion flow. For decision-level fusion, the result is very similar to that when only the motion stream is used, which indicates that the motion stream obtains higher weight when fusion, and the spatial distribution information provided by the spatial stream is ignored. On the other hand, although the feature level fusion method fuses temporal features and spatial features, the expected result is not achieved. Presumably, feature level fusion may lose some of the behavior's dependence on a particular flow (e.g., fear behavior is more dependent on motion flow), resulting in poor recognition accuracy.

In contrast, the dual stream fusion method proposed by the present invention not only combines spatial and temporal features, but also assigns different weights to these features according to their importance to the current class. This enables the model to focus on the most relevant information for each class and achieve a higher accuracy. Experimental results show that the fusion method has the highest accuracy in all fusion methods, and the accuracy on a fish behavior data set reaches 95.904%. The results demonstrate the effectiveness of the proposed dual stream fusion method in improving performance of the behavior recognition model.

To verify the advancement of the proposed SPA-TSN model, comparison experiments were performed with other advanced fish behavior recognition deep learning algorithms, including AlexNet, resNet-50, resNet-101, resNet152, DAN-EfficientNet-B2, mobileNet_v2, C3D, DSC D, all of which were pre-trained on UCF-101, and the experimental results are shown in Table 7. The results showed that the accuracy of AlexNet, resNet-50, resNet-101, resNet-152, DAN-EfficientNet-B2 and MobileNet V2-SENet were all lower than the motion flow baseline (91.809%). This is probably because these models use only a single frame RGB image as input, ignoring time information critical to fish behavior. It is also notable that the ResNet-101 model performs better on the test set than ResNet-50 and ResNet-152. To explore the cause, precision curves of different models on the test set were generated as shown in fig. 10. It can be seen that AlexNet is less accurate due to the shallower network structure. However, while ResNet-152 has a deeper network structure, it exhibits poor performance and is difficult to converge even after 40 iterative training. This suggests that increasing network depth may increase model efficiency, but too deep a network structure may result in over fitting, making the network difficult to train. Compared with other models (C3D and DSC 3D) considering time information, the accuracy of the method provided by the invention is respectively improved by 3.071% and 2.028%.

The comparison experiment result shows that the method provided by the invention has the best performance, and the method is mainly beneficial to the following two factors. First, a double-flow network structure is adopted to capture the space-time information of fish behaviors. Secondly, the spatial pyramid attention module can extract the behavioral characteristics of fish from both global and local angles.

TABLE 7 comparison of experimental results

/>

The SPA-TSN model is a deep learning algorithm aimed at identifying fish behavior by considering both global and local perspectives. The SPA-TSN model shows superior recognition accuracy compared to the most advanced methods. To further evaluate the effectiveness of the model, a class activation graph of the model is visualized, as shown in FIG. 11. Images of typical group behaviors and local behaviors are selected from the dataset for visualization. The overall fish behavior is shown in fig. 11 (a) - (d), and the local fish behavior is shown in fig. 11 (e) - (f).

Notably, in fig. 11 (f), the model of the present invention accurately detects that a fish at the bottom of the image shows fear of behavior, indicating that the model has the ability to detect local behavior. At the same time, the model of the present invention also recognizes fear behaviors of a group of fish at the upper right of the image, which also demonstrates the ability of the model to capture global behaviors.

Example two

The present embodiment provides a fish behavior recognition system based on spatial pyramid attention, including:

the data acquisition module is used for acquiring fish school behavior videos;

the method comprises the steps of adopting a double-flow network, aiming at each flow, adopting a spatial pyramid attention module to acquire corresponding characteristics, taking RGB images as input to extract spatial characteristics of fish behaviors from the spatial flows, taking optical flow images as input to extract time characteristics of the fish behaviors from the motion flows, wherein the spatial pyramid attention module aggregates characteristics from different levels of a base line, capturing behavior semantics of fish from global and local angles, fusing the extracted time characteristics with the spatial characteristics to obtain fused characteristics, and endowing different weights to the three characteristics in a classification process through a leavable multiplier for behavior recognition.

Example III

The present embodiment provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps in the fish behavior identification method based on spatial pyramid attention as described in embodiment one.

Example IV

The embodiment provides a computer device, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor realizes the steps in the fish behavior identification method based on the spatial pyramid attention according to the embodiment.

It will be appreciated by those skilled in the art that embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of a hardware embodiment, a software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, magnetic disk storage, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

Those skilled in the art will appreciate that implementing all or part of the above-described methods in accordance with the embodiments may be accomplished by way of a computer program stored on a computer readable storage medium, which when executed may comprise the steps of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random access Memory (Random AccessMemory, RAM), or the like.

The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. The fish behavior identification method based on the attention of the spatial pyramid is characterized by comprising the following steps of:

acquiring fish school behavior video data;

2. The fish behavior recognition method based on spatial pyramid attention of claim 1, wherein the data set is expanded by a data enhancement technique after the acquisition of the fish behavior video data.

3. The method for identifying fish behavior based on spatial pyramid attention of claim 1, wherein processing the fish-school behavior video data into RGB images and optical-flow images comprises: and calculating the optical flow of the fish shoal motion by adopting a flowet model to obtain time information of the fish shoal behavior, and visualizing the calculated optical flow into a format of an RGB picture.

4. A fish behavior recognition method based on spatial pyramid attention as in claim 1, wherein said learnable multiplier learns during training using a cross entropy loss function and a back propagation algorithm.

5. The fish behavior recognition method and system based on spatial pyramid attention as set forth in claim 1, wherein when the spatial pyramid attention module aggregates features from different levels of the base line, only the features of the third level, the fourth level and the fifth level are aggregated, global pooling is performed after the processing of the attention module, and after global pooling, the feature vectors output by the three levels correspondingly are spliced to obtain final output.

6. The fish behavior recognition method based on spatial pyramid attention as recited in claim 1, wherein a network of lines for each stream is used as a network of lines for each stream.

7. The method for identifying fish behavior based on spatial pyramid attention as recited in claim 1, wherein the behavior identification result includes ingestion behavior, fear behavior, hunger behavior, and normal behavior.

8. Fish behavior recognition system based on spatial pyramid attention, characterized by comprising:

the data acquisition module is used for acquiring fish school behavior videos;

9. A computer readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the steps of the fish behavior identification method based on spatial pyramid attention as claimed in any one of claims 1-7.

10. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps in the fish behavior identification method based on spatial pyramid attention as claimed in any one of claims 1-7 when the program is executed.