CN115797841A

CN115797841A - Quadruped animal behavior identification method based on adaptive space-time diagram attention Transformer network

Info

Publication number: CN115797841A
Application number: CN202211588021.0A
Authority: CN
Inventors: 赵亚琴; 冯丽琦
Original assignee: Nanjing Forestry University
Current assignee: Nanjing Forestry University
Priority date: 2022-12-12
Filing date: 2022-12-12
Publication date: 2023-03-14
Anticipated expiration: 2042-12-12
Also published as: CN115797841B

Abstract

A quadruped animal behavior identification method based on an adaptive space-time diagram attention transducer network comprises the following steps: 1) Collecting video images of quadruped animals, and marking joint points by using DeepLabCut to generate a skeleton topological graph; 2) Firstly, combining physical topology and potential topology of joint connection by using a Channel topology optimization module to generate Channel-wise framework topology, and then distributing different connection weights for the Channel-wise framework topology by using a Spatial-Transformer module; 3) Modeling interframe topological correlation by using a multi-scale time convolution module; 4) Through the full connection layer and the Softmax classification layer, 5 daily behaviors of chasing, resting, eating, alertness and walking of the animals are identified. In the invention, the feature extraction module of the space dimension can capture all possible connection modes among the joint points of the animal behaviors and distribute different connection weights; the time sequence dimension feature extraction module obtains the change of the animal posture of a continuous frame of a behavior, so that the accuracy of animal behavior recognition is improved, and the method has a good application prospect.

Description

Quadruped animal behavior identification method based on adaptive space-time diagram attention Transformer network

Technical Field

The invention relates to a behavior recognition method, in particular to a time-space diagram convolution network model constructed based on attention Transformer and adaptive skeleton diagram topology and a method applied to recognizing the skeleton behavior of quadruped animals.

Background

In the prior art, behavior recognition means that behavior information of a research object is automatically identified through static images or video sequence information by using key technologies such as moving object detection, feature information extraction and posture analysis.

Medium and large-sized ungulates are considered an important component of the ecosystem. The identification of the behavior of medium and large ungulates is an important component of animal research, protection and management. The application of automated imaging systems in the field of animal monitoring has made the acquisition of animal images increasingly convenient, but there is also a large amount of invalid information. The limited ability to manually screen information severely limits the effective use of data in animal research, protection and management. Thus, vision-based animal behavior recognition algorithms provide an automated way to more conveniently obtain useful animal information for animal behavior recognition studies.

In recent years, the research subjects for behavior recognition of animals have focused on laboratory animals (e.g., mice, fruit flies) or poultry animals (e.g., pigs, cattle). The research mode is mainly to use RGB images and use machine learning or deep learning algorithm to classify animal behaviors. Simple behavior of wild animals was identified in the Snapshot Serengeti (SS) dataset as Norouzzadeh applied 9 different deep neural network models [1]. However, the behavior of the animal is a dynamic process, and the expressive power of a single image is limited. Frank Schinder [2] trained three different ResNet variants for an 8-frame RGB video sequence for identifying differences in feeding, moving and gazing behaviors of animals, but with poor identification.

In the human behavior recognition research in the prior art, the behavior recognition research is carried out based on skeleton characteristics, for example:

researchers have proposed space-time graph convolutional networks ST-GCN that use a fixed graph that conforms to the natural connectivity of animal bones but has no flexibility. That is, the topology of all layers, all channels, and all actions is the same.

To better accommodate different motion expressions, another learner has proposed a dynamic convolutional neural network that learns the correlations between each joint pair using the context characteristics of the joints.

While the introduction of attention mechanism enables the model resources to focus on factors that are more important for different actions, the learner then uses the Transformer self-attention mechanism to reconstruct the spatio-temporal dependencies of the human skeleton. Likewise, expression of animal behavior is also closely related to skeletal topology. However, the background of an image in an existing data set for human behavior recognition is relatively single, and is usually shot by a camera facing the human body, while the background of an animal monitoring video is very complex, and sometimes the hair color of the background is very similar to the environmental color and is difficult to distinguish, which may interfere with animal behavior recognition.

The prior art, as disclosed in publication No. CN114596632A, "method for identifying behaviors of a large-and-medium-sized quadruped based on architecture search graph volume network" (202210204633.9), includes the steps of: firstly, based on the extraction of the behavior characteristics of an animal skeleton, aiming at an animal video image, the positions of joint points of animal body parts are quickly tracked by using DeepLabCut to form a space-time skeleton diagram, and the space-time characteristics of different behaviors of the animal are captured. And then designing a plurality of time-space diagram convolution operation modules based on animal skeletons to form a diagram-based search space, wherein residual connection, bottleneck structures and various attention mechanisms are fused. And then, based on a micro-architecture search strategy, the continuity of a search space is realized, so that a low-cost space-time diagram convolution model for identifying the behaviors of the medium-large quadruped animals is automatically searched, and the aim of distinguishing the daily behaviors of the animals is finally fulfilled. The technology mainly aims to automatically search an optimal space-time diagram convolution network structure according to different tasks of animal behavior identification by introducing and designing a plurality of effective attention diagram structure modules. However, in practice, an animal may have different postures, such as standing for eating or lying for eating, and the same posture may also be different behaviors, such as lying for eating or lying for resting, which greatly increases the difficulty of behavior recognition.

In order to solve the problems, the invention excavates potential association which may exist between joint points which are not connected physically of the animal, for example, coordinated movement of limbs in pursuit behaviors, and dynamically models potential connection topology of joints of the animal to aggregate the potential connection topology with the physical topology, thereby improving accuracy of animal behavior identification.

Disclosure of Invention

Through research, the background of animal monitoring videos is very complex, and sometimes the fur color of the backgrounds is very similar to the environmental color and is difficult to distinguish, so that the animal behavior identification is interfered. Meanwhile, one behavior of the animal can have different postures, such as standing for eating or lying for eating, and the same posture can also be different behaviors, such as lying for eating or lying for rest, so that the difficulty of behavior recognition is greatly increased.

In order to solve the problems and further improve the accuracy of behavior identification, the method dynamically models the potential connection topology of the animal joints to aggregate the potential connection topology with the physical topology, regenerates a new animal skeleton topology structure, and models the correlation of skeleton topology among continuous frames in a video sequence by using a multi-scale time convolution module.

Based on the thought, the invention provides an animal behavior identification method based on a self-adaptive space-time diagram attention transducer network, aiming at multi-species and multi-class animal behavior identification data, a space dimension feature extraction module captures potential connection modes among joint points of animal behaviors and distributes different connection weights; the time sequence dimension feature extraction module obtains the change of the animal posture of a behavior continuous frame, and then a 9-layer spatio-temporal map attention transform convolution network model is constructed to identify different behaviors of the animal.

The network input of the invention is based on the time-space diagram sequence of the animal skeleton, compared with the image recognition, only the most essential behavior characteristics of the animal are kept, and the interference of redundant information on the model is greatly reduced; based on the channel topology optimization and the Spatial-Transfomer module, the method is more beneficial to capturing similar characteristics of potential connection topology between joint points of the same animal behavior under different postures, such as head movement during eating; the multi-scale time convolution improves the expressive power of the behavior characteristics in longer time sequences.

The invention discloses a quadruped animal behavior recognition method based on a self-adaptive space-time diagram attention transducer network, which specifically comprises the following steps:

1) Collecting a quadruped video under a field environment;

2) Extracting animal joint point information in an animal video sequence to construct a spatiotemporal skeleton diagram, and further constructing an animal behavior identification model to identify different behaviors of the quadruped;

the method for constructing the adaptive space-time diagram attention Transformer network comprises the following steps:

designing the space dimension of the space-time graph convolution network: firstly, a Channel topology optimization module is utilized to adaptively learn the skeleton topologies of different behaviors of animals, the physical topology of joint connection is combined with the potential topology to generate Channel-wise skeleton topology, then, different connection weights are distributed to the Channel-wise skeleton topology through a Spatial-Transformer module, and the method comprises the following steps:

3.1 Utilizing a Channel topology optimization module to combine the physical topology and the potential topology of the joint connection to generate Channel-wise framework topology, and capturing all possible connection modes between joint points of animal behaviors;

3.2 Adopting an attention mechanism based on a Spatial-transducer module to calculate the connection weight between each pair of joint points of Channel-wise framework topology, thereby evaluating the importance of the joint connection of the topology and simulating the adaptive topology of each animal behavior sample, wherein the steps comprise:

3.2.1 Performing linear transformation and normalization processing on the Channel-wise animal skeleton topology obtained in the step 3.1) by a Spatial-Transfomer network, converting each joint point information into vector representation required by the attention of a Transfomer, and generating vectors comprising a query vector q, a key vector k and a value vector v;

3.2.2 For each pair of joint points of the same skeleton topology, calculating a query vector q of the ith joint point _i The dot product with the key vector k of the jth joint point as the pair of joint point attentions;

3.2.3 For each joint point of a skeleton topology), the joint point obtained in step 3.2.2) and attention weights of the other joint points of the skeleton topology are weighted and summed to obtain a new attention z of the ith joint point of the skeleton topology _i ；

3.2.4 H different learnable mappings are carried out on the multi-head attention of the Spatial-Transfomer network, namely, the steps 3.2.1) -3.2.3) are repeated for H times, and the outputs of the H attention heads are aggregated to obtain the multi-head attention of the ith joint point

Thereby mapping Channel-wise animal skeleton topology to a plurality of subspaces;

3.2.5 To direct attention to multiple heads

Splicing, and multiplying the spliced matrix by a weight matrix to obtain a multi-head attention output matrix of the Spatial-Transfomer network, wherein the output matrix comprises information of all attention heads;

4) Designing the time dimension of the space-time graph convolutional network: in order to obtain the time sequence information of a long video sequence, a multi-scale time convolution module is used for modeling the interframe topological correlation in a time domain, and the method comprises the following steps:

4.1 To achieve multi-scale sampling, sampling windows of different lengths are used, the sampling window size being varied by varying the expansion rate d of the expansion convolution;

4.2 For a skeleton sequence, selecting skeleton topology with d-1 as interval to calculate convolution, wherein d represents expansion rate, the size of convolution kernel is 5 x 1, and 1 x 1 convolution is introduced before each convolution operation, namely bottleneck design;

4.3 Borrow the core idea of the ResNet model, use residual concatenation (1 × 1 convolution) in the time domain, which allows the original information of a lower layer to be passed directly to a subsequent higher layer.

5) And (3) constructing a 9-layer spatio-temporal graph attention Transformer network model by using the space dimension network module and the time dimension network module designed in the steps 3) and 4), constructing a classifier through a full connection layer and Softmax, and identifying 5 behaviors of chasing, resting, eating, alarming and stroking the quadruped.

The beneficial effects of the invention mainly comprise:

(1) A brand-new skeleton-based multi-species behavior recognition model is provided. The framework two-dimensional coordinates are used as (space-time graph convolution) network input, so that the interference of redundant information such as external environment and the like is eliminated, and the data volume is greatly reduced. A self-attention Transformer mechanism is adopted on a multi-channel spatial topological structure on the model; multi-scale convolution is employed over time topology.

(2) The time-space graph convolutional network model (namely the behavior recognition model) is not limited to the topology structure of the animal physical skeleton, but dynamically models the potential connection topology of the animal joints to aggregate the potential connection topology with the physical topology, and establishes a corresponding channel topology, so that the time-space graph convolutional network model is more beneficial to capturing similar characteristics of the potential connection topology between the joint points of the same animal behavior in different postures, such as similar head movement when a person stands or lies on food in two different postures.

(3) The behavior recognition model adopts multi-scale time convolution to model the interframe topological correlation of a time domain in different fine granularities, obtains the change of the animal posture of a continuous frame of a behavior, and is beneficial to distinguishing the posture change difference of different behaviors from a time sequence dimension, such as the posture change difference of head movement during lying rest and lying food.

Drawings

FIG. 1 is a schematic flow diagram of the process of the present invention;

FIG. 2 is a diagram of an adaptive spatiotemporal pattern attention transducer network structure applied to animal behavior recognition according to the present invention;

FIG. 3 is a process of channel level animal potential topology acquisition;

FIG. 4 is a schematic diagram of a Spatial-Transformer's self-attention computation;

FIG. 5 is a schematic diagram of a multi-scale time convolution;

FIG. 6 is a multi-scale convolution acquisition process on a time scale;

FIG. 7 is an exemplary graph of a data set;

FIG. 8 is a graph of loss for different combinations of layers;

fig. 9 is a confusion matrix for a comparison experiment of convolution with different time series dimensions, wherein fig. 9 (a) is a confusion matrix using multi-scale time convolution, and fig. 9 (b) is a confusion matrix using one-dimensional scale time convolution.

Detailed Description

The animal may have different postures, such as standing for eating or lying for eating, and the same posture may be different behaviors, which makes the prior art less effective in completing these complex animal behavior recognition tasks. In order to solve the problems, the invention provides a quadruped animal behavior recognition method based on a self-adaptive spatiotemporal attention transducer network, which comprises the steps of combining a physical topology of joint connection with a potential topology by utilizing a Channel topology optimization module to generate a Channel-wise framework topology, and distributing different connection weights for the Channel-wise framework topology through a Spatial-fransformer module; and 5 daily behaviors of chasing, resting, eating, alertness and walking of the animal are identified by modeling interframe topological correlation by using a multi-scale time convolution module.

As shown in fig. 1 and fig. 2, the invention designs a space dimension network module and a time dimension network module, and based on the space dimension network module and the time dimension network module, a 9-layer spatiotemporal graph attention Transformer network model is constructed, and is applied to behavior recognition of medium and large-sized ungulates, and the method specifically comprises the following steps:

1) Collecting a quadruped video under a field environment;

2) Extracting animal joint point information in an animal video sequence to construct a space-time skeleton diagram, and further constructing an animal behavior identification model to identify different behaviors of the quadruped;

2.1 For an animal video sequence, tracking the positions of body joint points of the quadruped to construct an animal skeleton diagram;

2.1.1 Manually marking the joint positions of a few frames of images in a video sequence through a graphical user interface;

2.1.2 Using animal images as input, training and iterating through a deep neural network, generating a group of confidence maps describing the position of each joint point in the input animal images as output, and predicting the joint positions of all frames; the output is subjected to manual fine tuning, and the deep neural network is trained again;

2.1.3 Applying the deep neural network trained in the previous step to a new animal video sequence, and rapidly predicting joint position coordinates and confidence values of the animals in the video;

2.2 A probability threshold is defined, and only when the confidence values of the predicted positions of all the joint points of the animal in one frame image are larger than or equal to the threshold, the frame image is used for a subsequent animal behavior recognition algorithm.

3) Designing the space dimension of the space-time graph convolution network: as shown in fig. 2, firstly, a Channel topology optimization module is used for learning framework topologies of different behaviors of an animal in a self-adaptive manner, physical topologies of joint connections and potential topologies are combined to generate a Channel-wise framework topology, and then different connection weights are distributed to the Channel-wise framework topology through a Spatial-Transformer module;

3.1 Utilizing a Channel topology optimization module to combine the physical topology and the potential topology of the joint connection to generate a Channel-wise framework topology, and capturing all possible connection modes between joint points of animal behaviors;

3.1.1 ) converting the animal skeleton characteristic X obtained in the step 2) into a high-order characteristic through linear transformation

Such as formula(1) And using the model as the input of the whole model;

wherein,

n is the number of the joint points of the animal skeleton, C is the number of the input channels,

is a weight matrix, and T is the length of the video frame sequence for identifying the animal behaviors;

3.1.2 Using the articulation of the two-dimensional animal skeleton obtained in step 2) (i.e. the skeleton representing the connection of two articulation points of the animal body) to generate for each animal skeleton a corresponding adjacency matrix

And expressing the graph by using an adjacency matrix A to represent the adjacent relation between the animal skeleton nodes. Applying the adjacency matrix A as a Shared topology (Shared topology) to all channels of the channel topology optimization network;

3.1.3 Modeling potential correlation between joint points of Shared topology of different channels by using a modeling function, and obtaining Channel-specific topology (Channel-specific topology) by linear transformation, the steps including:

firstly, potential correlation among the joint points of Shared topology of different channels is modeled by a modeling function, and the formula of the modeling function is as follows (2):

wherein x is _i ，x _j Is e.g. X, and

node pairs (v) of the physical topology of the animal skeleton obtained in step 2) _i ，v _j ) (i.e. the joint points of the animal body), phi is a linear transformation psi, phi is used for reducing the dimension of the feature before the correlation modeling function and reducing the computational complexity, and sigma (-) is an activation function of a neural network;

then, carrying out linear transformation on the joint point potential correlation characteristics obtained in the step 3.1.3.1) by using a formula (3) so as to promote a Channel dimension and obtain a Channel-specific framework topology Q,

wherein,

for a vector of Channel-specific skeleton topology Q, the node pair (v) is reflected _i ，v _j ) The Channel-specific topological relation between the two.

3.1.4 As shown in fig. 3), aggregating the Shared topology feature a and the Channel-specific topology feature Q, and generating a Channel-wise framework topology feature using formula (4);

R＝A+α·Q (4)

where α is a trainable parameter used to adjust the strength between the Shared topology a and the channel topology Q, and the Shared topology feature a is added to each channel of α × Q by formula (4).

3.2 Using an attention mechanism based on a Spatial-Transfoner module to calculate a connection weight between each pair of joint points of the Channel-wise skeleton topology, thereby evaluating the importance of the joint connection of the topology and simulating an adaptive topology for each animal behavior sample, the steps comprising:

3.2.2 For each pair of joint points of the same skeleton topology, calculating a query vector q of the ith joint point _i Key vector k with j-th joint point _j As the attention of the pair of joint points;

as shown in FIG. 4, given the topology of the animal skeleton of the t-th frame of the animal video behavior sequence, the query vector of the i-th query joint point is set as

The key vectors of the rest inquired joint points of the animal skeleton topology of the t frame are

The value vector is

Computing query vectors

Sum key vector

Dot product to obtain the weight of the correlation strength between the pair of joint points

As the attention of the pair of joint points, then performing weighted summation on all the obtained weights of the ith query joint point to obtain the new attention of the ith query joint point of the tth frame

The calculation formula is shown as formula (5),

wherein, T isMeaning transposed, d _k The dimension of the key vector k.

For a video frame sequence, obtaining an attention feature vector;

3.2.4 ) multi-head attention of the Spatial-Transfomer network carries out H-time different learnable mappings, namely repeating the steps 3.2.1) -3.2.3) for H times, and aggregating the outputs of the H attention heads to obtain the multi-head attention of the ith joint point

3.2.5 To draw attention to multiple heads

multiplying the spliced matrix by a weight matrix to obtain a multi-head attention output matrix of the Spatial-Transfomer network, wherein a calculation formula is shown as a formula (6),

wherein Q, K, V are respectively query vector matrix, key vector matrix and value vector matrix, W _o Is a weight vector.

4) Designing the time dimension of the time-space graph convolution network: in order to obtain the timing information of the long video sequence, a multi-scale time convolution module is used for modeling the interframe topological correlation in a time domain, as shown in fig. 5 and 6, and the steps comprise;

5) Constructing a 9-layer space-time graph convolution network model by using the space dimension network module and the time dimension network module which are designed in the steps 3) and 4), constructing a classifier by using a full connection layer and Softmax, and identifying 5 behaviors of chasing, resting, eating, alertness and walking of the quadruped, wherein the network structure is as follows:

5.1 The first 4 layers of the space-time graph convolutional network model are the channel topology optimization convolutional networks of the space dimension network module in the step 3.1), and the multi-scale time convolution MS-TCN in the step 4) is applied to the time dimension of each layer of channel topology optimization convolution;

5.2 The last 5 layers of the spatio-temporal graph convolutional network model are the Spatial-fransformer network of the space dimension network module in the step 3.2), and the multi-scale time convolution MS-TCN in the step 4) is applied to the time dimension of convolution of each layer of the Spatial-fransformer network;

5.3 Constructs a classifier through the full connection layer FCN and Softmax, and identifies 5 behaviors of chasing, resting, eating, alertness and strolling of the medium-large quadruped.

Examples

1. Data set

In this example, a data set containing wild terrestrial mammalian bone was constructed. Fig. 7 shows example images of five different behaviors, including pursuit, feeding, resting, walking, watching. There are dozens of wild animals including tiger, lion, leopard, polar bear, black bear, antelope, alpaca, horse, etc.

To collect data, videos of five behaviors were edited from a large number of animal videos. The details of this data set are shown in table 1. The data set consisted of 67606 images from 2058 video sequences.

TABLE 1

As can be seen from table 1, 60% was randomly selected as the training set and 40% as the validation set for each type of sample. And training a comparison model on a training set, and acquiring accuracy on a verification set.

2. Comparative experiment

In this section, ablation experiments are aimed at verifying the contribution of various key parts of the model to accurately identify animal behavior. Then, this example compares the model (Channel-Transformer) proposed by the present invention with other most advanced GCN-based methods.

2.1 network framework design comparison

The Channel-Transformer consists of 9 layers of spatio-temporal GCN. In order to determine the number of layers of a Channel topology optimization module (CTG) and a Spatial-fransformer module (TFR) in a Channel-fransformer network structure, different numbers of layers are set for the CTG module and the TFR module respectively in this experiment, and different layer combination forms are compared. In order to ensure the fairness of comparison, a multi-scale time convolution module (TCN for short) is uniformly adopted in the time dimension. As shown in table 2, the combined design of two different sub-modules, CTG and TFR, has higher classification accuracy than the simple stacking of the same convolution module, and the experimental result performance of the 4-layer CTG structure + 5-layer TFR structure is the best, so the model Channel-Transformer of the present invention adopts a 9-layer (4-layer CTG structure + 5-layer TFR) structure. The loss function curve of each combination is shown in fig. 8, and it can also be seen from fig. 8 that the structure convergence speed of the 4-layer CTG structure + 5-layer TFR is the fastest and the convergence performance is the best.

TABLE 2

2.2 effectiveness of the fusion of CTG and TFR

In the experiment, a Spatial-Transfomer sub-module is introduced into the conventional two graph convolution structures (namely fixed-GCN and adaptive-GCN), and for the sake of fairly comparing the performances, the three methods are divided by the same hierarchical sequence, namely the first four layers are graph convolution structures with different node connection modes (fixed-GCN, adaptive-GCN and CTG), and the last five layers use the Spatial-Transfomer sub-module to distribute the weights of different nodes. The results of the classification accuracy of animal behavior are shown in table 3. Therefore, the animal behavior recognition performance of the proposed CTG and TFR fusion structure is far better than the combination of other two types of GCN and TFR. This is because the proposed model Channel-Transformer fuses the adjacency matrix with the feature map of the Channel-level topology, helping to exploit the potential correlation between animal joints under different channels. The performance of any network can be improved by introducing TFR sub-modules, as shown in table 3, lines 4 to 6. Thus, this experiment further verifies that the TFR submodule can exploit the self-attention mechanism to explore the importance of different articulations.

TABLE 3

2.3 validity of MS-TCN sub-Module

In order to evaluate the multi-scale convolution module on the time scale, the experiment also replaces the multi-scale convolution module with a one-dimensional time convolution module, as shown in fig. 9, and the performance of the multi-scale time convolution module is superior to that of the one-dimensional time convolution module by comparing the confusion matrices of the two schemes.

2.4 comparison with the prior art

The present invention compares the proposed Channel-Transformer model with existing bone-based human motion models due to the lack of the most advanced bone-based animal behavior recognition methods. From Table 4 we can see that the Channel-Transformer model of the present invention outperforms the most advanced human mobility method under the same evaluation settings.

TABLE 4

From table 4, it can be seen that: although the accuracy of the MS-G3D with a better effect at present can reach 91.11%, the parameter number of the model is as high as 3.01M. The accuracy of the method can reach 93.21%, and the number of model parameters is only 1.33M, so that the method can prove the advancement of the method.

2.5, the experimental results show the effectiveness of the network structure and the technical contribution of the model components, and prove that the performance of the model Channel-Transformer designed for animal behaviors is superior to that of the most advanced method.

3. Conclusion

The invention discloses a novel Channel-convolution network (Channel-Transformer) based on a Channel-level topological structure and a Transformer attention mechanism, which is used for animal behavior identification based on a framework and is used for forming a self-adaptive learning topology through Channel correlation modeling. To further focus on important joint connections, the invention employs a Spatial-Transformer to assign different weights to different topological connections with different behaviors, which helps to efficiently allocate resources. To improve the ability of the inter-frame topology dependent expressions, we apply multi-scale temporal convolution to improve the performance of the model. The experimental result shows that the Channel-Transformer has stronger representation capability than other graph convolution models.

Claims

1. A quadruped animal behavior identification method based on an adaptive space-time diagram attention transducer network comprises the following steps:

1) Collecting a quadruped animal video under a field environment;

the method is characterized in that an adaptive space-time diagram attention Transformer network is constructed to serve as an animal behavior recognition model to recognize different behaviors of quadruped animals; the method for constructing the adaptive space-time diagram attention Transformer network comprises the following steps of:

3) Designing a space dimension module of the space-time graph convolutional network: firstly, a Channel topology optimization module is utilized to adaptively learn the skeleton topologies of different behaviors of animals, the physical topology of joint connection is combined with the potential topology to generate Channel-wise skeleton topology, and then different connection weights are distributed to the Channel-wise skeleton topology through a Spatial-Transformer module;

4) Designing a time dimension module of the space-time graph convolutional network: in order to obtain the time sequence information of the long video sequence, a multi-scale time convolution module is used for modeling the interframe topological correlation on a time domain;

5) Constructing a 9-layer spatiotemporal graph attention Transformer convolution network model by using the space dimension network module and the time dimension network module designed in the steps 3) and 4), constructing a classifier through a full connection layer and Softmax, and identifying 5 behaviors of chasing, resting, eating, alarming and strolling of the quadruped;

in step 3):

3.2.1 A Spatial-Transformer module carries out linear transformation and normalization processing on the Channel-wise framework topology obtained in the step 3.1), each joint point information is converted into vector representation required by the attention of a Transformer, and the generated vector comprises a query vector omega, a key vector k and a value vector v;

3.2.4 H different learnable mappings are carried out on the multi-head attention of the Spatial-Transfomer module, namely, the steps 3.2.1) -3.2.3) are repeated for H times, and the outputs of the H attention heads are aggregated to obtain the multi-head attention of the ith joint point

Thereby mapping the Channel-wise skeleton topology to a plurality of subspaces;

3.2.5 To draw attention to multiple heads

in the step 4):

the method is characterized in that a multi-scale time convolution MS-TCN is used for modeling interframe framework topological correlation on a time domain and extracting animal time dimension behavior characteristics, and the method comprises the following steps:

4.2 For a skeleton sequence, selecting skeleton topology at intervals of d-1 to calculate convolution, wherein d represents a dilation rate, the size of a convolution kernel is 5 × 1, and introducing 1 × 1 convolution before each convolution operation, namely bottleneck design;

4.3 Using the core idea of the ResNet model, residual concatenation (1 × 1 convolution) is used in the time domain, which allows the original information of the lower layer to be passed directly to the subsequent higher layer.

2. The quadruped animal behavior recognition method based on the adaptive spatiotemporal graph attention transducer network as claimed in claim 1, wherein in the step 2), the extraction of the quadruped animal skeleton data is realized by a DeeplaCut algorithm based on attitude estimation, and the steps comprise:

2.1.1 Manually marking the joint positions of a few frame images in a video sequence through a graphical user interface;

3. The quadruped animal behavior recognition method based on the adaptive spatiotemporal graph attention transducer network as claimed in claim 1, wherein in the step 3.1), the physical topology of the joint connection and the potential topology are combined by using a Channel topology optimization module to generate a Channel-wise framework topology, and all possible connection modes between joint points of animal behavior are captured, and the steps comprise;

3.1.1 Converting the animal skeleton characteristic X obtained in the step 2) into a high-order characteristic through linear transformation

As formula (1) and as input to the entire model;

wherein,

3.1.2 Using the articulation of the two-dimensional skeletons of the animals obtained in step 2) to generate a corresponding adjacency matrix for each animal skeleton

Expressing the adjacent relation between animal skeleton nodes by using an adjacency matrix A to realize the expression of a graph; applying the adjacency matrix A as a Shared topology to all channels of the channel topology optimization network; the joint connection of the two-dimensional skeleton of the animal is a skeleton which represents the connection of two joint points of the animal body;

3.1.3 Modeling potential correlation between joint points of Shared topologies of different channels by using a modeling function, and obtaining Channel-specific topologies through linear transformation;

3.1.4 Aggregating the Shared topology characteristic A and the Channel-specific topology characteristic Q, and generating a Channel-wise framework topological characteristic by using a formula (2);

R＝A+α·Q (2)

where α is a trainable parameter used to adjust the strength between the Shared topology a and the channel topology Q, and the Shared topology feature a is added to each channel of α × Q by equation (2).

4. The quadruped animal behavior recognition method based on the adaptive spatiotemporal image attention transducer network as claimed in claim 1, wherein in the step 3.2), an attention mechanism based on a Spatial-transducer module is adopted to calculate the connection weight between each pair of joint points of Channel-wise framework topology, and the steps comprise:

in step 3.2.2), the animal skeleton topology of the t frame of the animal video behavior sequence is given, and the query vector of the ith query joint point is set as

The value vector is

Computing query vectors

Sum key vector

Dot product, to obtain the weight of the correlation strength between the pair of joint points

The calculation formula is shown as formula (3),

wherein T means transposed, d _k Is a keyThe vector is the dimension of k;

for a video frame sequence, obtaining an attention feature vector;

in step 3.2.5), multiplying the spliced matrix by a weight matrix to obtain a multi-head attention output matrix of the Spatial-Transfomer module, wherein a calculation formula is shown as a formula (4),

5. The quadruped animal behavior recognition method based on the adaptive spatiotemporal graph attention transducer network as claimed in claim 1, wherein in the step 5), a 9-layer spatiotemporal graph convolution network model is constructed, and the steps comprise:

applying the space dimension network module and the time dimension network module designed in the steps 3) and 4) to construct a 9-layer space-time graph convolution network model, constructing a classifier through a full connection layer and Softmax, and identifying the behavior of the quadruped animal, wherein the network structure is as follows:

5.2 The last 5 layers of the spatio-temporal graph convolutional network model are the Spatial-fransformer modules of the space dimension network module in the step 3.2), and the multi-scale time convolution MS-TCN in the step 4) is applied to the time dimension of convolution of each layer of the Spatial-fransformer modules;

5.3 Through full connectivity layers FCN and Softmax, a classifier was constructed to recognize 5 behaviors of chase, rest, feed, alert, and walk for the medium and large quadruped.

6. The quadruped animal behavior recognition method based on adaptive space-time diagram attention transducer network as claimed in claim 1 or 3, wherein in the step 3.1.3), a modeling function is used to model potential correlation between joint points of Shared topology of different channels, and Channel-specific skeleton topology is obtained through linear transformation, and the steps include:

3.1.3.1 Modeling potential correlations between the articulation points of Shared topology for different channels using a modeling function whose formula is as follows (5):

wherein x is _i ，x _j Is e.g. X, and

3.1.3.2 Applying formula (6) to perform linear transformation on the joint point potential correlation characteristics obtained in step 3.1.3.1) so as to improve the Channel dimension and obtain Channel-specific framework topology Q,

wherein,

(i is more than or equal to 1 and less than or equal to N, and j is more than or equal to 1 and less than or equal to N) is a vector of the Channel-specific framework topology Q, and the node pair (v) is reflected _i ，v _j ) The Channel-specific topological relation between the two.