CN115797841B

CN115797841B - Quadruped behavior recognition method based on self-adaptive space-time diagram attention transducer network

Info

Publication number: CN115797841B
Application number: CN202211588021.0A
Authority: CN
Inventors: 赵亚琴; 冯丽琦
Original assignee: Nanjing Forestry University
Current assignee: Nanjing Forestry University
Priority date: 2022-12-12
Filing date: 2022-12-12
Publication date: 2023-08-18
Anticipated expiration: 2042-12-12
Also published as: CN115797841A

Abstract

A quadruped behavior recognition method based on an adaptive space-time diagram attention transducer network comprises the following steps: 1) Collecting a quadruped video image, and marking a node by using a deep Labcut to generate a skeleton topological graph; 2) Firstly, combining a physical topology and a potential topology of joint connection by using a Channel topology optimization module to generate a Channel-wise skeleton topology, and then distributing different connection weights for the Channel-wise skeleton topology by using a Spatial-Transfomer module; 3) Modeling inter-frame topological correlation by using a multi-scale time convolution module; 4) Through the full connection layer and Softmax classification layer, 5 daily activities of animal pursuit, rest, feeding, alertness, and walking were identified. In the invention, the feature extraction module of the space dimension can capture all possible connection modes between the joint points of animal behaviors and distribute different connection weights; the characteristic extraction module of the time sequence dimension obtains the change of animal gestures of continuous frames of a behavior, thereby improving the accuracy of animal behavior identification and having better application prospect.

Description

Quadruped behavior recognition method based on self-adaptive space-time diagram attention transducer network

Technical Field

The invention relates to a behavior recognition method, in particular to a space-time diagram convolution network model constructed based on an attention transducer and an adaptive skeleton diagram topology and a method for applying the space-time diagram convolution network model to the behavior recognition of a quadruped skeleton.

Background

In the prior art, behavior identification refers to automatically identifying behavior information of a research object by using key technologies such as moving object detection, feature information extraction, gesture analysis and the like through static image or video sequence information.

Middle and large ungulates are considered to be an important component of the ecosystem. The identification of the behaviour of large and medium-sized ungulates is an important component of animal research, protection and management. The use of automated imaging systems in the field of animal monitoring has made the acquisition of animal images increasingly convenient, but there is also a large amount of invalid information. The limited ability to manually screen information severely limits the effective use of data in animal research, protection and management. Thus, vision-based animal behavior recognition algorithms provide an automated way to more conveniently obtain useful animal information for animal behavior recognition studies.

In recent years, subjects for animal behavior recognition have focused on laboratory animals (e.g., mice, drosophila) or poultry animals (e.g., pigs, cattle). The research mode mainly uses RGB images and uses machine learning or deep learning algorithm to classify animal behaviors. For example, norouzzadeh applied 9 different deep neural network models to identify simple behavior of wild animals in the Snapshot Serengeti (SS) dataset [1]. However, animal behavior is a dynamic process and the expression capacity of a single image is limited. Frank Schindler [2] trained three different ResNet variants for 8-frame RGB video sequences, for identifying differences in three behaviors of feeding, moving and gazing in animals, but with poor identification.

In the prior art, human behavior recognition research is performed by adopting a framework feature-based behavior recognition research, for example:

scholars have proposed a space-time diagram convolutional network ST-GCN that uses a fixed diagram that conforms to the natural connectivity of animal bones, but without flexibility. That is, the topology of all layers, all channels, and all actions is the same.

To better accommodate the different action expressions, students have also proposed dynamic graph roll-up neural networks that learn the correlation between each joint pair using the contextual characteristics of the joints.

While the introduction of attention mechanisms enables model resources to focus on factors that are more important to different actions, there are students that use the transducer self-attention mechanism to reconstruct the spatiotemporal dependencies of human bones. Likewise, the expression of animal behavior is also closely related to skeletal topology. However, the background of the image in the existing data set for human behavior recognition is relatively single, usually the camera is just opposite to the human body for shooting, but the background of the animal monitoring video is very complex background, and sometimes the hair colors of the animal monitoring video are very similar to the environmental colors, so that the animal behavior recognition is difficult to distinguish, and the animal behavior recognition is interfered.

The "architecture search graph rolling network-based medium-large quadruped behavior identification method" (202210204633.9) steps of the prior art such as publication number CN114596632a include: firstly, based on the extraction of the behavior characteristics of an animal skeleton, aiming at an animal video image, the position of an articulation point of an animal body part is quickly tracked by using deep Labcut to form a space-time skeleton diagram, and the space-time characteristics of different behaviors of the animal are captured. Then designing a plurality of space-time diagram convolution operation modules based on animal skeletons to form a diagram-based search space, wherein residual connection, bottleneck structures and various attention mechanisms are fused. And then, realizing the continuity of the search space based on a microarchitectural search strategy so as to automatically search a low-cost space-time diagram convolution model for identifying the behaviors of the medium-large quadruped, and finally realizing the aim of distinguishing the daily behaviors of the animals. The main purpose of the technology is to automatically search the optimal space-time diagram convolution network structure according to different animal behavior identification tasks by introducing and designing a plurality of effective attention diagram structural modules. However, in practice, an animal may have a different posture, such as standing to eat, or lying to eat, and the same posture may also be a different behavior, such as lying to eat, or lying to rest, which greatly increases the difficulty of behavior recognition.

In order to solve the problems, the invention digs potential association possibly existing between the physically unconnected joint points of the animal, such as coordinated movement of limbs in the pursuit behavior, dynamically models the potential connection topology of the animal joint to aggregate the potential connection topology with the physical topology, and improves the accuracy of animal behavior identification.

Disclosure of Invention

Through research, animal monitoring videos have very complex backgrounds, and sometimes their hair colors are very similar to environmental colors, and are difficult to distinguish, which can cause interference to animal behavior recognition. Also, an animal may have a different posture, such as standing to eat, or lying to eat, and the same posture may also be a different behavior, such as lying to eat, or lying to rest, which greatly increases the difficulty of behavior recognition.

In order to solve the problems, the accuracy of behavior recognition is further improved, potential connection topology of animal joints is dynamically modeled to be aggregated with physical topology, a new animal skeleton topology structure is regenerated, and correlation of skeleton topology among continuous frames in a video sequence is modeled by using a multi-scale time convolution module.

Based on the thought, the invention provides an animal behavior recognition method based on a self-adaptive space-time diagram attention transducer network, which aims at multi-species and multi-category animal behavior recognition data, and a space dimension feature extraction module captures potential connection modes among joint points of animal behaviors and distributes different connection weights; the feature extraction module of the time sequence dimension acquires the change of animal gestures of continuous frames of a behavior, and then a 9-layer space-time diagram attention transducer convolutional network model is constructed to identify different behaviors of the animal.

Compared with image recognition, the network input is based on the space-time diagram sequence of the animal skeleton, only the most essential behavior characteristics of the animal are reserved, and the interference of redundant information on the model is greatly reduced; based on channel topology optimization and a Spatial-fransfomer module, the method is more beneficial to capturing similar characteristics of potential connection topologies between joint points of the same animal behavior under different postures, such as head movement during eating; the multi-scale temporal convolution improves the expressive power of behavioral features in longer time series.

The invention discloses a quadruped behavior identification method based on a self-adaptive space-time diagram attention transducer network, which comprises the following specific steps:

1) Collecting quadruped videos in a field environment;

2) Extracting animal joint point information in an animal video sequence to construct a space-time skeleton diagram, and further constructing an animal behavior recognition model to recognize different behaviors of the quadruped;

the steps of constructing the adaptive space-time diagram attention transducer network comprise:

design of space dimension of space-time diagram convolution network: firstly, self-adaptively learning skeleton topologies of different behaviors of animals by using a Channel topology optimization module, combining a physical topology connected with joints with a potential topology to generate a Channel-wise skeleton topology, and then distributing different connection weights for the Channel-wise skeleton topology by using a Spatial-fransfomer module, wherein the method comprises the following steps of:

3.1 Utilizing a Channel topology optimization module to combine the physical topology and the potential topology of the joint connection to generate a Channel-wise skeleton topology, and capturing all possible connection modes between joint points of animal behaviors;

3.2 Adopting a Spatial-Transfomer module-based attention mechanism to calculate the connection weight between each pair of joint points of the Channel-wise skeleton topology, thereby evaluating the importance of joint connection of the topology structure and simulating the self-adaptive topology structure of each animal behavior sample, wherein the method comprises the following steps:

3.2.1 The Spatial-Transfomer network carries out linear transformation and normalization processing on the Channel-wise animal skeleton topology obtained in the step 3.1), the information of each joint point is converted into vector representation required by the Transformer attention, and the generated vectors comprise query vectors q, key vectors k and value vectors v;

3.2.2 For each pair of nodes of the same skeleton topology, calculate the query vector q of the ith node _i The dot product of the key vector k with the j-th joint point is taken as the attention of the pair of joint points;

3.2.3 For each node of a skeleton topology, weighting and summing the attention weights of the node obtained in step 3.2.2) and the rest of the nodes of the skeleton topology to obtain a new attention z of the ith node of the skeleton topology _i ；

3.2.4 Multi-head attention of the Spatial-fransfomer network is mapped for H times, namely, the steps 3.2.1) to 3.2.3) are repeated for H times, and the outputs of the H attention heads are aggregated to obtain the multi-head attention of the ith articulation pointMapping the Channel-wise animal skeleton topology to a plurality of subspaces;

3.2.5 Multiple head attentionSplicing, and multiplying the spliced matrix with a weight matrix to obtain a multi-head attention output matrix of the Spatial-fransfomer network, wherein the output matrix contains information of all attention heads;

4) Design of time dimension of space-time diagram convolution network: in order to acquire timing information of a long video sequence, modeling inter-frame topological correlation on a time domain by using a multi-scale time convolution module, the steps include:

4.1 To achieve multi-scale sampling, sampling windows of different lengths are used, the sampling window size being changed by changing the dilation rate d of the dilation convolution;

4.2 For a skeleton sequence, selecting skeleton topology at intervals of d-1 to calculate convolution, wherein d represents expansion rate, the size of convolution kernel is 5×1, and 1×1 convolution is introduced before each convolution operation, namely bottleneck design;

4.3 With reference to the core idea of the res net model, a residual connection (1 x 1 convolution) is used in the time domain, which allows the raw information of the lower layer to be passed directly to the subsequent higher layers.

5) And 3) constructing a 9-layer space-time diagram attention transducer network model by using the space dimension network module and the time dimension network module designed in the step 3) and the step 4), constructing a classifier through a full-connection layer and Softmax, and identifying 5 behaviors of pursuing, resting, feeding, alertness and walking of the quadruped.

The beneficial effects of the invention mainly include:

(1) A novel multi-species behavior recognition model based on a framework is provided. The skeleton two-dimensional coordinates are used as (space-time diagram convolution) network input, so that the interference of redundant information such as external environment is eliminated, and the data volume is greatly reduced. A self-attention transducer mechanism is adopted on a multichannel space topological structure on the model; multi-scale convolution is employed on a temporal topology.

(2) The space-time diagram convolution network model (namely the behavior recognition model) is not limited to the topological structure of the animal physical skeleton any more, but dynamically models the potential connection topology of the animal joints to aggregate the potential connection topology with the physical topology, and establishes a corresponding channel topological structure, thereby being more beneficial to capturing similar characteristics of the potential connection topology between the joints of the same animal behavior under different postures, such as similar head movements when eating under standing or lying under two different postures.

(3) The behavior recognition model adopts multi-scale time convolution to model inter-frame topological correlation of a time domain with different fine granularity, obtains the change of animal postures of continuous frames of one behavior, and is favorable for distinguishing posture change differences of different behaviors from time sequence dimension, such as posture change differences of head movements during lying rest and lying feeding.

Drawings

FIG. 1 is a schematic flow chart of the method of the present invention;

FIG. 2 is a diagram of an adaptive space-time diagram attention transducer network architecture for animal behavior recognition in accordance with the present invention;

FIG. 3 is a channel-level animal potential topology acquisition process;

FIG. 4 is a schematic diagram of the self-attention calculation of the Spatial-transducer;

FIG. 5 is a schematic diagram of a multi-scale time convolution;

FIG. 6 is a multi-scale convolution acquisition process on a time scale;

FIG. 7 is a diagram of an example dataset;

FIG. 8 is a graph of loss curves for different layered combinations;

fig. 9 is a confusion matrix of a comparison experiment of convolution with different time dimensions, wherein fig. 9 (a) is a confusion matrix using multi-scale time convolution and fig. 9 (b) is a confusion matrix using one-dimensional scale time convolution.

Detailed Description

One behavior of an animal may have a different posture, such as standing for feeding, or lying for feeding, and the same posture may also be a different behavior, which makes the prior art less effective in accomplishing these complex animal behavior recognition tasks. In order to solve the problems, the invention provides a quadruped behavior recognition method based on a self-adaptive space-time diagram attention transducer network, which utilizes a Channel topology optimization module to combine an articulated physical topology with a potential topology to generate a Channel-wise skeleton topology, and distributes different connection weights for the Channel-wise skeleton topology through a Spatial-transducer module; modeling inter-frame topological correlation by using a multi-scale time convolution module, and identifying 5 daily behaviors of animal pursuing, resting, eating, alertness and walking.

As shown in fig. 1 and 2, the invention designs a space dimension network module and a time dimension network module, based on which, a 9-layer space diagram attention transducer network model is constructed, which is applied to the behavior recognition of large and medium-sized ungulates, and comprises the following specific steps:

1) Collecting quadruped videos in a field environment;

2.1 For an animal video sequence, tracking the body joint point position of the quadruped animal, and constructing an animal skeleton diagram;

2.1.1 Manually marking joint points of a few frame images in a video sequence through a graphical user interface;

2.1.2 Taking an animal image as input, performing training iteration through a deep neural network, generating a group of confidence maps describing the positions of each joint point in the input animal image as output, and predicting the joint positions of all frames; manually fine-tuning the output, and training the deep neural network again;

2.1.3 Applying the depth neural network trained in the previous step to a new animal video sequence, and rapidly predicting joint position coordinates and confidence values of animals in the video;

2.2 A probability threshold is defined and a frame image is used for a subsequent animal behavior recognition algorithm only if confidence values for the predicted positions of all nodes of the animal in the frame image are equal to or greater than the threshold.

3) Design of space dimension of space-time diagram convolution network: as shown in fig. 2, firstly, a Channel topology optimization module is utilized to adaptively learn skeleton topologies of different behaviors of animals, a physical topology of joint connection is combined with a potential topology to generate a Channel-wise skeleton topology, and then, different connection weights are distributed for the Channel-wise skeleton topology through a Spatial-fransfomer module;

3.1.1 Converting the animal skeleton feature X obtained in the step 2) into a higher-order feature by linear transformationAs in equation (1) and taking it as input to the overall model;

wherein, the liquid crystal display device comprises a liquid crystal display device,n is the number of joints of the animal skeleton, C is the number of input channels, and-> The method comprises the steps that (1) a weight matrix is adopted, T is the length of a video frame sequence identified by animal behaviors;

3.1.2 Using the articulation of the two-dimensional skeleton of the animal obtained in step 2) (i.e. representing two articulation points of the animal body)Connected bones) to generate a corresponding adjacency matrix for each animal skeletonAnd expressing the adjacent relation between the animal skeleton nodes by using an adjacent matrix A to realize the expression of the graph. Applying the adjacency matrix a as a Shared topology to all channels of the channel topology optimized network;

3.1.3 Modeling potential correlations between joint points of Shared topology of different channels using a modeling function and obtaining a Channel-specific topology (Channel-specific topology) by linear transformation, the steps comprising:

first, the potential correlation between the articulation points of Shared biology of different channels is modeled using a modeling function, the formula of which is as formula (2):

wherein x is _i ，x _j E is X, andnode pair (v) of the physical topology of the animal skeleton obtained in step 2) _i ，v _j ) (i.e. the joints of the animal body) the features, ψ, φ are linear transformations, which are used to reduce the dimension of the features before the correlation modeling function, reduce the computational complexity, σ (·) is the activation function of the neural network;

then, applying a formula (3) to perform linear transformation on the potential relevance features of the joint points obtained in the step 3.1.3.1) so as to promote the Channel dimension and obtain a Channel-specific skeleton topology Q,

wherein, the liquid crystal display device comprises a liquid crystal display device,is a vector of Channel-specific skeleton topology Q, reflecting the node pair (v _i ，v _j ) Channel-specific topology relationship between.

3.1.4 As shown in fig. 3, the Shared topology feature a and the Channel-specific topology feature Q are aggregated, and the Channel-wise skeleton topology feature is generated by using the formula (4);

R＝A+α·Q (4)

where α is a trainable parameter for adjusting the strength between the Shared topology a and the channel topology Q, the Shared topology feature a is added to each channel of α×q by equation (4).

3.2.2 For each pair of nodes of the same skeleton topology, calculate the query vector q of the ith node _i Key vector k to the jth node _j As the dot product of the pair of joint point attentiveness;

as shown in FIG. 4, given the animal skeleton topology of the t frame of the animal video behavior sequence, let the query vector of its i-th query node beThe key vector of the rest of the queried nodes of the animal skeleton topology of the t-th frame is +.>The value vector isCalculate query vector +.>And key vector->Dot product, weight of correlation intensity between the pair of joint points is obtained>As the attention of the pair of nodes, then the weighted summation is carried out on all the weights of the obtained ith query node to obtain the new attention of the ith query node of the t frame +.>The calculation formula is shown as a formula (5),

wherein T has the meaning of transpose, d _k Is the dimension of the key vector k.

For a sequence of video frames, an attention feature vector is obtained;

multiplying the spliced matrix with a weight matrix to obtain a multi-head attention output matrix of the Spatial-fransfomer network, wherein the calculation formula is shown in a formula (6),

wherein Q, K, V are respectively a query vector matrix, a key vector matrix and a value vector matrix, W _o Is a weight vector.

4) Design of time dimension of space-time diagram convolution network: in order to acquire the time sequence information of a long video sequence, modeling the inter-frame topological correlation on a time domain by using a multi-scale time convolution module, as shown in fig. 5 and 6, the steps comprise;

5) The space dimension network module and the time dimension network module designed in the step 3) and the step 4) are utilized to construct a 9-layer space-time diagram convolution network model, a classifier is constructed through a full-connection layer and Softmax, 5 behaviors of pursuit, rest, feeding, alertness and strolling of quadruped are identified, and the network structure is as follows:

5.1 The first 4 layers of the space-time diagram convolutional network model are channel topology optimizing convolutional networks of the space dimension network module of the step 3.1), and the multi-scale time convolution MS-TCN of the step 4) is applied to the time dimension of the channel topology optimizing convolution of each layer;

5.2 The latter 5 layers of the space-time diagram convolution network model are the Spatial-fransfomer network of the Spatial dimension network module of the step 3.2), and the multi-scale time convolution MS-TCN of the step 4) is applied to the time dimension of each layer convolution of the Spatial-fransfomer network;

5.3 Through full connection layer FCN and Softmax to construct a classifier, identify the 5 actions of chase, rest, eating, vigilance, walk of medium and large quadruped.

Examples

1. Data set

In this example, a dataset comprising wild terrestrial mammalian bone was constructed. Fig. 7 shows example images of five different behaviors including chase, feed, rest, walk, watch. There are tens of wild animals including tiger, lion, leopard, polar bear, black bear, antelope, alpaca, horse, etc.

To collect data, five behavioral videos were clipped from a large number of animal videos. The details of this dataset are shown in table 1. The dataset consisted of 67106 images from 2058 video sequences.

TABLE 1

As can be seen from table 1, 60% was randomly selected as training set and 40% as validation set for each type of sample. The comparison model is trained on the training set and accuracy is known on the verification set.

2. Comparative experiments

In this section, ablation experiments were aimed at verifying the contribution of various key parts of the model to accurately identify animal behavior. The present example then compares the proposed model (Channel-transducer) with other most advanced GCN-based methods.

2.1 network frame design comparison

The Channel-transducer consists of 9 layers of spatiotemporal GCN. In order to determine the number of layers of each of a Channel topology optimization module (CTG) and a Spatial-fransfomer module (TFR) in a Channel-fransformer network structure, the experiment sets different layer numbers for the CTG and the TFR respectively, and compares different layering combination forms. In order to ensure the fairness of comparison, a multi-scale time convolution module (TCN) is uniformly adopted in the time dimension. As shown in table 2, the combined design of the two different sub-modules of CTG and TFR has higher classification accuracy than the simple superposition of the same convolution module, and the experimental result performance of the 4-layer CTG structure+5-layer TFR structure is optimal, so that the model Channel-Transformer of the present invention adopts a 9-layer (4-layer CTG structure+5-layer TFR) structure. The loss function curves of each combination are shown in fig. 8, and it can be seen from fig. 8 that the structure of the 4-layer CTG structure+5-layer TFR converges at the highest speed and the best convergence performance.

TABLE 2

2.2 validity of CTG and TFR fusion

In the experiment, a Spatial-fransfomer submodule is also introduced into the existing two types of graph rolling structures (namely fixed-GCN and adaptive-GCN), and in order to fairly compare the performances, the three methods adopt the same hierarchical sequence division, namely the first four layers are graph rolling structures with different node connection modes (fixed-GCN, adaptive-GCN and CTG), and the later five layers use the Spatial-fransfomer submodule to allocate weights of different nodes. The results of classification accuracy of animal behaviors are shown in table 3. It can be seen that the animal behavior recognition performance of the proposed CTG and TFR fusion structure is far superior to the binding of the other two classes of GCN to TFR. This is because the proposed model Channel-transducer fuses the adjacency matrix with a feature map of the Channel-level topology, helping to explore the potential correlation between animal joints under different channels. As shown in lines 4 to 6 of table 3, the performance of any network can be improved by introducing TFR sub-modules. Thus, this experiment further verifies the importance that the TFR sub-module can exploit the self-attention mechanism to explore different articulating joints.

TABLE 3 Table 3

2.3 validity of MS-TCN submodule

In order to evaluate the multi-scale convolution module on a time scale, the experiment is also replaced by a one-dimensional time convolution module, and as shown in fig. 9, the performance of the multi-scale time convolution module is superior to that of the one-dimensional time convolution module through comparing confusion matrixes of the two schemes.

2.4 comparison with the Prior Art

Due to the lack of the most advanced bone-based animal behavior recognition methods, the present invention compares the proposed Channel-transducer model with existing bone-based human motion models. From Table 4 we can see that the Channel-transducer model of the present invention is superior to the most advanced human action method under the same evaluation setup.

TABLE 4 Table 4

From table 4, it can be seen that: although the accuracy of the current graph network model MS-G3D with a good effect can reach 91.11%, the model parameter is as high as 3.01M. The accuracy of the method can reach 93.21%, and the model parameter is only 1.33M, so that the method can prove the advancement of the method.

2.5, the experimental results show that the effectiveness of the network structure and the technical contribution of the model components and prove that the performance of the model Channel-transducer designed for animal behaviors is superior to that of the most advanced method.

3. Conclusion(s)

The invention discloses a novel graph rolling network (Channel-transducer) based on a Channel-level topological structure and a transducer attention mechanism, which is used for animal behavior recognition based on a framework, and an adaptive learning topology is formed through Channel-related modeling. To further focus on important joint point connections, the present invention employs Spatial-transducers to assign different weights to different topological connections for different behaviors, which helps to efficiently allocate resources. To improve the ability of inter-frame topology dependent expressions, we apply multi-scale temporal convolution to improve the performance of the model. Experimental results show that the Channel-transducer has stronger representation capability than other graph convolution models.

Claims

1. A quadruped behavior recognition method based on an adaptive space-time diagram attention transducer network comprises the following steps:

firstly, collecting a quadruped video sequence in a field environment;

secondly, extracting animal joint point information in a quadruped video sequence to construct a space-time skeleton diagram, and further constructing an animal behavior recognition model to recognize different behaviors of the quadruped;

the method is characterized in that a self-adaptive space-time diagram attention transducer network is constructed as an animal behavior recognition model to recognize different behaviors of the quadruped; the different behaviors of the quadruped are the 5 behaviors of chasing, resting, feeding, vigilance and walking of the quadruped; the step of constructing an adaptive space-time diagram attention transducer network comprises the following steps:

1) Designing a space dimension module of the adaptive space-time diagram attention transducer network:

1.1 Combining the physical topology and the potential topology of the joint connection by utilizing a Channel topology optimization convolution network to generate a Channel-wise skeleton topology, and capturing all possible connection modes between joint points of the quadruped behavior;

1.2 Adopting an attention mechanism based on a Spatial-Transfomer module to calculate the connection weight between each pair of joint points of the Channel-wise skeleton topology, thereby evaluating the importance of joint connection of the topology structure and simulating the self-adaptive topology structure of each animal behavior sample, and specifically comprising the following steps:

1.2.1 The Spatial-Transfomer module carries out linear transformation and normalization processing on the Channel-phase skeleton topology obtained in the step 1.1), converts information of each animal node into vector representation required by the Transformer attention, and the generated vector comprises a query vector q, a key vector k and a value vector v;

1.2.2 For each pair of joint point i and joint point j of the Channel-wise skeleton topology, calculating a query vector q of the ith joint point _i Key vector k to the jth node _j As the dot product of the pair of joint point attentiveness;

1.2.3 For each joint point i of the Channel-wise skeleton topology, taking the attention of the joint point obtained in the step 1.2.2) and the rest of joint points of the skeleton topology as weights, and carrying out weighted summation on the value vectors of all the joint points of the skeleton topology and the corresponding attention weights to obtain the new attention z of the ith joint point of the skeleton topology _i ；

1.2.4 Multi-head attention of the Spatial-fransfomer module is mapped for H times, namely, the steps 1.2.1) to 1.2.3) are repeated for H times, and the outputs of the H attention heads are aggregated to obtain the multi-head attention of the ith articulation pointMapping the Channel-wise skeleton topology to a plurality of subspaces;

1.2.5 Multiple head attentionSplicing, and multiplying the spliced matrix by a weight matrix to obtain a multi-head attention output matrix of the Spatial-fransfomer network, wherein the output matrix contains information of all attention heads;

2) A time dimension module of the adaptive space-time diagram attention transducer network is designed: modeling inter-frame topological correlation on a time domain by using a multi-scale time convolution module MS-TCN, wherein the steps comprise:

2.1 To achieve multi-scale sampling, sampling windows of different lengths are used, the sampling window size being changed by changing the dilation rate d of the dilation convolution;

2.2 For a skeleton sequence, selecting skeleton topology at intervals of d-1 to calculate convolution, wherein d represents expansion rate, the size of convolution kernel is 5×1, and 1×1 convolution is introduced before each convolution operation, namely bottleneck design;

2.3 With reference to the core idea of the ResNet model, a residual connection, i.e., a 1×1 convolution, is used in the time domain, which allows the original information of the lower layer to be directly transferred to the subsequent higher layer;

3) And (3) constructing a 9-layer self-adaptive space-time diagram attention transducer network by using the space dimension network module and the time dimension network module designed in the step (1) and the step (2), wherein the network structure is as follows:

the first 4 layers of the self-adaptive space-time diagram attention transducer network are channel topology optimization convolution networks of the space dimension network module in the step 1.1), and the multi-scale time convolution module MS-TCN in the step 2) is applied to the time dimension of the channel topology optimization convolution of each layer;

the latter 5 layers of the adaptive space-time diagram attention-transferring network are the Spatial-transferring modules of the Spatial dimension network module of the step 1.2), and the multi-scale time convolution module MS-TCN of the step 2) is applied to the time dimension of each layer convolution of the Spatial-transferring module;

the classifier of the adaptive space-time diagram attention transducer network is constructed via the full connectivity layer FCN and Softmax.