CN115797841A - Quadruped animal behavior identification method based on adaptive space-time diagram attention Transformer network - Google Patents

Quadruped animal behavior identification method based on adaptive space-time diagram attention Transformer network Download PDF

Info

Publication number
CN115797841A
CN115797841A CN202211588021.0A CN202211588021A CN115797841A CN 115797841 A CN115797841 A CN 115797841A CN 202211588021 A CN202211588021 A CN 202211588021A CN 115797841 A CN115797841 A CN 115797841A
Authority
CN
China
Prior art keywords
topology
animal
channel
attention
skeleton
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202211588021.0A
Other languages
Chinese (zh)
Other versions
CN115797841B (en
Inventor
赵亚琴
冯丽琦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Forestry University
Original Assignee
Nanjing Forestry University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Forestry University filed Critical Nanjing Forestry University
Priority to CN202211588021.0A priority Critical patent/CN115797841B/en
Publication of CN115797841A publication Critical patent/CN115797841A/en
Application granted granted Critical
Publication of CN115797841B publication Critical patent/CN115797841B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A40/00Adaptation technologies in agriculture, forestry, livestock or agroalimentary production
    • Y02A40/70Adaptation technologies in agriculture, forestry, livestock or agroalimentary production in livestock or poultry

Landscapes

  • Image Analysis (AREA)

Abstract

A quadruped animal behavior identification method based on an adaptive space-time diagram attention transducer network comprises the following steps: 1) Collecting video images of quadruped animals, and marking joint points by using DeepLabCut to generate a skeleton topological graph; 2) Firstly, combining physical topology and potential topology of joint connection by using a Channel topology optimization module to generate Channel-wise framework topology, and then distributing different connection weights for the Channel-wise framework topology by using a Spatial-Transformer module; 3) Modeling interframe topological correlation by using a multi-scale time convolution module; 4) Through the full connection layer and the Softmax classification layer, 5 daily behaviors of chasing, resting, eating, alertness and walking of the animals are identified. In the invention, the feature extraction module of the space dimension can capture all possible connection modes among the joint points of the animal behaviors and distribute different connection weights; the time sequence dimension feature extraction module obtains the change of the animal posture of a continuous frame of a behavior, so that the accuracy of animal behavior recognition is improved, and the method has a good application prospect.

Description

Quadruped animal behavior identification method based on adaptive space-time diagram attention Transformer network
Technical Field
The invention relates to a behavior recognition method, in particular to a time-space diagram convolution network model constructed based on attention Transformer and adaptive skeleton diagram topology and a method applied to recognizing the skeleton behavior of quadruped animals.
Background
In the prior art, behavior recognition means that behavior information of a research object is automatically identified through static images or video sequence information by using key technologies such as moving object detection, feature information extraction and posture analysis.
Medium and large-sized ungulates are considered an important component of the ecosystem. The identification of the behavior of medium and large ungulates is an important component of animal research, protection and management. The application of automated imaging systems in the field of animal monitoring has made the acquisition of animal images increasingly convenient, but there is also a large amount of invalid information. The limited ability to manually screen information severely limits the effective use of data in animal research, protection and management. Thus, vision-based animal behavior recognition algorithms provide an automated way to more conveniently obtain useful animal information for animal behavior recognition studies.
In recent years, the research subjects for behavior recognition of animals have focused on laboratory animals (e.g., mice, fruit flies) or poultry animals (e.g., pigs, cattle). The research mode is mainly to use RGB images and use machine learning or deep learning algorithm to classify animal behaviors. Simple behavior of wild animals was identified in the Snapshot Serengeti (SS) dataset as Norouzzadeh applied 9 different deep neural network models [1]. However, the behavior of the animal is a dynamic process, and the expressive power of a single image is limited. Frank Schinder [2] trained three different ResNet variants for an 8-frame RGB video sequence for identifying differences in feeding, moving and gazing behaviors of animals, but with poor identification.
In the human behavior recognition research in the prior art, the behavior recognition research is carried out based on skeleton characteristics, for example:
researchers have proposed space-time graph convolutional networks ST-GCN that use a fixed graph that conforms to the natural connectivity of animal bones but has no flexibility. That is, the topology of all layers, all channels, and all actions is the same.
To better accommodate different motion expressions, another learner has proposed a dynamic convolutional neural network that learns the correlations between each joint pair using the context characteristics of the joints.
While the introduction of attention mechanism enables the model resources to focus on factors that are more important for different actions, the learner then uses the Transformer self-attention mechanism to reconstruct the spatio-temporal dependencies of the human skeleton. Likewise, expression of animal behavior is also closely related to skeletal topology. However, the background of an image in an existing data set for human behavior recognition is relatively single, and is usually shot by a camera facing the human body, while the background of an animal monitoring video is very complex, and sometimes the hair color of the background is very similar to the environmental color and is difficult to distinguish, which may interfere with animal behavior recognition.
The prior art, as disclosed in publication No. CN114596632A, "method for identifying behaviors of a large-and-medium-sized quadruped based on architecture search graph volume network" (202210204633.9), includes the steps of: firstly, based on the extraction of the behavior characteristics of an animal skeleton, aiming at an animal video image, the positions of joint points of animal body parts are quickly tracked by using DeepLabCut to form a space-time skeleton diagram, and the space-time characteristics of different behaviors of the animal are captured. And then designing a plurality of time-space diagram convolution operation modules based on animal skeletons to form a diagram-based search space, wherein residual connection, bottleneck structures and various attention mechanisms are fused. And then, based on a micro-architecture search strategy, the continuity of a search space is realized, so that a low-cost space-time diagram convolution model for identifying the behaviors of the medium-large quadruped animals is automatically searched, and the aim of distinguishing the daily behaviors of the animals is finally fulfilled. The technology mainly aims to automatically search an optimal space-time diagram convolution network structure according to different tasks of animal behavior identification by introducing and designing a plurality of effective attention diagram structure modules. However, in practice, an animal may have different postures, such as standing for eating or lying for eating, and the same posture may also be different behaviors, such as lying for eating or lying for resting, which greatly increases the difficulty of behavior recognition.
In order to solve the problems, the invention excavates potential association which may exist between joint points which are not connected physically of the animal, for example, coordinated movement of limbs in pursuit behaviors, and dynamically models potential connection topology of joints of the animal to aggregate the potential connection topology with the physical topology, thereby improving accuracy of animal behavior identification.
Disclosure of Invention
Through research, the background of animal monitoring videos is very complex, and sometimes the fur color of the backgrounds is very similar to the environmental color and is difficult to distinguish, so that the animal behavior identification is interfered. Meanwhile, one behavior of the animal can have different postures, such as standing for eating or lying for eating, and the same posture can also be different behaviors, such as lying for eating or lying for rest, so that the difficulty of behavior recognition is greatly increased.
In order to solve the problems and further improve the accuracy of behavior identification, the method dynamically models the potential connection topology of the animal joints to aggregate the potential connection topology with the physical topology, regenerates a new animal skeleton topology structure, and models the correlation of skeleton topology among continuous frames in a video sequence by using a multi-scale time convolution module.
Based on the thought, the invention provides an animal behavior identification method based on a self-adaptive space-time diagram attention transducer network, aiming at multi-species and multi-class animal behavior identification data, a space dimension feature extraction module captures potential connection modes among joint points of animal behaviors and distributes different connection weights; the time sequence dimension feature extraction module obtains the change of the animal posture of a behavior continuous frame, and then a 9-layer spatio-temporal map attention transform convolution network model is constructed to identify different behaviors of the animal.
The network input of the invention is based on the time-space diagram sequence of the animal skeleton, compared with the image recognition, only the most essential behavior characteristics of the animal are kept, and the interference of redundant information on the model is greatly reduced; based on the channel topology optimization and the Spatial-Transfomer module, the method is more beneficial to capturing similar characteristics of potential connection topology between joint points of the same animal behavior under different postures, such as head movement during eating; the multi-scale time convolution improves the expressive power of the behavior characteristics in longer time sequences.
The invention discloses a quadruped animal behavior recognition method based on a self-adaptive space-time diagram attention transducer network, which specifically comprises the following steps:
1) Collecting a quadruped video under a field environment;
2) Extracting animal joint point information in an animal video sequence to construct a spatiotemporal skeleton diagram, and further constructing an animal behavior identification model to identify different behaviors of the quadruped;
the method for constructing the adaptive space-time diagram attention Transformer network comprises the following steps:
designing the space dimension of the space-time graph convolution network: firstly, a Channel topology optimization module is utilized to adaptively learn the skeleton topologies of different behaviors of animals, the physical topology of joint connection is combined with the potential topology to generate Channel-wise skeleton topology, then, different connection weights are distributed to the Channel-wise skeleton topology through a Spatial-Transformer module, and the method comprises the following steps:
3.1 Utilizing a Channel topology optimization module to combine the physical topology and the potential topology of the joint connection to generate Channel-wise framework topology, and capturing all possible connection modes between joint points of animal behaviors;
3.2 Adopting an attention mechanism based on a Spatial-transducer module to calculate the connection weight between each pair of joint points of Channel-wise framework topology, thereby evaluating the importance of the joint connection of the topology and simulating the adaptive topology of each animal behavior sample, wherein the steps comprise:
3.2.1 Performing linear transformation and normalization processing on the Channel-wise animal skeleton topology obtained in the step 3.1) by a Spatial-Transfomer network, converting each joint point information into vector representation required by the attention of a Transfomer, and generating vectors comprising a query vector q, a key vector k and a value vector v;
3.2.2 For each pair of joint points of the same skeleton topology, calculating a query vector q of the ith joint point i The dot product with the key vector k of the jth joint point as the pair of joint point attentions;
3.2.3 For each joint point of a skeleton topology), the joint point obtained in step 3.2.2) and attention weights of the other joint points of the skeleton topology are weighted and summed to obtain a new attention z of the ith joint point of the skeleton topology i
3.2.4 H different learnable mappings are carried out on the multi-head attention of the Spatial-Transfomer network, namely, the steps 3.2.1) -3.2.3) are repeated for H times, and the outputs of the H attention heads are aggregated to obtain the multi-head attention of the ith joint point
Figure BDA0003992727140000031
Thereby mapping Channel-wise animal skeleton topology to a plurality of subspaces;
3.2.5 To direct attention to multiple heads
Figure BDA0003992727140000032
Splicing, and multiplying the spliced matrix by a weight matrix to obtain a multi-head attention output matrix of the Spatial-Transfomer network, wherein the output matrix comprises information of all attention heads;
4) Designing the time dimension of the space-time graph convolutional network: in order to obtain the time sequence information of a long video sequence, a multi-scale time convolution module is used for modeling the interframe topological correlation in a time domain, and the method comprises the following steps:
4.1 To achieve multi-scale sampling, sampling windows of different lengths are used, the sampling window size being varied by varying the expansion rate d of the expansion convolution;
4.2 For a skeleton sequence, selecting skeleton topology with d-1 as interval to calculate convolution, wherein d represents expansion rate, the size of convolution kernel is 5 x 1, and 1 x 1 convolution is introduced before each convolution operation, namely bottleneck design;
4.3 Borrow the core idea of the ResNet model, use residual concatenation (1 × 1 convolution) in the time domain, which allows the original information of a lower layer to be passed directly to a subsequent higher layer.
5) And (3) constructing a 9-layer spatio-temporal graph attention Transformer network model by using the space dimension network module and the time dimension network module designed in the steps 3) and 4), constructing a classifier through a full connection layer and Softmax, and identifying 5 behaviors of chasing, resting, eating, alarming and stroking the quadruped.
The beneficial effects of the invention mainly comprise:
(1) A brand-new skeleton-based multi-species behavior recognition model is provided. The framework two-dimensional coordinates are used as (space-time graph convolution) network input, so that the interference of redundant information such as external environment and the like is eliminated, and the data volume is greatly reduced. A self-attention Transformer mechanism is adopted on a multi-channel spatial topological structure on the model; multi-scale convolution is employed over time topology.
(2) The time-space graph convolutional network model (namely the behavior recognition model) is not limited to the topology structure of the animal physical skeleton, but dynamically models the potential connection topology of the animal joints to aggregate the potential connection topology with the physical topology, and establishes a corresponding channel topology, so that the time-space graph convolutional network model is more beneficial to capturing similar characteristics of the potential connection topology between the joint points of the same animal behavior in different postures, such as similar head movement when a person stands or lies on food in two different postures.
(3) The behavior recognition model adopts multi-scale time convolution to model the interframe topological correlation of a time domain in different fine granularities, obtains the change of the animal posture of a continuous frame of a behavior, and is beneficial to distinguishing the posture change difference of different behaviors from a time sequence dimension, such as the posture change difference of head movement during lying rest and lying food.
Drawings
FIG. 1 is a schematic flow diagram of the process of the present invention;
FIG. 2 is a diagram of an adaptive spatiotemporal pattern attention transducer network structure applied to animal behavior recognition according to the present invention;
FIG. 3 is a process of channel level animal potential topology acquisition;
FIG. 4 is a schematic diagram of a Spatial-Transformer's self-attention computation;
FIG. 5 is a schematic diagram of a multi-scale time convolution;
FIG. 6 is a multi-scale convolution acquisition process on a time scale;
FIG. 7 is an exemplary graph of a data set;
FIG. 8 is a graph of loss for different combinations of layers;
fig. 9 is a confusion matrix for a comparison experiment of convolution with different time series dimensions, wherein fig. 9 (a) is a confusion matrix using multi-scale time convolution, and fig. 9 (b) is a confusion matrix using one-dimensional scale time convolution.
Detailed Description
The animal may have different postures, such as standing for eating or lying for eating, and the same posture may be different behaviors, which makes the prior art less effective in completing these complex animal behavior recognition tasks. In order to solve the problems, the invention provides a quadruped animal behavior recognition method based on a self-adaptive spatiotemporal attention transducer network, which comprises the steps of combining a physical topology of joint connection with a potential topology by utilizing a Channel topology optimization module to generate a Channel-wise framework topology, and distributing different connection weights for the Channel-wise framework topology through a Spatial-fransformer module; and 5 daily behaviors of chasing, resting, eating, alertness and walking of the animal are identified by modeling interframe topological correlation by using a multi-scale time convolution module.
As shown in fig. 1 and fig. 2, the invention designs a space dimension network module and a time dimension network module, and based on the space dimension network module and the time dimension network module, a 9-layer spatiotemporal graph attention Transformer network model is constructed, and is applied to behavior recognition of medium and large-sized ungulates, and the method specifically comprises the following steps:
1) Collecting a quadruped video under a field environment;
2) Extracting animal joint point information in an animal video sequence to construct a space-time skeleton diagram, and further constructing an animal behavior identification model to identify different behaviors of the quadruped;
2.1 For an animal video sequence, tracking the positions of body joint points of the quadruped to construct an animal skeleton diagram;
2.1.1 Manually marking the joint positions of a few frames of images in a video sequence through a graphical user interface;
2.1.2 Using animal images as input, training and iterating through a deep neural network, generating a group of confidence maps describing the position of each joint point in the input animal images as output, and predicting the joint positions of all frames; the output is subjected to manual fine tuning, and the deep neural network is trained again;
2.1.3 Applying the deep neural network trained in the previous step to a new animal video sequence, and rapidly predicting joint position coordinates and confidence values of the animals in the video;
2.2 A probability threshold is defined, and only when the confidence values of the predicted positions of all the joint points of the animal in one frame image are larger than or equal to the threshold, the frame image is used for a subsequent animal behavior recognition algorithm.
The method for constructing the adaptive space-time diagram attention Transformer network comprises the following steps:
3) Designing the space dimension of the space-time graph convolution network: as shown in fig. 2, firstly, a Channel topology optimization module is used for learning framework topologies of different behaviors of an animal in a self-adaptive manner, physical topologies of joint connections and potential topologies are combined to generate a Channel-wise framework topology, and then different connection weights are distributed to the Channel-wise framework topology through a Spatial-Transformer module;
3.1 Utilizing a Channel topology optimization module to combine the physical topology and the potential topology of the joint connection to generate a Channel-wise framework topology, and capturing all possible connection modes between joint points of animal behaviors;
3.1.1 ) converting the animal skeleton characteristic X obtained in the step 2) into a high-order characteristic through linear transformation
Figure BDA0003992727140000051
Such as formula(1) And using the model as the input of the whole model;
Figure BDA0003992727140000052
wherein the content of the first and second substances,
Figure BDA0003992727140000053
n is the number of the joint points of the animal skeleton, C is the number of the input channels,
Figure BDA0003992727140000054
Figure BDA0003992727140000055
is a weight matrix, and T is the length of the video frame sequence for identifying the animal behaviors;
3.1.2 Using the articulation of the two-dimensional animal skeleton obtained in step 2) (i.e. the skeleton representing the connection of two articulation points of the animal body) to generate for each animal skeleton a corresponding adjacency matrix
Figure BDA0003992727140000061
And expressing the graph by using an adjacency matrix A to represent the adjacent relation between the animal skeleton nodes. Applying the adjacency matrix A as a Shared topology (Shared topology) to all channels of the channel topology optimization network;
3.1.3 Modeling potential correlation between joint points of Shared topology of different channels by using a modeling function, and obtaining Channel-specific topology (Channel-specific topology) by linear transformation, the steps including:
firstly, potential correlation among the joint points of Shared topology of different channels is modeled by a modeling function, and the formula of the modeling function is as follows (2):
Figure BDA0003992727140000062
wherein x is i ,x j Is e.g. X, and
Figure BDA0003992727140000063
node pairs (v) of the physical topology of the animal skeleton obtained in step 2) i ,v j ) (i.e. the joint points of the animal body), phi is a linear transformation psi, phi is used for reducing the dimension of the feature before the correlation modeling function and reducing the computational complexity, and sigma (-) is an activation function of a neural network;
then, carrying out linear transformation on the joint point potential correlation characteristics obtained in the step 3.1.3.1) by using a formula (3) so as to promote a Channel dimension and obtain a Channel-specific framework topology Q,
Figure BDA0003992727140000064
wherein, the first and the second end of the pipe are connected with each other,
Figure BDA0003992727140000065
for a vector of Channel-specific skeleton topology Q, the node pair (v) is reflected i ,v j ) The Channel-specific topological relation between the two.
3.1.4 As shown in fig. 3), aggregating the Shared topology feature a and the Channel-specific topology feature Q, and generating a Channel-wise framework topology feature using formula (4);
R=A+α·Q (4)
where α is a trainable parameter used to adjust the strength between the Shared topology a and the channel topology Q, and the Shared topology feature a is added to each channel of α × Q by formula (4).
3.2 Using an attention mechanism based on a Spatial-Transfoner module to calculate a connection weight between each pair of joint points of the Channel-wise skeleton topology, thereby evaluating the importance of the joint connection of the topology and simulating an adaptive topology for each animal behavior sample, the steps comprising:
3.2.1 Performing linear transformation and normalization processing on the Channel-wise animal skeleton topology obtained in the step 3.1) by a Spatial-Transfomer network, converting each joint point information into vector representation required by the attention of a Transfomer, and generating vectors comprising a query vector q, a key vector k and a value vector v;
3.2.2 For each pair of joint points of the same skeleton topology, calculating a query vector q of the ith joint point i Key vector k with j-th joint point j As the attention of the pair of joint points;
as shown in FIG. 4, given the topology of the animal skeleton of the t-th frame of the animal video behavior sequence, the query vector of the i-th query joint point is set as
Figure BDA0003992727140000066
The key vectors of the rest inquired joint points of the animal skeleton topology of the t frame are
Figure BDA0003992727140000071
The value vector is
Figure BDA0003992727140000072
Computing query vectors
Figure BDA0003992727140000073
Sum key vector
Figure BDA0003992727140000074
Dot product to obtain the weight of the correlation strength between the pair of joint points
Figure BDA0003992727140000075
As the attention of the pair of joint points, then performing weighted summation on all the obtained weights of the ith query joint point to obtain the new attention of the ith query joint point of the tth frame
Figure BDA0003992727140000076
The calculation formula is shown as formula (5),
Figure BDA0003992727140000077
wherein, T isMeaning transposed, d k The dimension of the key vector k.
For a video frame sequence, obtaining an attention feature vector;
3.2.3 For each joint point of a skeleton topology), the joint point obtained in step 3.2.2) and attention weights of the other joint points of the skeleton topology are weighted and summed to obtain a new attention z of the ith joint point of the skeleton topology i
3.2.4 ) multi-head attention of the Spatial-Transfomer network carries out H-time different learnable mappings, namely repeating the steps 3.2.1) -3.2.3) for H times, and aggregating the outputs of the H attention heads to obtain the multi-head attention of the ith joint point
Figure BDA0003992727140000078
Thereby mapping Channel-wise animal skeleton topology to a plurality of subspaces;
3.2.5 To draw attention to multiple heads
Figure BDA0003992727140000079
Splicing, and multiplying the spliced matrix by a weight matrix to obtain a multi-head attention output matrix of the Spatial-Transfomer network, wherein the output matrix comprises information of all attention heads;
multiplying the spliced matrix by a weight matrix to obtain a multi-head attention output matrix of the Spatial-Transfomer network, wherein a calculation formula is shown as a formula (6),
Figure BDA00039927271400000710
wherein Q, K, V are respectively query vector matrix, key vector matrix and value vector matrix, W o Is a weight vector.
4) Designing the time dimension of the time-space graph convolution network: in order to obtain the timing information of the long video sequence, a multi-scale time convolution module is used for modeling the interframe topological correlation in a time domain, as shown in fig. 5 and 6, and the steps comprise;
4.1 To achieve multi-scale sampling, sampling windows of different lengths are used, the sampling window size being varied by varying the expansion rate d of the expansion convolution;
4.2 For a skeleton sequence, selecting skeleton topology with d-1 as interval to calculate convolution, wherein d represents expansion rate, the size of convolution kernel is 5 x 1, and 1 x 1 convolution is introduced before each convolution operation, namely bottleneck design;
4.3 Borrow the core idea of the ResNet model, use residual concatenation (1 × 1 convolution) in the time domain, which allows the original information of a lower layer to be passed directly to a subsequent higher layer.
5) Constructing a 9-layer space-time graph convolution network model by using the space dimension network module and the time dimension network module which are designed in the steps 3) and 4), constructing a classifier by using a full connection layer and Softmax, and identifying 5 behaviors of chasing, resting, eating, alertness and walking of the quadruped, wherein the network structure is as follows:
5.1 The first 4 layers of the space-time graph convolutional network model are the channel topology optimization convolutional networks of the space dimension network module in the step 3.1), and the multi-scale time convolution MS-TCN in the step 4) is applied to the time dimension of each layer of channel topology optimization convolution;
5.2 The last 5 layers of the spatio-temporal graph convolutional network model are the Spatial-fransformer network of the space dimension network module in the step 3.2), and the multi-scale time convolution MS-TCN in the step 4) is applied to the time dimension of convolution of each layer of the Spatial-fransformer network;
5.3 Constructs a classifier through the full connection layer FCN and Softmax, and identifies 5 behaviors of chasing, resting, eating, alertness and strolling of the medium-large quadruped.
Examples
1. Data set
In this example, a data set containing wild terrestrial mammalian bone was constructed. Fig. 7 shows example images of five different behaviors, including pursuit, feeding, resting, walking, watching. There are dozens of wild animals including tiger, lion, leopard, polar bear, black bear, antelope, alpaca, horse, etc.
To collect data, videos of five behaviors were edited from a large number of animal videos. The details of this data set are shown in table 1. The data set consisted of 67606 images from 2058 video sequences.
TABLE 1
Figure BDA0003992727140000081
As can be seen from table 1, 60% was randomly selected as the training set and 40% as the validation set for each type of sample. And training a comparison model on a training set, and acquiring accuracy on a verification set.
2. Comparative experiment
In this section, ablation experiments are aimed at verifying the contribution of various key parts of the model to accurately identify animal behavior. Then, this example compares the model (Channel-Transformer) proposed by the present invention with other most advanced GCN-based methods.
2.1 network framework design comparison
The Channel-Transformer consists of 9 layers of spatio-temporal GCN. In order to determine the number of layers of a Channel topology optimization module (CTG) and a Spatial-fransformer module (TFR) in a Channel-fransformer network structure, different numbers of layers are set for the CTG module and the TFR module respectively in this experiment, and different layer combination forms are compared. In order to ensure the fairness of comparison, a multi-scale time convolution module (TCN for short) is uniformly adopted in the time dimension. As shown in table 2, the combined design of two different sub-modules, CTG and TFR, has higher classification accuracy than the simple stacking of the same convolution module, and the experimental result performance of the 4-layer CTG structure + 5-layer TFR structure is the best, so the model Channel-Transformer of the present invention adopts a 9-layer (4-layer CTG structure + 5-layer TFR) structure. The loss function curve of each combination is shown in fig. 8, and it can also be seen from fig. 8 that the structure convergence speed of the 4-layer CTG structure + 5-layer TFR is the fastest and the convergence performance is the best.
TABLE 2
Figure BDA0003992727140000091
2.2 effectiveness of the fusion of CTG and TFR
In the experiment, a Spatial-Transfomer sub-module is introduced into the conventional two graph convolution structures (namely fixed-GCN and adaptive-GCN), and for the sake of fairly comparing the performances, the three methods are divided by the same hierarchical sequence, namely the first four layers are graph convolution structures with different node connection modes (fixed-GCN, adaptive-GCN and CTG), and the last five layers use the Spatial-Transfomer sub-module to distribute the weights of different nodes. The results of the classification accuracy of animal behavior are shown in table 3. Therefore, the animal behavior recognition performance of the proposed CTG and TFR fusion structure is far better than the combination of other two types of GCN and TFR. This is because the proposed model Channel-Transformer fuses the adjacency matrix with the feature map of the Channel-level topology, helping to exploit the potential correlation between animal joints under different channels. The performance of any network can be improved by introducing TFR sub-modules, as shown in table 3, lines 4 to 6. Thus, this experiment further verifies that the TFR submodule can exploit the self-attention mechanism to explore the importance of different articulations.
TABLE 3
Figure BDA0003992727140000092
2.3 validity of MS-TCN sub-Module
In order to evaluate the multi-scale convolution module on the time scale, the experiment also replaces the multi-scale convolution module with a one-dimensional time convolution module, as shown in fig. 9, and the performance of the multi-scale time convolution module is superior to that of the one-dimensional time convolution module by comparing the confusion matrices of the two schemes.
2.4 comparison with the prior art
The present invention compares the proposed Channel-Transformer model with existing bone-based human motion models due to the lack of the most advanced bone-based animal behavior recognition methods. From Table 4 we can see that the Channel-Transformer model of the present invention outperforms the most advanced human mobility method under the same evaluation settings.
TABLE 4
Figure BDA0003992727140000101
From table 4, it can be seen that: although the accuracy of the MS-G3D with a better effect at present can reach 91.11%, the parameter number of the model is as high as 3.01M. The accuracy of the method can reach 93.21%, and the number of model parameters is only 1.33M, so that the method can prove the advancement of the method.
2.5, the experimental results show the effectiveness of the network structure and the technical contribution of the model components, and prove that the performance of the model Channel-Transformer designed for animal behaviors is superior to that of the most advanced method.
3. Conclusion
The invention discloses a novel Channel-convolution network (Channel-Transformer) based on a Channel-level topological structure and a Transformer attention mechanism, which is used for animal behavior identification based on a framework and is used for forming a self-adaptive learning topology through Channel correlation modeling. To further focus on important joint connections, the invention employs a Spatial-Transformer to assign different weights to different topological connections with different behaviors, which helps to efficiently allocate resources. To improve the ability of the inter-frame topology dependent expressions, we apply multi-scale temporal convolution to improve the performance of the model. The experimental result shows that the Channel-Transformer has stronger representation capability than other graph convolution models.

Claims (6)

1. A quadruped animal behavior identification method based on an adaptive space-time diagram attention transducer network comprises the following steps:
1) Collecting a quadruped animal video under a field environment;
2) Extracting animal joint point information in an animal video sequence to construct a space-time skeleton diagram, and further constructing an animal behavior identification model to identify different behaviors of the quadruped;
the method is characterized in that an adaptive space-time diagram attention Transformer network is constructed to serve as an animal behavior recognition model to recognize different behaviors of quadruped animals; the method for constructing the adaptive space-time diagram attention Transformer network comprises the following steps of:
3) Designing a space dimension module of the space-time graph convolutional network: firstly, a Channel topology optimization module is utilized to adaptively learn the skeleton topologies of different behaviors of animals, the physical topology of joint connection is combined with the potential topology to generate Channel-wise skeleton topology, and then different connection weights are distributed to the Channel-wise skeleton topology through a Spatial-Transformer module;
4) Designing a time dimension module of the space-time graph convolutional network: in order to obtain the time sequence information of the long video sequence, a multi-scale time convolution module is used for modeling the interframe topological correlation on a time domain;
5) Constructing a 9-layer spatiotemporal graph attention Transformer convolution network model by using the space dimension network module and the time dimension network module designed in the steps 3) and 4), constructing a classifier through a full connection layer and Softmax, and identifying 5 behaviors of chasing, resting, eating, alarming and strolling of the quadruped;
in step 3):
3.1 Utilizing a Channel topology optimization module to combine the physical topology and the potential topology of the joint connection to generate a Channel-wise framework topology, and capturing all possible connection modes between joint points of animal behaviors;
3.2 Using an attention mechanism based on a Spatial-Transfoner module to calculate a connection weight between each pair of joint points of the Channel-wise skeleton topology, thereby evaluating the importance of the joint connection of the topology and simulating an adaptive topology for each animal behavior sample, the steps comprising:
3.2.1 A Spatial-Transformer module carries out linear transformation and normalization processing on the Channel-wise framework topology obtained in the step 3.1), each joint point information is converted into vector representation required by the attention of a Transformer, and the generated vector comprises a query vector omega, a key vector k and a value vector v;
3.2.2 For each pair of joint points of the same skeleton topology, calculating a query vector q of the ith joint point i Key vector k with j-th joint point j As the attention of the pair of joint points;
3.2.3 For each joint point of a skeleton topology), the joint point obtained in step 3.2.2) and attention weights of the other joint points of the skeleton topology are weighted and summed to obtain a new attention z of the ith joint point of the skeleton topology i
3.2.4 H different learnable mappings are carried out on the multi-head attention of the Spatial-Transfomer module, namely, the steps 3.2.1) -3.2.3) are repeated for H times, and the outputs of the H attention heads are aggregated to obtain the multi-head attention of the ith joint point
Figure FDA0003992727130000021
Thereby mapping the Channel-wise skeleton topology to a plurality of subspaces;
3.2.5 To draw attention to multiple heads
Figure FDA0003992727130000022
Splicing, and multiplying the spliced matrix by a weight matrix to obtain a multi-head attention output matrix of the Spatial-Transfomer network, wherein the output matrix comprises information of all attention heads;
in the step 4):
the method is characterized in that a multi-scale time convolution MS-TCN is used for modeling interframe framework topological correlation on a time domain and extracting animal time dimension behavior characteristics, and the method comprises the following steps:
4.1 To achieve multi-scale sampling, sampling windows of different lengths are used, the sampling window size being varied by varying the expansion rate d of the expansion convolution;
4.2 For a skeleton sequence, selecting skeleton topology at intervals of d-1 to calculate convolution, wherein d represents a dilation rate, the size of a convolution kernel is 5 × 1, and introducing 1 × 1 convolution before each convolution operation, namely bottleneck design;
4.3 Using the core idea of the ResNet model, residual concatenation (1 × 1 convolution) is used in the time domain, which allows the original information of the lower layer to be passed directly to the subsequent higher layer.
2. The quadruped animal behavior recognition method based on the adaptive spatiotemporal graph attention transducer network as claimed in claim 1, wherein in the step 2), the extraction of the quadruped animal skeleton data is realized by a DeeplaCut algorithm based on attitude estimation, and the steps comprise:
2.1 For an animal video sequence, tracking the positions of body joint points of the quadruped to construct an animal skeleton diagram;
2.1.1 Manually marking the joint positions of a few frame images in a video sequence through a graphical user interface;
2.1.2 Using animal images as input, training and iterating through a deep neural network, generating a group of confidence maps describing the position of each joint point in the input animal images as output, and predicting the joint positions of all frames; the output is subjected to manual fine tuning, and the deep neural network is trained again;
2.1.3 Applying the deep neural network trained in the previous step to a new animal video sequence, and rapidly predicting joint position coordinates and confidence values of the animals in the video;
2.2 A probability threshold is defined, and only when the confidence values of the predicted positions of all the joint points of the animal in one frame image are larger than or equal to the threshold, the frame image is used for a subsequent animal behavior recognition algorithm.
3. The quadruped animal behavior recognition method based on the adaptive spatiotemporal graph attention transducer network as claimed in claim 1, wherein in the step 3.1), the physical topology of the joint connection and the potential topology are combined by using a Channel topology optimization module to generate a Channel-wise framework topology, and all possible connection modes between joint points of animal behavior are captured, and the steps comprise;
3.1.1 Converting the animal skeleton characteristic X obtained in the step 2) into a high-order characteristic through linear transformation
Figure FDA0003992727130000031
As formula (1) and as input to the entire model;
Figure FDA0003992727130000032
wherein, the first and the second end of the pipe are connected with each other,
Figure FDA0003992727130000033
n is the number of the joint points of the animal skeleton, C is the number of the input channels,
Figure FDA0003992727130000034
is a weight matrix, and T is the length of the video frame sequence for identifying the animal behaviors;
3.1.2 Using the articulation of the two-dimensional skeletons of the animals obtained in step 2) to generate a corresponding adjacency matrix for each animal skeleton
Figure FDA0003992727130000035
Expressing the adjacent relation between animal skeleton nodes by using an adjacency matrix A to realize the expression of a graph; applying the adjacency matrix A as a Shared topology to all channels of the channel topology optimization network; the joint connection of the two-dimensional skeleton of the animal is a skeleton which represents the connection of two joint points of the animal body;
3.1.3 Modeling potential correlation between joint points of Shared topologies of different channels by using a modeling function, and obtaining Channel-specific topologies through linear transformation;
3.1.4 Aggregating the Shared topology characteristic A and the Channel-specific topology characteristic Q, and generating a Channel-wise framework topological characteristic by using a formula (2);
R=A+α·Q (2)
where α is a trainable parameter used to adjust the strength between the Shared topology a and the channel topology Q, and the Shared topology feature a is added to each channel of α × Q by equation (2).
4. The quadruped animal behavior recognition method based on the adaptive spatiotemporal image attention transducer network as claimed in claim 1, wherein in the step 3.2), an attention mechanism based on a Spatial-transducer module is adopted to calculate the connection weight between each pair of joint points of Channel-wise framework topology, and the steps comprise:
in step 3.2.2), the animal skeleton topology of the t frame of the animal video behavior sequence is given, and the query vector of the ith query joint point is set as
Figure FDA0003992727130000036
The key vectors of the rest inquired joint points of the animal skeleton topology of the t frame are
Figure FDA0003992727130000037
The value vector is
Figure FDA0003992727130000038
Computing query vectors
Figure FDA0003992727130000039
Sum key vector
Figure FDA00039927271300000310
Dot product, to obtain the weight of the correlation strength between the pair of joint points
Figure FDA00039927271300000311
As the attention of the pair of joint points, then performing weighted summation on all the obtained weights of the ith query joint point to obtain the new attention of the ith query joint point of the tth frame
Figure FDA00039927271300000312
The calculation formula is shown as formula (3),
Figure FDA00039927271300000313
wherein T means transposed, d k Is a keyThe vector is the dimension of k;
for a video frame sequence, obtaining an attention feature vector;
in step 3.2.5), multiplying the spliced matrix by a weight matrix to obtain a multi-head attention output matrix of the Spatial-Transfomer module, wherein a calculation formula is shown as a formula (4),
Figure FDA0003992727130000041
wherein Q, K, V are respectively query vector matrix, key vector matrix and value vector matrix, W o Is a weight vector.
5. The quadruped animal behavior recognition method based on the adaptive spatiotemporal graph attention transducer network as claimed in claim 1, wherein in the step 5), a 9-layer spatiotemporal graph convolution network model is constructed, and the steps comprise:
applying the space dimension network module and the time dimension network module designed in the steps 3) and 4) to construct a 9-layer space-time graph convolution network model, constructing a classifier through a full connection layer and Softmax, and identifying the behavior of the quadruped animal, wherein the network structure is as follows:
5.1 The first 4 layers of the space-time graph convolutional network model are the channel topology optimization convolutional networks of the space dimension network module in the step 3.1), and the multi-scale time convolution MS-TCN in the step 4) is applied to the time dimension of each layer of channel topology optimization convolution;
5.2 The last 5 layers of the spatio-temporal graph convolutional network model are the Spatial-fransformer modules of the space dimension network module in the step 3.2), and the multi-scale time convolution MS-TCN in the step 4) is applied to the time dimension of convolution of each layer of the Spatial-fransformer modules;
5.3 Through full connectivity layers FCN and Softmax, a classifier was constructed to recognize 5 behaviors of chase, rest, feed, alert, and walk for the medium and large quadruped.
6. The quadruped animal behavior recognition method based on adaptive space-time diagram attention transducer network as claimed in claim 1 or 3, wherein in the step 3.1.3), a modeling function is used to model potential correlation between joint points of Shared topology of different channels, and Channel-specific skeleton topology is obtained through linear transformation, and the steps include:
3.1.3.1 Modeling potential correlations between the articulation points of Shared topology for different channels using a modeling function whose formula is as follows (5):
Figure FDA0003992727130000042
wherein x is i ,x j Is e.g. X, and
Figure FDA0003992727130000043
node pairs (v) of the physical topology of the animal skeleton obtained in step 2) i ,v j ) (i.e. the joint points of the animal body), phi is a linear transformation psi, phi is used for reducing the dimension of the feature before the correlation modeling function and reducing the computational complexity, and sigma (-) is an activation function of a neural network;
3.1.3.2 Applying formula (6) to perform linear transformation on the joint point potential correlation characteristics obtained in step 3.1.3.1) so as to improve the Channel dimension and obtain Channel-specific framework topology Q,
Figure FDA0003992727130000051
wherein the content of the first and second substances,
Figure FDA0003992727130000052
(i is more than or equal to 1 and less than or equal to N, and j is more than or equal to 1 and less than or equal to N) is a vector of the Channel-specific framework topology Q, and the node pair (v) is reflected i ,v j ) The Channel-specific topological relation between the two.
CN202211588021.0A 2022-12-12 2022-12-12 Quadruped behavior recognition method based on self-adaptive space-time diagram attention transducer network Active CN115797841B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211588021.0A CN115797841B (en) 2022-12-12 2022-12-12 Quadruped behavior recognition method based on self-adaptive space-time diagram attention transducer network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211588021.0A CN115797841B (en) 2022-12-12 2022-12-12 Quadruped behavior recognition method based on self-adaptive space-time diagram attention transducer network

Publications (2)

Publication Number Publication Date
CN115797841A true CN115797841A (en) 2023-03-14
CN115797841B CN115797841B (en) 2023-08-18

Family

ID=85418596

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211588021.0A Active CN115797841B (en) 2022-12-12 2022-12-12 Quadruped behavior recognition method based on self-adaptive space-time diagram attention transducer network

Country Status (1)

Country Link
CN (1) CN115797841B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108304795A (en) * 2018-01-29 2018-07-20 清华大学 Human skeleton Activity recognition method and device based on deeply study
CN110796110A (en) * 2019-11-05 2020-02-14 西安电子科技大学 Human behavior identification method and system based on graph convolution network
WO2022000420A1 (en) * 2020-07-02 2022-01-06 浙江大学 Human body action recognition method, human body action recognition system, and device
CN114596632A (en) * 2022-03-02 2022-06-07 南京林业大学 Medium-large quadruped animal behavior identification method based on architecture search graph convolution network
CN114821799A (en) * 2022-05-10 2022-07-29 清华大学 Motion recognition method, device and equipment based on space-time graph convolutional network

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108304795A (en) * 2018-01-29 2018-07-20 清华大学 Human skeleton Activity recognition method and device based on deeply study
CN110796110A (en) * 2019-11-05 2020-02-14 西安电子科技大学 Human behavior identification method and system based on graph convolution network
WO2022000420A1 (en) * 2020-07-02 2022-01-06 浙江大学 Human body action recognition method, human body action recognition system, and device
CN114596632A (en) * 2022-03-02 2022-06-07 南京林业大学 Medium-large quadruped animal behavior identification method based on architecture search graph convolution network
CN114821799A (en) * 2022-05-10 2022-07-29 清华大学 Motion recognition method, device and equipment based on space-time graph convolutional network

Also Published As

Publication number Publication date
CN115797841B (en) 2023-08-18

Similar Documents

Publication Publication Date Title
CN110472604B (en) Pedestrian and crowd behavior identification method based on video
Yang et al. Unik: A unified framework for real-world skeleton-based action recognition
Obinata et al. Temporal extension module for skeleton-based action recognition
KR20200068545A (en) System and method for training a convolutional neural network and classifying an action performed by a subject in a video using the trained convolutional neural network
Yang et al. SenseFi: A library and benchmark on deep-learning-empowered WiFi human sensing
CN113033276B (en) Behavior recognition method based on conversion module
CN113191230A (en) Gait recognition method based on gait space-time characteristic decomposition
CN113128424A (en) Attention mechanism-based graph convolution neural network action identification method
Purwanto et al. Extreme low resolution action recognition with spatial-temporal multi-head self-attention and knowledge distillation
CN114973097A (en) Method, device, equipment and storage medium for recognizing abnormal behaviors in electric power machine room
CN114724185A (en) Light-weight multi-person posture tracking method
CN116386141A (en) Multi-stage human motion capturing method, device and medium based on monocular video
Shrivastava et al. Diverse video generation using a Gaussian process trigger
Degardin et al. REGINA—Reasoning graph convolutional networks in human action recognition
Nasir et al. ENGA: Elastic net-based genetic algorithm for human action recognition
Yang et al. Via: View-invariant skeleton action representation learning via motion retargeting
CN116246338B (en) Behavior recognition method based on graph convolution and transducer composite neural network
Tang et al. Pose guided global and local gan for appearance preserving human video prediction
Ahmed et al. Two person interaction recognition based on effective hybrid learning
CN115797841B (en) Quadruped behavior recognition method based on self-adaptive space-time diagram attention transducer network
El-Assal et al. 2D versus 3D convolutional spiking neural networks trained with unsupervised STDP for human action recognition
Zhao et al. Research on human behavior recognition in video based on 3DCCA
CN112613405B (en) Method for recognizing actions at any visual angle
KR20230017126A (en) Action recognition system based on deep learning and the method thereof
Wu et al. Hierarchical learning approach for one-shot action imitation in humanoid robots

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant