CN115797841A - Quadruped animal behavior identification method based on adaptive space-time diagram attention Transformer network - Google Patents
Quadruped animal behavior identification method based on adaptive space-time diagram attention Transformer network Download PDFInfo
- Publication number
- CN115797841A CN115797841A CN202211588021.0A CN202211588021A CN115797841A CN 115797841 A CN115797841 A CN 115797841A CN 202211588021 A CN202211588021 A CN 202211588021A CN 115797841 A CN115797841 A CN 115797841A
- Authority
- CN
- China
- Prior art keywords
- topology
- animal
- channel
- attention
- skeleton
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 241001465754 Metazoa Species 0.000 title claims abstract description 148
- 238000000034 method Methods 0.000 title claims abstract description 39
- 238000010586 diagram Methods 0.000 title claims abstract description 29
- 230000003044 adaptive effect Effects 0.000 title claims abstract description 18
- 230000006399 behavior Effects 0.000 claims abstract description 105
- 238000005457 optimization Methods 0.000 claims abstract description 17
- 230000007937 eating Effects 0.000 claims abstract description 15
- 230000000284 resting effect Effects 0.000 claims abstract description 8
- 238000000605 extraction Methods 0.000 claims abstract description 7
- 239000013598 vector Substances 0.000 claims description 45
- 239000011159 matrix material Substances 0.000 claims description 36
- 230000007246 mechanism Effects 0.000 claims description 10
- 230000009466 transformation Effects 0.000 claims description 10
- 238000005070 sampling Methods 0.000 claims description 9
- 238000013528 artificial neural network Methods 0.000 claims description 8
- 238000013461 design Methods 0.000 claims description 6
- 238000013507 mapping Methods 0.000 claims description 6
- 238000012549 training Methods 0.000 claims description 5
- 238000004364 calculation method Methods 0.000 claims description 4
- 230000014509 gene expression Effects 0.000 claims description 4
- 230000004931 aggregating effect Effects 0.000 claims description 3
- 238000010606 normalization Methods 0.000 claims description 3
- 238000012545 processing Methods 0.000 claims description 3
- 230000004913 activation Effects 0.000 claims description 2
- 230000010339 dilation Effects 0.000 claims 1
- 230000008859 change Effects 0.000 abstract description 5
- 230000036626 alertness Effects 0.000 abstract description 4
- 230000009184 walking Effects 0.000 abstract description 4
- 230000036544 posture Effects 0.000 description 14
- 238000011160 research Methods 0.000 description 8
- 238000002474 experimental method Methods 0.000 description 7
- 230000006870 function Effects 0.000 description 6
- 230000009286 beneficial effect Effects 0.000 description 4
- 210000000988 bone and bone Anatomy 0.000 description 4
- 230000008569 process Effects 0.000 description 4
- 230000004886 head movement Effects 0.000 description 3
- 238000012544 monitoring process Methods 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 2
- 230000007613 environmental effect Effects 0.000 description 2
- 235000013305 food Nutrition 0.000 description 2
- 230000004927 fusion Effects 0.000 description 2
- 241000283690 Bos taurus Species 0.000 description 1
- 241000282817 Bovidae Species 0.000 description 1
- 241000283073 Equus caballus Species 0.000 description 1
- 241000699670 Mus sp. Species 0.000 description 1
- 241000282320 Panthera leo Species 0.000 description 1
- 241000282373 Panthera pardus Species 0.000 description 1
- 241000282376 Panthera tigris Species 0.000 description 1
- 241000282887 Suidae Species 0.000 description 1
- 241000255588 Tephritidae Species 0.000 description 1
- 241000282453 Ursus americanus Species 0.000 description 1
- 241001147416 Ursus maritimus Species 0.000 description 1
- 241001416177 Vicugna pacos Species 0.000 description 1
- 238000002679 ablation Methods 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 238000010171 animal model Methods 0.000 description 1
- 230000000052 comparative effect Effects 0.000 description 1
- 238000013527 convolutional neural network Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 235000019580 granularity Nutrition 0.000 description 1
- 230000037308 hair color Effects 0.000 description 1
- 238000003384 imaging method Methods 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000003062 neural network model Methods 0.000 description 1
- 244000144977 poultry Species 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 230000002123 temporal effect Effects 0.000 description 1
- 238000010200 validation analysis Methods 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
Images
Classifications
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02A—TECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
- Y02A40/00—Adaptation technologies in agriculture, forestry, livestock or agroalimentary production
- Y02A40/70—Adaptation technologies in agriculture, forestry, livestock or agroalimentary production in livestock or poultry
Landscapes
- Image Analysis (AREA)
Abstract
A quadruped animal behavior identification method based on an adaptive space-time diagram attention transducer network comprises the following steps: 1) Collecting video images of quadruped animals, and marking joint points by using DeepLabCut to generate a skeleton topological graph; 2) Firstly, combining physical topology and potential topology of joint connection by using a Channel topology optimization module to generate Channel-wise framework topology, and then distributing different connection weights for the Channel-wise framework topology by using a Spatial-Transformer module; 3) Modeling interframe topological correlation by using a multi-scale time convolution module; 4) Through the full connection layer and the Softmax classification layer, 5 daily behaviors of chasing, resting, eating, alertness and walking of the animals are identified. In the invention, the feature extraction module of the space dimension can capture all possible connection modes among the joint points of the animal behaviors and distribute different connection weights; the time sequence dimension feature extraction module obtains the change of the animal posture of a continuous frame of a behavior, so that the accuracy of animal behavior recognition is improved, and the method has a good application prospect.
Description
Technical Field
The invention relates to a behavior recognition method, in particular to a time-space diagram convolution network model constructed based on attention Transformer and adaptive skeleton diagram topology and a method applied to recognizing the skeleton behavior of quadruped animals.
Background
In the prior art, behavior recognition means that behavior information of a research object is automatically identified through static images or video sequence information by using key technologies such as moving object detection, feature information extraction and posture analysis.
Medium and large-sized ungulates are considered an important component of the ecosystem. The identification of the behavior of medium and large ungulates is an important component of animal research, protection and management. The application of automated imaging systems in the field of animal monitoring has made the acquisition of animal images increasingly convenient, but there is also a large amount of invalid information. The limited ability to manually screen information severely limits the effective use of data in animal research, protection and management. Thus, vision-based animal behavior recognition algorithms provide an automated way to more conveniently obtain useful animal information for animal behavior recognition studies.
In recent years, the research subjects for behavior recognition of animals have focused on laboratory animals (e.g., mice, fruit flies) or poultry animals (e.g., pigs, cattle). The research mode is mainly to use RGB images and use machine learning or deep learning algorithm to classify animal behaviors. Simple behavior of wild animals was identified in the Snapshot Serengeti (SS) dataset as Norouzzadeh applied 9 different deep neural network models [1]. However, the behavior of the animal is a dynamic process, and the expressive power of a single image is limited. Frank Schinder [2] trained three different ResNet variants for an 8-frame RGB video sequence for identifying differences in feeding, moving and gazing behaviors of animals, but with poor identification.
In the human behavior recognition research in the prior art, the behavior recognition research is carried out based on skeleton characteristics, for example:
researchers have proposed space-time graph convolutional networks ST-GCN that use a fixed graph that conforms to the natural connectivity of animal bones but has no flexibility. That is, the topology of all layers, all channels, and all actions is the same.
To better accommodate different motion expressions, another learner has proposed a dynamic convolutional neural network that learns the correlations between each joint pair using the context characteristics of the joints.
While the introduction of attention mechanism enables the model resources to focus on factors that are more important for different actions, the learner then uses the Transformer self-attention mechanism to reconstruct the spatio-temporal dependencies of the human skeleton. Likewise, expression of animal behavior is also closely related to skeletal topology. However, the background of an image in an existing data set for human behavior recognition is relatively single, and is usually shot by a camera facing the human body, while the background of an animal monitoring video is very complex, and sometimes the hair color of the background is very similar to the environmental color and is difficult to distinguish, which may interfere with animal behavior recognition.
The prior art, as disclosed in publication No. CN114596632A, "method for identifying behaviors of a large-and-medium-sized quadruped based on architecture search graph volume network" (202210204633.9), includes the steps of: firstly, based on the extraction of the behavior characteristics of an animal skeleton, aiming at an animal video image, the positions of joint points of animal body parts are quickly tracked by using DeepLabCut to form a space-time skeleton diagram, and the space-time characteristics of different behaviors of the animal are captured. And then designing a plurality of time-space diagram convolution operation modules based on animal skeletons to form a diagram-based search space, wherein residual connection, bottleneck structures and various attention mechanisms are fused. And then, based on a micro-architecture search strategy, the continuity of a search space is realized, so that a low-cost space-time diagram convolution model for identifying the behaviors of the medium-large quadruped animals is automatically searched, and the aim of distinguishing the daily behaviors of the animals is finally fulfilled. The technology mainly aims to automatically search an optimal space-time diagram convolution network structure according to different tasks of animal behavior identification by introducing and designing a plurality of effective attention diagram structure modules. However, in practice, an animal may have different postures, such as standing for eating or lying for eating, and the same posture may also be different behaviors, such as lying for eating or lying for resting, which greatly increases the difficulty of behavior recognition.
In order to solve the problems, the invention excavates potential association which may exist between joint points which are not connected physically of the animal, for example, coordinated movement of limbs in pursuit behaviors, and dynamically models potential connection topology of joints of the animal to aggregate the potential connection topology with the physical topology, thereby improving accuracy of animal behavior identification.
Disclosure of Invention
Through research, the background of animal monitoring videos is very complex, and sometimes the fur color of the backgrounds is very similar to the environmental color and is difficult to distinguish, so that the animal behavior identification is interfered. Meanwhile, one behavior of the animal can have different postures, such as standing for eating or lying for eating, and the same posture can also be different behaviors, such as lying for eating or lying for rest, so that the difficulty of behavior recognition is greatly increased.
In order to solve the problems and further improve the accuracy of behavior identification, the method dynamically models the potential connection topology of the animal joints to aggregate the potential connection topology with the physical topology, regenerates a new animal skeleton topology structure, and models the correlation of skeleton topology among continuous frames in a video sequence by using a multi-scale time convolution module.
Based on the thought, the invention provides an animal behavior identification method based on a self-adaptive space-time diagram attention transducer network, aiming at multi-species and multi-class animal behavior identification data, a space dimension feature extraction module captures potential connection modes among joint points of animal behaviors and distributes different connection weights; the time sequence dimension feature extraction module obtains the change of the animal posture of a behavior continuous frame, and then a 9-layer spatio-temporal map attention transform convolution network model is constructed to identify different behaviors of the animal.
The network input of the invention is based on the time-space diagram sequence of the animal skeleton, compared with the image recognition, only the most essential behavior characteristics of the animal are kept, and the interference of redundant information on the model is greatly reduced; based on the channel topology optimization and the Spatial-Transfomer module, the method is more beneficial to capturing similar characteristics of potential connection topology between joint points of the same animal behavior under different postures, such as head movement during eating; the multi-scale time convolution improves the expressive power of the behavior characteristics in longer time sequences.
The invention discloses a quadruped animal behavior recognition method based on a self-adaptive space-time diagram attention transducer network, which specifically comprises the following steps:
1) Collecting a quadruped video under a field environment;
2) Extracting animal joint point information in an animal video sequence to construct a spatiotemporal skeleton diagram, and further constructing an animal behavior identification model to identify different behaviors of the quadruped;
the method for constructing the adaptive space-time diagram attention Transformer network comprises the following steps:
designing the space dimension of the space-time graph convolution network: firstly, a Channel topology optimization module is utilized to adaptively learn the skeleton topologies of different behaviors of animals, the physical topology of joint connection is combined with the potential topology to generate Channel-wise skeleton topology, then, different connection weights are distributed to the Channel-wise skeleton topology through a Spatial-Transformer module, and the method comprises the following steps:
3.1 Utilizing a Channel topology optimization module to combine the physical topology and the potential topology of the joint connection to generate Channel-wise framework topology, and capturing all possible connection modes between joint points of animal behaviors;
3.2 Adopting an attention mechanism based on a Spatial-transducer module to calculate the connection weight between each pair of joint points of Channel-wise framework topology, thereby evaluating the importance of the joint connection of the topology and simulating the adaptive topology of each animal behavior sample, wherein the steps comprise:
3.2.1 Performing linear transformation and normalization processing on the Channel-wise animal skeleton topology obtained in the step 3.1) by a Spatial-Transfomer network, converting each joint point information into vector representation required by the attention of a Transfomer, and generating vectors comprising a query vector q, a key vector k and a value vector v;
3.2.2 For each pair of joint points of the same skeleton topology, calculating a query vector q of the ith joint point i The dot product with the key vector k of the jth joint point as the pair of joint point attentions;
3.2.3 For each joint point of a skeleton topology), the joint point obtained in step 3.2.2) and attention weights of the other joint points of the skeleton topology are weighted and summed to obtain a new attention z of the ith joint point of the skeleton topology i ;
3.2.4 H different learnable mappings are carried out on the multi-head attention of the Spatial-Transfomer network, namely, the steps 3.2.1) -3.2.3) are repeated for H times, and the outputs of the H attention heads are aggregated to obtain the multi-head attention of the ith joint pointThereby mapping Channel-wise animal skeleton topology to a plurality of subspaces;
3.2.5 To direct attention to multiple headsSplicing, and multiplying the spliced matrix by a weight matrix to obtain a multi-head attention output matrix of the Spatial-Transfomer network, wherein the output matrix comprises information of all attention heads;
4) Designing the time dimension of the space-time graph convolutional network: in order to obtain the time sequence information of a long video sequence, a multi-scale time convolution module is used for modeling the interframe topological correlation in a time domain, and the method comprises the following steps:
4.1 To achieve multi-scale sampling, sampling windows of different lengths are used, the sampling window size being varied by varying the expansion rate d of the expansion convolution;
4.2 For a skeleton sequence, selecting skeleton topology with d-1 as interval to calculate convolution, wherein d represents expansion rate, the size of convolution kernel is 5 x 1, and 1 x 1 convolution is introduced before each convolution operation, namely bottleneck design;
4.3 Borrow the core idea of the ResNet model, use residual concatenation (1 × 1 convolution) in the time domain, which allows the original information of a lower layer to be passed directly to a subsequent higher layer.
5) And (3) constructing a 9-layer spatio-temporal graph attention Transformer network model by using the space dimension network module and the time dimension network module designed in the steps 3) and 4), constructing a classifier through a full connection layer and Softmax, and identifying 5 behaviors of chasing, resting, eating, alarming and stroking the quadruped.
The beneficial effects of the invention mainly comprise:
(1) A brand-new skeleton-based multi-species behavior recognition model is provided. The framework two-dimensional coordinates are used as (space-time graph convolution) network input, so that the interference of redundant information such as external environment and the like is eliminated, and the data volume is greatly reduced. A self-attention Transformer mechanism is adopted on a multi-channel spatial topological structure on the model; multi-scale convolution is employed over time topology.
(2) The time-space graph convolutional network model (namely the behavior recognition model) is not limited to the topology structure of the animal physical skeleton, but dynamically models the potential connection topology of the animal joints to aggregate the potential connection topology with the physical topology, and establishes a corresponding channel topology, so that the time-space graph convolutional network model is more beneficial to capturing similar characteristics of the potential connection topology between the joint points of the same animal behavior in different postures, such as similar head movement when a person stands or lies on food in two different postures.
(3) The behavior recognition model adopts multi-scale time convolution to model the interframe topological correlation of a time domain in different fine granularities, obtains the change of the animal posture of a continuous frame of a behavior, and is beneficial to distinguishing the posture change difference of different behaviors from a time sequence dimension, such as the posture change difference of head movement during lying rest and lying food.
Drawings
FIG. 1 is a schematic flow diagram of the process of the present invention;
FIG. 2 is a diagram of an adaptive spatiotemporal pattern attention transducer network structure applied to animal behavior recognition according to the present invention;
FIG. 3 is a process of channel level animal potential topology acquisition;
FIG. 4 is a schematic diagram of a Spatial-Transformer's self-attention computation;
FIG. 5 is a schematic diagram of a multi-scale time convolution;
FIG. 6 is a multi-scale convolution acquisition process on a time scale;
FIG. 7 is an exemplary graph of a data set;
FIG. 8 is a graph of loss for different combinations of layers;
fig. 9 is a confusion matrix for a comparison experiment of convolution with different time series dimensions, wherein fig. 9 (a) is a confusion matrix using multi-scale time convolution, and fig. 9 (b) is a confusion matrix using one-dimensional scale time convolution.
Detailed Description
The animal may have different postures, such as standing for eating or lying for eating, and the same posture may be different behaviors, which makes the prior art less effective in completing these complex animal behavior recognition tasks. In order to solve the problems, the invention provides a quadruped animal behavior recognition method based on a self-adaptive spatiotemporal attention transducer network, which comprises the steps of combining a physical topology of joint connection with a potential topology by utilizing a Channel topology optimization module to generate a Channel-wise framework topology, and distributing different connection weights for the Channel-wise framework topology through a Spatial-fransformer module; and 5 daily behaviors of chasing, resting, eating, alertness and walking of the animal are identified by modeling interframe topological correlation by using a multi-scale time convolution module.
As shown in fig. 1 and fig. 2, the invention designs a space dimension network module and a time dimension network module, and based on the space dimension network module and the time dimension network module, a 9-layer spatiotemporal graph attention Transformer network model is constructed, and is applied to behavior recognition of medium and large-sized ungulates, and the method specifically comprises the following steps:
1) Collecting a quadruped video under a field environment;
2) Extracting animal joint point information in an animal video sequence to construct a space-time skeleton diagram, and further constructing an animal behavior identification model to identify different behaviors of the quadruped;
2.1 For an animal video sequence, tracking the positions of body joint points of the quadruped to construct an animal skeleton diagram;
2.1.1 Manually marking the joint positions of a few frames of images in a video sequence through a graphical user interface;
2.1.2 Using animal images as input, training and iterating through a deep neural network, generating a group of confidence maps describing the position of each joint point in the input animal images as output, and predicting the joint positions of all frames; the output is subjected to manual fine tuning, and the deep neural network is trained again;
2.1.3 Applying the deep neural network trained in the previous step to a new animal video sequence, and rapidly predicting joint position coordinates and confidence values of the animals in the video;
2.2 A probability threshold is defined, and only when the confidence values of the predicted positions of all the joint points of the animal in one frame image are larger than or equal to the threshold, the frame image is used for a subsequent animal behavior recognition algorithm.
The method for constructing the adaptive space-time diagram attention Transformer network comprises the following steps:
3) Designing the space dimension of the space-time graph convolution network: as shown in fig. 2, firstly, a Channel topology optimization module is used for learning framework topologies of different behaviors of an animal in a self-adaptive manner, physical topologies of joint connections and potential topologies are combined to generate a Channel-wise framework topology, and then different connection weights are distributed to the Channel-wise framework topology through a Spatial-Transformer module;
3.1 Utilizing a Channel topology optimization module to combine the physical topology and the potential topology of the joint connection to generate a Channel-wise framework topology, and capturing all possible connection modes between joint points of animal behaviors;
3.1.1 ) converting the animal skeleton characteristic X obtained in the step 2) into a high-order characteristic through linear transformationSuch as formula(1) And using the model as the input of the whole model;
wherein,n is the number of the joint points of the animal skeleton, C is the number of the input channels, is a weight matrix, and T is the length of the video frame sequence for identifying the animal behaviors;
3.1.2 Using the articulation of the two-dimensional animal skeleton obtained in step 2) (i.e. the skeleton representing the connection of two articulation points of the animal body) to generate for each animal skeleton a corresponding adjacency matrixAnd expressing the graph by using an adjacency matrix A to represent the adjacent relation between the animal skeleton nodes. Applying the adjacency matrix A as a Shared topology (Shared topology) to all channels of the channel topology optimization network;
3.1.3 Modeling potential correlation between joint points of Shared topology of different channels by using a modeling function, and obtaining Channel-specific topology (Channel-specific topology) by linear transformation, the steps including:
firstly, potential correlation among the joint points of Shared topology of different channels is modeled by a modeling function, and the formula of the modeling function is as follows (2):
wherein x is i ,x j Is e.g. X, andnode pairs (v) of the physical topology of the animal skeleton obtained in step 2) i ,v j ) (i.e. the joint points of the animal body), phi is a linear transformation psi, phi is used for reducing the dimension of the feature before the correlation modeling function and reducing the computational complexity, and sigma (-) is an activation function of a neural network;
then, carrying out linear transformation on the joint point potential correlation characteristics obtained in the step 3.1.3.1) by using a formula (3) so as to promote a Channel dimension and obtain a Channel-specific framework topology Q,
wherein,for a vector of Channel-specific skeleton topology Q, the node pair (v) is reflected i ,v j ) The Channel-specific topological relation between the two.
3.1.4 As shown in fig. 3), aggregating the Shared topology feature a and the Channel-specific topology feature Q, and generating a Channel-wise framework topology feature using formula (4);
R=A+α·Q (4)
where α is a trainable parameter used to adjust the strength between the Shared topology a and the channel topology Q, and the Shared topology feature a is added to each channel of α × Q by formula (4).
3.2 Using an attention mechanism based on a Spatial-Transfoner module to calculate a connection weight between each pair of joint points of the Channel-wise skeleton topology, thereby evaluating the importance of the joint connection of the topology and simulating an adaptive topology for each animal behavior sample, the steps comprising:
3.2.1 Performing linear transformation and normalization processing on the Channel-wise animal skeleton topology obtained in the step 3.1) by a Spatial-Transfomer network, converting each joint point information into vector representation required by the attention of a Transfomer, and generating vectors comprising a query vector q, a key vector k and a value vector v;
3.2.2 For each pair of joint points of the same skeleton topology, calculating a query vector q of the ith joint point i Key vector k with j-th joint point j As the attention of the pair of joint points;
as shown in FIG. 4, given the topology of the animal skeleton of the t-th frame of the animal video behavior sequence, the query vector of the i-th query joint point is set asThe key vectors of the rest inquired joint points of the animal skeleton topology of the t frame areThe value vector isComputing query vectorsSum key vectorDot product to obtain the weight of the correlation strength between the pair of joint pointsAs the attention of the pair of joint points, then performing weighted summation on all the obtained weights of the ith query joint point to obtain the new attention of the ith query joint point of the tth frameThe calculation formula is shown as formula (5),
wherein, T isMeaning transposed, d k The dimension of the key vector k.
For a video frame sequence, obtaining an attention feature vector;
3.2.3 For each joint point of a skeleton topology), the joint point obtained in step 3.2.2) and attention weights of the other joint points of the skeleton topology are weighted and summed to obtain a new attention z of the ith joint point of the skeleton topology i ;
3.2.4 ) multi-head attention of the Spatial-Transfomer network carries out H-time different learnable mappings, namely repeating the steps 3.2.1) -3.2.3) for H times, and aggregating the outputs of the H attention heads to obtain the multi-head attention of the ith joint pointThereby mapping Channel-wise animal skeleton topology to a plurality of subspaces;
3.2.5 To draw attention to multiple headsSplicing, and multiplying the spliced matrix by a weight matrix to obtain a multi-head attention output matrix of the Spatial-Transfomer network, wherein the output matrix comprises information of all attention heads;
multiplying the spliced matrix by a weight matrix to obtain a multi-head attention output matrix of the Spatial-Transfomer network, wherein a calculation formula is shown as a formula (6),
wherein Q, K, V are respectively query vector matrix, key vector matrix and value vector matrix, W o Is a weight vector.
4) Designing the time dimension of the time-space graph convolution network: in order to obtain the timing information of the long video sequence, a multi-scale time convolution module is used for modeling the interframe topological correlation in a time domain, as shown in fig. 5 and 6, and the steps comprise;
4.1 To achieve multi-scale sampling, sampling windows of different lengths are used, the sampling window size being varied by varying the expansion rate d of the expansion convolution;
4.2 For a skeleton sequence, selecting skeleton topology with d-1 as interval to calculate convolution, wherein d represents expansion rate, the size of convolution kernel is 5 x 1, and 1 x 1 convolution is introduced before each convolution operation, namely bottleneck design;
4.3 Borrow the core idea of the ResNet model, use residual concatenation (1 × 1 convolution) in the time domain, which allows the original information of a lower layer to be passed directly to a subsequent higher layer.
5) Constructing a 9-layer space-time graph convolution network model by using the space dimension network module and the time dimension network module which are designed in the steps 3) and 4), constructing a classifier by using a full connection layer and Softmax, and identifying 5 behaviors of chasing, resting, eating, alertness and walking of the quadruped, wherein the network structure is as follows:
5.1 The first 4 layers of the space-time graph convolutional network model are the channel topology optimization convolutional networks of the space dimension network module in the step 3.1), and the multi-scale time convolution MS-TCN in the step 4) is applied to the time dimension of each layer of channel topology optimization convolution;
5.2 The last 5 layers of the spatio-temporal graph convolutional network model are the Spatial-fransformer network of the space dimension network module in the step 3.2), and the multi-scale time convolution MS-TCN in the step 4) is applied to the time dimension of convolution of each layer of the Spatial-fransformer network;
5.3 Constructs a classifier through the full connection layer FCN and Softmax, and identifies 5 behaviors of chasing, resting, eating, alertness and strolling of the medium-large quadruped.
Examples
1. Data set
In this example, a data set containing wild terrestrial mammalian bone was constructed. Fig. 7 shows example images of five different behaviors, including pursuit, feeding, resting, walking, watching. There are dozens of wild animals including tiger, lion, leopard, polar bear, black bear, antelope, alpaca, horse, etc.
To collect data, videos of five behaviors were edited from a large number of animal videos. The details of this data set are shown in table 1. The data set consisted of 67606 images from 2058 video sequences.
TABLE 1
As can be seen from table 1, 60% was randomly selected as the training set and 40% as the validation set for each type of sample. And training a comparison model on a training set, and acquiring accuracy on a verification set.
2. Comparative experiment
In this section, ablation experiments are aimed at verifying the contribution of various key parts of the model to accurately identify animal behavior. Then, this example compares the model (Channel-Transformer) proposed by the present invention with other most advanced GCN-based methods.
2.1 network framework design comparison
The Channel-Transformer consists of 9 layers of spatio-temporal GCN. In order to determine the number of layers of a Channel topology optimization module (CTG) and a Spatial-fransformer module (TFR) in a Channel-fransformer network structure, different numbers of layers are set for the CTG module and the TFR module respectively in this experiment, and different layer combination forms are compared. In order to ensure the fairness of comparison, a multi-scale time convolution module (TCN for short) is uniformly adopted in the time dimension. As shown in table 2, the combined design of two different sub-modules, CTG and TFR, has higher classification accuracy than the simple stacking of the same convolution module, and the experimental result performance of the 4-layer CTG structure + 5-layer TFR structure is the best, so the model Channel-Transformer of the present invention adopts a 9-layer (4-layer CTG structure + 5-layer TFR) structure. The loss function curve of each combination is shown in fig. 8, and it can also be seen from fig. 8 that the structure convergence speed of the 4-layer CTG structure + 5-layer TFR is the fastest and the convergence performance is the best.
TABLE 2
2.2 effectiveness of the fusion of CTG and TFR
In the experiment, a Spatial-Transfomer sub-module is introduced into the conventional two graph convolution structures (namely fixed-GCN and adaptive-GCN), and for the sake of fairly comparing the performances, the three methods are divided by the same hierarchical sequence, namely the first four layers are graph convolution structures with different node connection modes (fixed-GCN, adaptive-GCN and CTG), and the last five layers use the Spatial-Transfomer sub-module to distribute the weights of different nodes. The results of the classification accuracy of animal behavior are shown in table 3. Therefore, the animal behavior recognition performance of the proposed CTG and TFR fusion structure is far better than the combination of other two types of GCN and TFR. This is because the proposed model Channel-Transformer fuses the adjacency matrix with the feature map of the Channel-level topology, helping to exploit the potential correlation between animal joints under different channels. The performance of any network can be improved by introducing TFR sub-modules, as shown in table 3, lines 4 to 6. Thus, this experiment further verifies that the TFR submodule can exploit the self-attention mechanism to explore the importance of different articulations.
TABLE 3
2.3 validity of MS-TCN sub-Module
In order to evaluate the multi-scale convolution module on the time scale, the experiment also replaces the multi-scale convolution module with a one-dimensional time convolution module, as shown in fig. 9, and the performance of the multi-scale time convolution module is superior to that of the one-dimensional time convolution module by comparing the confusion matrices of the two schemes.
2.4 comparison with the prior art
The present invention compares the proposed Channel-Transformer model with existing bone-based human motion models due to the lack of the most advanced bone-based animal behavior recognition methods. From Table 4 we can see that the Channel-Transformer model of the present invention outperforms the most advanced human mobility method under the same evaluation settings.
TABLE 4
From table 4, it can be seen that: although the accuracy of the MS-G3D with a better effect at present can reach 91.11%, the parameter number of the model is as high as 3.01M. The accuracy of the method can reach 93.21%, and the number of model parameters is only 1.33M, so that the method can prove the advancement of the method.
2.5, the experimental results show the effectiveness of the network structure and the technical contribution of the model components, and prove that the performance of the model Channel-Transformer designed for animal behaviors is superior to that of the most advanced method.
3. Conclusion
The invention discloses a novel Channel-convolution network (Channel-Transformer) based on a Channel-level topological structure and a Transformer attention mechanism, which is used for animal behavior identification based on a framework and is used for forming a self-adaptive learning topology through Channel correlation modeling. To further focus on important joint connections, the invention employs a Spatial-Transformer to assign different weights to different topological connections with different behaviors, which helps to efficiently allocate resources. To improve the ability of the inter-frame topology dependent expressions, we apply multi-scale temporal convolution to improve the performance of the model. The experimental result shows that the Channel-Transformer has stronger representation capability than other graph convolution models.
Claims (6)
1. A quadruped animal behavior identification method based on an adaptive space-time diagram attention transducer network comprises the following steps:
1) Collecting a quadruped animal video under a field environment;
2) Extracting animal joint point information in an animal video sequence to construct a space-time skeleton diagram, and further constructing an animal behavior identification model to identify different behaviors of the quadruped;
the method is characterized in that an adaptive space-time diagram attention Transformer network is constructed to serve as an animal behavior recognition model to recognize different behaviors of quadruped animals; the method for constructing the adaptive space-time diagram attention Transformer network comprises the following steps of:
3) Designing a space dimension module of the space-time graph convolutional network: firstly, a Channel topology optimization module is utilized to adaptively learn the skeleton topologies of different behaviors of animals, the physical topology of joint connection is combined with the potential topology to generate Channel-wise skeleton topology, and then different connection weights are distributed to the Channel-wise skeleton topology through a Spatial-Transformer module;
4) Designing a time dimension module of the space-time graph convolutional network: in order to obtain the time sequence information of the long video sequence, a multi-scale time convolution module is used for modeling the interframe topological correlation on a time domain;
5) Constructing a 9-layer spatiotemporal graph attention Transformer convolution network model by using the space dimension network module and the time dimension network module designed in the steps 3) and 4), constructing a classifier through a full connection layer and Softmax, and identifying 5 behaviors of chasing, resting, eating, alarming and strolling of the quadruped;
in step 3):
3.1 Utilizing a Channel topology optimization module to combine the physical topology and the potential topology of the joint connection to generate a Channel-wise framework topology, and capturing all possible connection modes between joint points of animal behaviors;
3.2 Using an attention mechanism based on a Spatial-Transfoner module to calculate a connection weight between each pair of joint points of the Channel-wise skeleton topology, thereby evaluating the importance of the joint connection of the topology and simulating an adaptive topology for each animal behavior sample, the steps comprising:
3.2.1 A Spatial-Transformer module carries out linear transformation and normalization processing on the Channel-wise framework topology obtained in the step 3.1), each joint point information is converted into vector representation required by the attention of a Transformer, and the generated vector comprises a query vector omega, a key vector k and a value vector v;
3.2.2 For each pair of joint points of the same skeleton topology, calculating a query vector q of the ith joint point i Key vector k with j-th joint point j As the attention of the pair of joint points;
3.2.3 For each joint point of a skeleton topology), the joint point obtained in step 3.2.2) and attention weights of the other joint points of the skeleton topology are weighted and summed to obtain a new attention z of the ith joint point of the skeleton topology i ;
3.2.4 H different learnable mappings are carried out on the multi-head attention of the Spatial-Transfomer module, namely, the steps 3.2.1) -3.2.3) are repeated for H times, and the outputs of the H attention heads are aggregated to obtain the multi-head attention of the ith joint pointThereby mapping the Channel-wise skeleton topology to a plurality of subspaces;
3.2.5 To draw attention to multiple headsSplicing, and multiplying the spliced matrix by a weight matrix to obtain a multi-head attention output matrix of the Spatial-Transfomer network, wherein the output matrix comprises information of all attention heads;
in the step 4):
the method is characterized in that a multi-scale time convolution MS-TCN is used for modeling interframe framework topological correlation on a time domain and extracting animal time dimension behavior characteristics, and the method comprises the following steps:
4.1 To achieve multi-scale sampling, sampling windows of different lengths are used, the sampling window size being varied by varying the expansion rate d of the expansion convolution;
4.2 For a skeleton sequence, selecting skeleton topology at intervals of d-1 to calculate convolution, wherein d represents a dilation rate, the size of a convolution kernel is 5 × 1, and introducing 1 × 1 convolution before each convolution operation, namely bottleneck design;
4.3 Using the core idea of the ResNet model, residual concatenation (1 × 1 convolution) is used in the time domain, which allows the original information of the lower layer to be passed directly to the subsequent higher layer.
2. The quadruped animal behavior recognition method based on the adaptive spatiotemporal graph attention transducer network as claimed in claim 1, wherein in the step 2), the extraction of the quadruped animal skeleton data is realized by a DeeplaCut algorithm based on attitude estimation, and the steps comprise:
2.1 For an animal video sequence, tracking the positions of body joint points of the quadruped to construct an animal skeleton diagram;
2.1.1 Manually marking the joint positions of a few frame images in a video sequence through a graphical user interface;
2.1.2 Using animal images as input, training and iterating through a deep neural network, generating a group of confidence maps describing the position of each joint point in the input animal images as output, and predicting the joint positions of all frames; the output is subjected to manual fine tuning, and the deep neural network is trained again;
2.1.3 Applying the deep neural network trained in the previous step to a new animal video sequence, and rapidly predicting joint position coordinates and confidence values of the animals in the video;
2.2 A probability threshold is defined, and only when the confidence values of the predicted positions of all the joint points of the animal in one frame image are larger than or equal to the threshold, the frame image is used for a subsequent animal behavior recognition algorithm.
3. The quadruped animal behavior recognition method based on the adaptive spatiotemporal graph attention transducer network as claimed in claim 1, wherein in the step 3.1), the physical topology of the joint connection and the potential topology are combined by using a Channel topology optimization module to generate a Channel-wise framework topology, and all possible connection modes between joint points of animal behavior are captured, and the steps comprise;
3.1.1 Converting the animal skeleton characteristic X obtained in the step 2) into a high-order characteristic through linear transformationAs formula (1) and as input to the entire model;
wherein,n is the number of the joint points of the animal skeleton, C is the number of the input channels,is a weight matrix, and T is the length of the video frame sequence for identifying the animal behaviors;
3.1.2 Using the articulation of the two-dimensional skeletons of the animals obtained in step 2) to generate a corresponding adjacency matrix for each animal skeletonExpressing the adjacent relation between animal skeleton nodes by using an adjacency matrix A to realize the expression of a graph; applying the adjacency matrix A as a Shared topology to all channels of the channel topology optimization network; the joint connection of the two-dimensional skeleton of the animal is a skeleton which represents the connection of two joint points of the animal body;
3.1.3 Modeling potential correlation between joint points of Shared topologies of different channels by using a modeling function, and obtaining Channel-specific topologies through linear transformation;
3.1.4 Aggregating the Shared topology characteristic A and the Channel-specific topology characteristic Q, and generating a Channel-wise framework topological characteristic by using a formula (2);
R=A+α·Q (2)
where α is a trainable parameter used to adjust the strength between the Shared topology a and the channel topology Q, and the Shared topology feature a is added to each channel of α × Q by equation (2).
4. The quadruped animal behavior recognition method based on the adaptive spatiotemporal image attention transducer network as claimed in claim 1, wherein in the step 3.2), an attention mechanism based on a Spatial-transducer module is adopted to calculate the connection weight between each pair of joint points of Channel-wise framework topology, and the steps comprise:
in step 3.2.2), the animal skeleton topology of the t frame of the animal video behavior sequence is given, and the query vector of the ith query joint point is set asThe key vectors of the rest inquired joint points of the animal skeleton topology of the t frame areThe value vector isComputing query vectorsSum key vectorDot product, to obtain the weight of the correlation strength between the pair of joint pointsAs the attention of the pair of joint points, then performing weighted summation on all the obtained weights of the ith query joint point to obtain the new attention of the ith query joint point of the tth frameThe calculation formula is shown as formula (3),
wherein T means transposed, d k Is a keyThe vector is the dimension of k;
for a video frame sequence, obtaining an attention feature vector;
in step 3.2.5), multiplying the spliced matrix by a weight matrix to obtain a multi-head attention output matrix of the Spatial-Transfomer module, wherein a calculation formula is shown as a formula (4),
wherein Q, K, V are respectively query vector matrix, key vector matrix and value vector matrix, W o Is a weight vector.
5. The quadruped animal behavior recognition method based on the adaptive spatiotemporal graph attention transducer network as claimed in claim 1, wherein in the step 5), a 9-layer spatiotemporal graph convolution network model is constructed, and the steps comprise:
applying the space dimension network module and the time dimension network module designed in the steps 3) and 4) to construct a 9-layer space-time graph convolution network model, constructing a classifier through a full connection layer and Softmax, and identifying the behavior of the quadruped animal, wherein the network structure is as follows:
5.1 The first 4 layers of the space-time graph convolutional network model are the channel topology optimization convolutional networks of the space dimension network module in the step 3.1), and the multi-scale time convolution MS-TCN in the step 4) is applied to the time dimension of each layer of channel topology optimization convolution;
5.2 The last 5 layers of the spatio-temporal graph convolutional network model are the Spatial-fransformer modules of the space dimension network module in the step 3.2), and the multi-scale time convolution MS-TCN in the step 4) is applied to the time dimension of convolution of each layer of the Spatial-fransformer modules;
5.3 Through full connectivity layers FCN and Softmax, a classifier was constructed to recognize 5 behaviors of chase, rest, feed, alert, and walk for the medium and large quadruped.
6. The quadruped animal behavior recognition method based on adaptive space-time diagram attention transducer network as claimed in claim 1 or 3, wherein in the step 3.1.3), a modeling function is used to model potential correlation between joint points of Shared topology of different channels, and Channel-specific skeleton topology is obtained through linear transformation, and the steps include:
3.1.3.1 Modeling potential correlations between the articulation points of Shared topology for different channels using a modeling function whose formula is as follows (5):
wherein x is i ,x j Is e.g. X, andnode pairs (v) of the physical topology of the animal skeleton obtained in step 2) i ,v j ) (i.e. the joint points of the animal body), phi is a linear transformation psi, phi is used for reducing the dimension of the feature before the correlation modeling function and reducing the computational complexity, and sigma (-) is an activation function of a neural network;
3.1.3.2 Applying formula (6) to perform linear transformation on the joint point potential correlation characteristics obtained in step 3.1.3.1) so as to improve the Channel dimension and obtain Channel-specific framework topology Q,
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211588021.0A CN115797841B (en) | 2022-12-12 | 2022-12-12 | Quadruped behavior recognition method based on self-adaptive space-time diagram attention transducer network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211588021.0A CN115797841B (en) | 2022-12-12 | 2022-12-12 | Quadruped behavior recognition method based on self-adaptive space-time diagram attention transducer network |
Publications (2)
Publication Number | Publication Date |
---|---|
CN115797841A true CN115797841A (en) | 2023-03-14 |
CN115797841B CN115797841B (en) | 2023-08-18 |
Family
ID=85418596
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211588021.0A Active CN115797841B (en) | 2022-12-12 | 2022-12-12 | Quadruped behavior recognition method based on self-adaptive space-time diagram attention transducer network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115797841B (en) |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108304795A (en) * | 2018-01-29 | 2018-07-20 | 清华大学 | Human skeleton Activity recognition method and device based on deeply study |
CN110796110A (en) * | 2019-11-05 | 2020-02-14 | 西安电子科技大学 | Human behavior identification method and system based on graph convolution network |
WO2022000420A1 (en) * | 2020-07-02 | 2022-01-06 | 浙江大学 | Human body action recognition method, human body action recognition system, and device |
CN114596632A (en) * | 2022-03-02 | 2022-06-07 | 南京林业大学 | Medium-large quadruped animal behavior identification method based on architecture search graph convolution network |
CN114821799A (en) * | 2022-05-10 | 2022-07-29 | 清华大学 | Motion recognition method, device and equipment based on space-time graph convolutional network |
-
2022
- 2022-12-12 CN CN202211588021.0A patent/CN115797841B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108304795A (en) * | 2018-01-29 | 2018-07-20 | 清华大学 | Human skeleton Activity recognition method and device based on deeply study |
CN110796110A (en) * | 2019-11-05 | 2020-02-14 | 西安电子科技大学 | Human behavior identification method and system based on graph convolution network |
WO2022000420A1 (en) * | 2020-07-02 | 2022-01-06 | 浙江大学 | Human body action recognition method, human body action recognition system, and device |
CN114596632A (en) * | 2022-03-02 | 2022-06-07 | 南京林业大学 | Medium-large quadruped animal behavior identification method based on architecture search graph convolution network |
CN114821799A (en) * | 2022-05-10 | 2022-07-29 | 清华大学 | Motion recognition method, device and equipment based on space-time graph convolutional network |
Also Published As
Publication number | Publication date |
---|---|
CN115797841B (en) | 2023-08-18 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Li et al. | Co-occurrence feature learning from skeleton data for action recognition and detection with hierarchical aggregation | |
CN110472604B (en) | Pedestrian and crowd behavior identification method based on video | |
Soo Kim et al. | Interpretable 3d human action analysis with temporal convolutional networks | |
Yang et al. | Unik: A unified framework for real-world skeleton-based action recognition | |
Obinata et al. | Temporal extension module for skeleton-based action recognition | |
KR20200068545A (en) | System and method for training a convolutional neural network and classifying an action performed by a subject in a video using the trained convolutional neural network | |
CN110728183A (en) | Human body action recognition method based on attention mechanism neural network | |
CN113033276B (en) | Behavior recognition method based on conversion module | |
CN113191230A (en) | Gait recognition method based on gait space-time characteristic decomposition | |
CN116386141A (en) | Multi-stage human motion capturing method, device and medium based on monocular video | |
CN113128424A (en) | Attention mechanism-based graph convolution neural network action identification method | |
CN114973097A (en) | Method, device, equipment and storage medium for recognizing abnormal behaviors in electric power machine room | |
Nasir et al. | ENGA: elastic net-based genetic algorithm for human action recognition | |
Shrivastava et al. | Diverse video generation using a gaussian process trigger | |
Degardin et al. | REGINA—Reasoning graph convolutional networks in human action recognition | |
Yang et al. | Via: View-invariant skeleton action representation learning via motion retargeting | |
CN116246338B (en) | Behavior recognition method based on graph convolution and transducer composite neural network | |
Tang et al. | Pose guided global and local gan for appearance preserving human video prediction | |
CN117373116A (en) | Human body action detection method based on lightweight characteristic reservation of graph neural network | |
Nikpour et al. | Deep reinforcement learning in human activity recognition: A survey | |
CN115797841B (en) | Quadruped behavior recognition method based on self-adaptive space-time diagram attention transducer network | |
Ahmed et al. | Two person interaction recognition based on effective hybrid learning | |
Du | The computer vision simulation of athlete’s wrong actions recognition model based on artificial intelligence | |
Zhao et al. | Research on human behavior recognition in video based on 3DCCA | |
Zhou et al. | Regional Self-Attention Convolutional Neural Network for Facial Expression Recognition |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |