CN115841697A - Motion recognition method based on skeleton and image data fusion - Google Patents

Motion recognition method based on skeleton and image data fusion Download PDF

Info

Publication number
CN115841697A
CN115841697A CN202211137852.6A CN202211137852A CN115841697A CN 115841697 A CN115841697 A CN 115841697A CN 202211137852 A CN202211137852 A CN 202211137852A CN 115841697 A CN115841697 A CN 115841697A
Authority
CN
China
Prior art keywords
skeleton
motion information
module
joint
frame
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211137852.6A
Other languages
Chinese (zh)
Inventor
孙妍
沈亦馨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Shanghai for Science and Technology
Original Assignee
University of Shanghai for Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Shanghai for Science and Technology filed Critical University of Shanghai for Science and Technology
Priority to CN202211137852.6A priority Critical patent/CN115841697A/en
Publication of CN115841697A publication Critical patent/CN115841697A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Image Analysis (AREA)

Abstract

The invention discloses an action recognition method based on skeleton and image data fusion, which comprises the following steps: the behavior recognition network model based on the skeleton data comprises a coordinate motion information guided sampling module, a multi-scale motion information fusion module and a multi-stream space-time relative transform model; the behavior recognition network model based on the image data comprises a picture cutting module based on joint points and a key image block feature extraction model; and fusing the action recognition network model of the skeleton data and the action category prediction probabilities obtained by the action recognition network model based on the image data to obtain the final classification prediction probability of the whole model, thereby completing the action recognition process for public safety. The recognition network model fully excavates the skeleton motion information, establishes dependence for remote joint points and enhances the recognition capability of detailed actions; and further, local image data and skeleton data are fused, so that rich action detail information is supplemented, and high computing cost is avoided.

Description

Action recognition method based on skeleton and image data fusion
Technical Field
Behavior recognition is a technology for analyzing and judging the action type of people by using a specific algorithm through data such as videos. The technology is the basis of many applications such as public safety management, man-machine interaction, intelligent old-age care, intelligent medical treatment and the like, and has wide application prospects. Therefore, the method has important theoretical significance and practical value for developing research on behavior recognition. In a real scene, behavior recognition is a very challenging task, and is easily affected by external factors such as illumination, background, shooting angle and the like, and different ways of doing the same action by different people lead to great difference among the classes. Behavior recognition is also a research hotspot in the field of computer vision because it is challenging and covers multiple disciplines.
Background
The behavior recognition method based on the deep neural network may be classified into an image-based behavior recognition method and a skeleton-based behavior recognition method according to the type of input data. The behavior recognition method based on the image recognizes the human body action in the video by analyzing the RGB image sequence, and the behavior recognition method mainly comprises the following three genres:
1) A dual-flow Network model represented by a Temporal Segment Network (TSN);
2) A 3D Convolutional neural network model represented by a three-dimensional Convolutional network (Convolutional 3D, c 3D);
3) A 2D convolutional neural Network model represented by a Time Difference Network (TDN).
In recent years, research based on the above genres is widespread and advanced performance is achieved. The input of the above model is usually an image obtained by scaling and randomly cropping a video frame, and although the size of the image is reduced to some extent, the following defects still exist: 1) The reduced size will reduce the image precision, thereby affecting the recognition of subtle actions by the model; 2) Although the image size is reduced, the training data scale is still large, the requirement on video memory is high, and the calculation delay is long.
The skeleton-based behavior recognition method recognizes human body actions by analyzing a skeleton sequence. As early as the 70's of the 20 th century, johansson et al demonstrated that skeletal data could effectively describe human motion. With the development of human motion estimation technology, such as advanced human posture estimation algorithm, multi-modal sensors, and the like, the acquisition cost of skeleton data is reduced. Based on this, researchers have developed a large number of studies on behavior recognition methods based on frameworks, which are mainly classified into three categories: a Network model based on a Recurrent Neural Network (RNN), a Network model based on a Convolutional Neural Network (CNN), and a Network model based on a Graph Convolution Network (GCN). RNN-based and CNN-based network models treat the skeleton as a sequence or pseudo-image, resulting in the topology of the skeleton being destroyed. And the GCN-based network model extracts the skeleton characteristics through graph convolution, so that the natural structure of the skeleton is maintained, and the model performance is rapidly improved. In recent years, a network model based on the GCN has become a mainstream method in the field of skeletal behavior recognition. Although the GCN-based network model achieves advanced performance, the following drawbacks exist: 1) The motion information plays an important role in video classification tasks such as behavior recognition tasks and the like, but the motion information contained in the framework sequence is not fully mined by the existing method; 2) The receptive field of the graph convolution network is limited by the size of the convolution kernel, and long-distance connection cannot be established for joint points far away from each other in the skeleton.
In addition to the above-described deficiencies, both image data and skeleton data themselves have limitations. The image data has rich scene information and detail information, but is easily interfered by environmental factors such as illumination and the like, and the scale of the image data is large, and the training time of the relevant model is long. The skeleton data describes the human body movement in a more compact mode, the data volume is small, the requirement on hardware is low, compared with image data, the skeleton data is not easily interfered by external factors (such as illumination shielding and the like), and the robustness is high. Although skeleton data has many advantages as described above, it does not have scene information and detail information specific to an image, but both of these pieces of information play an important role in behavior recognition, and are particularly significant in cases where a motion is subtle or a motion depends on a scene. In conclusion, the image data and the skeleton data have high complementarity, and the behavior recognition network model based on the image and the behavior recognition network model based on the skeleton are fused, so that the research significance is better.
Disclosure of Invention
In order to solve the problems in the prior art, the present invention aims to overcome the disadvantages in the prior art, and provides a method for identifying public security actions, wherein a network model is identified based on skeleton and image data, and the network model is divided into two branches, namely a behavior identification network model based on skeleton data and a behavior identification network model based on image data, according to data type differences: the former extracts skeleton characteristics through a lightweight network, is good at identifying actions with large amplitude and plays a main role in an action identification task; the image recognition method reduces training cost by cutting images, extracts image features from key image blocks, is good at recognizing small-amplitude actions concentrated on hands and feet, and plays a role in supplementing detailed information in an action recognition task.
In order to achieve the purpose of the invention, the invention adopts the following technical scheme:
a behavior recognition method for public security is characterized in that a behavior recognition network model based on skeleton data and a behavior recognition network model based on image data are respectively established to form a recognition network, skeleton features are extracted by the behavior recognition network model based on the skeleton data through a lightweight network and used for recognizing actions with large amplitude and completing a main action recognition task, model input data of the behavior recognition network model based on the skeleton data are skeleton sequences, and the input data sequentially pass through a sampling module guided by coordinate motion information, a multi-scale motion information fusion module and a multi-stream space-time relative transform model to obtain action category prediction probability; extracting image features from image blocks by a picture cutting method based on a behavior recognition network model of image data, and identifying small-amplitude actions concentrated on hands and feet, and supplementing detailed information in an action recognition task; identifying model input data information of a network model into an image sequence based on the behavior of image data, wherein the input data sequentially passes through a picture cutting module based on joint points and a key image block feature extraction model (KBN) to obtain a supplementary action category prediction probability; and fusing the action type prediction probabilities obtained by the action recognition network model based on the skeleton data and the action recognition network model based on the image data to obtain the final classification prediction probability of the whole model, thereby completing the action recognition process for public safety.
Preferably, in the behavior recognition network model based on skeleton data, a frame sampling module guided by coordinate motion information screens out a representative skeleton sequence from the skeleton sequence according to a coordinate motion information measurement index; the multi-scale motion information fusion module fuses static information of the framework and the multi-scale motion information, and sets two different types of motion information according to the characteristics that different actions of a human body have different change speeds and duration, namely, solidified motion information and self-adaptive motion information; the solidification motion information comprises two different scales, so that the network adapts to actions with different change speeds; the self-adaptive motion information enables the recognition network to have the capability of recognizing actions with different durations; establishing long-distance connection for each joint point on a time-space domain by using the multi-stream space-time relative Transformer model, wherein the multi-stream space-time relative Transformer model is as follows: setting a space topological graph based on a framework on a space domain, and constructing a space relative Transformer module for establishing remote dependence of joint points in an airspace; on a time domain, constructing a time topological graph based on a skeleton sequence, and establishing a time domain relative Transformer module for establishing long-distance dependence of joint points in the time domain; then, combining the space and time domain relative modules to obtain a space-time relative transform model, and further extracting the space-time characteristics of the framework sequence; and (3) fusing at least 4 different space-time relative models of input data by adopting a multi-time scale frame to construct a multi-stream space-time relative transform model.
Further preferably, the coordinate motion information-guided frame sampling module includes:
1.1 designing indexes for measuring coordinate motion information:
in the skeleton data, joint points are represented by 3D coordinates; the displacement distance of the joint points in two adjacent frames is used as an index for measuring the motion information content contained in the joint points, the sum of the displacement distances of all the joint points in the framework is used as an index for measuring the motion information content contained in the whole framework, and whether the framework has a representation or not is further judgedSex; assume that the t-th frame has a joint point coordinate of i
Figure SMS_1
The joint point coordinate of the t-1 th frame labeled i is ≥>
Figure SMS_2
Coordinate motion information M contained in the t-th frame t As shown in equation (1):
Figure SMS_3
wherein, N represents the number of joint points contained in a frame;
in order to eliminate the scale expansion effect caused by the difference of the video lengths, the coordinate motion information contained in each frame is normalized, as shown in formula (2):
Figure SMS_4
wherein T represents the number of frames contained in the video;
1.2, sampling a video by adopting a cumulative distribution function:
assuming that N frames need to be sampled from a video with a length of T, the specific operations are as follows:
firstly, accumulating the skeleton coordinate motion information frame by frame to obtain the accumulated coordinate motion information C of the t-th frame t The calculation formula is shown in (3):
Figure SMS_5
according to
Figure SMS_6
And dividing the sequence into N segments, and randomly sampling a frame from the N segments to form a new sequence, so that a representative skeleton series in the skeleton sequence is screened out through the measuring index.
Further preferably, the multi-scale motion information fusion module includes:
2.1 designing different scale motion information:
from the original framework sequence I by sampling origin =[I 1 ,…,I F ]Selecting T frames and combining the T frames into a new framework sequence I according to the original sequence new =[I 1 ,…,I T ]F represents the total frame number of the original skeleton sequence, and I represents the coordinates of all joint points in each frame; motion information is calculated by calculating the same joint point
Figure SMS_7
The coordinate displacement in the two frames yields: />
Figure SMS_8
Representing the original skeleton sequence I origin The knuckle labeled i in frame t, <' > H>
Figure SMS_9
Representing the sampled framework sequence I new The label of the tth frame is a joint point of i;
adaptive motion information M a By the framework sequence I new The motion information of different scales is obtained from videos with different lengths by subtracting the coordinates of the joint points of two continuous frames, and the formula is as follows:
Figure SMS_10
Figure SMS_11
wherein the content of the first and second substances,
Figure SMS_12
representing a novel framework sequence I new Adaptive motion information of the ith frame; />
The motion information is divided into two types: short-distance motion information M t And long-distance motion information M i (ii) a Short-distance motion information M s By passing throughFramework sequence I origin The coordinates of the skeleton joint points which are separated by 2 frames are subtracted to obtain the motion information which is used for capturing the rapidly-changed motion; the calculation formula is shown as follows:
Figure SMS_13
Figure SMS_14
wherein the content of the first and second substances,
Figure SMS_15
representing the short-distance motion information of the ith frame in the new skeleton sequence, f is the new skeleton sequence I new In the original skeleton sequence I of the ith frame origin The number in (1); />
Figure SMS_16
Represents the original skeleton sequence I origin The label of the f-th frame is a joint point of N;
long distance motion information M i Through the proto-framework sequence I origin The coordinates of the skeletal joint points which are separated by 5 frames are subtracted, and the coordinates are used for capturing motion information of slowly changing motion, and the calculation formula is expressed as follows:
Figure SMS_17
Figure SMS_18
wherein, the first and the second end of the pipe are connected with each other,
Figure SMS_19
long-distance motion information of the ith frame in the new skeleton sequence is shown, f is shown as the new skeleton sequence I new In the original frame sequence I origin The number in (1);
2.2, high-dimensional mapping of different-scale motion information:
static information of skeleton I new Adaptive motion information M a Short-term exercise information M s And long-term exercise information M l Are all (T, N, C) 0 ) Where T represents the number of video frames, N represents the number of joints of a skeleton, C 0 A coordinate dimension representing a joint point; mapping the four kinds of information to a high-dimensional space through an Embedding module (Embedding block) to obtain a high-dimensional feature F, F ma 、F ms And F ml (ii) a The embedded module is composed of two convolutional layers and two active layers (ReLU):
the first convolution maps various information to a space with the dimension of C, and the second convolution maps various information to the space with the dimension of C respectively 1 、,C 2 、C 3 、C 4 A high dimensional space of (a); convolution kernels corresponding to different motion information are mutually independent, and parameters are not shared; with static information I new For example, the embedded module quadratic mapping formula is shown as (10):
F=σ(W 2 (σ(W 1 I new +b 1 ))+b 2 ) (10)
where σ denotes the activation function, W 1 、b 1 Representing a parameter in the first convolution function, W 2 、b 2 Representing the parameters of the second convolution function, the parameters of both convolution functions being learned, I new Representing static information;
2.3, multi-scale motion information fusion:
fusing various types of information through stacking operation (concat) to obtain a dynamic representation Z of the framework, as shown in a formula (11); the operation enables the dynamic representation Z of the skeleton to contain multi-scale motion information, and further improves the capability of the network to adapt to actions with different change speeds and different durations;
Z=concat(F,F ma ,F ms ,F ml ) (11)
and (4) fusing the four high-dimensional characteristics to obtain Z, and outputting the Z as a multi-scale motion information fusion module.
Further preferably, the multi-stream spatiotemporal relative transform model comprises:
3.1, constructing a space topological graph based on a framework:
except original joint points in a skeleton, a virtual node is introduced in the step, a new space topological graph is formed together with all the joint points and serves as model input, the introduced virtual node not only needs to collect integrated information from all the joint points, but also plays a role in distributing the integrated global information to all the joint points, and the virtual node is named as a space relay node;
meanwhile, two types of connection are established among nodes, namely space inherent connection and space virtual connection, so as to construct a space topological graph of the framework; the space diagram structure comprising n joint points has n-1 space inherent connections;
3.2, designing a space relative Transformer module:
the module comprises a space joint point updating module (SJU) and a space relay node updating module (SRU), and the connection is established for the remote joint point in the airspace by alternately updating a SJU module and the SRU module; the model input is the joint point sequence in the t frame skeleton
Figure SMS_20
Wherein the content of the first and second substances, N denotes the number of articulation points in this frame, based on>
Figure SMS_21
Representing a joint point +>
Figure SMS_22
A set of all neighboring joint point labels; each node has a corresponding query vector->
Figure SMS_23
The key vector->
Figure SMS_24
value vector->
Figure SMS_25
In Spatial Joint node Update module (Spatial Joint node Update Block, SJU)For any joint point
Figure SMS_26
Firstly, the query vector q corresponding to the joint point i t And its neighbor node->
Figure SMS_27
Corresponding key vector +>
Figure SMS_28
And (3) performing dot product operation to obtain the influence of each neighbor node on the joint point, as shown in a formula (12):
Figure SMS_29
wherein, the first and the second end of the pipe are connected with each other,
Figure SMS_30
representing the influence of the node j on the node i; neighbor nodes include its neighboring joint points +>
Figure SMS_31
Spatial relay node R t And itself>
Figure SMS_32
r represents a label of the spatial relay node;
calculating to obtain influence strength
Figure SMS_33
Thereafter, its value vector corresponding to the neighbor node is @>
Figure SMS_34
Multiplying and summing all the products to obtain a value which is the articulation point->
Figure SMS_35
The formula (13) shows:
Figure SMS_36
wherein the content of the first and second substances,
Figure SMS_37
is the result obtained after one-time update of a joint point update Submodule (SJU), and the result simultaneously aggregates local information and global information, d k The channel dimension representing the key vector, which serves as the normalization, softmax j Indicates that all adjacent joint points are pick>
Figure SMS_38
The influence strength is normalized;
in order to enable the spatial relay node to reasonably and fully collect and integrate the information of each joint point, the spatial relay node updating Submodule (SRU) also adopts dot product operation to calculate the influence of each joint point on the relay node; integrating the information of all joint points into global information through the influence strength; degree of influence
Figure SMS_39
Query vector corresponding by relay node->
Figure SMS_40
The key vector corresponding to each joint point->
Figure SMS_41
The multiplication results in the formula (14):
Figure SMS_42
the update of the spatial relay node is as shown in equation (15),
Figure SMS_43
represents a joint point->
Figure SMS_44
For space relay node R t Influence score of (a), (b), and (c)>
Figure SMS_45
For all nodesA value vector of;
Figure SMS_46
the alternate updating of the joint points and the spatial relay nodes realizes the exchange of information among the joint points, and finally realizes the goal that each joint point simultaneously collects the information of the neighbor joint points and the remote joint points;
3.3, constructing a time topological graph based on the skeleton sequence:
a time relay node is introduced when a time topological graph is constructed, and all joints are connected with each other through time inherent connection and time virtual connection to jointly form a graph structure in a time domain;
along the time dimension, the same joint points in the continuous frames form a new sequence, and the step also constructs connection for the joint points at the head and the tail to form a ring structure; the sequence of n nodes contains n time-dependent connections;
3.4, designing a TRT module:
the Temporal Relative transform module (TRT) comprises a Temporal joint point updating module (TJU) and a Temporal relay node updating module (TRU) and is used for extracting time domain characteristics; the module takes each joint point in the skeleton as an independent node, and respectively takes a sequence formed by the same joint point in a frame sequence as an object to extract the time domain characteristics of the joint point; the input of the TRT module is
Figure SMS_48
A sequence of the same joint for all frames; each joint point->
Figure SMS_50
Has its corresponding query vector->
Figure SMS_52
The key vector->
Figure SMS_49
And value vector->
Figure SMS_51
Time relay node R v Corresponding query vector->
Figure SMS_53
The key vector->
Figure SMS_54
And value vector->
Figure SMS_47
In the TJU submodule, each joint point to be updated
Figure SMS_55
Collecting information of neighbor nodes through virtual connection to perform self-updating; the influence calculation formula of the neighbor node is shown as (16):
Figure SMS_56
wherein the content of the first and second substances,
Figure SMS_57
indicating the same node or time relay node R in the jth frame v The strength of the influence on a joint point in the i-th frame->
Figure SMS_58
Represents a pair->
Figure SMS_59
Performing transposition processing; articulation point->
Figure SMS_60
Is as shown in equation (17):
Figure SMS_61
all query vectors
Figure SMS_62
Are combined into a matrix Q v ∈R C×1×t All key vectors->
Figure SMS_63
Are combined into a matrix K v ∈R C×B×t All value vectors->
Figure SMS_64
Are combined into a matrix V v ∈R C×B×t (ii) a The matrix form definition of the influence strength is shown in formula (18):
Figure SMS_65
b represents the number of neighbor nodes, and DEG represents a Hadamard product;
in the TRU module, a time relay node R v Collecting information from other frames through virtual connection, thereby completing self node updating; the specific operation is as follows:
Figure SMS_66
Figure SMS_67
wherein, the first and the second end of the pipe are connected with each other,
Figure SMS_68
indicates that the articulation point in the jth frame->
Figure SMS_69
For relay node R v In conjunction with a strength of influence of>
Figure SMS_70
Is a scaling factor;
3.5, packaging the ST-RT module:
the ST-RT module is obtained by connecting and combining an SRT module and a TRT module, wherein the SRT module comprises a space joint point updating module and a space relay node updating module; the TRT module comprises a time joint point updating module and a time relay node updating module; each updating module is connected with a forward feedback network layer backwards, and maps the characteristics to a space with larger dimensionality so as to enhance the model expression capacity; lx denotes L cycles;
3.6, encapsulating MSST-RT network:
fusing and packaging the four different ST-RT models with input data through a multi-stream framework to obtain an MSST-RT model; different sampling frequencies may also provide complementary information for the model, sampling n for joint and bone sequences, respectively 1 Frame and n 2 A frame; the skeleton data is subjected to MSST-RT network to obtain final classification prediction probability based on the skeleton data.
Preferably, in the behavior recognition network model based on the image data, the joint point-based picture cropping module selects to crop joint points of hands and feet of the human body; and packaging the image block feature extraction model trained end to end into a key image block feature extraction model based on a time domain segmentation network as a basic framework by adopting the image block feature extraction model trained end to end.
Further preferably, the joint point-based picture cropping module comprises:
picture I of t-th frame t By means of a matrix P t Indicating, by the joint point N, the desired cut j Coordinates in the image are (x, y), and the size of the cropping picture is l × l, then the image is I t Center around the joint point N of hand and foot j Image block set obtained by cutting
Figure SMS_71
As shown in the following equation:
Figure SMS_72
Figure SMS_73
besides cutting the picture by taking the joint point coordinates as the center, extracting optical flow through the picture blocks corresponding to two adjacent frames, wherein the formula is shown as (23):
Figure SMS_74
wherein TV-L1 is a classical optical flow calculation method,
Figure SMS_75
represents the light flow field in the x-axis direction>
Figure SMS_76
Indicating the optical flow field in the y-direction.
Further preferably, the joint point-based picture cropping module comprises: the behavior recognition network based on the key image blocks comprises:
5.1, designing an IBCN model:
the image blocks cut based on the skeleton joint points have independence and correlation, and each image block obtained by cutting is firstly subjected to the IBCN model
Figure SMS_77
Respectively input into a convolutional neural network to obtain the characteristics of each image block>
Figure SMS_78
The calculation formula is shown as (24):
Figure SMS_79
wherein the content of the first and second substances,
Figure SMS_80
means for extracting an image block by a convolutional neural network with a parameter W>
Figure SMS_81
Sharing each convolution neural network parameter; then the characteristics f of each image square t j Splicing to obtain new characteristic vector
Figure SMS_82
As shown in equation (25)
Figure SMS_83
Finally, calculating a characteristic vector F by a point multiplication mode t At an arbitrary spatial position x i From other positions x j Similarity f (x) of i ,x j ) As shown in equation (26):
f(x i ,x j )=softmax(θ(x i ) T ·φ(x j )) (26)
wherein θ (-) and φ (-) are 1 × 1 convolution functions;
the obtained similarity f (x) i ,x j ) Will be used as the weight and g (x) j ) Weighted summation to achieve x i Obtaining information from other locations, y i Is x i The result of global information exchange is shown in equation (27):
Figure SMS_84
wherein g (-) is a mapping function, and a 1 × 1 convolution function is adopted for mapping; nl' 2 To select a feature map
Figure SMS_85
The size of (2) is used as a normalization coefficient to avoid scale expansion caused by different input sizes; when the input is the feature tensor, the formula is shown as (28):
Figure SMS_86
wherein θ (-), φ (-), and g (-) are all 1 × 1 convolution functions, nl' 2 Is a normalized coefficient;
5.2, packaging a KBN network:
the method comprises the steps of packaging an IBCN model into a KBN network by taking a TSN network as a framework, dividing the network into a spatial stream and a time stream, wherein input data are image blocksCorresponding to spatial streams, the input data are optical stream blocks corresponding to temporal streams; adopting spatial stream, firstly sampling a plurality of frames from a video through sparse sampling, and processing each frame through an image cutting module based on a joint point; then, corresponding key image block set of each frame
Figure SMS_87
Respectively inputting IBCN models, and sharing parameters of each IBCN model according to the preliminary prediction class probability of the sampling frame; and then fusing the prediction classification results of all the sampling frames through a consensus function to obtain a video-level classification prediction, wherein the calculation formula is shown as (29):
Figure SMS_88
wherein, KBN-S is the prediction result of the spatial stream of the KBN network, T K Representing the K-th segment after segmentation from the video,
Figure SMS_89
represents the set of image blocks corresponding to the Kth sample frame, in a manner which is characteristic of the fact that the image block corresponding to the Kth sample frame is taken in conjunction with a reference picture>
Figure SMS_90
Indicating that the image block set is/is asserted by the IBCN module>
Figure SMS_91
And processing, wherein the calculation method of the time flow prediction result is consistent with that of the spatial flow.
Compared with the prior art, the invention has the following obvious and prominent substantive characteristics and remarkable advantages:
1. the behavior recognition network model fusing the skeleton data and the image data is provided, skeleton motion information is fully mined, dependence is built for remote joint points, and the recognition capability of detail actions is enhanced; the local image data and the skeleton data are further fused, so that rich action detail information is supplemented, and high calculation cost is avoided;
2. the invention achieves the recognition level of 98.65% on the NTU60 of the data set, the invention provides a behavior recognition network model based on skeleton data and image data, and the two network models are fused; the accuracy of the model is improved, and the model is more accurate,
3. according to the invention, the information exchange channels among all spatial positions are established through the Non-Local module, so that the global information exchange among all image blocks is realized, the independence and the relevance among all image blocks are considered, and the human body Local fine action recognition capability is further improved; and finally, fusing the behavior recognition network model based on the skeleton data and the behavior recognition network model based on the image data, and fully exerting the complementarity of the skeleton data and the image data.
Drawings
FIG. 1 is a schematic diagram of the overall structure of a network model of the method of the present invention.
FIG. 2 is a graph of a skeleton motion information cumulative distribution function according to the method of the present invention.
FIG. 3 is a schematic diagram of various types of motion information calculation according to the method of the present invention.
FIG. 4 is a schematic diagram of a skeletal dynamics information representation module of the method of the present invention.
FIG. 5 is a skeleton-based spatial topology of the method of the present invention.
FIG. 6 is a schematic diagram of a space relative transform module according to the method of the present invention.
FIG. 7 is a time topology diagram based on a skeleton sequence of the method of the present invention.
FIG. 8 is a schematic diagram of a Temporal Relative Transform (TRT) module of the method of the present invention.
FIG. 9 is a schematic diagram of the overall architecture of the ST-RT model of the method of the present invention.
FIG. 10 is a schematic diagram of the overall architecture of the MSST-RT model of the method of the present invention.
FIG. 11 is a schematic diagram of image cropping and corresponding optical flow based on joint point location for the method of the present invention.
FIG. 12 is a schematic diagram of an image block feature extraction model (IBCN) according to the present invention.
Fig. 13 is a schematic diagram of a key image block-based behavior recognition network (KBN) according to the method of the present invention.
Detailed Description
The above-described scheme is further illustrated below with reference to specific embodiments, which are detailed below:
the first embodiment is as follows:
in this embodiment, as shown in fig. 1, a behavior recognition method for public security is implemented, where a behavior recognition network model based on skeleton data and a behavior recognition network model based on image data are respectively established to form a recognition network, the behavior recognition network model based on skeleton data utilizes a lightweight network to extract skeleton features for recognizing actions with large amplitude, and completes a main action recognition task, model input data of the behavior recognition network model based on skeleton data is a skeleton sequence, and the input data sequentially passes through a sampling module guided by coordinate motion information, a multi-scale motion information fusion module, and a multi-stream spatiotemporal relative Transformer model, so as to obtain an action category prediction probability; extracting image characteristics from image blocks by a picture cutting method based on a behavior recognition network model of image data, and identifying small-amplitude actions concentrated on hands and feet, and supplementing detail information in an action recognition task; identifying model input data information of a network model into an image sequence based on the behavior of image data, wherein the input data sequentially passes through a picture cutting module based on joint points and a key image block feature extraction model (KBN) to obtain a supplementary action category prediction probability; and fusing the action type prediction probabilities obtained by the action recognition network model based on the skeleton data and the action recognition network model based on the image data to obtain the final classification prediction probability of the whole model, thereby completing the action recognition process for public safety.
Each of the modules will be described in turn in detail.
(1) Sampling module guided by coordinate motion information
The frame sampling module guided by the coordinate motion information has the innovation point that a representative framework in the framework sequence is screened out according to the coordinate motion information measurement index, and then the motion information contained in the sampling sequence is increased.
Step 1.1, designing indexes for measuring coordinate motion information
In skeletal data, joint points are typically represented by 3D coordinates. And taking the displacement distances of the joint points in two adjacent frames as an index for measuring the motion information content contained in the joint points, taking the sum of the displacement distances of all the joint points in the skeleton as an index for measuring the motion information content contained in the whole skeleton, and further judging whether the skeleton is representative. Assume that the t-th frame has a joint point coordinate of i
Figure SMS_92
The joint point coordinate of the t-1 th frame labeled i is ≥>
Figure SMS_93
Coordinate motion information M contained in the t-th frame t As shown in equation (1):
Figure SMS_94
where N represents the number of joints contained in a frame.
In order to eliminate the scale expansion effect caused by the difference of the video lengths, the coordinate motion information contained in each frame is normalized, as shown in formula (2):
Figure SMS_95
where T represents the number of frames contained in the video.
Step 1.2, sampling the video by adopting the cumulative distribution function
Assuming that N frames need to be sampled from a video with a length of T, the specific operation is as follows: firstly, accumulating the skeleton coordinate motion information frame by frame to obtain the accumulated coordinate motion information C of the t-th frame t The calculation formula is shown in (3).
Figure SMS_96
According to
Figure SMS_97
The sequence is divided into N segments as shown by the dashed lines in fig. 2 (10 frames for a total of samples in fig. 2). And finally, randomly sampling a frame from the N fragments respectively to form a new sequence.
In conclusion, the module provides a skeleton coordinate motion information measurement index, screens out representative skeletons in a skeleton sequence through the measurement index, and further increases motion information contained in a sampling sequence.
(2) Multi-scale motion information fusion module
The multi-scale motion information fusion module has the innovative point that the static information of the framework is fused with the multi-scale motion information, so that the effect of enriching the input information of the model is achieved. According to the characteristic that different actions of human beings have different change speeds and duration, two different types of motion information are designed in the module, namely solidification motion information and self-adaptive motion information. The solidification motion information comprises two different scales, so that the network adapts to actions with different change speeds; adaptive motion information enables the network the ability to recognize different duration actions. The generalization capability of the network can be improved by fusing the multi-scale motion information, and the specific steps are as follows.
Step 2.1, designing motion information of different scales
From the original skeleton sequence I by sampling origin =[I 1 ,…,I F ]Selecting T frames and combining the T frames into a new framework sequence I according to the original sequence new =[I 1 ,…,I T ]As shown in fig. 3, the pink frame is a sampling frame, and I represents the coordinates of all the joint points in each frame. Motion information is calculated by calculating the same joint point
Figure SMS_98
The coordinate displacement in the two frames yields: />
Figure SMS_99
Representing the original skeleton sequence I origin The joint point labeled i, jn, of the tth frame i t Indicating obtained by samplingFramework sequence I new The t-th frame in (1) is labeled as the joint point of i.
Adaptive motion information M a By the framework sequence I new The motion information of different scales is obtained from videos with different lengths by subtracting the coordinates of the joint points of two continuous frames, and the formula is as follows:
Figure SMS_100
Figure SMS_101
wherein the content of the first and second substances,
Figure SMS_102
representing a novel framework sequence I new Adaptive motion information of the t-th frame.
Although adaptive motion information M a By finding new framework sequences I new The difference between two adjacent frames is obtained, but the distance between two frames depends on their I origin The position (2) is closely related to the length of the original skeleton sequence, and each skeleton sequence obtains motion information matched with the length of the skeleton sequence.
The curing motion information is divided into two types: short-distance motion information M s And long-distance motion information M l . Short-distance motion information M s Through the proto-framework sequence I origin And the coordinates of the skeleton joint points separated by 2 frames are subtracted to obtain the motion information for capturing the rapidly-changing motion. The calculation formula is shown as follows:
Figure SMS_103
Figure SMS_104
wherein the content of the first and second substances,
Figure SMS_105
short-distance motion information of the t frame in the new skeleton sequence is shown, f is the new skeleton sequence I new The t-th frame in the original skeleton sequence I origin The numbering in (1).
Long distance motion information M l Through the proto-framework sequence I origin And the coordinates of the skeleton joint points separated by 5 frames are subtracted to capture the motion information of the motion with slower change. The calculation formula is expressed as follows:
Figure SMS_106
Figure SMS_107
wherein, the first and the second end of the pipe are connected with each other,
Figure SMS_108
long-distance motion information of t frame in new skeleton sequence is shown, f shows new skeleton sequence I new In the original frame sequence I origin The numbering in (1).
Step 2.2 high-dimensional mapping of different-scale motion information
Static information of skeleton I new Adaptive motion information M a Short-term exercise information M s And long-term motion information M l Are all (T, N, C) 0 ) Where T represents the number of video frames, N represents the number of joints of a skeleton, C 0 Representing the coordinate dimensions of the joint points. Mapping the four kinds of information to a high-dimensional space through an embedding module (Embeddingblock) to obtain a high-dimensional feature F, F ma 、F ms And F ml . The embedded module is composed of two convolutional layers and two active layers (ReLU): the first convolution maps various information to a space with the dimension of C, and the second convolution maps various information to the space with the dimension of C respectively 1 、,C 2 、C 3 、C 4 A high dimensional space of (a). Convolution kernels corresponding to different motion information are independent of each other, and parameters are not shared. With static information I new For example, the embedded module quadratic mapping formula is shown as (10):
F=σ(W 2 (σ(W 1 I new +b 1 ))+b 2 )#(10)
step 2.3, multi-scale motion information fusion
And fusing various types of information through stacking operation (concat) to obtain a dynamic representation Z of the skeleton, as shown in formula (11). The operation enables the dynamic representation Z of the skeleton to contain multi-scale motion information, and therefore the capability of the network to adapt to actions with different change speeds and different durations is improved.
Z=concat(F,F ma ,F ms ,F ml )#(11)
In summary, the module provides three types of motion information with different scales, namely adaptive motion information, short-term motion information and long-term motion information, and then an embedded module is adopted to map the motion information and static information to a high-dimensional space respectively, and finally four types of high-dimensional features are fused to be used as model input. The method provided in this section enables model input to contain rich motion information, and the multi-scale characteristics of the model input can improve the generalization of the behavior recognition network.
(3) Multi-stream spatiotemporal relative transform model
The action recognition task is heavy, and a plurality of human body actions are often completed by matching joint points which are far away from each other. For example, when a person claps, the person needs to collaboratively complete the left hand and the right hand, and the joint points of the left hand and the right hand are far apart in the skeleton, but have strong correlation in the action. The innovation point of the multi-stream space-time relative transform model is that the long-distance relation is established for each joint point on a time-space domain, and the work is as follows: in a spatial domain, a space topological graph based on a framework is designed, and a space relative Transformer module is provided for establishing remote dependence of joint points in an airspace; in a time domain, a time topological graph based on a skeleton sequence is designed, and a time domain relative transform module is provided for establishing long-distance dependence of joint points in the time domain. Then, the space and time domain relative modules are combined to obtain a space-time relative transform model, and further the space-time characteristics of the framework sequence are extracted. And finally, fusing different space-time relative models of the four input data by adopting a multi-time scale frame to obtain a multi-stream relative space-time model. The method comprises the following specific steps.
Step 3.1, constructing a space topological graph based on a framework
In addition to the original joint points in the skeleton, the step introduces a virtual node, and forms a new spatial topological graph together with all the joint points as the model input. As shown in fig. 5, the blue node is the original node, and the purple node is the introduced virtual node. The introduced virtual node not only needs to collect the integrated information from each joint point, but also plays a role of distributing the integrated global information to each joint point, and the virtual node is named as a spatial relay node.
Meanwhile, two types of connections are established between nodes (including joint nodes and spatial relay nodes) in the step, namely spatial inherent connection and spatial virtual connection, so as to construct a spatial topological graph of the skeleton. As shown in fig. 5, the purpose of maintaining the original map topology in the skeleton is achieved by establishing spatial inherent connections, i.e. blue line segments, for all the joint point pairs directly connected by the skeleton in the human skeleton. The spatial intrinsic connection contains a large amount of a priori knowledge and can serve to gather local information from neighboring joint points. At the same time, the existence of the connection enables the joint point to obtain more information from the neighbor joint point than the remote joint point. The spatial graph structure containing n joint points has n-1 spatial inherent connections.
Step 3.2, designing a space relative Transformer module
The spatial relative Transformer module is essentially a transform-based spatial feature extraction algorithm, as shown in fig. 6. The module comprises a space joint point updating module (SJU) and a space relay node updating module (SRU), and the purpose of establishing contact for a remote joint point in an airspace is achieved by alternately updating a SJU module and the SRU module. Since the module independently updates the joint point and the spatial relay node in each frame, this step will describe the model algorithm with a single frame as an example. The model input is the joint point sequence in the t frame skeleton
Figure SMS_109
Wherein the content of the first and second substances, N denotes the number of articulation points in this frame, based on>
Figure SMS_110
Represents a joint point->
Figure SMS_111
Of all neighboring node tags. Each node (including the joint point->
Figure SMS_112
And spatial relay node R t ) All have a corresponding query vector->
Figure SMS_113
The key vector->
Figure SMS_114
value vector->
Figure SMS_115
In the Spatial joint Update module (SJU), any joint is targeted
Figure SMS_116
Firstly, the query vector q corresponding to the joint point i t And its neighbor node->
Figure SMS_117
The corresponding key vector->
Figure SMS_118
Performing dot product operation to obtain the influence of each neighbor node on the joint point, as shown in a formula (3.12):
Figure SMS_119
wherein, the first and the second end of the pipe are connected with each other,
Figure SMS_120
representing the strength of the influence of node j on node i. The neighbor node includes its neighboring knuckle point->
Figure SMS_121
Spatial relay node R t And itself>
Figure SMS_122
Calculating to obtain influence strength
Figure SMS_123
Then it is compared with the value vector corresponding to the neighbor node @>
Figure SMS_124
Multiplying and summing all the products to obtain a value which is the articulation point->
Figure SMS_125
The formula (13) shows:
Figure SMS_126
wherein, the first and the second end of the pipe are connected with each other,
Figure SMS_127
is the result obtained after one-time updating by a joint point updating Submodule (SJU), the result simultaneously aggregates the local information and the global information, d k And the channel dimension of the key vector is expressed, and the normalization function is realized. As shown in the block SJU in fig. 6, the red nodes are the nodes to be updated, which collect information from neighboring nodes through orange connections. />
In order to enable the spatial relay node to reasonably and fully collect and integrate the information of each joint point, the spatial relay node updating Submodule (SRU) also adopts dot product operation to calculate the influence of each joint point on the relay node. As shown in the SRU module in fig. 6, the spatial relay node to be updated (red node) collects information from each node through orange connection, and integrates each node information into global information through influence. Degree of influence
Figure SMS_128
Query vector corresponding by relay node->
Figure SMS_129
The key vector corresponding to each joint point->
Figure SMS_130
The multiplication is carried out, and the formula is shown as (14):
Figure SMS_131
the update of the spatial relay node is as shown in equation (15),
Figure SMS_132
represents a joint point->
Figure SMS_133
For space relay node R t Influence score of (a), (b), and (c)>
Figure SMS_134
Value vectors for all nodes (including all nodes and spatial relay nodes in the skeleton).
Figure SMS_135
The alternate updating of the joint points and the spatial relay nodes realizes the exchange of information among the joint points, and finally realizes the goal that each joint point simultaneously collects the information of the neighbor joint points and the remote joint points. The overall update algorithm of the SRT module is algorithm 1, as shown in table 1, where the first layer cycles through all frames and the second layer cycles through all the nodes (including spatial relay nodes) in the frame.
Table 1. Algorithm 1
Figure SMS_136
Figure SMS_137
Step 3.3, constructing a time topological graph based on the skeleton sequence
In the step, a time relay node is introduced when a time topological graph is constructed, and all joints are connected with each other through time inherent connection and time virtual connection to jointly form a graph structure in a time domain.
Along the time dimension, the same joint in successive frames constitutes a new sequence, which also constructs a connection for the head and tail joints, constituting a ring structure, as shown in fig. 7. The above connection is named as a temporal intrinsic connection (blue line segment) because the order of each frame is preserved, and plays a role of directly exchanging information with the adjacent frame. The sequence of n nodes contains n time-dependent connections.
Similar to the configuration in step 3.1, a temporal virtual connection (purple segment) connects the temporal relay node (purple node) and each node in the sequence (blue node), which completes the remote information exchange through such a connection. Thus, a graph containing n nodes has n virtual connections in time, as shown in FIG. 7.
Step 3.4, design TRT Module
The Temporal Relative Transformer module (TRT) comprises a Temporal joint point updating module (TJU) and a Temporal relay node updating module (TRU) and is used for extracting time domain characteristics. The module takes each joint point in the skeleton as an independent node, and respectively takes a sequence formed by the same joint point in the frame sequence as an object to extract the time domain characteristics of the joint point. This step will describe the model algorithm with a single joint as an example. The input of the TRT module is
Figure SMS_140
A sequence of the same joint for all frames. Each joint point->
Figure SMS_142
Has its corresponding query vector->
Figure SMS_144
The key vector->
Figure SMS_139
And value vector
Figure SMS_141
Time relay node R v Corresponding query vector->
Figure SMS_143
Key vector->
Figure SMS_145
And value vector->
Figure SMS_138
In the TJU submodule, each joint point to be updated
Figure SMS_146
(Red node) Collection of neighbor nodes (time Relay nodes R) by virtual connection (orange segment) v The same articulation point of an adjacent frame->
Figure SMS_147
And the node itself>
Figure SMS_148
) Is updated, as shown in block TJU of fig. 8. The influence calculation formula of the neighbor node is shown as (16):
Figure SMS_149
wherein the content of the first and second substances,
Figure SMS_150
indicating the same node or time relay node R in the jth frame v The influence on a certain joint point in the ith frame. Articulation point->
Figure SMS_151
Is as shown in equation (17):
Figure SMS_152
all query vectors
Figure SMS_153
Are combined into a matrix Q v ∈R C×1×t All key vectors &>
Figure SMS_154
Are combined into a matrix K v ∈R C×B×t All value vectors->
Figure SMS_155
Are combined into a matrix V v ∈R C×B×t . The matrix form definition of the influence strength is shown in formula (18):
Figure SMS_156
where B represents the number of neighboring nodes, and ° represents the hadamard product.
In the TRU module, as shown in fig. 8, a time relay node R v The (red nodes) collect information from other frames through virtual connections (orange segments), thereby completing self node update. The specific operation is as follows:
Figure SMS_157
Figure SMS_158
/>
wherein the content of the first and second substances,
Figure SMS_159
indicates that the articulation point in the jth frame->
Figure SMS_160
For relay node R v Is greater or less than>
Figure SMS_161
Is a scaling factor.
And alternately updating the time relay nodes and the same node in all frames, and finally capturing the dependence of the long distance and the short distance between the frames by the TRT module. The overall TRT module update algorithm is algorithm 2, as shown in table 2, where the first layer circularly traverses all the joint points in the skeleton, and the second layer circularly traverses the corresponding joint points (including time relay nodes) of the joint points in all frames.
Table 2. Algorithm 2: TRT Module update Algorithm Specification
Figure SMS_162
Step 3.5, packaging the ST-RT module:
the ST-RT module is obtained by connecting and combining an SRT module and a TRT module, and as shown in fig. 9, the SRT module includes a Spatial joint node update module (SJU) and a Spatial relay node update module (SRU). The TRT module includes a Temporal joint node update module (TJU) and a Temporal relay node update module (TRU). Each update module is connected with a forward feedback Network (FNN) backward, and the features are mapped to a space with larger dimensions so as to enhance the expression capability of the model. L x represents L cycles.
Step 3.6, encapsulate MSST-RT network
In order to further improve the model accuracy, the step performs fusion packaging on the four ST-RT models with different input data through a Multi-stream framework to obtain an MSST-RT model (Multi stream ST-RT), as shown in FIG. 10. In addition to extracting features through first-order information (Joint points) of the skeleton, features may also be extracted through second-order information (skeleton). At the same time, different sampling frequencies may also provide supplementary information for the model, such as sampling n for joint and bone sequences, respectively 1 Frame and n 2 And (5) frame. The skeleton data is subjected to MSST-RT network to obtain final classification prediction probability based on the skeleton data.
In conclusion, the model MSST-RT improves the Transformer model according to the characteristics of the skeleton diagram and the sequence characteristics, establishes dependence for the remote joint points at lower calculation cost, and simultaneously maintains the integrity of the skeleton structure and the sequence order, thereby improving the calculation efficiency and the identification accuracy.
(4) Image clipping module based on joint points
Since the human body fine motion is mostly concentrated on the hand or the foot, the corresponding image square block includes most of the detail information of the skeleton missing. Therefore, the innovative point of the module is to choose to cut the joints of the hands and feet of the human body, so that the training cost is greatly reduced, as shown in fig. 11.
Specifically, image I of the t-th frame t By means of a matrix P t Indicating, by the joint point N, the desired cut j Coordinates in the image are (x, y), and the size of the cropping picture is l × l, then the image is I t Center around the joint point N of hand and foot j Image block set obtained by cutting
Figure SMS_163
As shown in the following equation:
Figure SMS_164
Figure SMS_165
in addition to cropping the picture by taking the joint point coordinates as the center, this section also extracts the optical flow through the picture blocks corresponding to two adjacent frames, and the formula is shown as (23):
Figure SMS_166
wherein TV-L1 is a classical optical flow calculation method,
Figure SMS_167
represents the light flow field in the x-axis direction>
Figure SMS_168
Indicating the optical flow field in the y-direction.
(5) Behavior recognition network (KBN) based on key image blocks
In order to extract features in each cut key Image block, this embodiment designs an Image block feature extraction model (IBCN) trained end to end, and encapsulates the IBCN model into a KBN Network with a time domain Segment Network (TSN) as a basic framework. The method comprises the following specific steps.
Step 5.1, designing IBCN model
There are both independence and correlation between image blocks cropped based on skeletal joint points, so as shown in FIG. 12, each image block cropped by the IBCN model is firstly cropped by the IBCN model
Figure SMS_169
Respectively input into a Convolutional Neural Network (CNN) to obtain the characteristics of each image block>
Figure SMS_170
The calculation formula is shown as (24):
Figure SMS_171
wherein, the first and the second end of the pipe are connected with each other,
Figure SMS_172
means for extracting an image block by a convolutional neural network with a parameter W>
Figure SMS_173
The convolutional neural network parameters are shared.
Then the characteristics f of each image square t j Splicing to obtain new characteristic vector
Figure SMS_174
As shown in equation (25)
Figure SMS_175
Finally, calculating a characteristic vector F by a point multiplication mode t At an arbitrary spatial position x i From other positions x j Similarity f (x) of i ,x j ) As shown in equation (26):
f(x i ,x j )=softmax(θ(x i ) T ·φ(x j ))# (26)
where θ (-) and φ (-) are 1 × 1 convolution functions.
The obtained similarity f (x) i ,x j ) Will be used as the weight and g (x) j ) Weighted summation to achieve x i Obtaining information from other locations, y i Is x i The result after global information exchange is as shown in formula (27):
Figure SMS_176
wherein g (-) is a mapping function, and this section uses a 1 × 1 convolution function for mapping. Nl' 2 To select a feature map
Figure SMS_177
The dimension of (2) is used as a normalization coefficient, so that the scale expansion caused by different input dimensions can be avoided. When the input is the feature tensor, the formula is shown as (28):
Figure SMS_178
wherein, θ (-), φ (-), and g (-), are all 1 × 1 convolution functions, nl' 2 Are normalized coefficients.
Step 5.2, encapsulating the KBN network
The step uses TSN network as frame, seals IBCN modelThe method is provided as a KBN network which is divided into a spatial stream and a temporal stream, wherein the input data are optical flow blocks corresponding to the spatial stream and the input data are optical flow blocks corresponding to the temporal stream. Taking a spatial stream as an example, a number of frames are first sampled from a video by sparse sampling, and each frame is processed by an image cropping module based on a joint point. Then, corresponding key image block set of each frame
Figure SMS_179
And respectively inputting the IBCN models, and sharing parameters of the IBCN models according to the preliminary prediction class probability of the sampling frame. The prediction classification results of all the sampled frames are then fused by Consensus function (Consensus) to obtain the classification prediction of video level, and the calculation formula is shown as (29).
Figure SMS_180
The KBN-S is a prediction result of a spatial stream of the KBN network, and a calculation method of a time stream prediction result is consistent with that of the spatial stream.
And finally, fusing the spatial stream prediction result and the time stream prediction result to obtain the final classification prediction probability based on the image data.
Behavior recognition is taken as a popular research direction in the field of computer vision, has wide application prospects in the aspects of public safety, man-machine interaction and the like, and has important research significance. The behavior recognition method mainly comprises two types based on skeleton data and image data, the embodiment provides a behavior recognition network model fusing the skeleton data and the image data, skeleton motion information is fully mined, dependence is built for remote joint points, and the recognition capability of detail actions is enhanced. And further, local image data and skeleton data are fused, so that rich action detail information is supplemented, and high calculation cost is avoided. This example achieved a recognition level of 98.65% on data set NTU 60. The embodiment provides a behavior recognition network model based on skeleton data and image data, and the two network models are fused to form a complete system:
the embodiment provides a motion information guidance sampling module and a multi-scale motion information fusion module, aiming at the problem that the existing skeleton behavior identification method does not fully mine skeleton motion information. In the motion information guiding and sampling module, the sum of coordinate displacements of each joint point of two adjacent frames is provided as an index for measuring the motion information of the skeleton coordinate, and the sampling is guided by the measuring index, so that the skeleton obtained by sampling has richer motion information, and the identification accuracy is further improved. In the multi-scale motion information fusion module, solidified motion information and self-adaptive motion information are provided and fused with static information, so that the model input has rich motion information, the adaptability of the model to actions with different change speeds and different durations is further enhanced, and the accuracy of the model is improved.
In the embodiment, aiming at the problem that the graph convolution network cannot establish long-distance dependence for the joint points with longer distance in the skeleton, the framework behavior recognition network MSST-RT based on the Transformer is provided. The network model respectively introduces a virtual node in the space-time field, establishes direct contact (virtual connection) with each joint point through the node, and collects and integrates joint point information to realize the autonomous updating of the virtual node; each joint point acquires local information from adjacent nodes through bones (inherent connection), and acquires global information from the virtual node through virtual connection, so that the joint point is updated. Through the two updates, each joint point completes information exchange with any other joint point, long-distance dependence is established, and space-time characteristics are extracted.
Aiming at the problem that image data has detail information lacking in skeleton data but related models are high in training cost, the embodiment provides a picture cropping module based on joint points and a behavior recognition network KBN based on key image blocks. In the image cropping module based on the joint points, in order to reduce the image data size and reduce the training cost, the present embodiment crops the positions of the hands and the feet of the person in the image according to the coordinates of the joint points to obtain a plurality of image squares, and the image square set is used to replace the image for feature extraction. In the KBN model, the present embodiment establishes an information exchange channel between spatial positions through a Non-Local module, thereby implementing global information exchange between image blocks, and taking into account independence and relevance between image blocks, thereby improving the human body Local fine motion recognition capability. And finally, fusing the behavior recognition network model based on the skeleton data and the behavior recognition network model based on the image data, and fully exerting the complementarity of the skeleton data and the image data.
Example two:
this embodiment is substantially the same as the first embodiment, and is characterized in that:
in this embodiment, the action recognition method for public safety is to perform an experiment by using skeleton data and image data in an NTU60 data set, divide a training set and a test set according to a C-Subject rule, and measure model performance by Top1 accuracy.
(1) Behavior recognition network model based on skeleton data
Static information, adaptive motion information, short-term motion information and long-term motion information in the multi-scale motion information fusion module are mapped from a space with a dimension of 3 to a space with a dimension of 64 through a first 1 x 1 convolution, and then mapped from the space with the dimension of 64 to high-dimensional spaces with dimensions of 256, 128 and 128 through a second 1 x 1 convolution respectively.
The number of the SRT module and the TRT module in the MSST-RT model is set to be 3, the number of the heads of the multi-head attention mechanism is set to be 8, and batch normalization is adopted in a normalization mode. All experiments are completed by adopting a Pythrch frame, an Adam optimizer is adopted for model training, and parameters are set to be beta = [0.9,0.98 ]]And e =10 -9 . The training is divided into two stages: 1) In the first stage (the first 700 iterations), the learning rate is changed from 4 × 10 by hot start -7 Linear increase to 5X 10 -4 (ii) a 2) In the second stage, the learning rate is gradually reduced by a natural exponential decay strategy with the decay weight of 0.9996. The training mode can not only accelerate the convergence of the model, but also make the training more stable. During the training process, the training batch size is set to 64 times, and the number of training times is 30 times. At the same time, all experiments adopt the epsilon ls Label smoothing strategy of = 0.1.
In the aspect of data processing, the coordinate displacement of the joint point of each frame and the same joint point of the first frame is adopted to replace the original coordinate information of each joint point to describe the skeleton of each frame. Some actions in the training set are double-person interactive actions, that is, two skeletons are included in the same frame, such as hugging, shaking and the like. In this case, a frame including two skeletons is divided into two frames, and each frame includes one skeleton. In addition, more different samples are obtained by randomly rotating the 3D skeleton to realize data enhancement, and the generalization capability of the network is enhanced to a certain extent.
(2) Behavior recognition network model based on image data
The experiments in this section are all completed based on a Pytorch framework, and a random gradient descent algorithm with a Momentum (Momentum) value of 0.9 is adopted to learn network parameters. In the training of the KBN cyberspace stream, the training batch size is set to 24, the training number is set to 80, the initial value of the learning rate is set to 0.001, and the learning rate is updated at the 25 th, 45 th and 70 th training times, each updating reducing the learning rate to 1/2 of the original. The experiment will initialize the network parameters using a pre-trained model on the ImageNet dataset. In the time flow training of the KBN network, the training batch size is set to 24, the training times are set to 300 times, the initial learning rate value is set to 0.001, the learning rate is updated in the 50 th, 100 th, 150 th and 200 th training times, and each updating reduces the learning rate to 1/2 of the original learning rate. When the gradient value is larger than 20 in the training process, the gradient cutting operation is carried out, and the operation can effectively avoid gradient explosion. To accelerate model convergence, the experiment will initialize the KBN time flow network with the model parameters of the KBN network spatial flow. Experiments the TV-L1 algorithm provided by OpenCV of CUDA version was used to extract the optical flow of image squares.
TABLE 3 Performance of methods on NTU60 data set
Figure SMS_181
The action recognition method for public security in the embodiment includes the steps that a behavior recognition network model based on skeleton data and image data is divided into two branches, namely the behavior recognition network model based on skeleton data and the behavior recognition network model based on image data, according to data type differences: the former extracts skeleton characteristics through a lightweight network, is good at identifying actions with large amplitude and plays a main role in an action identification task; the image recognition method reduces training cost by cutting images, extracts image features from key image blocks, is good at recognizing small-amplitude actions concentrated on hands and feet, and plays a role in supplementing detailed information in an action recognition task.
The embodiments of the present invention have been described above with reference to the accompanying drawings, but the present invention is not limited to the above embodiments, and various changes can be made according to the purpose of the invention, and all changes, modifications, substitutions, combinations or simplifications made according to the spirit and principle of the technical solution of the present invention shall be made in the form of equivalent substitution, so long as the invention is in accordance with the purpose of the invention, and the invention shall fall within the protection scope of the present invention as long as the technical principle and the inventive concept of the present invention are not departed from the present invention.

Claims (8)

1. A motion recognition method based on skeleton and image data fusion is characterized in that: respectively establishing a behavior recognition network model based on skeleton data and a behavior recognition network model based on image data to form a recognition network, extracting skeleton characteristics by using a lightweight network for recognizing actions with larger amplitude and completing a main action recognition task by the behavior recognition network model based on the skeleton data, wherein model input data of the behavior recognition network model based on the skeleton data are skeleton sequences, and the input data sequentially pass through a sampling module guided by coordinate motion information, a multi-scale motion information fusion module and a multi-stream spatiotemporal relative transform model to obtain action category prediction probability; extracting image features from image blocks by a picture cutting method based on a behavior recognition network model of image data, and identifying small-amplitude actions concentrated on hands and feet, and supplementing detailed information in an action recognition task; identifying model input data information of a network model into an image sequence based on the behavior of image data, wherein the input data sequentially passes through a picture cutting module based on joint points and a key image block feature extraction model (KBN) to obtain a supplementary action category prediction probability; and fusing the action type prediction probabilities obtained by the action recognition network model based on the skeleton data and the action recognition network model based on the image data to obtain the final classification prediction probability of the whole model, thereby completing the action recognition process for public safety.
2. The method of claim 1, wherein the method comprises: in a behavior recognition network model based on skeleton data, a frame sampling module guided by coordinate motion information screens out a representative skeleton sequence from skeleton sequences according to coordinate motion information measurement indexes; the multi-scale motion information fusion module fuses the static information of the skeleton with the multi-scale motion information, and sets two different types of motion information according to the characteristics that different actions of a human body have different change speeds and duration, namely, solidified motion information and self-adaptive motion information; the solidification motion information comprises two different scales, so that the network adapts to actions with different change speeds; the self-adaptive motion information enables the identification network to have the capability of identifying actions with different durations; establishing long-distance connection for each joint point on a time-space domain by using the multi-stream space-time relative Transformer model, wherein the multi-stream space-time relative Transformer model is as follows: setting a space topological graph based on a framework on a space domain, and constructing a space relative Transformer module for establishing remote dependence of joint points in an airspace; on a time domain, constructing a time topological graph based on a skeleton sequence, and establishing a time domain relative Transformer module for establishing remote dependence of joint points in the time domain; then, combining the space and time domain relative modules to obtain a space-time relative transform model, and further extracting the space-time characteristics of the framework sequence; and (3) fusing at least 4 different space-time relative models of input data by adopting a multi-time scale frame to construct a multi-stream space-time relative transform model.
3. The method of claim 2, wherein the method comprises: the coordinate motion information directed frame sampling module comprises:
1.1 designing indexes for measuring coordinate motion information:
in the skeleton data, joint points are represented by 3D coordinates; taking the displacement distances of the joint points in two adjacent frames as an index for measuring the motion information content contained in the joint points, taking the sum of the displacement distances of all the joint points in the skeleton as an index for measuring the motion information content contained in the whole skeleton, and further judging whether the skeleton is representative; assume that the t-th frame has a joint coordinate of i
Figure FDA0003852118210000021
The joint point coordinate labeled i of the t-1 th frame is
Figure FDA0003852118210000022
Coordinate motion information M contained in the t-th frame t As shown in equation (1):
Figure FDA0003852118210000023
wherein, N represents the number of joint points contained in a frame;
in order to eliminate the scale expansion effect caused by different video lengths, the coordinate motion information contained in each frame is normalized, as shown in formula (2):
Figure FDA0003852118210000024
wherein, T represents the number of frames contained in the video;
1.2, sampling the video by adopting a cumulative distribution function:
assuming that N frames need to be sampled from a video with a length of T, the specific operation is as follows:
firstly, accumulating the skeleton coordinate motion information frame by frame to obtain the accumulated coordinate motion information, and accumulating the accumulated coordinate motion information C of the t-th frame t The calculation formula is shown as (3):
Figure FDA0003852118210000025
according to
Figure FDA0003852118210000026
And dividing the sequence into N segments, and randomly sampling a frame from the N segments to form a new sequence, so as to screen out a representative skeleton series in the skeleton sequence through the weighing index.
4. The method of claim 2, wherein the method comprises: the multi-scale motion information fusion module comprises:
2.1 designing different scale motion information:
from the original framework sequence I by sampling origin =[I 1 ,…,I F ]Selecting T frames and combining the T frames into a new framework sequence I according to the original sequence new =[I 1 ,…,I T ]F represents the total frame number of the original skeleton sequence, and I represents the coordinates of all joint points in each frame; motion information is calculated by calculating the same joint point
Figure FDA00038521182100000212
The coordinate displacement in the two frames yields:
Figure FDA0003852118210000028
representing the original skeleton sequence I origin The t-th frame in (1) is labeled as the joint point of i,
Figure FDA0003852118210000029
representing the sampled framework sequence I new The label of the t-th frame is a joint point of i;
adaptive motion information M a By the framework sequence I new The coordinates of the joint points of two continuous frames are subtracted, and the method for acquiring the motion with different scales from the videos with different lengthsThe dynamic information is characterized by the following formula:
Figure FDA00038521182100000210
Figure FDA00038521182100000211
wherein the content of the first and second substances,
Figure FDA0003852118210000031
representing a New framework sequence I new Adaptive motion information of the ith frame;
motion information is divided into two categories: short-distance motion information M s And long-distance motion information M i (ii) a Short-distance motion information M s Through the proto-framework sequence I origin The coordinates of the skeleton joint points which are separated by 2 frames are subtracted to obtain the motion information which is used for capturing the rapidly-changed motion; the calculation formula is shown as follows:
Figure FDA0003852118210000032
Figure FDA0003852118210000033
wherein the content of the first and second substances,
Figure FDA0003852118210000034
short-distance motion information of the ith frame in the new skeleton sequence is shown, and f is the new skeleton sequence I new In the original skeleton sequence I of the ith frame origin The number in (1);
Figure FDA0003852118210000035
represents the original skeleton sequence I origin The label of the f-th frame is a joint point of N;
long distance motion information M i By the proto-framework sequence I origin The coordinates of the skeletal joint points which are separated by 5 frames are subtracted, and the coordinates are used for capturing motion information of slowly changing motion, and the calculation formula is expressed as follows:
Figure FDA0003852118210000036
Figure FDA0003852118210000037
wherein the content of the first and second substances,
Figure FDA0003852118210000038
long-distance motion information of I-th frame in new skeleton sequence is shown, f is new skeleton sequence I new The t-th frame in the original skeleton sequence I origin The number in (1);
2.2, high-dimensional mapping of different-scale motion information:
static information of skeleton I new Adaptive motion information M a Short-term exercise information M s And long-term exercise information M l All tensors of (T, N, C) 0 ) Where T represents the number of video frames, N represents the number of joints of a skeleton, C 0 A coordinate dimension representing a joint point; mapping the four kinds of information to a high-dimensional space through an Embedding module (Embedding block) to obtain high-dimensional features F, F as a graph ma 、F ms And F ml (ii) a The embedded module is composed of two convolutional layers and two active layers (ReLU):
the first convolution maps various information to a space with the dimension of C, and the second convolution maps various information to the space with the dimension of C respectively 1 、,C 2 、C 3 、C 4 A high dimensional space of (a); convolution kernels corresponding to different motion information are mutually independent, and parameters are not shared; with static information I new For example, the embedded module quadratic mapping formula is shown as (10):
F=σ(W 2 (σ(W 1 I new +b 1 ))+b 2 ) (10)
where σ denotes the activation function, W 1 、b 1 Representing a parameter in the first convolution function, W 2 、b 2 Representing the parameters of the second convolution function, the parameters of the second convolution function being obtained by learning, I new Representing static information;
2.3, multi-scale motion information fusion:
fusing various types of information through stacking operation (concat) to obtain a dynamic representation Z of the framework, as shown in a formula (11); the operation enables the dynamic representation Z of the skeleton to contain multi-scale motion information, and further improves the capability of the network to adapt to actions with different change speeds and different durations;
Z=concat(F,F ma ,F ms ,F ml ) (11)
and (3) fusing the four high-dimensional features to obtain Z, and outputting the Z as a multi-scale motion information fusion module.
5. The method of claim 2, wherein the method comprises: the multi-stream spatiotemporal relative transform model comprises:
3.1, constructing a space topological graph based on a framework:
except original joint points in a skeleton, a virtual node is introduced in the step, a new space topological graph is formed together with all the joint points and serves as model input, the introduced virtual node not only needs to collect integrated information from all the joint points, but also plays a role in distributing the integrated global information to all the joint points, and the virtual node is named as a space relay node;
meanwhile, two types of connection are established among nodes, namely space inherent connection and space virtual connection, so as to construct a space topological graph of the framework; the space diagram structure comprising n joint points has n-1 space inherent connections;
3.2, designing a space relative Transformer module:
the module comprises a space joint point updating module (SJU) and a space relay node updating module (SRU), and the connection is established for the remote joint point in the airspace by alternately updating a SJU module and the SRU module; the model input is the joint point sequence in the t frame skeleton
Figure FDA00038521182100000418
Where N represents the number of joint points in this frame,
Figure FDA0003852118210000042
representing joint points
Figure FDA00038521182100000419
A set of all neighboring joint point labels; each node has a corresponding query vector
Figure FDA0003852118210000043
key vector
Figure FDA0003852118210000044
value vector
Figure FDA0003852118210000045
In the Spatial Joint node Update module (SJU), any Joint point is targeted
Figure FDA0003852118210000046
Firstly, the query vector corresponding to the joint point
Figure FDA00038521182100000420
And its neighbor node
Figure FDA00038521182100000421
Corresponding key vector
Figure FDA0003852118210000048
And (3) performing dot product operation to obtain the influence of each neighbor node on the joint point, as shown in a formula (12):
Figure FDA0003852118210000049
wherein the content of the first and second substances,
Figure FDA00038521182100000410
representing the influence strength of the node j on the node i; the neighbor nodes include their neighboring joint points
Figure FDA00038521182100000411
Spatial relay node R t And itself
Figure FDA00038521182100000412
r represents a label of the spatial relay node;
calculating to obtain influence strength
Figure FDA00038521182100000413
Then, the value vector corresponding to the neighbor node is added
Figure FDA00038521182100000414
Multiplying, and summing all the products to obtain the value as the joint point
Figure FDA00038521182100000415
The formula (13) shows:
Figure FDA00038521182100000416
wherein the content of the first and second substances,
Figure FDA00038521182100000417
the sub-module (SJU) is updated once by the joint pointUpdated results which aggregate both local and global information, d k The channel dimension representing the key vector, which serves as the normalization, softmax j Representing all adjacent joint points
Figure FDA00038521182100000422
The influence strength is normalized;
in order to enable the spatial relay node to reasonably and fully collect and integrate the information of each joint point, the spatial relay node updating Submodule (SRU) also adopts dot product operation to calculate the influence of each joint point on the relay node; integrating the information of all joint points into global information through the influence strength; degree of influence
Figure FDA0003852118210000051
Query vectors corresponding through relay nodes
Figure FDA0003852118210000052
Key vector corresponding to each joint point
Figure FDA0003852118210000053
The multiplication is carried out, and the formula is shown as (14):
Figure FDA0003852118210000054
the update of the spatial relay node is as shown in equation (15),
Figure FDA0003852118210000055
representing joint points
Figure FDA0003852118210000056
For space relay node R t The score of the degree of influence of (c),
Figure FDA0003852118210000057
va for all nodesA lue vector;
Figure FDA0003852118210000058
the alternate updating of the joint points and the spatial relay nodes realizes the exchange of information among the joint points, and finally realizes the goal that each joint point simultaneously collects the information of the neighbor joint points and the remote joint points;
3.3, constructing a time topological graph based on the skeleton sequence:
a time relay node is introduced when a time topological graph is constructed, and all joints are connected with each other through time inherent connection and time virtual connection to jointly form a graph structure in a time domain;
along the time dimension, the same joint points in the continuous frames form a new sequence, and the step also constructs connection for the joint points at the head and the tail to form a ring structure; the sequence of n nodes contains n time-dependent connections;
3.4, designing a TRT module:
the Temporal Relative transform module (TRT) comprises a Temporal joint point updating module (TJU) and a Temporal relay node updating module (TRU) and is used for extracting time domain characteristics; the module takes each joint point in the skeleton as an independent node, and respectively takes a sequence formed by the same joint point in a frame sequence as an object to extract the time domain characteristics of the joint point; the input of the TRT module is
Figure FDA0003852118210000059
A sequence of the same joint for all frames; each joint point
Figure FDA00038521182100000526
With its corresponding query vector
Figure FDA00038521182100000510
Key vector
Figure FDA00038521182100000511
And value vector
Figure FDA00038521182100000512
Time relay node R v Corresponding query vector
Figure FDA00038521182100000513
Key vector
Figure FDA00038521182100000527
And value vector
Figure FDA00038521182100000514
In the TJU submodule, each joint point to be updated
Figure FDA00038521182100000515
Collecting information of neighbor nodes through virtual connection to perform self-updating; the influence calculation formula of the neighbor node is shown as (16):
Figure FDA00038521182100000516
wherein the content of the first and second substances,
Figure FDA00038521182100000517
indicating the same node or time relay node R in the jth frame v The influence strength on a certain joint point in the ith frame,
Figure FDA00038521182100000518
pair of representations
Figure FDA00038521182100000519
Performing transposition processing; joint point
Figure FDA00038521182100000520
Is as shown in equation (17):
Figure FDA00038521182100000521
all query vectors
Figure FDA00038521182100000522
Are combined into a matrix Q v ∈R C×1×t All key vectors
Figure FDA00038521182100000523
Combined into a matrix K v ∈R C×B×t All value vectors
Figure FDA00038521182100000524
Are combined into a matrix V v ∈R C×B×t (ii) a The matrix form definition of the influence is shown in formula (18):
Figure FDA00038521182100000525
wherein, B represents the number of neighbor nodes,
Figure FDA00038521182100000613
representing a hadamard product;
in the TRU module, a time relay node R v Collecting information from other frames through virtual connection, thereby completing self node updating; the specific operation is as follows:
Figure FDA0003852118210000061
Figure FDA0003852118210000062
wherein the content of the first and second substances,
Figure FDA0003852118210000063
representing a joint in a jth frame
Figure FDA0003852118210000064
For relay node R v The degree of influence of (a) is,
Figure FDA00038521182100000612
is a scaling factor;
3.5, packaging an ST-RT module:
the ST-RT module is obtained by connecting and combining an SRT module and a TRT module, wherein the SRT module comprises a space joint point updating module and a space relay node updating module; the TRT module comprises a time joint point updating module and a time relay node updating module; each updating module is connected with a forward feedback network layer backwards, and maps the characteristics to a space with larger dimensionality so as to enhance the model expression capacity; lx denotes cycle L times;
3.6, encapsulating the MSST-RT network:
fusing and packaging the four ST-RT models with different input data through a multi-stream framework to obtain an MSST-RT model; different sampling frequencies may also provide complementary information for the model, sampling n for joint and bone sequences, respectively 1 Frame and n 2 A frame; the skeleton data is subjected to MSST-RT network to obtain final classification prediction probability based on the skeleton data.
6. The method of claim 1, wherein the method comprises: in the behavior recognition network model based on the image data, a joint point-based picture cutting module selects joint points of hands and feet of a human body to be cut; and packaging the image block feature extraction model trained end to end into a key image block feature extraction model based on a time domain segmentation network as a basic framework by adopting the image block feature extraction model trained end to end.
7. The method of claim 6, wherein the method comprises: the joint point-based picture cropping module comprises:
picture I of t-th frame t By means of a matrix P t Indicating, by the joint point N, the desired cut j Coordinates in the image are (x, y), and the cropping picture size is 1 × 1, then in image I t Middle surrounding hand and foot joint point N j Image block set obtained by cutting
Figure FDA0003852118210000066
As shown in the following equation:
Figure FDA0003852118210000067
Figure FDA0003852118210000068
besides cutting the picture by taking the joint point coordinates as the center, extracting optical flow through the picture blocks corresponding to two adjacent frames, wherein the formula is shown as (23):
Figure FDA0003852118210000069
wherein TV-L1 is a classical optical flow calculation method,
Figure FDA00038521182100000610
representing the optical flow field in the x-axis direction,
Figure FDA00038521182100000611
indicating the optical flow field in the y-direction.
8. The method of claim 6, wherein the method comprises: the joint point-based picture cropping module comprises: the behavior recognition network based on the key image blocks comprises:
5.1, designing an IBCN model:
the image blocks cut based on the skeleton joint points have independence and correlation, and each image block obtained by cutting is firstly subjected to the IBCN model
Figure FDA0003852118210000071
Respectively inputting the images into a convolutional neural network to obtain the characteristics of each image block
Figure FDA0003852118210000072
The calculation formula is shown as (24):
Figure FDA0003852118210000073
wherein the content of the first and second substances,
Figure FDA0003852118210000074
representing the extraction of image blocks by a convolutional neural network with a parameter W
Figure FDA0003852118210000075
Sharing each convolution neural network parameter; then the characteristics of each image square
Figure FDA0003852118210000076
Splicing to obtain new characteristic vector
Figure FDA0003852118210000077
As shown in equation (25)
Figure FDA0003852118210000078
Finally, calculating a characteristic vector F by a point multiplication mode t At an arbitrary spatial position x i With other positions x j Similarity f (x) of i ,x j ) As shown in equation (26):
f(x i ,x j )=softmax(θ(x i ) T ·φ(x j )) (26)
wherein θ (-) and φ (-) are 1 × 1 convolution functions;
the obtained similarity f (x) i ,x j ) Will be used as the weight and g (x) j ) Weighted summation to achieve x i Obtaining information from other locations, y i Is x i The result of global information exchange is shown in equation (27):
Figure FDA0003852118210000079
wherein g (·) is a mapping function, and a 1 × 1 convolution function is adopted for mapping; nl' 2 To select a feature map
Figure FDA00038521182100000712
The size of (2) is used as a normalization coefficient to avoid scale expansion caused by different input sizes; when the input is the feature tensor, the formula is shown as (28):
Figure FDA00038521182100000710
wherein, θ (-), φ (-), and g (-), are all 1 × 1 convolution functions, nl' 2 Is a normalized coefficient;
5.2, packaging a KBN network:
the method comprises the steps that an IBCN (intermediate bulk node network) is packaged into a KBN (KBN) network by taking a TSN (time transport network) as a framework, the network is divided into a spatial stream and a time stream, input data are image blocks corresponding to the spatial stream, and input data are optical flow blocks corresponding to the time stream; adopting spatial stream, firstly sampling a plurality of frames from a video through sparse sampling, and processing each frame through an image cutting module based on a joint point; then, corresponding key image block set of each frame
Figure FDA00038521182100000711
Respectively inputting IBCN models, preliminarily predicting class probabilities according to sampling frames, and sharing parameters of the IBCN models; and then fusing the prediction classification results of all the sampling frames through a consensus function to obtain a video-level classification prediction, wherein the calculation formula is shown as (29):
Figure FDA0003852118210000081
wherein, KBN-S is the prediction result of the spatial stream of the KBN network, T K Representing the K-th segment after segmentation from the video,
Figure FDA0003852118210000082
representing the set of image squares corresponding to the K-th sample frame,
Figure FDA0003852118210000083
representing a collection of image tiles by an IBCN Module
Figure FDA0003852118210000084
And processing, wherein the calculation method of the time flow prediction result is consistent with that of the spatial flow.
CN202211137852.6A 2022-09-19 2022-09-19 Motion recognition method based on skeleton and image data fusion Pending CN115841697A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211137852.6A CN115841697A (en) 2022-09-19 2022-09-19 Motion recognition method based on skeleton and image data fusion

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211137852.6A CN115841697A (en) 2022-09-19 2022-09-19 Motion recognition method based on skeleton and image data fusion

Publications (1)

Publication Number Publication Date
CN115841697A true CN115841697A (en) 2023-03-24

Family

ID=85575452

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211137852.6A Pending CN115841697A (en) 2022-09-19 2022-09-19 Motion recognition method based on skeleton and image data fusion

Country Status (1)

Country Link
CN (1) CN115841697A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116665312A (en) * 2023-08-02 2023-08-29 烟台大学 Man-machine cooperation method based on multi-scale graph convolution neural network
CN116665308A (en) * 2023-06-21 2023-08-29 石家庄铁道大学 Double interaction space-time feature extraction method
CN117137435A (en) * 2023-07-21 2023-12-01 北京体育大学 Rehabilitation action recognition method and system based on multi-mode information fusion
CN117197727A (en) * 2023-11-07 2023-12-08 浙江大学 Global space-time feature learning-based behavior detection method and system
CN117612072A (en) * 2024-01-23 2024-02-27 中国科学技术大学 Video understanding method based on dynamic space-time diagram

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116665308A (en) * 2023-06-21 2023-08-29 石家庄铁道大学 Double interaction space-time feature extraction method
CN116665308B (en) * 2023-06-21 2024-01-23 石家庄铁道大学 Double interaction space-time feature extraction method
CN117137435A (en) * 2023-07-21 2023-12-01 北京体育大学 Rehabilitation action recognition method and system based on multi-mode information fusion
CN116665312A (en) * 2023-08-02 2023-08-29 烟台大学 Man-machine cooperation method based on multi-scale graph convolution neural network
CN116665312B (en) * 2023-08-02 2023-10-31 烟台大学 Man-machine cooperation method based on multi-scale graph convolution neural network
CN117197727A (en) * 2023-11-07 2023-12-08 浙江大学 Global space-time feature learning-based behavior detection method and system
CN117197727B (en) * 2023-11-07 2024-02-02 浙江大学 Global space-time feature learning-based behavior detection method and system
CN117612072A (en) * 2024-01-23 2024-02-27 中国科学技术大学 Video understanding method based on dynamic space-time diagram
CN117612072B (en) * 2024-01-23 2024-04-19 中国科学技术大学 Video understanding method based on dynamic space-time diagram

Similar Documents

Publication Publication Date Title
He et al. Swin transformer embedding UNet for remote sensing image semantic segmentation
Fan et al. Point 4d transformer networks for spatio-temporal modeling in point cloud videos
Mou et al. IM2HEIGHT: Height estimation from single monocular imagery via fully residual convolutional-deconvolutional network
Fan et al. Point spatio-temporal transformer networks for point cloud video modeling
CN115841697A (en) Motion recognition method based on skeleton and image data fusion
Zhang et al. Learning 3d human shape and pose from dense body parts
CN111460928B (en) Human body action recognition system and method
Dai et al. RADANet: Road augmented deformable attention network for road extraction from complex high-resolution remote-sensing images
Xu et al. Monocular 3d pose estimation via pose grammar and data augmentation
CN114596520A (en) First visual angle video action identification method and device
CN113128424B (en) Method for identifying action of graph convolution neural network based on attention mechanism
CN111160294B (en) Gait recognition method based on graph convolution network
Xin et al. Transformer for skeleton-based action recognition: A review of recent advances
Wang et al. Pm-gans: Discriminative representation learning for action recognition using partial-modalities
Núñez et al. Multiview 3D human pose estimation using improved least-squares and LSTM networks
CN114419732A (en) HRNet human body posture identification method based on attention mechanism optimization
Yan et al. Human-object interaction recognition using multitask neural network
Wang et al. Simplified-attention Enhanced Graph Convolutional Network for 3D human pose estimation
CN114689038A (en) Fruit detection positioning and orchard map construction method based on machine vision
Chen et al. Prior-knowledge-based self-attention network for 3D human pose estimation
Bai et al. Double chain networks for monocular 3D human pose estimation
Zhou et al. Human pose-based estimation, tracking and action recognition with deep learning: A survey
Wang et al. Gated Region-Refine pose transformer for human pose estimation
Chen et al. Learning a robust part-aware monocular 3D human pose estimator via neural architecture search
Bai et al. Two-Steam fully connected graph convolutional network for skeleton-based action recognition

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination