CN115841697A

CN115841697A - Motion recognition method based on skeleton and image data fusion

Info

Publication number: CN115841697A
Application number: CN202211137852.6A
Authority: CN
Inventors: 孙妍; 沈亦馨
Original assignee: University of Shanghai for Science and Technology
Current assignee: University of Shanghai for Science and Technology
Priority date: 2022-09-19
Filing date: 2022-09-19
Publication date: 2023-03-24

Abstract

The invention discloses an action recognition method based on skeleton and image data fusion, which comprises the following steps: the behavior recognition network model based on the skeleton data comprises a coordinate motion information guided sampling module, a multi-scale motion information fusion module and a multi-stream space-time relative transform model; the behavior recognition network model based on the image data comprises a picture cutting module based on joint points and a key image block feature extraction model; and fusing the action recognition network model of the skeleton data and the action category prediction probabilities obtained by the action recognition network model based on the image data to obtain the final classification prediction probability of the whole model, thereby completing the action recognition process for public safety. The recognition network model fully excavates the skeleton motion information, establishes dependence for remote joint points and enhances the recognition capability of detailed actions; and further, local image data and skeleton data are fused, so that rich action detail information is supplemented, and high computing cost is avoided.

Description

Action recognition method based on skeleton and image data fusion

Technical Field

Behavior recognition is a technology for analyzing and judging the action type of people by using a specific algorithm through data such as videos. The technology is the basis of many applications such as public safety management, man-machine interaction, intelligent old-age care, intelligent medical treatment and the like, and has wide application prospects. Therefore, the method has important theoretical significance and practical value for developing research on behavior recognition. In a real scene, behavior recognition is a very challenging task, and is easily affected by external factors such as illumination, background, shooting angle and the like, and different ways of doing the same action by different people lead to great difference among the classes. Behavior recognition is also a research hotspot in the field of computer vision because it is challenging and covers multiple disciplines.

Background

The behavior recognition method based on the deep neural network may be classified into an image-based behavior recognition method and a skeleton-based behavior recognition method according to the type of input data. The behavior recognition method based on the image recognizes the human body action in the video by analyzing the RGB image sequence, and the behavior recognition method mainly comprises the following three genres:

1) A dual-flow Network model represented by a Temporal Segment Network (TSN);

2) A 3D Convolutional neural network model represented by a three-dimensional Convolutional network (Convolutional 3D, c 3D);

3) A 2D convolutional neural Network model represented by a Time Difference Network (TDN).

In recent years, research based on the above genres is widespread and advanced performance is achieved. The input of the above model is usually an image obtained by scaling and randomly cropping a video frame, and although the size of the image is reduced to some extent, the following defects still exist: 1) The reduced size will reduce the image precision, thereby affecting the recognition of subtle actions by the model; 2) Although the image size is reduced, the training data scale is still large, the requirement on video memory is high, and the calculation delay is long.

The skeleton-based behavior recognition method recognizes human body actions by analyzing a skeleton sequence. As early as the 70's of the 20 th century, johansson et al demonstrated that skeletal data could effectively describe human motion. With the development of human motion estimation technology, such as advanced human posture estimation algorithm, multi-modal sensors, and the like, the acquisition cost of skeleton data is reduced. Based on this, researchers have developed a large number of studies on behavior recognition methods based on frameworks, which are mainly classified into three categories: a Network model based on a Recurrent Neural Network (RNN), a Network model based on a Convolutional Neural Network (CNN), and a Network model based on a Graph Convolution Network (GCN). RNN-based and CNN-based network models treat the skeleton as a sequence or pseudo-image, resulting in the topology of the skeleton being destroyed. And the GCN-based network model extracts the skeleton characteristics through graph convolution, so that the natural structure of the skeleton is maintained, and the model performance is rapidly improved. In recent years, a network model based on the GCN has become a mainstream method in the field of skeletal behavior recognition. Although the GCN-based network model achieves advanced performance, the following drawbacks exist: 1) The motion information plays an important role in video classification tasks such as behavior recognition tasks and the like, but the motion information contained in the framework sequence is not fully mined by the existing method; 2) The receptive field of the graph convolution network is limited by the size of the convolution kernel, and long-distance connection cannot be established for joint points far away from each other in the skeleton.

In addition to the above-described deficiencies, both image data and skeleton data themselves have limitations. The image data has rich scene information and detail information, but is easily interfered by environmental factors such as illumination and the like, and the scale of the image data is large, and the training time of the relevant model is long. The skeleton data describes the human body movement in a more compact mode, the data volume is small, the requirement on hardware is low, compared with image data, the skeleton data is not easily interfered by external factors (such as illumination shielding and the like), and the robustness is high. Although skeleton data has many advantages as described above, it does not have scene information and detail information specific to an image, but both of these pieces of information play an important role in behavior recognition, and are particularly significant in cases where a motion is subtle or a motion depends on a scene. In conclusion, the image data and the skeleton data have high complementarity, and the behavior recognition network model based on the image and the behavior recognition network model based on the skeleton are fused, so that the research significance is better.

Disclosure of Invention

In order to solve the problems in the prior art, the present invention aims to overcome the disadvantages in the prior art, and provides a method for identifying public security actions, wherein a network model is identified based on skeleton and image data, and the network model is divided into two branches, namely a behavior identification network model based on skeleton data and a behavior identification network model based on image data, according to data type differences: the former extracts skeleton characteristics through a lightweight network, is good at identifying actions with large amplitude and plays a main role in an action identification task; the image recognition method reduces training cost by cutting images, extracts image features from key image blocks, is good at recognizing small-amplitude actions concentrated on hands and feet, and plays a role in supplementing detailed information in an action recognition task.

In order to achieve the purpose of the invention, the invention adopts the following technical scheme:

a behavior recognition method for public security is characterized in that a behavior recognition network model based on skeleton data and a behavior recognition network model based on image data are respectively established to form a recognition network, skeleton features are extracted by the behavior recognition network model based on the skeleton data through a lightweight network and used for recognizing actions with large amplitude and completing a main action recognition task, model input data of the behavior recognition network model based on the skeleton data are skeleton sequences, and the input data sequentially pass through a sampling module guided by coordinate motion information, a multi-scale motion information fusion module and a multi-stream space-time relative transform model to obtain action category prediction probability; extracting image features from image blocks by a picture cutting method based on a behavior recognition network model of image data, and identifying small-amplitude actions concentrated on hands and feet, and supplementing detailed information in an action recognition task; identifying model input data information of a network model into an image sequence based on the behavior of image data, wherein the input data sequentially passes through a picture cutting module based on joint points and a key image block feature extraction model (KBN) to obtain a supplementary action category prediction probability; and fusing the action type prediction probabilities obtained by the action recognition network model based on the skeleton data and the action recognition network model based on the image data to obtain the final classification prediction probability of the whole model, thereby completing the action recognition process for public safety.

Preferably, in the behavior recognition network model based on skeleton data, a frame sampling module guided by coordinate motion information screens out a representative skeleton sequence from the skeleton sequence according to a coordinate motion information measurement index; the multi-scale motion information fusion module fuses static information of the framework and the multi-scale motion information, and sets two different types of motion information according to the characteristics that different actions of a human body have different change speeds and duration, namely, solidified motion information and self-adaptive motion information; the solidification motion information comprises two different scales, so that the network adapts to actions with different change speeds; the self-adaptive motion information enables the recognition network to have the capability of recognizing actions with different durations; establishing long-distance connection for each joint point on a time-space domain by using the multi-stream space-time relative Transformer model, wherein the multi-stream space-time relative Transformer model is as follows: setting a space topological graph based on a framework on a space domain, and constructing a space relative Transformer module for establishing remote dependence of joint points in an airspace; on a time domain, constructing a time topological graph based on a skeleton sequence, and establishing a time domain relative Transformer module for establishing long-distance dependence of joint points in the time domain; then, combining the space and time domain relative modules to obtain a space-time relative transform model, and further extracting the space-time characteristics of the framework sequence; and (3) fusing at least 4 different space-time relative models of input data by adopting a multi-time scale frame to construct a multi-stream space-time relative transform model.

Further preferably, the coordinate motion information-guided frame sampling module includes:

1.1 designing indexes for measuring coordinate motion information:

in the skeleton data, joint points are represented by 3D coordinates; the displacement distance of the joint points in two adjacent frames is used as an index for measuring the motion information content contained in the joint points, the sum of the displacement distances of all the joint points in the framework is used as an index for measuring the motion information content contained in the whole framework, and whether the framework has a representation or not is further judgedSex; assume that the t-th frame has a joint point coordinate of i

The joint point coordinate of the t-1 th frame labeled i is ≥>

Coordinate motion information M contained in the t-th frame _t As shown in equation (1):

wherein, N represents the number of joint points contained in a frame;

in order to eliminate the scale expansion effect caused by the difference of the video lengths, the coordinate motion information contained in each frame is normalized, as shown in formula (2):

wherein T represents the number of frames contained in the video;

1.2, sampling a video by adopting a cumulative distribution function:

assuming that N frames need to be sampled from a video with a length of T, the specific operations are as follows:

firstly, accumulating the skeleton coordinate motion information frame by frame to obtain the accumulated coordinate motion information C of the t-th frame _t The calculation formula is shown in (3):

according to

And dividing the sequence into N segments, and randomly sampling a frame from the N segments to form a new sequence, so that a representative skeleton series in the skeleton sequence is screened out through the measuring index.

Further preferably, the multi-scale motion information fusion module includes:

2.1 designing different scale motion information:

from the original framework sequence I by sampling _origin ＝[I ¹ ，…，I ^F ]Selecting T frames and combining the T frames into a new framework sequence I according to the original sequence _new ＝[I ¹ ，…，I ^T ]F represents the total frame number of the original skeleton sequence, and I represents the coordinates of all joint points in each frame; motion information is calculated by calculating the same joint point

The coordinate displacement in the two frames yields: />

Representing the original skeleton sequence I _origin The knuckle labeled i in frame t, <' > H>

Representing the sampled framework sequence I _new The label of the tth frame is a joint point of i;

adaptive motion information M _a By the framework sequence I _new The motion information of different scales is obtained from videos with different lengths by subtracting the coordinates of the joint points of two continuous frames, and the formula is as follows:

wherein the content of the first and second substances,

representing a novel framework sequence I _new Adaptive motion information of the ith frame; />

The motion information is divided into two types: short-distance motion information M _t And long-distance motion information M _i (ii) a Short-distance motion information M _s By passing throughFramework sequence I _origin The coordinates of the skeleton joint points which are separated by 2 frames are subtracted to obtain the motion information which is used for capturing the rapidly-changed motion; the calculation formula is shown as follows:

wherein the content of the first and second substances,

representing the short-distance motion information of the ith frame in the new skeleton sequence, f is the new skeleton sequence I _new In the original skeleton sequence I of the ith frame _origin The number in (1); />

Represents the original skeleton sequence I _origin The label of the f-th frame is a joint point of N;

long distance motion information M _i Through the proto-framework sequence I _origin The coordinates of the skeletal joint points which are separated by 5 frames are subtracted, and the coordinates are used for capturing motion information of slowly changing motion, and the calculation formula is expressed as follows:

wherein, the first and the second end of the pipe are connected with each other,

long-distance motion information of the ith frame in the new skeleton sequence is shown, f is shown as the new skeleton sequence I _new In the original frame sequence I _origin The number in (1);

2.2, high-dimensional mapping of different-scale motion information:

static information of skeleton I _new Adaptive motion information M _a Short-term exercise information M _s And long-term exercise information M _l Are all (T, N, C) ₀ ) Where T represents the number of video frames, N represents the number of joints of a skeleton, C ₀ A coordinate dimension representing a joint point; mapping the four kinds of information to a high-dimensional space through an Embedding module (Embedding block) to obtain a high-dimensional feature F, F _ma 、F _ms And F _ml (ii) a The embedded module is composed of two convolutional layers and two active layers (ReLU):

the first convolution maps various information to a space with the dimension of C, and the second convolution maps various information to the space with the dimension of C respectively ₁ 、,C ₂ 、C ₃ 、C ₄ A high dimensional space of (a); convolution kernels corresponding to different motion information are mutually independent, and parameters are not shared; with static information I _new For example, the embedded module quadratic mapping formula is shown as (10):

F＝σ(W ₂ (σ(W ₁ I _new +b ₁ ))+b ₂ ) (10)

where σ denotes the activation function, W ₁ 、b ₁ Representing a parameter in the first convolution function, W ₂ 、b ₂ Representing the parameters of the second convolution function, the parameters of both convolution functions being learned, I _new Representing static information;

2.3, multi-scale motion information fusion:

fusing various types of information through stacking operation (concat) to obtain a dynamic representation Z of the framework, as shown in a formula (11); the operation enables the dynamic representation Z of the skeleton to contain multi-scale motion information, and further improves the capability of the network to adapt to actions with different change speeds and different durations;

Z＝concat(F,F _ma ,F _ms ,F _ml ) (11)

and (4) fusing the four high-dimensional characteristics to obtain Z, and outputting the Z as a multi-scale motion information fusion module.

Further preferably, the multi-stream spatiotemporal relative transform model comprises:

3.1, constructing a space topological graph based on a framework:

except original joint points in a skeleton, a virtual node is introduced in the step, a new space topological graph is formed together with all the joint points and serves as model input, the introduced virtual node not only needs to collect integrated information from all the joint points, but also plays a role in distributing the integrated global information to all the joint points, and the virtual node is named as a space relay node;

meanwhile, two types of connection are established among nodes, namely space inherent connection and space virtual connection, so as to construct a space topological graph of the framework; the space diagram structure comprising n joint points has n-1 space inherent connections;

3.2, designing a space relative Transformer module:

the module comprises a space joint point updating module (SJU) and a space relay node updating module (SRU), and the connection is established for the remote joint point in the airspace by alternately updating a SJU module and the SRU module; the model input is the joint point sequence in the t frame skeleton

Wherein the content of the first and second substances, N denotes the number of articulation points in this frame, based on>

Representing a joint point +>

A set of all neighboring joint point labels; each node has a corresponding query vector->

The key vector->

value vector->

In Spatial Joint node Update module (Spatial Joint node Update Block, SJU)For any joint point

Firstly, the query vector q corresponding to the joint point _i ^t And its neighbor node->

Corresponding key vector +>

And (3) performing dot product operation to obtain the influence of each neighbor node on the joint point, as shown in a formula (12):

representing the influence of the node j on the node i; neighbor nodes include its neighboring joint points +>

Spatial relay node R ^t And itself>

r represents a label of the spatial relay node;

calculating to obtain influence strength

Thereafter, its value vector corresponding to the neighbor node is @>

Multiplying and summing all the products to obtain a value which is the articulation point->

The formula (13) shows:

wherein the content of the first and second substances,

is the result obtained after one-time update of a joint point update Submodule (SJU), and the result simultaneously aggregates local information and global information, d _k The channel dimension representing the key vector, which serves as the normalization, softmax _j Indicates that all adjacent joint points are pick>

The influence strength is normalized;

in order to enable the spatial relay node to reasonably and fully collect and integrate the information of each joint point, the spatial relay node updating Submodule (SRU) also adopts dot product operation to calculate the influence of each joint point on the relay node; integrating the information of all joint points into global information through the influence strength; degree of influence

Query vector corresponding by relay node->

The key vector corresponding to each joint point->

The multiplication results in the formula (14):

the update of the spatial relay node is as shown in equation (15),

represents a joint point->

For space relay node R ^t Influence score of (a), (b), and (c)>

For all nodesA value vector of;

the alternate updating of the joint points and the spatial relay nodes realizes the exchange of information among the joint points, and finally realizes the goal that each joint point simultaneously collects the information of the neighbor joint points and the remote joint points;

3.3, constructing a time topological graph based on the skeleton sequence:

a time relay node is introduced when a time topological graph is constructed, and all joints are connected with each other through time inherent connection and time virtual connection to jointly form a graph structure in a time domain;

along the time dimension, the same joint points in the continuous frames form a new sequence, and the step also constructs connection for the joint points at the head and the tail to form a ring structure; the sequence of n nodes contains n time-dependent connections;

3.4, designing a TRT module:

the Temporal Relative transform module (TRT) comprises a Temporal joint point updating module (TJU) and a Temporal relay node updating module (TRU) and is used for extracting time domain characteristics; the module takes each joint point in the skeleton as an independent node, and respectively takes a sequence formed by the same joint point in a frame sequence as an object to extract the time domain characteristics of the joint point; the input of the TRT module is

A sequence of the same joint for all frames; each joint point->

Has its corresponding query vector->

The key vector->

And value vector->

Time relay node R _v Corresponding query vector->

The key vector->

And value vector->

In the TJU submodule, each joint point to be updated

Collecting information of neighbor nodes through virtual connection to perform self-updating; the influence calculation formula of the neighbor node is shown as (16):

wherein the content of the first and second substances,

indicating the same node or time relay node R in the jth frame _v The strength of the influence on a joint point in the i-th frame->

Represents a pair->

Performing transposition processing; articulation point->

Is as shown in equation (17):

all query vectors

Are combined into a matrix Q _v ∈R ^C×1×t All key vectors->

Are combined into a matrix K _v ∈R ^C×B×t All value vectors->

Are combined into a matrix V _v ∈R ^C×B×t (ii) a The matrix form definition of the influence strength is shown in formula (18):

b represents the number of neighbor nodes, and DEG represents a Hadamard product;

in the TRU module, a time relay node R _v Collecting information from other frames through virtual connection, thereby completing self node updating; the specific operation is as follows:

indicates that the articulation point in the jth frame->

For relay node R _v In conjunction with a strength of influence of>

Is a scaling factor;

3.5, packaging the ST-RT module:

the ST-RT module is obtained by connecting and combining an SRT module and a TRT module, wherein the SRT module comprises a space joint point updating module and a space relay node updating module; the TRT module comprises a time joint point updating module and a time relay node updating module; each updating module is connected with a forward feedback network layer backwards, and maps the characteristics to a space with larger dimensionality so as to enhance the model expression capacity; lx denotes L cycles;

3.6, encapsulating MSST-RT network:

fusing and packaging the four different ST-RT models with input data through a multi-stream framework to obtain an MSST-RT model; different sampling frequencies may also provide complementary information for the model, sampling n for joint and bone sequences, respectively ₁ Frame and n ₂ A frame; the skeleton data is subjected to MSST-RT network to obtain final classification prediction probability based on the skeleton data.

Preferably, in the behavior recognition network model based on the image data, the joint point-based picture cropping module selects to crop joint points of hands and feet of the human body; and packaging the image block feature extraction model trained end to end into a key image block feature extraction model based on a time domain segmentation network as a basic framework by adopting the image block feature extraction model trained end to end.

Further preferably, the joint point-based picture cropping module comprises:

picture I of t-th frame _t By means of a matrix P _t Indicating, by the joint point N, the desired cut _j Coordinates in the image are (x, y), and the size of the cropping picture is l × l, then the image is I _t Center around the joint point N of hand and foot _j Image block set obtained by cutting

As shown in the following equation:

besides cutting the picture by taking the joint point coordinates as the center, extracting optical flow through the picture blocks corresponding to two adjacent frames, wherein the formula is shown as (23):

wherein TV-L1 is a classical optical flow calculation method,

represents the light flow field in the x-axis direction>

Indicating the optical flow field in the y-direction.

Further preferably, the joint point-based picture cropping module comprises: the behavior recognition network based on the key image blocks comprises:

5.1, designing an IBCN model:

the image blocks cut based on the skeleton joint points have independence and correlation, and each image block obtained by cutting is firstly subjected to the IBCN model

Respectively input into a convolutional neural network to obtain the characteristics of each image block>

The calculation formula is shown as (24):

wherein the content of the first and second substances,

means for extracting an image block by a convolutional neural network with a parameter W>

Sharing each convolution neural network parameter; then the characteristics f of each image square _t ^j Splicing to obtain new characteristic vector

As shown in equation (25)

Finally, calculating a characteristic vector F by a point multiplication mode _t At an arbitrary spatial position x _i From other positions x _j Similarity f (x) of _i ,x _j ) As shown in equation (26):

f(x _i ,x _j )＝softmax(θ(x _i ) ^T ·φ(x _j )) (26)

wherein θ (-) and φ (-) are 1 × 1 convolution functions;

the obtained similarity f (x) _i ,x _j ) Will be used as the weight and g (x) _j ) Weighted summation to achieve x _i Obtaining information from other locations, y _i Is x _i The result of global information exchange is shown in equation (27):

wherein g (-) is a mapping function, and a 1 × 1 convolution function is adopted for mapping; nl' ² To select a feature map

The size of (2) is used as a normalization coefficient to avoid scale expansion caused by different input sizes; when the input is the feature tensor, the formula is shown as (28):

wherein θ (-), φ (-), and g (-) are all 1 × 1 convolution functions, nl' ² Is a normalized coefficient;

5.2, packaging a KBN network:

the method comprises the steps of packaging an IBCN model into a KBN network by taking a TSN network as a framework, dividing the network into a spatial stream and a time stream, wherein input data are image blocksCorresponding to spatial streams, the input data are optical stream blocks corresponding to temporal streams; adopting spatial stream, firstly sampling a plurality of frames from a video through sparse sampling, and processing each frame through an image cutting module based on a joint point; then, corresponding key image block set of each frame

Respectively inputting IBCN models, and sharing parameters of each IBCN model according to the preliminary prediction class probability of the sampling frame; and then fusing the prediction classification results of all the sampling frames through a consensus function to obtain a video-level classification prediction, wherein the calculation formula is shown as (29):

wherein, KBN-S is the prediction result of the spatial stream of the KBN network, T _K Representing the K-th segment after segmentation from the video,

represents the set of image blocks corresponding to the Kth sample frame, in a manner which is characteristic of the fact that the image block corresponding to the Kth sample frame is taken in conjunction with a reference picture>

Indicating that the image block set is/is asserted by the IBCN module>

And processing, wherein the calculation method of the time flow prediction result is consistent with that of the spatial flow.

Compared with the prior art, the invention has the following obvious and prominent substantive characteristics and remarkable advantages:

1. the behavior recognition network model fusing the skeleton data and the image data is provided, skeleton motion information is fully mined, dependence is built for remote joint points, and the recognition capability of detail actions is enhanced; the local image data and the skeleton data are further fused, so that rich action detail information is supplemented, and high calculation cost is avoided;

2. the invention achieves the recognition level of 98.65% on the NTU60 of the data set, the invention provides a behavior recognition network model based on skeleton data and image data, and the two network models are fused; the accuracy of the model is improved, and the model is more accurate,

3. according to the invention, the information exchange channels among all spatial positions are established through the Non-Local module, so that the global information exchange among all image blocks is realized, the independence and the relevance among all image blocks are considered, and the human body Local fine action recognition capability is further improved; and finally, fusing the behavior recognition network model based on the skeleton data and the behavior recognition network model based on the image data, and fully exerting the complementarity of the skeleton data and the image data.

Drawings

FIG. 1 is a schematic diagram of the overall structure of a network model of the method of the present invention.

FIG. 2 is a graph of a skeleton motion information cumulative distribution function according to the method of the present invention.

FIG. 3 is a schematic diagram of various types of motion information calculation according to the method of the present invention.

FIG. 4 is a schematic diagram of a skeletal dynamics information representation module of the method of the present invention.

FIG. 5 is a skeleton-based spatial topology of the method of the present invention.

FIG. 6 is a schematic diagram of a space relative transform module according to the method of the present invention.

FIG. 7 is a time topology diagram based on a skeleton sequence of the method of the present invention.

FIG. 8 is a schematic diagram of a Temporal Relative Transform (TRT) module of the method of the present invention.

FIG. 9 is a schematic diagram of the overall architecture of the ST-RT model of the method of the present invention.

FIG. 10 is a schematic diagram of the overall architecture of the MSST-RT model of the method of the present invention.

FIG. 11 is a schematic diagram of image cropping and corresponding optical flow based on joint point location for the method of the present invention.

FIG. 12 is a schematic diagram of an image block feature extraction model (IBCN) according to the present invention.

Fig. 13 is a schematic diagram of a key image block-based behavior recognition network (KBN) according to the method of the present invention.

Detailed Description

The above-described scheme is further illustrated below with reference to specific embodiments, which are detailed below:

the first embodiment is as follows:

in this embodiment, as shown in fig. 1, a behavior recognition method for public security is implemented, where a behavior recognition network model based on skeleton data and a behavior recognition network model based on image data are respectively established to form a recognition network, the behavior recognition network model based on skeleton data utilizes a lightweight network to extract skeleton features for recognizing actions with large amplitude, and completes a main action recognition task, model input data of the behavior recognition network model based on skeleton data is a skeleton sequence, and the input data sequentially passes through a sampling module guided by coordinate motion information, a multi-scale motion information fusion module, and a multi-stream spatiotemporal relative Transformer model, so as to obtain an action category prediction probability; extracting image characteristics from image blocks by a picture cutting method based on a behavior recognition network model of image data, and identifying small-amplitude actions concentrated on hands and feet, and supplementing detail information in an action recognition task; identifying model input data information of a network model into an image sequence based on the behavior of image data, wherein the input data sequentially passes through a picture cutting module based on joint points and a key image block feature extraction model (KBN) to obtain a supplementary action category prediction probability; and fusing the action type prediction probabilities obtained by the action recognition network model based on the skeleton data and the action recognition network model based on the image data to obtain the final classification prediction probability of the whole model, thereby completing the action recognition process for public safety.

Each of the modules will be described in turn in detail.

(1) Sampling module guided by coordinate motion information

The frame sampling module guided by the coordinate motion information has the innovation point that a representative framework in the framework sequence is screened out according to the coordinate motion information measurement index, and then the motion information contained in the sampling sequence is increased.

Step 1.1, designing indexes for measuring coordinate motion information

In skeletal data, joint points are typically represented by 3D coordinates. And taking the displacement distances of the joint points in two adjacent frames as an index for measuring the motion information content contained in the joint points, taking the sum of the displacement distances of all the joint points in the skeleton as an index for measuring the motion information content contained in the whole skeleton, and further judging whether the skeleton is representative. Assume that the t-th frame has a joint point coordinate of i

The joint point coordinate of the t-1 th frame labeled i is ≥>

where N represents the number of joints contained in a frame.

where T represents the number of frames contained in the video.

Step 1.2, sampling the video by adopting the cumulative distribution function

Assuming that N frames need to be sampled from a video with a length of T, the specific operation is as follows: firstly, accumulating the skeleton coordinate motion information frame by frame to obtain the accumulated coordinate motion information C of the t-th frame _t The calculation formula is shown in (3).

According to

The sequence is divided into N segments as shown by the dashed lines in fig. 2 (10 frames for a total of samples in fig. 2). And finally, randomly sampling a frame from the N fragments respectively to form a new sequence.

In conclusion, the module provides a skeleton coordinate motion information measurement index, screens out representative skeletons in a skeleton sequence through the measurement index, and further increases motion information contained in a sampling sequence.

(2) Multi-scale motion information fusion module

The multi-scale motion information fusion module has the innovative point that the static information of the framework is fused with the multi-scale motion information, so that the effect of enriching the input information of the model is achieved. According to the characteristic that different actions of human beings have different change speeds and duration, two different types of motion information are designed in the module, namely solidification motion information and self-adaptive motion information. The solidification motion information comprises two different scales, so that the network adapts to actions with different change speeds; adaptive motion information enables the network the ability to recognize different duration actions. The generalization capability of the network can be improved by fusing the multi-scale motion information, and the specific steps are as follows.

Step 2.1, designing motion information of different scales

From the original skeleton sequence I by sampling _origin ＝[I ¹ ,…,I ^F ]Selecting T frames and combining the T frames into a new framework sequence I according to the original sequence _new ＝[I ¹ ,…,I ^T ]As shown in fig. 3, the pink frame is a sampling frame, and I represents the coordinates of all the joint points in each frame. Motion information is calculated by calculating the same joint point

The coordinate displacement in the two frames yields: />

Representing the original skeleton sequence I _origin The joint point labeled i, jn, of the tth frame _i ^t Indicating obtained by samplingFramework sequence I _new The t-th frame in (1) is labeled as the joint point of i.

wherein the content of the first and second substances,

representing a novel framework sequence I _new Adaptive motion information of the t-th frame.

Although adaptive motion information M _a By finding new framework sequences I _new The difference between two adjacent frames is obtained, but the distance between two frames depends on their I _origin The position (2) is closely related to the length of the original skeleton sequence, and each skeleton sequence obtains motion information matched with the length of the skeleton sequence.

The curing motion information is divided into two types: short-distance motion information M _s And long-distance motion information M _l . Short-distance motion information M _s Through the proto-framework sequence I _origin And the coordinates of the skeleton joint points separated by 2 frames are subtracted to obtain the motion information for capturing the rapidly-changing motion. The calculation formula is shown as follows:

wherein the content of the first and second substances,

short-distance motion information of the t frame in the new skeleton sequence is shown, f is the new skeleton sequence I _new The t-th frame in the original skeleton sequence I _origin The numbering in (1).

Long distance motion information M _l Through the proto-framework sequence I _origin And the coordinates of the skeleton joint points separated by 5 frames are subtracted to capture the motion information of the motion with slower change. The calculation formula is expressed as follows:

long-distance motion information of t frame in new skeleton sequence is shown, f shows new skeleton sequence I _new In the original frame sequence I _origin The numbering in (1).

Step 2.2 high-dimensional mapping of different-scale motion information

Static information of skeleton I _new Adaptive motion information M _a Short-term exercise information M _s And long-term motion information M _l Are all (T, N, C) ₀ ) Where T represents the number of video frames, N represents the number of joints of a skeleton, C ₀ Representing the coordinate dimensions of the joint points. Mapping the four kinds of information to a high-dimensional space through an embedding module (Embeddingblock) to obtain a high-dimensional feature F, F _ma 、F _ms And F _ml . The embedded module is composed of two convolutional layers and two active layers (ReLU): the first convolution maps various information to a space with the dimension of C, and the second convolution maps various information to the space with the dimension of C respectively ₁ 、,C ₂ 、C ₃ 、C ₄ A high dimensional space of (a). Convolution kernels corresponding to different motion information are independent of each other, and parameters are not shared. With static information I _new For example, the embedded module quadratic mapping formula is shown as (10):

F＝σ(W ₂ (σ(W ₁ I _new +b ₁ ))+b ₂ )#(10)

step 2.3, multi-scale motion information fusion

And fusing various types of information through stacking operation (concat) to obtain a dynamic representation Z of the skeleton, as shown in formula (11). The operation enables the dynamic representation Z of the skeleton to contain multi-scale motion information, and therefore the capability of the network to adapt to actions with different change speeds and different durations is improved.

Z＝concat(F，F _ma ，F _ms ，F _ml )#(11)

In summary, the module provides three types of motion information with different scales, namely adaptive motion information, short-term motion information and long-term motion information, and then an embedded module is adopted to map the motion information and static information to a high-dimensional space respectively, and finally four types of high-dimensional features are fused to be used as model input. The method provided in this section enables model input to contain rich motion information, and the multi-scale characteristics of the model input can improve the generalization of the behavior recognition network.

(3) Multi-stream spatiotemporal relative transform model

The action recognition task is heavy, and a plurality of human body actions are often completed by matching joint points which are far away from each other. For example, when a person claps, the person needs to collaboratively complete the left hand and the right hand, and the joint points of the left hand and the right hand are far apart in the skeleton, but have strong correlation in the action. The innovation point of the multi-stream space-time relative transform model is that the long-distance relation is established for each joint point on a time-space domain, and the work is as follows: in a spatial domain, a space topological graph based on a framework is designed, and a space relative Transformer module is provided for establishing remote dependence of joint points in an airspace; in a time domain, a time topological graph based on a skeleton sequence is designed, and a time domain relative transform module is provided for establishing long-distance dependence of joint points in the time domain. Then, the space and time domain relative modules are combined to obtain a space-time relative transform model, and further the space-time characteristics of the framework sequence are extracted. And finally, fusing different space-time relative models of the four input data by adopting a multi-time scale frame to obtain a multi-stream relative space-time model. The method comprises the following specific steps.

Step 3.1, constructing a space topological graph based on a framework

In addition to the original joint points in the skeleton, the step introduces a virtual node, and forms a new spatial topological graph together with all the joint points as the model input. As shown in fig. 5, the blue node is the original node, and the purple node is the introduced virtual node. The introduced virtual node not only needs to collect the integrated information from each joint point, but also plays a role of distributing the integrated global information to each joint point, and the virtual node is named as a spatial relay node.

Meanwhile, two types of connections are established between nodes (including joint nodes and spatial relay nodes) in the step, namely spatial inherent connection and spatial virtual connection, so as to construct a spatial topological graph of the skeleton. As shown in fig. 5, the purpose of maintaining the original map topology in the skeleton is achieved by establishing spatial inherent connections, i.e. blue line segments, for all the joint point pairs directly connected by the skeleton in the human skeleton. The spatial intrinsic connection contains a large amount of a priori knowledge and can serve to gather local information from neighboring joint points. At the same time, the existence of the connection enables the joint point to obtain more information from the neighbor joint point than the remote joint point. The spatial graph structure containing n joint points has n-1 spatial inherent connections.

Step 3.2, designing a space relative Transformer module

The spatial relative Transformer module is essentially a transform-based spatial feature extraction algorithm, as shown in fig. 6. The module comprises a space joint point updating module (SJU) and a space relay node updating module (SRU), and the purpose of establishing contact for a remote joint point in an airspace is achieved by alternately updating a SJU module and the SRU module. Since the module independently updates the joint point and the spatial relay node in each frame, this step will describe the model algorithm with a single frame as an example. The model input is the joint point sequence in the t frame skeleton

Represents a joint point->

Of all neighboring node tags. Each node (including the joint point->

And spatial relay node R ^t ) All have a corresponding query vector->

The key vector->

value vector->

In the Spatial joint Update module (SJU), any joint is targeted

The corresponding key vector->

Performing dot product operation to obtain the influence of each neighbor node on the joint point, as shown in a formula (3.12):

representing the strength of the influence of node j on node i. The neighbor node includes its neighboring knuckle point->

Spatial relay node R ^t And itself>

Calculating to obtain influence strength

Then it is compared with the value vector corresponding to the neighbor node @>

The formula (13) shows:

is the result obtained after one-time updating by a joint point updating Submodule (SJU), the result simultaneously aggregates the local information and the global information, d _k And the channel dimension of the key vector is expressed, and the normalization function is realized. As shown in the block SJU in fig. 6, the red nodes are the nodes to be updated, which collect information from neighboring nodes through orange connections. />

In order to enable the spatial relay node to reasonably and fully collect and integrate the information of each joint point, the spatial relay node updating Submodule (SRU) also adopts dot product operation to calculate the influence of each joint point on the relay node. As shown in the SRU module in fig. 6, the spatial relay node to be updated (red node) collects information from each node through orange connection, and integrates each node information into global information through influence. Degree of influence

Query vector corresponding by relay node->

The key vector corresponding to each joint point->

The multiplication is carried out, and the formula is shown as (14):

the update of the spatial relay node is as shown in equation (15),

represents a joint point->

For space relay node R ^t Influence score of (a), (b), and (c)>

Value vectors for all nodes (including all nodes and spatial relay nodes in the skeleton).

The alternate updating of the joint points and the spatial relay nodes realizes the exchange of information among the joint points, and finally realizes the goal that each joint point simultaneously collects the information of the neighbor joint points and the remote joint points. The overall update algorithm of the SRT module is algorithm 1, as shown in table 1, where the first layer cycles through all frames and the second layer cycles through all the nodes (including spatial relay nodes) in the frame.

Table 1. Algorithm 1

Step 3.3, constructing a time topological graph based on the skeleton sequence

In the step, a time relay node is introduced when a time topological graph is constructed, and all joints are connected with each other through time inherent connection and time virtual connection to jointly form a graph structure in a time domain.

Along the time dimension, the same joint in successive frames constitutes a new sequence, which also constructs a connection for the head and tail joints, constituting a ring structure, as shown in fig. 7. The above connection is named as a temporal intrinsic connection (blue line segment) because the order of each frame is preserved, and plays a role of directly exchanging information with the adjacent frame. The sequence of n nodes contains n time-dependent connections.

Similar to the configuration in step 3.1, a temporal virtual connection (purple segment) connects the temporal relay node (purple node) and each node in the sequence (blue node), which completes the remote information exchange through such a connection. Thus, a graph containing n nodes has n virtual connections in time, as shown in FIG. 7.

Step 3.4, design TRT Module

The Temporal Relative Transformer module (TRT) comprises a Temporal joint point updating module (TJU) and a Temporal relay node updating module (TRU) and is used for extracting time domain characteristics. The module takes each joint point in the skeleton as an independent node, and respectively takes a sequence formed by the same joint point in the frame sequence as an object to extract the time domain characteristics of the joint point. This step will describe the model algorithm with a single joint as an example. The input of the TRT module is

A sequence of the same joint for all frames. Each joint point->

Has its corresponding query vector->

The key vector->

And value vector

Time relay node R _v Corresponding query vector->

Key vector->

And value vector->

In the TJU submodule, each joint point to be updated

(Red node) Collection of neighbor nodes (time Relay nodes R) by virtual connection (orange segment) _v The same articulation point of an adjacent frame->

And the node itself>

) Is updated, as shown in block TJU of fig. 8. The influence calculation formula of the neighbor node is shown as (16):

wherein the content of the first and second substances,

indicating the same node or time relay node R in the jth frame _v The influence on a certain joint point in the ith frame. Articulation point->

Is as shown in equation (17):

all query vectors

Are combined into a matrix Q _v ∈R ^C×1×t All key vectors &>

Are combined into a matrix K _v ∈R ^C×B×t All value vectors->

Are combined into a matrix V _v ∈R ^C×B×t . The matrix form definition of the influence strength is shown in formula (18):

where B represents the number of neighboring nodes, and ° represents the hadamard product.

In the TRU module, as shown in fig. 8, a time relay node R _v The (red nodes) collect information from other frames through virtual connections (orange segments), thereby completing self node update. The specific operation is as follows:

/>

wherein the content of the first and second substances,

indicates that the articulation point in the jth frame->

For relay node R _v Is greater or less than>

Is a scaling factor.

And alternately updating the time relay nodes and the same node in all frames, and finally capturing the dependence of the long distance and the short distance between the frames by the TRT module. The overall TRT module update algorithm is algorithm 2, as shown in table 2, where the first layer circularly traverses all the joint points in the skeleton, and the second layer circularly traverses the corresponding joint points (including time relay nodes) of the joint points in all frames.

Table 2. Algorithm 2: TRT Module update Algorithm Specification

Step 3.5, packaging the ST-RT module:

the ST-RT module is obtained by connecting and combining an SRT module and a TRT module, and as shown in fig. 9, the SRT module includes a Spatial joint node update module (SJU) and a Spatial relay node update module (SRU). The TRT module includes a Temporal joint node update module (TJU) and a Temporal relay node update module (TRU). Each update module is connected with a forward feedback Network (FNN) backward, and the features are mapped to a space with larger dimensions so as to enhance the expression capability of the model. L x represents L cycles.

Step 3.6, encapsulate MSST-RT network

In order to further improve the model accuracy, the step performs fusion packaging on the four ST-RT models with different input data through a Multi-stream framework to obtain an MSST-RT model (Multi stream ST-RT), as shown in FIG. 10. In addition to extracting features through first-order information (Joint points) of the skeleton, features may also be extracted through second-order information (skeleton). At the same time, different sampling frequencies may also provide supplementary information for the model, such as sampling n for joint and bone sequences, respectively ₁ Frame and n ₂ And (5) frame. The skeleton data is subjected to MSST-RT network to obtain final classification prediction probability based on the skeleton data.

In conclusion, the model MSST-RT improves the Transformer model according to the characteristics of the skeleton diagram and the sequence characteristics, establishes dependence for the remote joint points at lower calculation cost, and simultaneously maintains the integrity of the skeleton structure and the sequence order, thereby improving the calculation efficiency and the identification accuracy.

(4) Image clipping module based on joint points

Since the human body fine motion is mostly concentrated on the hand or the foot, the corresponding image square block includes most of the detail information of the skeleton missing. Therefore, the innovative point of the module is to choose to cut the joints of the hands and feet of the human body, so that the training cost is greatly reduced, as shown in fig. 11.

Specifically, image I of the t-th frame _t By means of a matrix P _t Indicating, by the joint point N, the desired cut _j Coordinates in the image are (x, y), and the size of the cropping picture is l × l, then the image is I _t Center around the joint point N of hand and foot _j Image block set obtained by cutting

As shown in the following equation:

in addition to cropping the picture by taking the joint point coordinates as the center, this section also extracts the optical flow through the picture blocks corresponding to two adjacent frames, and the formula is shown as (23):

wherein TV-L1 is a classical optical flow calculation method,

represents the light flow field in the x-axis direction>

Indicating the optical flow field in the y-direction.

(5) Behavior recognition network (KBN) based on key image blocks

In order to extract features in each cut key Image block, this embodiment designs an Image block feature extraction model (IBCN) trained end to end, and encapsulates the IBCN model into a KBN Network with a time domain Segment Network (TSN) as a basic framework. The method comprises the following specific steps.

Step 5.1, designing IBCN model

There are both independence and correlation between image blocks cropped based on skeletal joint points, so as shown in FIG. 12, each image block cropped by the IBCN model is firstly cropped by the IBCN model

Respectively input into a Convolutional Neural Network (CNN) to obtain the characteristics of each image block>

The calculation formula is shown as (24):

The convolutional neural network parameters are shared.

Then the characteristics f of each image square _t ^j Splicing to obtain new characteristic vector

As shown in equation (25)

f(x _i ，x _j )＝softmax(θ(x _i ) ^T ·φ(x _j ))# (26)

where θ (-) and φ (-) are 1 × 1 convolution functions.

The obtained similarity f (x) _i ,x _j ) Will be used as the weight and g (x) _j ) Weighted summation to achieve x _i Obtaining information from other locations, y _i Is x _i The result after global information exchange is as shown in formula (27):

wherein g (-) is a mapping function, and this section uses a 1 × 1 convolution function for mapping. Nl' ² To select a feature map

The dimension of (2) is used as a normalization coefficient, so that the scale expansion caused by different input dimensions can be avoided. When the input is the feature tensor, the formula is shown as (28):

wherein, θ (-), φ (-), and g (-), are all 1 × 1 convolution functions, nl' ² Are normalized coefficients.

Step 5.2, encapsulating the KBN network

The step uses TSN network as frame, seals IBCN modelThe method is provided as a KBN network which is divided into a spatial stream and a temporal stream, wherein the input data are optical flow blocks corresponding to the spatial stream and the input data are optical flow blocks corresponding to the temporal stream. Taking a spatial stream as an example, a number of frames are first sampled from a video by sparse sampling, and each frame is processed by an image cropping module based on a joint point. Then, corresponding key image block set of each frame

And respectively inputting the IBCN models, and sharing parameters of the IBCN models according to the preliminary prediction class probability of the sampling frame. The prediction classification results of all the sampled frames are then fused by Consensus function (Consensus) to obtain the classification prediction of video level, and the calculation formula is shown as (29).

The KBN-S is a prediction result of a spatial stream of the KBN network, and a calculation method of a time stream prediction result is consistent with that of the spatial stream.

And finally, fusing the spatial stream prediction result and the time stream prediction result to obtain the final classification prediction probability based on the image data.

Behavior recognition is taken as a popular research direction in the field of computer vision, has wide application prospects in the aspects of public safety, man-machine interaction and the like, and has important research significance. The behavior recognition method mainly comprises two types based on skeleton data and image data, the embodiment provides a behavior recognition network model fusing the skeleton data and the image data, skeleton motion information is fully mined, dependence is built for remote joint points, and the recognition capability of detail actions is enhanced. And further, local image data and skeleton data are fused, so that rich action detail information is supplemented, and high calculation cost is avoided. This example achieved a recognition level of 98.65% on data set NTU 60. The embodiment provides a behavior recognition network model based on skeleton data and image data, and the two network models are fused to form a complete system:

the embodiment provides a motion information guidance sampling module and a multi-scale motion information fusion module, aiming at the problem that the existing skeleton behavior identification method does not fully mine skeleton motion information. In the motion information guiding and sampling module, the sum of coordinate displacements of each joint point of two adjacent frames is provided as an index for measuring the motion information of the skeleton coordinate, and the sampling is guided by the measuring index, so that the skeleton obtained by sampling has richer motion information, and the identification accuracy is further improved. In the multi-scale motion information fusion module, solidified motion information and self-adaptive motion information are provided and fused with static information, so that the model input has rich motion information, the adaptability of the model to actions with different change speeds and different durations is further enhanced, and the accuracy of the model is improved.

In the embodiment, aiming at the problem that the graph convolution network cannot establish long-distance dependence for the joint points with longer distance in the skeleton, the framework behavior recognition network MSST-RT based on the Transformer is provided. The network model respectively introduces a virtual node in the space-time field, establishes direct contact (virtual connection) with each joint point through the node, and collects and integrates joint point information to realize the autonomous updating of the virtual node; each joint point acquires local information from adjacent nodes through bones (inherent connection), and acquires global information from the virtual node through virtual connection, so that the joint point is updated. Through the two updates, each joint point completes information exchange with any other joint point, long-distance dependence is established, and space-time characteristics are extracted.

Aiming at the problem that image data has detail information lacking in skeleton data but related models are high in training cost, the embodiment provides a picture cropping module based on joint points and a behavior recognition network KBN based on key image blocks. In the image cropping module based on the joint points, in order to reduce the image data size and reduce the training cost, the present embodiment crops the positions of the hands and the feet of the person in the image according to the coordinates of the joint points to obtain a plurality of image squares, and the image square set is used to replace the image for feature extraction. In the KBN model, the present embodiment establishes an information exchange channel between spatial positions through a Non-Local module, thereby implementing global information exchange between image blocks, and taking into account independence and relevance between image blocks, thereby improving the human body Local fine motion recognition capability. And finally, fusing the behavior recognition network model based on the skeleton data and the behavior recognition network model based on the image data, and fully exerting the complementarity of the skeleton data and the image data.

Example two:

this embodiment is substantially the same as the first embodiment, and is characterized in that:

in this embodiment, the action recognition method for public safety is to perform an experiment by using skeleton data and image data in an NTU60 data set, divide a training set and a test set according to a C-Subject rule, and measure model performance by Top1 accuracy.

(1) Behavior recognition network model based on skeleton data

Static information, adaptive motion information, short-term motion information and long-term motion information in the multi-scale motion information fusion module are mapped from a space with a dimension of 3 to a space with a dimension of 64 through a first 1 x 1 convolution, and then mapped from the space with the dimension of 64 to high-dimensional spaces with dimensions of 256, 128 and 128 through a second 1 x 1 convolution respectively.

The number of the SRT module and the TRT module in the MSST-RT model is set to be 3, the number of the heads of the multi-head attention mechanism is set to be 8, and batch normalization is adopted in a normalization mode. All experiments are completed by adopting a Pythrch frame, an Adam optimizer is adopted for model training, and parameters are set to be beta = [0.9,0.98 ]]And e =10 ^-9 . The training is divided into two stages: 1) In the first stage (the first 700 iterations), the learning rate is changed from 4 × 10 by hot start ^-7 Linear increase to 5X 10 ^-4 (ii) a 2) In the second stage, the learning rate is gradually reduced by a natural exponential decay strategy with the decay weight of 0.9996. The training mode can not only accelerate the convergence of the model, but also make the training more stable. During the training process, the training batch size is set to 64 times, and the number of training times is 30 times. At the same time, all experiments adopt the epsilon _ls Label smoothing strategy of = 0.1.

In the aspect of data processing, the coordinate displacement of the joint point of each frame and the same joint point of the first frame is adopted to replace the original coordinate information of each joint point to describe the skeleton of each frame. Some actions in the training set are double-person interactive actions, that is, two skeletons are included in the same frame, such as hugging, shaking and the like. In this case, a frame including two skeletons is divided into two frames, and each frame includes one skeleton. In addition, more different samples are obtained by randomly rotating the 3D skeleton to realize data enhancement, and the generalization capability of the network is enhanced to a certain extent.

(2) Behavior recognition network model based on image data

The experiments in this section are all completed based on a Pytorch framework, and a random gradient descent algorithm with a Momentum (Momentum) value of 0.9 is adopted to learn network parameters. In the training of the KBN cyberspace stream, the training batch size is set to 24, the training number is set to 80, the initial value of the learning rate is set to 0.001, and the learning rate is updated at the 25 th, 45 th and 70 th training times, each updating reducing the learning rate to 1/2 of the original. The experiment will initialize the network parameters using a pre-trained model on the ImageNet dataset. In the time flow training of the KBN network, the training batch size is set to 24, the training times are set to 300 times, the initial learning rate value is set to 0.001, the learning rate is updated in the 50 th, 100 th, 150 th and 200 th training times, and each updating reduces the learning rate to 1/2 of the original learning rate. When the gradient value is larger than 20 in the training process, the gradient cutting operation is carried out, and the operation can effectively avoid gradient explosion. To accelerate model convergence, the experiment will initialize the KBN time flow network with the model parameters of the KBN network spatial flow. Experiments the TV-L1 algorithm provided by OpenCV of CUDA version was used to extract the optical flow of image squares.

TABLE 3 Performance of methods on NTU60 data set

The action recognition method for public security in the embodiment includes the steps that a behavior recognition network model based on skeleton data and image data is divided into two branches, namely the behavior recognition network model based on skeleton data and the behavior recognition network model based on image data, according to data type differences: the former extracts skeleton characteristics through a lightweight network, is good at identifying actions with large amplitude and plays a main role in an action identification task; the image recognition method reduces training cost by cutting images, extracts image features from key image blocks, is good at recognizing small-amplitude actions concentrated on hands and feet, and plays a role in supplementing detailed information in an action recognition task.

The embodiments of the present invention have been described above with reference to the accompanying drawings, but the present invention is not limited to the above embodiments, and various changes can be made according to the purpose of the invention, and all changes, modifications, substitutions, combinations or simplifications made according to the spirit and principle of the technical solution of the present invention shall be made in the form of equivalent substitution, so long as the invention is in accordance with the purpose of the invention, and the invention shall fall within the protection scope of the present invention as long as the technical principle and the inventive concept of the present invention are not departed from the present invention.

Claims

1. A motion recognition method based on skeleton and image data fusion is characterized in that: respectively establishing a behavior recognition network model based on skeleton data and a behavior recognition network model based on image data to form a recognition network, extracting skeleton characteristics by using a lightweight network for recognizing actions with larger amplitude and completing a main action recognition task by the behavior recognition network model based on the skeleton data, wherein model input data of the behavior recognition network model based on the skeleton data are skeleton sequences, and the input data sequentially pass through a sampling module guided by coordinate motion information, a multi-scale motion information fusion module and a multi-stream spatiotemporal relative transform model to obtain action category prediction probability; extracting image features from image blocks by a picture cutting method based on a behavior recognition network model of image data, and identifying small-amplitude actions concentrated on hands and feet, and supplementing detailed information in an action recognition task; identifying model input data information of a network model into an image sequence based on the behavior of image data, wherein the input data sequentially passes through a picture cutting module based on joint points and a key image block feature extraction model (KBN) to obtain a supplementary action category prediction probability; and fusing the action type prediction probabilities obtained by the action recognition network model based on the skeleton data and the action recognition network model based on the image data to obtain the final classification prediction probability of the whole model, thereby completing the action recognition process for public safety.

2. The method of claim 1, wherein the method comprises: in a behavior recognition network model based on skeleton data, a frame sampling module guided by coordinate motion information screens out a representative skeleton sequence from skeleton sequences according to coordinate motion information measurement indexes; the multi-scale motion information fusion module fuses the static information of the skeleton with the multi-scale motion information, and sets two different types of motion information according to the characteristics that different actions of a human body have different change speeds and duration, namely, solidified motion information and self-adaptive motion information; the solidification motion information comprises two different scales, so that the network adapts to actions with different change speeds; the self-adaptive motion information enables the identification network to have the capability of identifying actions with different durations; establishing long-distance connection for each joint point on a time-space domain by using the multi-stream space-time relative Transformer model, wherein the multi-stream space-time relative Transformer model is as follows: setting a space topological graph based on a framework on a space domain, and constructing a space relative Transformer module for establishing remote dependence of joint points in an airspace; on a time domain, constructing a time topological graph based on a skeleton sequence, and establishing a time domain relative Transformer module for establishing remote dependence of joint points in the time domain; then, combining the space and time domain relative modules to obtain a space-time relative transform model, and further extracting the space-time characteristics of the framework sequence; and (3) fusing at least 4 different space-time relative models of input data by adopting a multi-time scale frame to construct a multi-stream space-time relative transform model.

3. The method of claim 2, wherein the method comprises: the coordinate motion information directed frame sampling module comprises:

1.1 designing indexes for measuring coordinate motion information:

in the skeleton data, joint points are represented by 3D coordinates; taking the displacement distances of the joint points in two adjacent frames as an index for measuring the motion information content contained in the joint points, taking the sum of the displacement distances of all the joint points in the skeleton as an index for measuring the motion information content contained in the whole skeleton, and further judging whether the skeleton is representative; assume that the t-th frame has a joint coordinate of i

The joint point coordinate labeled i of the t-1 th frame is

wherein, N represents the number of joint points contained in a frame;

in order to eliminate the scale expansion effect caused by different video lengths, the coordinate motion information contained in each frame is normalized, as shown in formula (2):

wherein, T represents the number of frames contained in the video;

1.2, sampling the video by adopting a cumulative distribution function:

assuming that N frames need to be sampled from a video with a length of T, the specific operation is as follows:

firstly, accumulating the skeleton coordinate motion information frame by frame to obtain the accumulated coordinate motion information, and accumulating the accumulated coordinate motion information C of the t-th frame _t The calculation formula is shown as (3):

according to

And dividing the sequence into N segments, and randomly sampling a frame from the N segments to form a new sequence, so as to screen out a representative skeleton series in the skeleton sequence through the weighing index.

4. The method of claim 2, wherein the method comprises: the multi-scale motion information fusion module comprises:

2.1 designing different scale motion information:

The coordinate displacement in the two frames yields:

representing the original skeleton sequence I _origin The t-th frame in (1) is labeled as the joint point of i,

representing the sampled framework sequence I _new The label of the t-th frame is a joint point of i;

adaptive motion information M _a By the framework sequence I _new The coordinates of the joint points of two continuous frames are subtracted, and the method for acquiring the motion with different scales from the videos with different lengthsThe dynamic information is characterized by the following formula:

wherein the content of the first and second substances,

representing a New framework sequence I _new Adaptive motion information of the ith frame;

motion information is divided into two categories: short-distance motion information M _s And long-distance motion information M _i (ii) a Short-distance motion information M _s Through the proto-framework sequence I _origin The coordinates of the skeleton joint points which are separated by 2 frames are subtracted to obtain the motion information which is used for capturing the rapidly-changed motion; the calculation formula is shown as follows:

wherein the content of the first and second substances,

short-distance motion information of the ith frame in the new skeleton sequence is shown, and f is the new skeleton sequence I _new In the original skeleton sequence I of the ith frame _origin The number in (1);

long distance motion information M _i By the proto-framework sequence I _origin The coordinates of the skeletal joint points which are separated by 5 frames are subtracted, and the coordinates are used for capturing motion information of slowly changing motion, and the calculation formula is expressed as follows:

wherein the content of the first and second substances,

long-distance motion information of I-th frame in new skeleton sequence is shown, f is new skeleton sequence I _new The t-th frame in the original skeleton sequence I _origin The number in (1);

2.2, high-dimensional mapping of different-scale motion information:

static information of skeleton I _new Adaptive motion information M _a Short-term exercise information M _s And long-term exercise information M _l All tensors of (T, N, C) ₀ ) Where T represents the number of video frames, N represents the number of joints of a skeleton, C ₀ A coordinate dimension representing a joint point; mapping the four kinds of information to a high-dimensional space through an Embedding module (Embedding block) to obtain high-dimensional features F, F as a graph _ma 、F _ms And F _ml (ii) a The embedded module is composed of two convolutional layers and two active layers (ReLU):

the first convolution maps various information to a space with the dimension of C, and the second convolution maps various information to the space with the dimension of C respectively ₁ 、，C ₂ 、C ₃ 、C ₄ A high dimensional space of (a); convolution kernels corresponding to different motion information are mutually independent, and parameters are not shared; with static information I _new For example, the embedded module quadratic mapping formula is shown as (10):

F＝σ(W ₂ (σ(W ₁ I _new +b ₁ ))+b ₂ ) (10)

where σ denotes the activation function, W ₁ 、b ₁ Representing a parameter in the first convolution function, W ₂ 、b ₂ Representing the parameters of the second convolution function, the parameters of the second convolution function being obtained by learning, I _new Representing static information;

2.3, multi-scale motion information fusion:

Z＝concat(F，F _ma ，F _ms ，F _ml ) (11)

and (3) fusing the four high-dimensional features to obtain Z, and outputting the Z as a multi-scale motion information fusion module.

5. The method of claim 2, wherein the method comprises: the multi-stream spatiotemporal relative transform model comprises:

3.1, constructing a space topological graph based on a framework:

3.2, designing a space relative Transformer module:

Where N represents the number of joint points in this frame,

representing joint points

A set of all neighboring joint point labels; each node has a corresponding query vector

key vector

value vector

In the Spatial Joint node Update module (SJU), any Joint point is targeted

Firstly, the query vector corresponding to the joint point

And its neighbor node

Corresponding key vector

wherein the content of the first and second substances,

representing the influence strength of the node j on the node i; the neighbor nodes include their neighboring joint points

Spatial relay node R ^t And itself

r represents a label of the spatial relay node;

calculating to obtain influence strength

Then, the value vector corresponding to the neighbor node is added

Multiplying, and summing all the products to obtain the value as the joint point

The formula (13) shows:

wherein the content of the first and second substances,

the sub-module (SJU) is updated once by the joint pointUpdated results which aggregate both local and global information, d _k The channel dimension representing the key vector, which serves as the normalization, softmax _j Representing all adjacent joint points

The influence strength is normalized;

Query vectors corresponding through relay nodes

Key vector corresponding to each joint point

The multiplication is carried out, and the formula is shown as (14):

the update of the spatial relay node is as shown in equation (15),

representing joint points

For space relay node R ^t The score of the degree of influence of (c),

va for all nodesA lue vector;

3.3, constructing a time topological graph based on the skeleton sequence:

3.4, designing a TRT module:

A sequence of the same joint for all frames; each joint point

With its corresponding query vector

Key vector

And value vector

Time relay node R _v Corresponding query vector

Key vector

And value vector

In the TJU submodule, each joint point to be updated

wherein the content of the first and second substances,

indicating the same node or time relay node R in the jth frame _v The influence strength on a certain joint point in the ith frame,

pair of representations

Performing transposition processing; joint point

Is as shown in equation (17):

all query vectors

Are combined into a matrix Q _v ∈R ^C×1×t All key vectors

Combined into a matrix K _v ∈R ^C×B×t All value vectors

Are combined into a matrix V _v ∈R ^C×B×t (ii) a The matrix form definition of the influence is shown in formula (18):

wherein, B represents the number of neighbor nodes,

representing a hadamard product;

wherein the content of the first and second substances,

representing a joint in a jth frame

For relay node R _v The degree of influence of (a) is,

is a scaling factor;

3.5, packaging an ST-RT module:

the ST-RT module is obtained by connecting and combining an SRT module and a TRT module, wherein the SRT module comprises a space joint point updating module and a space relay node updating module; the TRT module comprises a time joint point updating module and a time relay node updating module; each updating module is connected with a forward feedback network layer backwards, and maps the characteristics to a space with larger dimensionality so as to enhance the model expression capacity; lx denotes cycle L times;

3.6, encapsulating the MSST-RT network:

fusing and packaging the four ST-RT models with different input data through a multi-stream framework to obtain an MSST-RT model; different sampling frequencies may also provide complementary information for the model, sampling n for joint and bone sequences, respectively ₁ Frame and n ₂ A frame; the skeleton data is subjected to MSST-RT network to obtain final classification prediction probability based on the skeleton data.

6. The method of claim 1, wherein the method comprises: in the behavior recognition network model based on the image data, a joint point-based picture cutting module selects joint points of hands and feet of a human body to be cut; and packaging the image block feature extraction model trained end to end into a key image block feature extraction model based on a time domain segmentation network as a basic framework by adopting the image block feature extraction model trained end to end.

7. The method of claim 6, wherein the method comprises: the joint point-based picture cropping module comprises:

picture I of t-th frame _t By means of a matrix P _t Indicating, by the joint point N, the desired cut _j Coordinates in the image are (x, y), and the cropping picture size is 1 × 1, then in image I _t Middle surrounding hand and foot joint point N _j Image block set obtained by cutting

As shown in the following equation:

wherein TV-L1 is a classical optical flow calculation method,

representing the optical flow field in the x-axis direction,

indicating the optical flow field in the y-direction.

8. The method of claim 6, wherein the method comprises: the joint point-based picture cropping module comprises: the behavior recognition network based on the key image blocks comprises:

5.1, designing an IBCN model:

Respectively inputting the images into a convolutional neural network to obtain the characteristics of each image block

The calculation formula is shown as (24):

wherein the content of the first and second substances,

representing the extraction of image blocks by a convolutional neural network with a parameter W

Sharing each convolution neural network parameter; then the characteristics of each image square

Splicing to obtain new characteristic vector

As shown in equation (25)

Finally, calculating a characteristic vector F by a point multiplication mode _t At an arbitrary spatial position x _i With other positions x _j Similarity f (x) of _i ，x _j ) As shown in equation (26):

f(x _i ，x _j )＝softmax(θ(x _i ) ^T ·φ(x _j )) (26)

wherein θ (-) and φ (-) are 1 × 1 convolution functions;

the obtained similarity f (x) _i ，x _j ) Will be used as the weight and g (x) _j ) Weighted summation to achieve x _i Obtaining information from other locations, y _i Is x _i The result of global information exchange is shown in equation (27):

wherein g (·) is a mapping function, and a 1 × 1 convolution function is adopted for mapping; nl' ² To select a feature map

wherein, θ (-), φ (-), and g (-), are all 1 × 1 convolution functions, nl' ² Is a normalized coefficient;

5.2, packaging a KBN network:

the method comprises the steps that an IBCN (intermediate bulk node network) is packaged into a KBN (KBN) network by taking a TSN (time transport network) as a framework, the network is divided into a spatial stream and a time stream, input data are image blocks corresponding to the spatial stream, and input data are optical flow blocks corresponding to the time stream; adopting spatial stream, firstly sampling a plurality of frames from a video through sparse sampling, and processing each frame through an image cutting module based on a joint point; then, corresponding key image block set of each frame

Respectively inputting IBCN models, preliminarily predicting class probabilities according to sampling frames, and sharing parameters of the IBCN models; and then fusing the prediction classification results of all the sampling frames through a consensus function to obtain a video-level classification prediction, wherein the calculation formula is shown as (29):

representing the set of image squares corresponding to the K-th sample frame,

representing a collection of image tiles by an IBCN Module