CN116092189A

CN116092189A - Bimodal human behavior recognition method based on RGB data and bone data

Info

Publication number: CN116092189A
Application number: CN202310010763.3A
Authority: CN
Inventors: 陈良银; 石静; 张媛媛; 廖俊华; 刘圣杰; 倪浩文
Original assignee: Sichuan University
Current assignee: Sichuan University
Priority date: 2023-01-05
Filing date: 2023-01-05
Publication date: 2023-05-09

Abstract

The invention relates to a bimodal human behavior recognition method based on RGB data and skeleton data, which can accurately capture action information in a skeleton and space information in an RGB mode by using different network structures, and solves the problem that the skeleton information and the RGB information in a transducer frame are difficult to fuse and exert the maximum effect. Firstly, a pseudo heat map is generated for bone data, which can avoid the problem that the stability caused by expressing bones as graphics is insufficient and a multi-person scene cannot be processed. Then, a dual stream transducer architecture having different attention layers and different window sizes is designed, and the pseudo heat map and RGB frames are input into the dual stream architecture with different temporal and spatial resolutions. Finally, through experimental verification, the behavior recognition method provided by the invention has higher accuracy and can solve the behavior recognition under a multi-person scene. The double-flow structure and skeleton heat map generation mode based on the method is suitable for behavior identification in various public monitoring.

Description

Bimodal human behavior recognition method based on RGB data and bone data

Technical Field

The invention relates to the field of human behavior recognition, in particular to the field of human behavior recognition based on videos, and specifically relates to a human behavior recognition method based on RGB data and skeletal heat map data.

Background

Human behavior recognition refers to the retrieval and recognition of human behaviors by acquiring human behavior data through monitoring videos, motion traps and the like. The human behavior recognition has wide application prospect in public safety, intelligent transportation, medical monitoring and production safety. The human behavior recognition technology which is widely used can be divided into two types according to the difference of input data: human behavior recognition based on RGB video, and human behavior recognition based on skeletal data. Compared with bone data, the RGB video is very convenient to collect, can provide detailed and rich appearance information such as shapes, colors, textures and the like, but is very sensitive to lighting conditions, shooting angles and the like, and when illumination is weak, the accuracy of a human behavior recognition algorithm based on the RGB video is greatly reduced. In contrast, bone data can not provide specific appearance information, but is quite compatible with illumination conditions and shooting angles, and good recognition accuracy can be obtained through methods such as graphic neural network (GCN). However, bone data is generally difficult to collect, accurate bone key point information can be obtained through the motion capture system, but in many application scenarios, the motion capture system is difficult to popularize and use. The development of the gesture estimation algorithm provides a more convenient and quick way for extracting bone data, bone data of a corresponding video can be obtained by inputting RGB video into the gesture estimation algorithm, and the RGB data and the bone data are combined to be used as the input of the human behavior recognition algorithm, so that higher accuracy can be obtained than an algorithm which only depends on the RGB data or the bone data.

There is a work of combining RGB data with bone data to improve accuracy of human behavior recognition. Luvizon et al relate the problem of pose estimation and behavior recognition, and achieve efficient pose extraction and behavior recognition through a shared mechanism of two tasks. Das et al propose a gesture-driven spatiotemporal attention mechanism and apply it to 3DCNN to perform human behavior recognition. In its subsequent work, it calculates spatiotemporal features by increasing the attentive mechanisms to the topology of the bone. Li et al propose a dual stream network structure in which there are three main modules, an ST-GCN module to extract skeleton features, an R (2+1) D network to extract RGB features, and a module to use the two features to enhance motion related information in RGB video, finally using score fusion to obtain classification results. Cai et al also employ a dual stream network architecture, but differs in that the input to another stream, in addition to the skeletal data stream, is stream data with aligned key points extracted from the RGB video. The above work is to treat the skeleton data as a topological graph, but the topological graph is not a robust expression mode compared with the RGB data, and the loss of certain skeleton key point information will have a great influence on the whole skeleton data. Jing and Wang propose a two-way network based on ViT, wherein the input of the two-way network is RGB frames with different frame rates and different resolutions, and bone data is fused with the RGB data in three different fusion modes, so that a final classification result is obtained. The fusion of the skeleton data is carried out by adopting a mode of fusing skeleton codes into a token as ebedding and RGB data. The fusion mode still does not avoid the problem of non-robustness of bone data, and simultaneously, the feature dimension is increased, so that the algorithm complexity is increased.

Disclosure of Invention

The invention aims to provide a bone and RGB bimodal behavior recognition method based on a transducer. A dual stream transducer architecture is presented with RGB frames and skeletal heatmaps as inputs. The present invention uses different number of attention layers and window sizes for the two inputs inside the network by inputting RGB frames and skeletal heatmaps into the transducer at different temporal and spatial resolutions. In addition, in order to avoid the problems of stability and the like caused by graphically representing bones, a heat map is generated for the bones, so that the stability is improved, and the problem of behavior identification in a multi-person scene can be solved. The fusion mode of double-flow transverse connection is provided, the influence of noisy background in a behavior recognition algorithm is reduced, and the problem of missing key appearance information in character interaction behavior is solved.

The aim of the invention is achieved in that:

a dual stream transducer architecture with RGB and skeletal modalities as inputs is presented. Through the use of the posture estimation algorithm and the generation of the bone heat map, the problem of reduced stability of graphically expressed bones can be avoided, and the fusion with RGB information can be more conveniently carried out. Based on different spatial resolutions of input and different structures in the network, the RGB stream can capture spatial information and the skeleton stream can accurately capture motion information. Finally, the fusion method provided by the invention is used in the network to fully fuse the appearance information in the RGB mode and the action information in the skeleton mode, thereby solving the problem of insufficient single-mode information and reducing the influence of noisy background.

The specific mode is as follows:

a dual-flow RGB and skeleton dual-mode behavior recognition method based on a transducer comprises the following steps:

and step 1, acquiring skeleton information in the RGB video by using a posture estimation algorithm.

And 2, generating a bone heat map.

And 3, sampling the RGB video and the bone heat map.

And 4, inputting a double-flow transducer structure.

And 5, acquiring classification token information to carry out transverse fusion.

And 6, acquiring network output, and mapping the network output into a classification result in the linear classifier.

Further, in the step 1, the data owner uses HRNet to perform gesture estimation on the RGB video on the COCO-keypoint key point by using a pre-training model.

Further, in the step 2, after the result of the step 1 is obtained and the 2D pose is obtained, a bone heat map is generated in the following manner. For each skeletal joint:

wherein, (x) _k ,y _k ) Representing the coordinates of the kth point, c _k Representing the confidence of the kth point.

For the limb:

wherein D ((i, j), seg [ a ] _k ,b _k ]) Representing the distance from the point to the line, seg [ a ] _k ,b _k ]Representing a line segment between two points. All heatmaps are then organized along the time dimension into the form of heatmap videos.

Further, step three uniformly samples the heat map video and the RGB video obtained in step two, and for the heat map video, 32 frames of heat maps should be uniformly sampled along the time dimension. For RGB video, 8 frames of RGB frames should be sampled uniformly along the time dimension.

Further, before inputting the transducer, the video which is adjusted to be of a short side 320 pixels is cut into 224×224 size by using a random cutting method in a training stage, cut into 224×224 size by using a center cutting method in a verification stage, cut into 224×224 size at three space positions respectively at the left upper part, the center and the right lower part in a test stage, and input a network to take a softmax average as a final result. And when the heat map video is generated, acquiring a minimum detection frame capable of covering all target characters, performing zero filling in the detection frame, and cutting a background outside the detection frame, which is irrelevant to human behavior recognition. The resulting heat map size was 224×224. No cutting or resizing operations are required.

For RGB video andheat map video, using the same decomposition operation, is exemplified by heat map video, which is decomposed into N non-overlapping space-time "tubes", x ₁ ,x ₂ ,...,x _N ∈R ^t×h×w×3 Wherein n=

Next, each tube is set _i Linearly mapped to a mark (encoding) z _i ，z _i ＝Ex _i . Finally, all encoding z _i Connected in series to form a vector z ⁰ . Adding a special learnable vector z at a first position _cls ∈R ^d Representing the embedding of the classification marks. Position embedding p _pos ∈R ^(N+1)×d Is also added to this sequence.

The RGB stream and the skeletal stream use the same attention mechanism. The mechanism first calculates the frame-level spatial attention at the same time pointer. The spatial attention layer number of the bone stream is L' =10, and the spatial attention layer number of the rgb stream is l=12. For the spatial attention module of layer l, first calculate query/key/value:

where a=1,..a represents the attention head, p=1..n represents a spatial position,

representing the time position. />

Representing the output of the previous layer. Spatial attention is then calculated:

wherein the method comprises the steps of

The output of the l layer can be obtained by the following formula:

where s represents the output vector of all the attention heads.

After calculating the L' layer, the output is passed to the MLP layer, which contains one gel function and two linear layers separated by the gel function:

at this time, the data owner obtains the frame-level spatial attention expression

Which can be regarded as classification features

So we can express it as a frame level expression h _i ∈R ^d And the frame-level expressions are all merged into:

/>

after the spatial attention block output is obtained, double-flow fusion can be performed.

Further, step five averages the frame-level expressions obtained by the bone stream in step four in a manner that the frame-level expressions obtained by the bone stream are averaged every four frames in a group, so that the frame-level expressions obtained by the bone stream can be combined with the frame-level expressions obtained by the RGB stream at the same time position.

Further, the module of spatial attention is sent to the temporal attention module, whose attention is as follows:

thereafter, the frame-level representation is obtained in line with spatial attention.

In the process of L _t After 4 temporal attention layers, the obtained classification token will be input into the MLP header.

Further, in step 6, the data owner inputs the output of step 5 into the linear classification layer, we obtain a class score. And (5) averaging the scores of the double streams to obtain a final classification result.

The invention has the positive effects that:

(1) A double-flow transducer framework for RGB and skeleton double modes is provided, and the framework integrates the advantages of skeleton data and RGB data, so that the accuracy of behavior recognition is improved.

(2) The framework generates the bone heat map from the bone data, so that the problem that different gesture extractors brought by expressing bones as maps affect the behavior recognition accuracy is avoided, and the problem that the calculation amount of the rapid growth of a multi-person scene cannot be solved. The skeletal heatmap and RGB frames are then input into a dual stream network at different temporal and spatial resolutions. Within the network, the skeletal heat map employs fewer layers of attention. So that the bone stream can extract the motion information more accurately.

(3) Meanwhile, a new fusion mode is provided for transverse connection in double streams, and the information from skeleton streams and the information from RGB streams can be fused.

Drawings

Fig. 1 is a dual flow frame structure diagram.

Fig. 2 is a diagram of a skeletal flow attention mechanism.

Fig. 3 is a detail view of a dual stream transverse connection.

Detailed Description

The following describes the embodiments of the present invention in further detail with reference to the drawings.

Step 1, as shown in fig. 1, we acquire bone data from the original RGB video using a pose extractor and generate a bone heat map in step 2.

Step 3, as shown in fig. 1, when the bone heat map and RGB frames are input into the dual stream transducer, they have different time resolutions. The frame rate of the heatmap is 4 times the frame rate of the heatmap. In step 4, the same attention mechanism is used in the transducer, the spatial attention is calculated first, the frame-level expression is obtained, the transverse connection is performed in step 5, the time attention is calculated again, and finally the MLP head is input. And obtaining the class scores of the double streams, and finally obtaining the classification result on average.

The detailed structure of step 4 is shown in fig. 2, and the input data is first decomposed to obtain complementary overlapping tubes. After mapping to token and adding classification token, inputting the attention mechanism, after L' layer space attention, outputting result to time attention, finally inputting to MLP head.

The detailed structure of step 5 is shown in fig. 3, after the spatial attention layer, the two streams acquire frame level expressions, after the frame level expressions of the bone stream are averaged, the frame level expressions of the two streams are combined according to the same time position, and then the temporal attention layer is input.

Claims

1. The bimodal human behavior recognition method based on RGB data and skeleton data is characterized by comprising the following steps:

step 1, acquiring skeleton information in RGB video by using a gesture estimation algorithm;

step 2, generating a bone heat map;

step 3, sampling the RGB video and the bone heat map;

step 4, inputting a double-flow transducer structure;

step 5, acquiring classification token information for transverse fusion;

2. The method of claim 1 wherein the data owner uses HRNet to pose the RGB video on a co-keypoint pre-trained model.

3. According to claim 1, the data owner generates a skeletal heat map in said step 2 after obtaining the result of step 1, the 2D pose, using the following manner; for each skeletal joint:

wherein, (x) _k ,y _k ) Representing the coordinates of the kth point, c _k Representing the confidence of the kth point;

for the limb:

wherein D ((i, j), seg [ a ] _k ,b _k ]) Representing the distance from the point to the line, seg [ a ] _k ,b _k ]Representing a line segment between two points; all heatmaps are then organized along the time dimension into the form of heatmap videos.

4. According to the requirements in claim 1, the data owner uniformly samples the heat map video and the RGB video obtained in the second step in the third step, and for the heat map video, 32 frames of heat maps are uniformly sampled along the time dimension; for RGB video, 8 frames of RGB frames should be sampled uniformly along the time dimension.

5. According to claim 1, the data owner, before inputting the transducer, cuts the video which is resized to the short side 320 pixels for the RGB video, uses the random cutting method to 224×224 during the training phase, uses the center cutting method to 224×224 during the verification phase, cuts the 224×224 at the upper left, center and lower right three spatial positions respectively during the testing phase, and inputs the network to take the softmax average as the final result; when a heat map is generated for the heat map video, acquiring a minimum detection frame capable of covering all target characters, performing zero filling in the detection frame, and cutting a background outside the detection frame, which is irrelevant to human behavior recognition; the size of the finally generated heat map is 224×224; the operation of cutting or adjusting the size is not needed;

for RGB video and heat map video, the same decomposition operation is adopted, and the heat map video is taken as an example to decompose the heat map video into N non-overlapping space-time 'tubes', x ₁ ,x ₂ ,...,x _N ∈R ^t×h×w×3 Wherein

Next, each tube is set _i Linearly mapped to a mark (encoding) z _i ，z _i ＝Ex _i The method comprises the steps of carrying out a first treatment on the surface of the Finally, all encoding z _i Connected in series to form a vector z ⁰ The method comprises the steps of carrying out a first treatment on the surface of the Adding a special learnable vector z at a first position _cls ∈R ^d Representing the embedding of the classification marks; position embedding p _pos ∈R ^(N+1)×d Is also added to this sequence;

the RGB streams and the skeletal streams use the same attention mechanism; the mechanism first calculates the frame-level spatial attention at the same time pointer; the spatial attention layer number of the bone stream is L' =10, and the spatial attention layer number of the rgb stream is l=12; for the spatial attention module of layer l, first calculate query/key/value:

representing a time position;

an output representing the previous layer; spatial attention is then calculated:

wherein the method comprises the steps of

The output of the l layer can be obtained by the following formula:

where s represents the output vectors of all the attention heads;

Which can be regarded as classification features

6. According to claim 1, the data owner averages the frame-level representations obtained from the bone stream in step four in a manner that the frame-level representations obtained from the bone stream are averaged every four frames in a group, so that the frame-level representations obtained from the bone stream can be combined with the frame-level representations obtained from the RGB stream at the same time position;

thereafter, the frame-level representation is obtained in line with spatial attention; in the process of L _t After 4 temporal attention layers, the obtained classification token will be input into the MLP header.

7. According to claim 1, the data owner inputs the output of step 5 to the linear classification layer in step 6, we obtain a class score; and (5) averaging the scores of the double streams to obtain a final classification result.