CN117894065A

CN117894065A - Multi-person scene behavior recognition method based on skeleton key points

Info

Publication number: CN117894065A
Application number: CN202311711306.3A
Authority: CN
Inventors: 黎科宏; 贺龙泽; 王艺凡; 陈雪峰; 齐宏拓; 冯亮; 刘界鹏
Original assignee: Chongqing University
Current assignee: Chongqing University
Priority date: 2023-12-13
Filing date: 2023-12-13
Publication date: 2024-04-16

Abstract

The invention discloses a multi-person scene behavior recognition method based on skeleton key points, which comprises the following steps: 1) Acquiring a source video, and detecting a human body in the source video by using a human body detector to acquire a human body boundary frame image; 2) Aligning a human body in the human body boundary frame image to a central position by utilizing a space transformation network; 3) Obtaining a heat map of the human body boundary frame image after the space transformation by using a FastPose network; 4) Processing the heat map of the human body boundary frame image by using a two-step heat map normalization method; 5) Restoring the human body to an original position by using a spatial inverse transformation network; 6) And eliminating redundant coordinates by using a parameterized pose non-maximum suppression method, and inputting the rest coordinate information into an LSTM-based behavior recognition network to obtain the human body action type. According to the invention, the video is input into the behavior recognition model to obtain the thermodynamic diagram, the coordinate of the human skeleton point is extracted from the thermodynamic diagram, and then the specific behavior type is estimated according to the relation and the change between the skeleton points.

Description

Multi-person scene behavior recognition method based on skeleton key points

Technical Field

The invention relates to the field of behavior recognition in computer vision, in particular to a multi-person scene behavior recognition method based on skeleton key points.

Background

Along with the rapid development of artificial intelligence technology, the application of multi-person behavior recognition is also more and more mature, and the method has wide application prospects in the fields of man-machine interaction, intelligent construction sites and the like.

Most of the existing methods are behavior recognition methods based on optical flows, and the methods input video and optical flow information into a behavior recognition network model together, but optical flow calculation is generally time-consuming and complex, and efficient algorithm and hardware support are needed. In addition, the optical flow has weak processing capability for large displacement, shielding, background clutter and the like, and false detection or omission may be caused.

Disclosure of Invention

The invention aims to provide a multi-person scene behavior recognition method based on skeleton key points, which comprises the following steps:

1) Acquiring a source video, detecting a human body in the source video by using a human body detector, and acquiring a human body boundary frame image;

2) Aligning a human body in the human body boundary frame image to a central position by utilizing a space transformation network;

3) Obtaining a heat map of the human body boundary frame image after the space transformation by using a FastPose network;

4) Processing the heat map of the human body boundary frame image by using a two-step heat map normalization method to obtain skeleton key point coordinate information;

5) Restoring the human body to an original position by using a spatial inverse transformation network;

6) And eliminating redundant coordinates by using a parameterized pose non-maximum suppression method, and inputting the rest coordinate information into an LSTM-based behavior recognition network to obtain the human body action type.

Further, the human detector includes a YOLO V3 network.

Further, the aligning the human body in the human body boundary frame image to the center position by the space transformation network means that 2D affine transformation is performed on the human body boundary frame image.

Further, a transformation formula of the 2D affine transformation is as follows:

wherein θ ₁ ，θ ₂ ，θ ₃ Is thatVector in (a); />And->Respectively representing the coordinates before and after transformation. R represents a real set.

Further, the FastPose network comprises a ResNet backbone network, a plurality of dense up-sampling convolution modules and a convolution layer;

the ResNet backbone network is used for extracting the characteristics of the human body boundary frame image after the space transformation;

the dense up-sampling convolution module is used for up-sampling the extracted features;

and outputting a heat map of the human body boundary frame image after spatial transformation by the convolution layer.

Further, the step of processing the heat map of the human body boundary box image by using the two-step heat map normalization method comprises the following steps:

4.1 Element-wise normalizing the heat map to generate a confidence heat map C; bone confidence: conf=max (C); max (C) is the maximum value of the heat map;

4.2 Performing global normalization to generate a probability heat map P to predict skeletal point location coordinates;

wherein p is _x The following is shown:

wherein c _x A confidence heat map value for position x, used to characterize the probability of occurrence of the keypoint for each pixel position; p is p _x The pixel probability representing position x in the probability heat map is determined by comparing the confidence heat map c _x And performing global normalization.

Further, the spatial inverse transformation network restores the human body to the original position by using an inverse transformation equation;

the inverse transformation equation is shown below:

wherein [ gamma ] ₁ γ ₂ ]＝[θ ₁ θ ₂ ] ^-1 ，γ ₃ ＝-1×[γ ₁ γ ₂ ]θ ₃ ，And->Respectively representing the coordinates before and after transformation.

Further, the step of eliminating redundant coordinates using the parameterized pose non-maximal suppression method includes:

6.1 Calculating the pose distance metric d (P) _i ,P _j I Λ), i.e.:

d(P _i ,P _j |Λ)＝K _Sim (P _i ,P _j |σ1)+λH _Sim (P _i ,P _j |σ2) (4)

wherein P is _i And P _j Two gestures, each gesture is composed of m skeleton points, and each skeleton point has a coordinate and a confidence level; Λ is a super parameter comprising three pose distance metric parameters σ1, σ2 and λ; k (K) _Sim (P _i ,P _j σ1) is a skeletal point matching degree function; h _Sim (P _i ,P _j σ2) is a spatial distance function;

6.2 Determining a pose distance measure d (P) _i ,P _j I Λ) is smaller than the threshold η, if yes, deleting the gesture P _i And gesture P _j One of which is a metal alloy.

Further, the bone point matching degree function and the spatial distance function are respectively as follows:

wherein,the ith and jth bone points of the nth pose position. />Is a bone key point set; c _i ⁿ 、c _j ⁿ Is a confidence heat map value.

Further, the LSTM neural network comprises a full connection layer, a relu layer, a dropout layer and an output layer.

The invention has the technical effects that needless to say, the invention obtains thermodynamic diagram by inputting the video into the behavior recognition model, extracts the coordinates of the human skeleton points from the thermodynamic diagram, and presumes the specific behavior type according to the relation and the change among the skeleton points.

The invention is not interfered by factors such as illumination change, background complexity, human appearance difference and the like, and can effectively eliminate noise.

Drawings

FIG. 1 is a flow diagram of an overall behavior recognition network;

FIG. 2 is a schematic diagram of a backbone network during a gesture detection phase;

fig. 3 is a schematic diagram of a behavior recognition structure.

Detailed Description

The present invention is further described below with reference to examples, but it should not be construed that the scope of the above subject matter of the present invention is limited to the following examples. Various substitutions and alterations are made according to the ordinary skill and familiar means of the art without departing from the technical spirit of the invention, and all such substitutions and alterations are intended to be included in the scope of the invention.

Example 1:

referring to fig. 1 to 3, a multi-person scene behavior recognition method based on skeletal keypoints includes the following steps:

The human detector includes a YOLO V3 network.

The space transformation network aligns the human body in the human body boundary frame image to the central position, namely, performing 2D affine transformation on the human body boundary frame image.

The transformation formula of the 2D affine transformation is as follows:

The FastPose network comprises a ResNet backbone network, a plurality of dense up-sampling convolution modules and a convolution layer;

The step of processing the heat map of the human body boundary frame image by using the two-step heat map normalization method comprises the following steps:

wherein p is _x The following is shown:

The space inverse transformation network restores the human body to the original position by utilizing an inverse transformation equation;

the inverse transformation equation is shown below:

The step of eliminating redundant coordinates by using the parameterized pose non-maximum suppression method comprises the following steps:

6.1 Calculating the pose distance metric d (P) _i ,P _j I Λ), i.e.:

d(P _i ,P _j |A)＝K _Sim (P _i ,P _j |σ1)+λH _Sim (P _i ,P _j |σ2) (4)

6.2)determining a pose distance measure d (P) _i ,P _j I Λ) is smaller than the threshold η, if yes, deleting the gesture P _i And gesture P _j One of which is a metal alloy.

The bone point matching degree function and the spatial distance function are respectively as follows:

The LSTM neural network comprises a full connection layer, a relu layer, a dropout layer and an output layer.

Example 2:

a multi-person scene behavior recognition method based on skeleton key points comprises the following steps:

3) Acquiring a heat map of the human body boundary frame image after the space transformation by using a FastPose network (human body key point detection network);

Example 3:

the technical content of the multi-person scene behavior recognition method based on the skeleton key points is the same as that of the embodiment 2, and further, the human body detector comprises a YOLO V3 network.

Example 4:

the multi-person scene behavior recognition method based on skeleton key points includes the technical content as in any one of embodiments 2-3, and further, the space transformation network aligns a human body in a human body boundary frame image to a central position, namely, performing 2D affine transformation on the human body boundary frame image.

Example 5:

the technical content of the multi-person scene behavior recognition method based on skeleton key points is as in any one of embodiments 2-4, and further, a transformation formula of the 2D affine transformation is as follows:

wherein θ ₁ ，θ ₂ ，θ ₃ Is thatVector in (a); />And->Respectively representing the coordinates before and after transformation.

Example 6:

the multi-person scene behavior recognition method based on skeleton key points has the technical content as in any one of embodiments 2-5, and further, the FastPose network comprises a ResNet backbone network, a plurality of dense up-sampling convolution modules and a convolution layer;

Example 7:

the multi-person scene behavior recognition method based on skeleton key points has the technical content as in any one of embodiments 2-6, and further, the step of processing the heat map of the human body boundary frame image by using the two-step heat map normalization method comprises the following steps:

wherein p is _x The following is shown:

Example 8:

the method for identifying multi-person scene behaviors based on skeleton key points comprises the following steps that the technical content is the same as any one of the embodiments 2-7, and further, the space inverse transformation network restores a human body to an original position by using an inverse transformation equation;

the inverse transformation equation is shown below:

Example 9:

the method for identifying multi-person scene behaviors based on skeleton key points, which has the technical content as in any one of embodiments 2-8, further comprises the steps of eliminating redundant coordinates by using a parameterized pose non-maximum suppression method:

6.1 Calculating the pose distance metric d (P) _i ,P _j I Λ), i.e.:

d(P _i ,P _j |Λ)＝K _Sim (P _i ,P _j |σ1)+λH _Sim (P _i ,P _j σ2) (4)

wherein P is _i And P _j Two gestures, each gesture is composed of m skeleton points, and each skeleton point has a coordinate and a confidence level; Λ is a super-parameter, comprising three parameters of σ1, σ2 and λ; k (K) _Sim (P _i ,P _j σ1) is a skeletal point matching degree function; h _Sim (P _i ,P _j σ2) is a spatial distance function;

Example 10:

the technical content of the multi-person scene behavior recognition method based on skeleton key points is as in any one of embodiments 2-9, and further, a skeleton point matching degree function and a space distance function are respectively as follows:

wherein,is the jth bone key point of the nth position.

Example 11:

the method for identifying behavior of multiple scenes based on skeletal key points includes steps of any one of embodiments 2-10, and further, the LSTM neural network includes a full connection layer, a relu layer, a dropout layer and an output layer.

Example 12:

a behavior recognition method suitable for complex multi-person scene is characterized in that a multi-person scene behavior recognition method based on skeleton key points firstly obtains a video source to be recognized, then detects a person in a video frame by using a human body detector such as YOLOV3, then corrects the detection result of the human body detector (if needed), then a gesture estimator generates skeleton points for the person in the figure and marks identities, then post-processes are carried out to obtain more accurate skeleton points of the human body, and finally final behavior recognition is carried out by using a trained LSTM neural network.

The specific content comprises the following aspects:

in a first aspect, embodiments of the present application entail first identifying a human body part in a video, the method comprising:

inputting a video to be identified, wherein the video comprises a plurality of human body targets to be detected;

detecting all human bodies in the video by using a human body detector such as YOLOV3 or EfficientDet to obtain a plurality of bounding boxes containing a single person;

the human body boundary box obtained in the previous step may have a certain error, so that affine transformation is performed on the input image by using a spatial transformation network, thereby aligning the human body to the central position.

In a second aspect, embodiments of the present application obtain skeletal points of a human body in a human body bounding box, the method comprising:

inputting the obtained human body region into a single person posture estimator, including a new network FastPose, and then outputting human skeleton points of the region, wherein the specific process comprises the following steps:

using ResNet as a backbone network of FastPose to extract features of the input cropped image;

upsampling the extracted features using three dense upsampling convolution (Dense Upsampling Convolution, DUC) modules;

the DUC module firstly carries out 2D convolution on the feature diagram with the dimension of h multiplied by w multiplied by c, and then transforms the size of the feature diagram into 2h multiplied by 2w multiplied by c' through a PixelSheffle operation;

generating a heat map through the 1X 1 convolution layer;

according to the obtained heat map, predicting the position coordinates of the bone points by using a two-step heat map normalization method, which specifically comprises the following steps:

in a first step, element-by-element normalization is performed to generate a confidence heat map C: c _x ＝sigmoid(z _x ) Wherein z is _x Representing the unnormalized logarithmic value at position x, c _x Representing a confidence heat map value at location x;

the bone confidence is then represented by the maximum of the heat map: conf=max (C);

in a second step, global normalization is performed to generate a probability heat map P:thus according to p _x Predicting the position coordinates of the bone points;

inputting the obtained result of the bone points in the human body boundary box into a spatial inverse transformation network, and recovering the image to the original position;

the single person bounding box in the result may contain redundant gestures, and the redundant gestures need to be eliminated by using parameterized gesture non-maximum suppression, which specifically includes:

first, select the confidence level that is the greatestThe pose is used as a reference and a cancellation criterion, namely a pose distance measure d (P _i ,P _j Λ) to eliminate some poses close to it and define a threshold η as elimination criterion;

P _i and P _j Two gestures, each gesture is composed of m skeleton points, and each skeleton point has a coordinate and a confidence level;

Λ is a super parameter comprising three parameters of sigma 1, sigma 2 and lambda, and is used for adjusting the calculation mode of the gesture distance;

defining two auxiliary functions K _Sim (P _i ,P _j |σ1) and H _Sim (P _i ,P _j σ2), respectively representing the matching degree and the space distance of the skeleton points of the two gestures;

wherein,

thus, d (P _i ,P _j |Λ)＝K _Sim (P _i ,p _j |σ1)+λH _Sim (p _i ,p _j |σ2)；

The elimination process is repeated on the set of remaining poses until redundant poses are eliminated and only unique poses are reported.

In a third aspect, in the embodiment of the present application, skeleton points in each video frame are spliced into a vector, and the vector is used as an input of a trained LSTM neural network, the output of the LSTM neural network is the probability of each action, and the maximum probability class is selected as the output, so as to obtain the final action class.

Example 13:

a behavior recognition method suitable for complex multi-person scenes comprises the following steps:

detecting all human bodies in the video by using an off-the-shelf human body detector such as YOLOV3 or EfficientDet to obtain a plurality of bounding boxes containing a single person;

the human body boundary box obtained in the previous step may have a certain error, so that affine transformation is performed on the input image by using a custom space transformation network, thereby aligning the human body to the central position. The spatial transformation network is represented as a 2D affine transformation, which can be expressed mathematically as followsWherein θ is ₁ ，θ ₂ ，θ ₂ Is->Is included in the vector.And->Respectively representing the coordinates before and after transformation.

inputting the obtained human body region into a single person posture estimator, including a new network FastPose, as shown in FIG. 2, and then outputting the human skeleton points of the region, wherein the specific process comprises:

the DUC module firstly carries out 2D convolution on the feature map with the dimension of h multiplied by w multiplied by c to obtain a feature map with the dimension of h multiplied by w multiplied by 4c ', and then transforms the size of the feature map into 2h multiplied by 2w multiplied by c' through a PixelSheffle operation;

continuously performing the DUC operation for 3 times, and generating a heat map through the 1X 1 convolution layer;

according to the obtained heat map, predicting the position coordinates of the bone points by using a two-step heat map normalization method:

the bone confidence may then be represented by the maximum of the heat map: conf=max (C);

acquiring a joint confidence coefficient through the first step, and acquiring a joint position on the heat map generated in the second step;

inputting the obtained result of the bone points in the human body boundary box into a spatial inverse transformation network, recovering the image to the original position, wherein the spatial inverse transformation network is expressed mathematically asWherein [ gamma ] ₁ γ ₂ ]＝[θ ₁ θ ₂ ] ^-1 ，γ ₃ ＝-1×[γ ₁ γ ₂ ]θ ₃ 。

first, a pose with the highest confidence is selected as a reference, and a pose distance measure d (P) which is a cancellation criterion is applied _i ,P _j Λ) to eliminate some poses close to it and define a threshold η as elimination criterion;

P _i and P _j Is in the form of two gestures, namely,each gesture consists of m skeletal points, each skeletal point having a coordinate and a confidence level;

Λ is a super-parameter including three parameters of σ1, σ2 and λ, and σ1 and σ2 are used to calculate the similarity of the keypoints (K _iim ) And spatial similarity (H) _Sim ) And they reflect the weight and distance sensitivity between different keypoints, λ is the weight parameter used to balance the similarity of keypoints and spatial similarity, and reflects the importance of both similarities;

wherein, is in the position +.>The bounding box is the center and is 1/10 of the original image in size. tanh can filter out low confidence poses, and when the confidence of both poses is high, the output of the function approaches 1.

The spatial distance function is expressed as

The final distance function is then defined as d (P _i ,P _j |Λ)＝K _Sim (P _i ,P _j |σ1)+λH _Sim (P _i ,P _j σ2), if d (P _i ,P _j |Λ)<η, then represent the gesture P _i Should be eliminated because of P for reference _j P in the sense of _i Is redundant;

In a third aspect, embodiments of the present application stitch skeletal points in each video frame into a vector as input to a trained LSTM neural network, where the computation logic includes:

initializing h0 and c0 to be all zero tensors, which represent the initial hidden state and the initial cell state of the LSTM layer, respectively;

invoking the lstm object, taking vectors x, h0 and c0 as inputs, resulting in an output out, which is an output sequence containing all time steps;

taking the hidden state of the last time step in the out and taking the hidden state as the input of fc1 (the first full connection layer), and then sequentially passing through relu, dropoout, fc2, relu, dropoout, fc3, relu, dropoout, fc4, relu, dropoout and an output layer to obtain a final output vector out, wherein the out contains probabilities corresponding to all actions;

and finally, outputting the maximum probability pair action type by using the maximum function of PyTorch, namely the final action type.

Claims

1. The multi-person scene behavior recognition method based on the skeleton key points is characterized by comprising the following steps of:

1) And acquiring the source video, detecting a human body in the source video by using a human body detector, and acquiring a human body boundary frame image.

2. The method for identifying multi-person scene behavior based on skeletal keypoints of claim 1, wherein the human body detector comprises a YOLO V3 network.

3. The method for identifying the behavior of the multi-person scene based on the skeletal key points according to claim 1, wherein the spatial transformation network aligns the human body in the human body boundary frame image to the central position, namely, performing 2D affine transformation on the human body boundary frame image.

4. The multi-person scene behavior recognition method based on skeleton key points according to claim 2, wherein a transformation formula of the 2D affine transformation is as follows:

wherein θ ₁ ，θ ₂ ，θ ₃ Is thatVector in (a); />And->Respectively representing the coordinates before and after transformation; r represents a real set.

5. The multi-person scene behavior recognition method based on skeleton key points according to claim 1, wherein the FastPose network comprises a ResNet backbone network, a plurality of dense upsampling convolution modules and a convolution layer;

6. The method for identifying multi-person scene behavior based on skeletal keypoints according to claim 1, wherein the step of processing the heat map of the human body bounding box image using a two-step heat map normalization method comprises:

1) Normalizing the heat map element by element to generate a confidence heat map C; bone confidence: conf=max (C); max (C) is the maximum value of the heat map;

2) Performing global normalization to generate a probability heat map P to predict skeletal point location coordinates;

wherein the pixel probability p for position x in the probability heat map _x The following is shown:

wherein c _x Confidence heat map value for location x; p is p _x Representing the pixel probability for position x in the probability heat map.

7. The method for identifying multi-person scene behavior based on skeletal keypoints according to claim 1, wherein the spatial inverse transformation network restores the human body to the original position using an inverse transformation equation;

the inverse transformation equation is shown below:

wherein the vector [ gamma ] ₁ γ ₂ ]＝[θ ₁ θ ₂ ] ^-1 Vector gamma ₃ ＝-1×[γ ₁ γ ₂ ]θ ₃ 。

8. The method for identifying multi-person scene behavior based on skeletal keypoints according to claim 1, wherein the step of eliminating redundant coordinates using a parameterized pose non-maximal suppression method comprises:

1) Calculating an attitude distance measure d (P) _i ,P _j I Λ), i.e.:

d(P _i ,P _j |Λ)＝K _Sim (P _i ,P _j |σ1)+λH _Sim (P _i ,P _j |σ2) (4)

2) Determining a pose distance measure d (P) _i ,P _j I Λ) is smaller than the threshold η, if yes, deleting the gesture P _i And gesture P _j One of which is a metal alloy.

9. The multi-person scene behavior recognition method based on skeleton key points according to claim 8, wherein the skeleton point matching degree function and the space distance function are respectively as follows:

10. The method for identifying multi-person scene behavior based on skeleton key points according to claim 1, wherein the LSTM neural network comprises a full connection layer, a relu layer, a dropout layer and an output layer.