CN117894065A - Multi-person scene behavior recognition method based on skeleton key points - Google Patents

Multi-person scene behavior recognition method based on skeleton key points Download PDF

Info

Publication number
CN117894065A
CN117894065A CN202311711306.3A CN202311711306A CN117894065A CN 117894065 A CN117894065 A CN 117894065A CN 202311711306 A CN202311711306 A CN 202311711306A CN 117894065 A CN117894065 A CN 117894065A
Authority
CN
China
Prior art keywords
human body
heat map
network
frame image
boundary frame
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311711306.3A
Other languages
Chinese (zh)
Inventor
黎科宏
贺龙泽
王艺凡
陈雪峰
齐宏拓
冯亮
刘界鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing University
Original Assignee
Chongqing University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing University filed Critical Chongqing University
Priority to CN202311711306.3A priority Critical patent/CN117894065A/en
Publication of CN117894065A publication Critical patent/CN117894065A/en
Pending legal-status Critical Current

Links

Landscapes

  • Image Analysis (AREA)

Abstract

The invention discloses a multi-person scene behavior recognition method based on skeleton key points, which comprises the following steps: 1) Acquiring a source video, and detecting a human body in the source video by using a human body detector to acquire a human body boundary frame image; 2) Aligning a human body in the human body boundary frame image to a central position by utilizing a space transformation network; 3) Obtaining a heat map of the human body boundary frame image after the space transformation by using a FastPose network; 4) Processing the heat map of the human body boundary frame image by using a two-step heat map normalization method; 5) Restoring the human body to an original position by using a spatial inverse transformation network; 6) And eliminating redundant coordinates by using a parameterized pose non-maximum suppression method, and inputting the rest coordinate information into an LSTM-based behavior recognition network to obtain the human body action type. According to the invention, the video is input into the behavior recognition model to obtain the thermodynamic diagram, the coordinate of the human skeleton point is extracted from the thermodynamic diagram, and then the specific behavior type is estimated according to the relation and the change between the skeleton points.

Description

Multi-person scene behavior recognition method based on skeleton key points
Technical Field
The invention relates to the field of behavior recognition in computer vision, in particular to a multi-person scene behavior recognition method based on skeleton key points.
Background
Along with the rapid development of artificial intelligence technology, the application of multi-person behavior recognition is also more and more mature, and the method has wide application prospects in the fields of man-machine interaction, intelligent construction sites and the like.
Most of the existing methods are behavior recognition methods based on optical flows, and the methods input video and optical flow information into a behavior recognition network model together, but optical flow calculation is generally time-consuming and complex, and efficient algorithm and hardware support are needed. In addition, the optical flow has weak processing capability for large displacement, shielding, background clutter and the like, and false detection or omission may be caused.
Disclosure of Invention
The invention aims to provide a multi-person scene behavior recognition method based on skeleton key points, which comprises the following steps:
1) Acquiring a source video, detecting a human body in the source video by using a human body detector, and acquiring a human body boundary frame image;
2) Aligning a human body in the human body boundary frame image to a central position by utilizing a space transformation network;
3) Obtaining a heat map of the human body boundary frame image after the space transformation by using a FastPose network;
4) Processing the heat map of the human body boundary frame image by using a two-step heat map normalization method to obtain skeleton key point coordinate information;
5) Restoring the human body to an original position by using a spatial inverse transformation network;
6) And eliminating redundant coordinates by using a parameterized pose non-maximum suppression method, and inputting the rest coordinate information into an LSTM-based behavior recognition network to obtain the human body action type.
Further, the human detector includes a YOLO V3 network.
Further, the aligning the human body in the human body boundary frame image to the center position by the space transformation network means that 2D affine transformation is performed on the human body boundary frame image.
Further, a transformation formula of the 2D affine transformation is as follows:
wherein θ 1 ,θ 2 ,θ 3 Is thatVector in (a); />And->Respectively representing the coordinates before and after transformation. R represents a real set.
Further, the FastPose network comprises a ResNet backbone network, a plurality of dense up-sampling convolution modules and a convolution layer;
the ResNet backbone network is used for extracting the characteristics of the human body boundary frame image after the space transformation;
the dense up-sampling convolution module is used for up-sampling the extracted features;
and outputting a heat map of the human body boundary frame image after spatial transformation by the convolution layer.
Further, the step of processing the heat map of the human body boundary box image by using the two-step heat map normalization method comprises the following steps:
4.1 Element-wise normalizing the heat map to generate a confidence heat map C; bone confidence: conf=max (C); max (C) is the maximum value of the heat map;
4.2 Performing global normalization to generate a probability heat map P to predict skeletal point location coordinates;
wherein p is x The following is shown:
wherein c x A confidence heat map value for position x, used to characterize the probability of occurrence of the keypoint for each pixel position; p is p x The pixel probability representing position x in the probability heat map is determined by comparing the confidence heat map c x And performing global normalization.
Further, the spatial inverse transformation network restores the human body to the original position by using an inverse transformation equation;
the inverse transformation equation is shown below:
wherein [ gamma ] 1 γ 2 ]=[θ 1 θ 2 ] -1 ,γ 3 =-1×[γ 1 γ 23And->Respectively representing the coordinates before and after transformation.
Further, the step of eliminating redundant coordinates using the parameterized pose non-maximal suppression method includes:
6.1 Calculating the pose distance metric d (P) i ,P j I Λ), i.e.:
d(P i ,P j |Λ)=K Sim (P i ,P j |σ1)+λH Sim (P i ,P j |σ2) (4)
wherein P is i And P j Two gestures, each gesture is composed of m skeleton points, and each skeleton point has a coordinate and a confidence level; Λ is a super parameter comprising three pose distance metric parameters σ1, σ2 and λ; k (K) Sim (P i ,P j σ1) is a skeletal point matching degree function; h Sim (P i ,P j σ2) is a spatial distance function;
6.2 Determining a pose distance measure d (P) i ,P j I Λ) is smaller than the threshold η, if yes, deleting the gesture P i And gesture P j One of which is a metal alloy.
Further, the bone point matching degree function and the spatial distance function are respectively as follows:
wherein,the ith and jth bone points of the nth pose position. />Is a bone key point set; c i n 、c j n Is a confidence heat map value.
Further, the LSTM neural network comprises a full connection layer, a relu layer, a dropout layer and an output layer.
The invention has the technical effects that needless to say, the invention obtains thermodynamic diagram by inputting the video into the behavior recognition model, extracts the coordinates of the human skeleton points from the thermodynamic diagram, and presumes the specific behavior type according to the relation and the change among the skeleton points.
The invention is not interfered by factors such as illumination change, background complexity, human appearance difference and the like, and can effectively eliminate noise.
Drawings
FIG. 1 is a flow diagram of an overall behavior recognition network;
FIG. 2 is a schematic diagram of a backbone network during a gesture detection phase;
fig. 3 is a schematic diagram of a behavior recognition structure.
Detailed Description
The present invention is further described below with reference to examples, but it should not be construed that the scope of the above subject matter of the present invention is limited to the following examples. Various substitutions and alterations are made according to the ordinary skill and familiar means of the art without departing from the technical spirit of the invention, and all such substitutions and alterations are intended to be included in the scope of the invention.
Example 1:
referring to fig. 1 to 3, a multi-person scene behavior recognition method based on skeletal keypoints includes the following steps:
1) Acquiring a source video, detecting a human body in the source video by using a human body detector, and acquiring a human body boundary frame image;
2) Aligning a human body in the human body boundary frame image to a central position by utilizing a space transformation network;
3) Obtaining a heat map of the human body boundary frame image after the space transformation by using a FastPose network;
4) Processing the heat map of the human body boundary frame image by using a two-step heat map normalization method to obtain skeleton key point coordinate information;
5) Restoring the human body to an original position by using a spatial inverse transformation network;
6) And eliminating redundant coordinates by using a parameterized pose non-maximum suppression method, and inputting the rest coordinate information into an LSTM-based behavior recognition network to obtain the human body action type.
The human detector includes a YOLO V3 network.
The space transformation network aligns the human body in the human body boundary frame image to the central position, namely, performing 2D affine transformation on the human body boundary frame image.
The transformation formula of the 2D affine transformation is as follows:
wherein θ 1 ,θ 2 ,θ 3 Is thatVector in (a); />And->Respectively representing the coordinates before and after transformation. R represents a real set.
The FastPose network comprises a ResNet backbone network, a plurality of dense up-sampling convolution modules and a convolution layer;
the ResNet backbone network is used for extracting the characteristics of the human body boundary frame image after the space transformation;
the dense up-sampling convolution module is used for up-sampling the extracted features;
and outputting a heat map of the human body boundary frame image after spatial transformation by the convolution layer.
The step of processing the heat map of the human body boundary frame image by using the two-step heat map normalization method comprises the following steps:
4.1 Element-wise normalizing the heat map to generate a confidence heat map C; bone confidence: conf=max (C); max (C) is the maximum value of the heat map;
4.2 Performing global normalization to generate a probability heat map P to predict skeletal point location coordinates;
wherein p is x The following is shown:
wherein c x A confidence heat map value for position x, used to characterize the probability of occurrence of the keypoint for each pixel position; p is p x The pixel probability representing position x in the probability heat map is determined by comparing the confidence heat map c x And performing global normalization.
The space inverse transformation network restores the human body to the original position by utilizing an inverse transformation equation;
the inverse transformation equation is shown below:
wherein [ gamma ] 1 γ 2 ]=[θ 1 θ 2 ] -1 ,γ 3 =-1×[γ 1 γ 23And->Respectively representing the coordinates before and after transformation.
The step of eliminating redundant coordinates by using the parameterized pose non-maximum suppression method comprises the following steps:
6.1 Calculating the pose distance metric d (P) i ,P j I Λ), i.e.:
d(P i ,P j |A)=K Sim (P i ,P j |σ1)+λH Sim (P i ,P j |σ2) (4)
wherein P is i And P j Two gestures, each gesture is composed of m skeleton points, and each skeleton point has a coordinate and a confidence level; Λ is a super parameter comprising three pose distance metric parameters σ1, σ2 and λ; k (K) Sim (P i ,P j σ1) is a skeletal point matching degree function; h Sim (P i ,P j σ2) is a spatial distance function;
6.2)determining a pose distance measure d (P) i ,P j I Λ) is smaller than the threshold η, if yes, deleting the gesture P i And gesture P j One of which is a metal alloy.
The bone point matching degree function and the spatial distance function are respectively as follows:
wherein,the ith and jth bone points of the nth pose position. />Is a bone key point set; c i n 、c j n Is a confidence heat map value.
The LSTM neural network comprises a full connection layer, a relu layer, a dropout layer and an output layer.
Example 2:
a multi-person scene behavior recognition method based on skeleton key points comprises the following steps:
1) Acquiring a source video, detecting a human body in the source video by using a human body detector, and acquiring a human body boundary frame image;
2) Aligning a human body in the human body boundary frame image to a central position by utilizing a space transformation network;
3) Acquiring a heat map of the human body boundary frame image after the space transformation by using a FastPose network (human body key point detection network);
4) Processing the heat map of the human body boundary frame image by using a two-step heat map normalization method to obtain skeleton key point coordinate information;
5) Restoring the human body to an original position by using a spatial inverse transformation network;
6) And eliminating redundant coordinates by using a parameterized pose non-maximum suppression method, and inputting the rest coordinate information into an LSTM-based behavior recognition network to obtain the human body action type.
Example 3:
the technical content of the multi-person scene behavior recognition method based on the skeleton key points is the same as that of the embodiment 2, and further, the human body detector comprises a YOLO V3 network.
Example 4:
the multi-person scene behavior recognition method based on skeleton key points includes the technical content as in any one of embodiments 2-3, and further, the space transformation network aligns a human body in a human body boundary frame image to a central position, namely, performing 2D affine transformation on the human body boundary frame image.
Example 5:
the technical content of the multi-person scene behavior recognition method based on skeleton key points is as in any one of embodiments 2-4, and further, a transformation formula of the 2D affine transformation is as follows:
wherein θ 1 ,θ 2 ,θ 3 Is thatVector in (a); />And->Respectively representing the coordinates before and after transformation.
Example 6:
the multi-person scene behavior recognition method based on skeleton key points has the technical content as in any one of embodiments 2-5, and further, the FastPose network comprises a ResNet backbone network, a plurality of dense up-sampling convolution modules and a convolution layer;
the ResNet backbone network is used for extracting the characteristics of the human body boundary frame image after the space transformation;
the dense up-sampling convolution module is used for up-sampling the extracted features;
and outputting a heat map of the human body boundary frame image after spatial transformation by the convolution layer.
Example 7:
the multi-person scene behavior recognition method based on skeleton key points has the technical content as in any one of embodiments 2-6, and further, the step of processing the heat map of the human body boundary frame image by using the two-step heat map normalization method comprises the following steps:
4.1 Element-wise normalizing the heat map to generate a confidence heat map C; bone confidence: conf=max (C); max (C) is the maximum value of the heat map;
4.2 Performing global normalization to generate a probability heat map P to predict skeletal point location coordinates;
wherein p is x The following is shown:
wherein c x A confidence heat map value for position x, used to characterize the probability of occurrence of the keypoint for each pixel position; p is p x The pixel probability representing position x in the probability heat map is determined by comparing the confidence heat map c x And performing global normalization.
Example 8:
the method for identifying multi-person scene behaviors based on skeleton key points comprises the following steps that the technical content is the same as any one of the embodiments 2-7, and further, the space inverse transformation network restores a human body to an original position by using an inverse transformation equation;
the inverse transformation equation is shown below:
wherein [ gamma ] 1 γ 2 ]=[θ 1 θ 2 ] -1 ,γ 3 =-1×[γ 1 γ 23And->Respectively representing the coordinates before and after transformation.
Example 9:
the method for identifying multi-person scene behaviors based on skeleton key points, which has the technical content as in any one of embodiments 2-8, further comprises the steps of eliminating redundant coordinates by using a parameterized pose non-maximum suppression method:
6.1 Calculating the pose distance metric d (P) i ,P j I Λ), i.e.:
d(P i ,P j |Λ)=K Sim (P i ,P j |σ1)+λH Sim (P i ,P j σ2) (4)
wherein P is i And P j Two gestures, each gesture is composed of m skeleton points, and each skeleton point has a coordinate and a confidence level; Λ is a super-parameter, comprising three parameters of σ1, σ2 and λ; k (K) Sim (P i ,P j σ1) is a skeletal point matching degree function; h Sim (P i ,P j σ2) is a spatial distance function;
6.2 Determining a pose distance measure d (P) i ,P j I Λ) is smaller than the threshold η, if yes, deleting the gesture P i And gesture P j One of which is a metal alloy.
Example 10:
the technical content of the multi-person scene behavior recognition method based on skeleton key points is as in any one of embodiments 2-9, and further, a skeleton point matching degree function and a space distance function are respectively as follows:
wherein,is the jth bone key point of the nth position.
Example 11:
the method for identifying behavior of multiple scenes based on skeletal key points includes steps of any one of embodiments 2-10, and further, the LSTM neural network includes a full connection layer, a relu layer, a dropout layer and an output layer.
Example 12:
a behavior recognition method suitable for complex multi-person scene is characterized in that a multi-person scene behavior recognition method based on skeleton key points firstly obtains a video source to be recognized, then detects a person in a video frame by using a human body detector such as YOLOV3, then corrects the detection result of the human body detector (if needed), then a gesture estimator generates skeleton points for the person in the figure and marks identities, then post-processes are carried out to obtain more accurate skeleton points of the human body, and finally final behavior recognition is carried out by using a trained LSTM neural network.
The specific content comprises the following aspects:
in a first aspect, embodiments of the present application entail first identifying a human body part in a video, the method comprising:
inputting a video to be identified, wherein the video comprises a plurality of human body targets to be detected;
detecting all human bodies in the video by using a human body detector such as YOLOV3 or EfficientDet to obtain a plurality of bounding boxes containing a single person;
the human body boundary box obtained in the previous step may have a certain error, so that affine transformation is performed on the input image by using a spatial transformation network, thereby aligning the human body to the central position.
In a second aspect, embodiments of the present application obtain skeletal points of a human body in a human body bounding box, the method comprising:
inputting the obtained human body region into a single person posture estimator, including a new network FastPose, and then outputting human skeleton points of the region, wherein the specific process comprises the following steps:
using ResNet as a backbone network of FastPose to extract features of the input cropped image;
upsampling the extracted features using three dense upsampling convolution (Dense Upsampling Convolution, DUC) modules;
the DUC module firstly carries out 2D convolution on the feature diagram with the dimension of h multiplied by w multiplied by c, and then transforms the size of the feature diagram into 2h multiplied by 2w multiplied by c' through a PixelSheffle operation;
generating a heat map through the 1X 1 convolution layer;
according to the obtained heat map, predicting the position coordinates of the bone points by using a two-step heat map normalization method, which specifically comprises the following steps:
in a first step, element-by-element normalization is performed to generate a confidence heat map C: c x =sigmoid(z x ) Wherein z is x Representing the unnormalized logarithmic value at position x, c x Representing a confidence heat map value at location x;
the bone confidence is then represented by the maximum of the heat map: conf=max (C);
in a second step, global normalization is performed to generate a probability heat map P:thus according to p x Predicting the position coordinates of the bone points;
inputting the obtained result of the bone points in the human body boundary box into a spatial inverse transformation network, and recovering the image to the original position;
the single person bounding box in the result may contain redundant gestures, and the redundant gestures need to be eliminated by using parameterized gesture non-maximum suppression, which specifically includes:
first, select the confidence level that is the greatestThe pose is used as a reference and a cancellation criterion, namely a pose distance measure d (P i ,P j Λ) to eliminate some poses close to it and define a threshold η as elimination criterion;
P i and P j Two gestures, each gesture is composed of m skeleton points, and each skeleton point has a coordinate and a confidence level;
Λ is a super parameter comprising three parameters of sigma 1, sigma 2 and lambda, and is used for adjusting the calculation mode of the gesture distance;
defining two auxiliary functions K Sim (P i ,P j |σ1) and H Sim (P i ,P j σ2), respectively representing the matching degree and the space distance of the skeleton points of the two gestures;
wherein,
thus, d (P i ,P j |Λ)=K Sim (P i ,p j |σ1)+λH Sim (p i ,p j |σ2);
The elimination process is repeated on the set of remaining poses until redundant poses are eliminated and only unique poses are reported.
In a third aspect, in the embodiment of the present application, skeleton points in each video frame are spliced into a vector, and the vector is used as an input of a trained LSTM neural network, the output of the LSTM neural network is the probability of each action, and the maximum probability class is selected as the output, so as to obtain the final action class.
Example 13:
a behavior recognition method suitable for complex multi-person scenes comprises the following steps:
in a first aspect, embodiments of the present application entail first identifying a human body part in a video, the method comprising:
inputting a video to be identified, wherein the video comprises a plurality of human body targets to be detected;
detecting all human bodies in the video by using an off-the-shelf human body detector such as YOLOV3 or EfficientDet to obtain a plurality of bounding boxes containing a single person;
the human body boundary box obtained in the previous step may have a certain error, so that affine transformation is performed on the input image by using a custom space transformation network, thereby aligning the human body to the central position. The spatial transformation network is represented as a 2D affine transformation, which can be expressed mathematically as followsWherein θ is 1 ,θ 2 ,θ 2 Is->Is included in the vector.And->Respectively representing the coordinates before and after transformation.
In a second aspect, embodiments of the present application obtain skeletal points of a human body in a human body bounding box, the method comprising:
inputting the obtained human body region into a single person posture estimator, including a new network FastPose, as shown in FIG. 2, and then outputting the human skeleton points of the region, wherein the specific process comprises:
using ResNet as a backbone network of FastPose to extract features of the input cropped image;
upsampling the extracted features using three dense upsampling convolution (Dense Upsampling Convolution, DUC) modules;
the DUC module firstly carries out 2D convolution on the feature map with the dimension of h multiplied by w multiplied by c to obtain a feature map with the dimension of h multiplied by w multiplied by 4c ', and then transforms the size of the feature map into 2h multiplied by 2w multiplied by c' through a PixelSheffle operation;
continuously performing the DUC operation for 3 times, and generating a heat map through the 1X 1 convolution layer;
according to the obtained heat map, predicting the position coordinates of the bone points by using a two-step heat map normalization method:
in a first step, element-by-element normalization is performed to generate a confidence heat map C: c x =sigmoid(z x ) Wherein z is x Representing the unnormalized logarithmic value at position x, c x Representing a confidence heat map value at location x;
the bone confidence may then be represented by the maximum of the heat map: conf=max (C);
in a second step, global normalization is performed to generate a probability heat map P:thus according to p x Predicting the position coordinates of the bone points;
acquiring a joint confidence coefficient through the first step, and acquiring a joint position on the heat map generated in the second step;
inputting the obtained result of the bone points in the human body boundary box into a spatial inverse transformation network, recovering the image to the original position, wherein the spatial inverse transformation network is expressed mathematically asWherein [ gamma ] 1 γ 2 ]=[θ 1 θ 2 ] -1 ,γ 3 =-1×[γ 1 γ 23
The single person bounding box in the result may contain redundant gestures, and the redundant gestures need to be eliminated by using parameterized gesture non-maximum suppression, which specifically includes:
first, a pose with the highest confidence is selected as a reference, and a pose distance measure d (P) which is a cancellation criterion is applied i ,P j Λ) to eliminate some poses close to it and define a threshold η as elimination criterion;
P i and P j Is in the form of two gestures, namely,each gesture consists of m skeletal points, each skeletal point having a coordinate and a confidence level;
Λ is a super-parameter including three parameters of σ1, σ2 and λ, and σ1 and σ2 are used to calculate the similarity of the keypoints (K iim ) And spatial similarity (H) Sim ) And they reflect the weight and distance sensitivity between different keypoints, λ is the weight parameter used to balance the similarity of keypoints and spatial similarity, and reflects the importance of both similarities;
defining two auxiliary functions K Sim (P i ,P j |σ1) and H Sim (P i ,P j σ2), respectively representing the matching degree and the space distance of the skeleton points of the two gestures;
wherein, is in the position +.>The bounding box is the center and is 1/10 of the original image in size. tanh can filter out low confidence poses, and when the confidence of both poses is high, the output of the function approaches 1.
The spatial distance function is expressed as
The final distance function is then defined as d (P i ,P j |Λ)=K Sim (P i ,P j |σ1)+λH Sim (P i ,P j σ2), if d (P i ,P j |Λ)<η, then represent the gesture P i Should be eliminated because of P for reference j P in the sense of i Is redundant;
the elimination process is repeated on the set of remaining poses until redundant poses are eliminated and only unique poses are reported.
In a third aspect, embodiments of the present application stitch skeletal points in each video frame into a vector as input to a trained LSTM neural network, where the computation logic includes:
initializing h0 and c0 to be all zero tensors, which represent the initial hidden state and the initial cell state of the LSTM layer, respectively;
invoking the lstm object, taking vectors x, h0 and c0 as inputs, resulting in an output out, which is an output sequence containing all time steps;
taking the hidden state of the last time step in the out and taking the hidden state as the input of fc1 (the first full connection layer), and then sequentially passing through relu, dropoout, fc2, relu, dropoout, fc3, relu, dropoout, fc4, relu, dropoout and an output layer to obtain a final output vector out, wherein the out contains probabilities corresponding to all actions;
and finally, outputting the maximum probability pair action type by using the maximum function of PyTorch, namely the final action type.

Claims (10)

1. The multi-person scene behavior recognition method based on the skeleton key points is characterized by comprising the following steps of:
1) And acquiring the source video, detecting a human body in the source video by using a human body detector, and acquiring a human body boundary frame image.
2) Aligning a human body in the human body boundary frame image to a central position by utilizing a space transformation network;
3) Obtaining a heat map of the human body boundary frame image after the space transformation by using a FastPose network;
4) Processing the heat map of the human body boundary frame image by using a two-step heat map normalization method to obtain skeleton key point coordinate information;
5) Restoring the human body to an original position by using a spatial inverse transformation network;
6) And eliminating redundant coordinates by using a parameterized pose non-maximum suppression method, and inputting the rest coordinate information into an LSTM-based behavior recognition network to obtain the human body action type.
2. The method for identifying multi-person scene behavior based on skeletal keypoints of claim 1, wherein the human body detector comprises a YOLO V3 network.
3. The method for identifying the behavior of the multi-person scene based on the skeletal key points according to claim 1, wherein the spatial transformation network aligns the human body in the human body boundary frame image to the central position, namely, performing 2D affine transformation on the human body boundary frame image.
4. The multi-person scene behavior recognition method based on skeleton key points according to claim 2, wherein a transformation formula of the 2D affine transformation is as follows:
wherein θ 1 ,θ 2 ,θ 3 Is thatVector in (a); />And->Respectively representing the coordinates before and after transformation; r represents a real set.
5. The multi-person scene behavior recognition method based on skeleton key points according to claim 1, wherein the FastPose network comprises a ResNet backbone network, a plurality of dense upsampling convolution modules and a convolution layer;
the ResNet backbone network is used for extracting the characteristics of the human body boundary frame image after the space transformation;
the dense up-sampling convolution module is used for up-sampling the extracted features;
and outputting a heat map of the human body boundary frame image after spatial transformation by the convolution layer.
6. The method for identifying multi-person scene behavior based on skeletal keypoints according to claim 1, wherein the step of processing the heat map of the human body bounding box image using a two-step heat map normalization method comprises:
1) Normalizing the heat map element by element to generate a confidence heat map C; bone confidence: conf=max (C); max (C) is the maximum value of the heat map;
2) Performing global normalization to generate a probability heat map P to predict skeletal point location coordinates;
wherein the pixel probability p for position x in the probability heat map x The following is shown:
wherein c x Confidence heat map value for location x; p is p x Representing the pixel probability for position x in the probability heat map.
7. The method for identifying multi-person scene behavior based on skeletal keypoints according to claim 1, wherein the spatial inverse transformation network restores the human body to the original position using an inverse transformation equation;
the inverse transformation equation is shown below:
wherein the vector [ gamma ] 1 γ 2 ]=[θ 1 θ 2 ] -1 Vector gamma 3 =-1×[γ 1 γ 23
8. The method for identifying multi-person scene behavior based on skeletal keypoints according to claim 1, wherein the step of eliminating redundant coordinates using a parameterized pose non-maximal suppression method comprises:
1) Calculating an attitude distance measure d (P) i ,P j I Λ), i.e.:
d(P i ,P j |Λ)=K Sim (P i ,P j |σ1)+λH Sim (P i ,P j |σ2) (4)
wherein P is i And P j Two gestures, each gesture is composed of m skeleton points, and each skeleton point has a coordinate and a confidence level; Λ is a super parameter comprising three pose distance metric parameters σ1, σ2 and λ; k (K) Sim (P i ,P j σ1) is a skeletal point matching degree function; h sim (P i ,P j σ2) is a spatial distance function;
2) Determining a pose distance measure d (P) i ,P j I Λ) is smaller than the threshold η, if yes, deleting the gesture P i And gesture P j One of which is a metal alloy.
9. The multi-person scene behavior recognition method based on skeleton key points according to claim 8, wherein the skeleton point matching degree function and the space distance function are respectively as follows:
wherein,the ith and jth bone points of the nth pose position. />Is a bone key point set; c i n 、c j n Is a confidence heat map value.
10. The method for identifying multi-person scene behavior based on skeleton key points according to claim 1, wherein the LSTM neural network comprises a full connection layer, a relu layer, a dropout layer and an output layer.
CN202311711306.3A 2023-12-13 2023-12-13 Multi-person scene behavior recognition method based on skeleton key points Pending CN117894065A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311711306.3A CN117894065A (en) 2023-12-13 2023-12-13 Multi-person scene behavior recognition method based on skeleton key points

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311711306.3A CN117894065A (en) 2023-12-13 2023-12-13 Multi-person scene behavior recognition method based on skeleton key points

Publications (1)

Publication Number Publication Date
CN117894065A true CN117894065A (en) 2024-04-16

Family

ID=90649660

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311711306.3A Pending CN117894065A (en) 2023-12-13 2023-12-13 Multi-person scene behavior recognition method based on skeleton key points

Country Status (1)

Country Link
CN (1) CN117894065A (en)

Similar Documents

Publication Publication Date Title
WO2022036777A1 (en) Method and device for intelligent estimation of human body movement posture based on convolutional neural network
CN110009679B (en) Target positioning method based on multi-scale feature convolutional neural network
CN109948526B (en) Image processing method and device, detection equipment and storage medium
US9098740B2 (en) Apparatus, method, and medium detecting object pose
CN112801169B (en) Camouflage target detection method, system, device and storage medium based on improved YOLO algorithm
US20230134967A1 (en) Method for recognizing activities using separate spatial and temporal attention weights
CN112163498B (en) Method for establishing pedestrian re-identification model with foreground guiding and texture focusing functions and application of method
CN103150546B (en) video face identification method and device
CN112232134B (en) Human body posture estimation method based on hourglass network and attention mechanism
JP7327077B2 (en) Road obstacle detection device, road obstacle detection method, and road obstacle detection program
CN111639570B (en) Online multi-target tracking method based on motion model and single-target clue
CN111429481B (en) Target tracking method, device and terminal based on adaptive expression
CN115187786A (en) Rotation-based CenterNet2 target detection method
CN114118303B (en) Face key point detection method and device based on prior constraint
CN114492634A (en) Fine-grained equipment image classification and identification method and system
Yang et al. Dangerous Driving Behavior Recognition Based on Improved YoloV5 and Openpose [J]
CN114066844A (en) Pneumonia X-ray image analysis model and method based on attention superposition and feature fusion
CN117576380A (en) Target autonomous detection tracking method and system
CN116311518A (en) Hierarchical character interaction detection method based on human interaction intention information
CN117894065A (en) Multi-person scene behavior recognition method based on skeleton key points
CN113111804B (en) Face detection method and device, electronic equipment and storage medium
CN111160219B (en) Object integrity evaluation method and device, electronic equipment and storage medium
CN114842506A (en) Human body posture estimation method and system
CN114898304A (en) Vehicle tracking method and device, road side equipment and network side equipment
CN113221626B (en) Human body posture estimation method based on Non-local high-resolution network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination