CN115050055B

CN115050055B - Human skeleton sequence construction method based on Kalman filtering

Info

Publication number: CN115050055B
Application number: CN202210788077.4A
Authority: CN
Inventors: 彭倍; 刁宏健; 邵继业; 杨文章
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2022-07-06
Filing date: 2022-07-06
Publication date: 2024-04-30
Anticipated expiration: 2042-07-06
Also published as: CN115050055A

Abstract

The invention discloses a human skeleton sequence construction method based on Kalman filtering, which comprises the following steps: s1, carrying out attitude estimation, and carrying out normalization processing on joint point characteristics; s2, numbering the skeletons of which all the characteristics of the first frame of video are not all invalid values, and adding the skeletons into a processing queue; s3, inputting all numbered skeleton sequences and skeleton sets in the processing queue into a Kalman filtering module for constructing human skeleton sequences, and updating the processing queue according to the processing result; s4, repeatedly executing S3 on each frame of video until all frames are processed, wherein each numbered skeleton sequence in the processing queue is the constructed human skeleton sequence. Compared with the traditional Kalman filtering, the invention further processes the observed value of the attitude estimation method by the newly defined decision module, so that the Kalman filtering module has the capability of tracking the motion of the human body; the problem of feature extraction errors caused by false detection and missing detection in the attitude estimation method is corrected through a Kalman filtering algorithm.

Description

Human skeleton sequence construction method based on Kalman filtering

Technical Field

The invention belongs to the field of computer vision, and particularly relates to a human skeleton sequence construction method based on Kalman filtering.

Background

Behavior recognition based on video is one of the representative tasks of understanding video information, and the task of recognizing human actions in video is called video behavior recognition. The current video behavior recognition method based on deep learning comprises the following steps: dual stream networks (Two-stream networks), three-dimensional convolutional neural networks (3D Convolutional Neural Networks), other non-end-to-end methods, and the like.

The proposal of the space-time convolutional network (Spatial Temporal Graph Convolutional Networks, STGCN), P-CNN (Pose-based CNN), provides a new non-end-to-end approach for behavior recognition. According to the method, the frame-by-frame joint points are extracted through an advanced posture estimation method, the joint points are clustered to form a framework to serve as network input, and joint characteristics are extracted and fused to be used for video behavior recognition.

Because the gesture information is closely related to human behaviors, the method has a good effect on behavior recognition independent of background information. The main stream human body posture estimation method such as openpose does not correlate the skeletons among different frames, but some methods only analyze the behaviors of the same person, the methods are seriously dependent on the joint point information extracted by the posture estimation method, are quite sensitive to misidentification and missing identification results of the posture estimation method caused by disturbance in a video frame, and the posture estimation method can number each skeleton randomly when a plurality of people are simultaneously contained in the video, so that the behavior identification result is wrong.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provides a human skeleton sequence construction method based on Kalman filtering, which is used for carrying out observation value decision on the output result of a current frame posture estimation method by predicting the characteristics through a Kalman filtering module and updating a processing queue by using the decision result so as to obtain a human skeleton sequence. The problem of feature extraction errors caused by false detection and missing detection in the attitude estimation method is corrected.

The aim of the invention is realized by the following technical scheme: a human skeleton sequence construction method based on Kalman filtering comprises the following steps:

S1, carrying out gesture estimation on videos containing human behavior information frame by frame, and carrying out normalization processing on all joint point characteristics to obtain a skeleton set with joint point characteristic information;

s2, numbering the skeletons of which all the characteristics of the first frame of video are not all invalid values, and adding the skeletons into a processing queue;

s3, inputting all numbered skeleton sequences and skeleton sets in the processing queue into a Kalman filtering module for constructing human skeleton sequences, and updating the processing queue into a frame according to the processing result;

S4, repeatedly executing S3 on each frame of video until all frames are processed, wherein each numbered skeleton sequence in the processing queue is the constructed human skeleton sequence.

Further, in the step S3, the kalman filtering module for constructing the human skeleton sequence includes a forward prediction module, a decision module and an update module.

The prediction module is a Kalman filtering algorithm prediction part, the prediction input is all numbered skeletons in the processing queue, and the output is specifically:

Wherein, To estimate/>, from t-1 moment posterior statePrior state estimation at T moment predicted by motion model, T is total frame number, A is state transition matrix,/>To estimate covariance/>, based on t-1 moment posterior stateThe calculated prior state estimation covariance at the moment T is represented by a process covariance matrix Q, and a superscript T represents transposition;

Posterior state estimation The definition is as follows:

When t=0, x and y are defined as the characteristics of joint points of the joint v with Z _x,t,v,n,Z_y,t,v,n,Z_x,t,v,n、Z_y,t,v,n being the serial number n after the normalization of the t frame; n is a skeleton sequence number, and n is equal to m during initialization; v _x,v_y is defined as 0 when t=0, when t noteq0, The iteration result of the Kalman filtering module is represented by posterior state estimation of the joint v with the number m in the t frame;

t=0 Defined as an identity matrix, when t is not equal to 0,/>The iteration result of the Kalman filtering module is represented by the posterior state estimation covariance of the joint v with the number m in the t frame;

the state transition matrix a is defined as:

dt is the video frame time interval;

the process covariance matrix Q is defined as:

For the variance of the coordinates of joints of human body,/> Is a coordinate movement velocity variance parameter.

The decision module specifically comprises the following steps:

S31, calculating the matching degree of the processing queue and all detected skeletons of the current frame to obtain a Marsh distance matrix D _t; the method comprises the following steps: calculating a candidate set of each skeleton in the processing queue, wherein elements in the set are current frame skeletons successfully matched, and outputting a result through a prediction module And/>Calculating the prior estimation/>, of each joint point v characteristic Z _t,v,n of all frameworks n of the current frame t, and each framework m in the corresponding processing queueHorse-type distance D _t,v,nm;

wherein, the mahalanobis distance matrix used for measuring the matching degree is:

D_t,nm＝min(D_t,v,nm)，v∈V

H is the observation matrix:

S32, obtaining optimal matching of the D _t matrix by using a Hungary algorithm, obtaining a matching threshold alpha, if D _t,nm is less than or equal to alpha, matching successfully, setting the observed value C _t,v,m ^- of the current frame as the joint point characteristic Z _t,v,n;D_t,nm > alpha, and setting the observed value as a predicted value, wherein the matching fails

S33, adding the unmatched frameworks in S32, which are caused by D _t,nm > alpha or N > M, into a processing queue by using a new number.

The updating module specifically comprises: posterior estimation for computing Kalman gain K _t,v,m and joint point characteristicsCovariance/>, with posterior estimation

Wherein the method comprises the steps ofC _t,v,m ^- is the output result of the prediction module and the decision module, and R is the observed noise covariance matrix:

Estimating a variance for the pose;

finally update C _t,v,m:

C _t,v,m is an element in the set C _v,m, and represents the characteristic of the joint v in the skeleton number m at the t-th frame of C _v,m.

The beneficial effects of the invention are as follows: the invention uses the attitude estimation method to process the result and initialize the processing queue, takes the frame as a unit, predicts the characteristics through the Kalman filtering module, makes the observation value decision on the output result of the current frame attitude estimation method, and uses the decision result to update the processing queue, thereby obtaining the human skeleton sequence. The method has the following advantages:

1. Compared with the traditional Kalman filtering, the newly defined decision module further processes the observed value of the attitude estimation method, so that the Kalman filtering module has the capability of tracking the motion of the human body;

2. the problem of feature extraction errors caused by false detection and missing detection in the attitude estimation method is corrected through a Kalman filtering algorithm;

3. Aiming at a video behavior recognition network such as STGCN which needs to extract the same skeleton motion information, the problem that the network cannot be directly used because the sequence number of the human skeleton recognized by the gesture estimation method is random in a multi-person scene is solved.

Drawings

FIG. 1 is a block diagram of an algorithm flow of the present invention;

FIG. 2 is a block diagram of a Kalman filtering module for constructing a human skeleton sequence according to the invention;

FIG. 3 is a diagram showing the comparison between the construction result of the frame skip in the skeleton detection and the original data;

FIG. 4 is a diagram showing the comparison of the construction result of the false detection in the skeleton detection with the original data;

FIG. 5 is a diagram showing the comparison between the construction result and the original data when the sequence numbers of the multi-person scene skeletons are frequently exchanged;

FIG. 6 is a graph comparing the construction result with the original data when the simulation multi-person scene is partially overlapped;

FIG. 7 is a diagram showing the alignment of the raw data with the recognition results of the framework sequences constructed according to the present invention at STGCN.

Detailed Description

The invention provides a human skeleton sequence construction method based on Kalman filtering, which predicts probability distribution of a target joint point according to a time sequence by using Kalman filtering, processes outliers and missing values by setting a probability threshold to match with an optimal matching skeleton instead of directly inputting the skeleton of posture estimation into a posture recognition network, can give a more stable skeleton sequence, and can realize target skeleton tracking in a multi-person environment, thereby avoiding behavior recognition errors caused by confusion of numbers of multiple persons to a certain extent. The technical scheme of the invention is further described below with reference to the accompanying drawings.

As shown in FIG. 1, the human skeleton sequence construction method based on Kalman filtering comprises the following steps:

The skeleton set is defined as Z _F,T,V,N, where f= { f|f=x, f=y } is a joint feature, i.e. two-dimensional coordinate information of the image; t is the total frame number, V is the total joint number, and if the human posture estimation method openpose is used, the joint number effective value is v= [ V ₀,v₁,...,v₂₄ ]; n is the total skeleton number, which is the random number according to the number of the identified skeletons in a certain frame, and the numbers of different frames are not linked;

The normalization method comprises the following steps:

Wherein, Z _x,t,v,n、Z_y,t,v,n is the characteristic of the joint point after the normalization of the joint V with the sequence number n in the t frame, X _t,v,n、Y_t,v,n is the posture estimation result of the joint V with the sequence number n in the t frame, V _xmax、V_ymax is the maximum effective value on the corresponding characteristic, generally the image width and height, and V _xmin、V_ymin is the minimum effective value on the corresponding characteristic, generally zero.

S2, numbering the skeletons of which all the characteristics of the first frame of video are not all invalid values, and adding the skeletons into a processing queue; the skeleton in the treatment queue is defined as C _V,M, M is the total skeleton number, V is the total joint number, C _v,m represents all feature sets of the joints V in skeleton number M in the video set, m=1, 2.

The Kalman filtering module for constructing the human skeleton sequence comprises a forward prediction module, a decision module and an updating module, which are connected in sequence, as shown in figure 2.

Posterior state estimation The definition is as follows:

X, y is defined as Z _x,t,v,n,Z_y,t,v,n when t=0, n is a skeleton number, n is equal to m when initialized, v _x,v_y is defined as 0 when t=0, t is not equal to 0, The iteration result of the Kalman filtering module is represented by posterior state estimation of the joint v with the number m in the t frame;

the state transition matrix a is defined as:

dt is the video frame time interval;

the process covariance matrix Q is defined as:

The decision module specifically comprises the following steps:

s31, calculating the matching degree of the processing queue (numbered skeleton sequence) and all detected skeletons of the current frame to obtain a Marsh distance matrix D _t; the method comprises the following steps: calculating a candidate set of each skeleton in the processing queue, wherein elements in the set are current frame skeletons successfully matched, and outputting a result through a prediction module And/>Calculating the prior estimation/>, of each joint point v characteristic Z _t,v,n of all frameworks n of the current frame t, and each framework m in the corresponding processing queueHorse-type distance D _t,v,nm.

D_t,nm＝min(D_t,v,nm)，v∈V

H is the observation matrix:

S32, the essence of the matching problem between the numbered skeleton sequences and all skeletons of the current frame is an assignment problem, the Hungary algorithm is used for obtaining the optimal matching of the D _t matrix, the matching threshold alpha is obtained, the matching is successful if D _t,nm is less than or equal to alpha, the matching fails if the observed value C _t,v,m ^- of the current frame is set as the joint point characteristic Z _t,v,n;D_t,nm & gtalpha, and the observed value is set as the predicted value

Estimating a variance for the pose;

And finally updating Ct _,v,m:

C _t,v,m is an element in the set C _v,m, and represents the characteristics of the joint v in the skeleton number m at the t-th frame of C _v, m.

In this example, experiments were performed using python3.8 in a Windows operating system using a data set of Kinetics-700. Video in which detection errors, detection missing frames, and multi-person scenes are typically compared are selected as the material for constructing the skeleton sequence. The common parameters used in the verification process are listed in table 1.

Table 1 common parameter table

Inputting the video into openpose human body posture estimation method, analyzing and normalizing the output result, inputting the result into Kalman filtering module for constructing human body skeleton sequence to obtain constructed skeleton sequence.

Fig. 3 is a diagram comparing the posture estimation result with the constructed neck joint of the skeleton sequence when the frame jump occurs in the posture estimation, fig. 3 (a) is an image drawn according to the neck joint point characteristics in the result output by openpose when the frame 2 of the data is shown, no human skeleton is detected at this time, fig. 3 (b) is a diagram comparing the posture estimation result (marked with an asterisk) with the constructed skeleton sequence (marked with a dotted line) on the x-axis with respect to the frame after normalization, and fig. 3 (c) is a diagram comparing the frame on the corresponding y-axis. In the 2 nd frame, because the scene is complex, openpose does not detect the human skeleton, namely the x, y feature coordinates are 0; at this time, the Kalman filtering module output value filters the outlier, and replaces the outlier with the prior estimated value to update the skeleton sequence.

Fig. 4 is a diagram comparing the posture estimation result with the constructed skeleton sequence neck joint when the posture estimation is false, fig. 4 (a) is an image drawn according to the neck joint point characteristics in the openpose output result when the data is shown in the 4 th frame, fig. 4 (b) is a diagram comparing the normalized posture estimation result (marked with an asterisk) with the constructed skeleton sequence (marked with a dotted line) on the x-axis, and fig. 4 (c) is a diagram comparing the frame on the corresponding y-axis. And in the 4 th and 5 th frames, because of background interference, openpose detection errors, the x and y coordinates have larger offset, and the x and y values are corrected after the X and y coordinates pass through a Kalman filtering module.

Fig. 5 is a diagram showing a comparison between a skeleton sequence and a result of posture estimation when the skeleton sequence obtained by posture estimation is repeatedly switched, and fig. 5 (a) is an image drawn according to openpose output results. Fig. 5 (b) is a diagram showing features of neck joints of two skeletons in openpose output results, a horizontal axis is a normalized x feature, a vertical axis is a normalized y feature, an asterisk shows features of neck joint Z _f,t,1,0 with a posture estimation output skeleton number of 0, a plus sign shows features of Z _f,t,1,1 with a number of 1, solid lines and broken lines respectively show feature lines of neck joint C _t,1,0 and C _t,1,1 in a skeleton sequence with numbers of 0 and 1 after a skeleton sequence is constructed, and subscript v=1 shows neck joint. Fig. 5 (c) and 5 (d) are graphs comparing the results of the construction of the x-feature and the y-feature according to the number of frames in fig. 5 (b), respectively.

In fig. 5 (b), it can be seen that the detected two human skeletons are distributed in the upper left and lower right regions, but openpose detects for each frame, and the detection results between frames are independent, so that the skeleton numbers are not fixed, and the skeleton numbers in the upper left and lower right regions are frequently switched. The result shows that the characteristic change between two skeleton frames is effectively tracked after the Kalman filtering module of the skeleton sequence is constructed.

FIG. 6 is a comparison of the building of skeleton sequences before and after the simulation of the partial overlapping of a multi-person scene, wherein the program generates two paths from (0.35,0.47) to (0.39,0.32) and from (0.5,0.37) to (0.31,0.46), samples the paths and adds Gaussian noise, the Gaussian noise conforms to N to (0,0.0001), discrete points are obtained on the paths as simulations of joint coordinates, the skeleton number is randomly 0 or 1, the points are used for simulating joint movements, and FIG. 6 (a) is the generated discrete point coordinates. Fig. 6 (b) and 6 (c) are graphs comparing the construction results of the x-feature and the y-feature according to the number of frames in fig. 6 (a), respectively.

It can be seen that under the environment of a multi-person scene, if the correct joint characteristics can be detected, even if the distances between the joint points are small and the motion tracks overlap, the invention can still capture the inter-frame joint information and has the capability of distinguishing different skeletons.

All the constructed skeleton sequences are input STGCN with the original data, the model weights are provided by the authorities, st_gcn.graphics.pt, and the partial video is inferred at STGCN as shown in Table 2.

Sequence number: video sequence number, all video is abserling video under label in kinetics-700. The video 1 camera is stable, and the behavior of the person is clear. Video 2 is the video shown in fig. 3, and the attitude estimation result is unstable and has the condition of missing detection and false detection. Video 3 is the video shown in fig. 5, and has the problem of multi-user skeleton number switching. 1',2',3' are the results of the STGCN reasoning of the skeleton sequences constructed by the method of the invention corresponding to the 1,2,3 videos:

Frame number: representing the total number of identified video frames;

Output frame number: STGCN number of frames of output characteristics, the original frame number T is compressed to

Abserling: the number of frames in the output frame number identified as abserling acts;

WATER SKIING: the number of frames identified as WATER SKIING acts among the output frames, which acts are easily confused with abserling;

other behavior amounts: the number of actions in the output frame number that are identified as other actions;

Other behavior frame numbers: outputting a number of frames of the number of frames identified as other behavior;

voting result: and outputting the behavior with the largest number of frames in the frame number as the voting result, and considering the behavior as the main behavior in the video.

TABLE 2 comparison of the results of reasoning on STGCN for part of the video

Abserling and WATER SKIING are similar in feature with background omitted, so that the recognition results of abserling and WATER SKIING can be considered correct at the same time in this case. The comparison result of the original data and the human skeleton recognition result constructed by the invention according to the table 2 is shown in fig. 7, wherein in two adjacent bar graphs of each frame in the figure, the left side is the data before processing, and the right side is the processed data. From comparison results, the skeleton sequence constructed by the method has different degrees of improvement in recognition accuracy when compared with openpose posture estimation results for direct recognition. Meanwhile, the skeleton sequence constructed by the method can reduce the false recognition of the gesture estimation method, and the adverse effect of the false recognition on the recognition result is reduced, namely the number of false recognition behaviors is reduced. For STGCN needing to extract time information, the problem that the skeleton sequence number is output randomly by the gesture estimation method is effectively solved.

Those of ordinary skill in the art will recognize that the embodiments described herein are for the purpose of aiding the reader in understanding the principles of the present invention and should be understood that the scope of the invention is not limited to such specific statements and embodiments. Those of ordinary skill in the art can make various other specific modifications and combinations from the teachings of the present disclosure without departing from the spirit thereof, and such modifications and combinations remain within the scope of the present disclosure.

Claims

1. The human skeleton sequence construction method based on Kalman filtering is characterized by comprising the following steps of:

S3, inputting all numbered skeleton sequences and skeleton sets in the processing queue into a Kalman filtering module for constructing human skeleton sequences, and updating the processing queue into a frame according to the processing result; the Kalman filtering module for constructing the human skeleton sequence comprises a forward prediction module, a decision module and an updating module;

Posterior state estimation The definition is as follows:

the state transition matrix a is defined as:

dt is the video frame time interval;

the process covariance matrix Q is defined as:

For the variance of the coordinates of joints of human body,/> The coordinate motion speed variance parameter is used;

the decision module specifically comprises the following steps:

D_t,nm＝min(D_t,v,nm),v∈V

H is the observation matrix:

S33, adding a new number to the processing queue for the unmatched skeleton caused by D _t,nm > alpha or N > M in S32;

2. The human skeleton sequence construction method based on kalman filtering according to claim 1, wherein the updating module specifically comprises: posterior estimation for computing Kalman gain K _t,v,m and joint point characteristicsEstimating covariance by posterior

Wherein the method comprises the steps ofFor the output result of the prediction module and the decision module, R is the observed noise covariance matrix:

Estimating a variance for the pose;

finally update C _t,v,m: