CN115050055A

CN115050055A - Human body skeleton sequence construction method based on Kalman filtering

Info

Publication number: CN115050055A
Application number: CN202210788077.4A
Authority: CN
Inventors: 彭倍; 刁宏健; 邵继业; 杨文章
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2022-07-06
Filing date: 2022-07-06
Publication date: 2022-09-13
Anticipated expiration: 2042-07-06
Also published as: CN115050055B

Abstract

The invention discloses a human body skeleton sequence construction method based on Kalman filtering, which comprises the following steps: s1, carrying out attitude estimation and carrying out normalization processing on the joint point characteristics; s2, numbering all skeletons of which the characteristics of the first frame of video are not all invalid values, and adding the skeletons into a processing queue; s3, inputting all numbered skeleton sequences and skeleton sets in the processing queue into a Kalman filtering module for constructing human skeleton sequences, and updating the processing queue according to the processing result; and S4, repeatedly executing S3 on each frame of video until all frames are processed, wherein each numbered skeleton sequence in the processing queue is the constructed human skeleton sequence. Compared with the traditional Kalman filtering, the newly defined decision module further processes the observed value of the attitude estimation method, so that the Kalman filtering module has the capability of tracking the motion of the human body; through the Kalman filtering algorithm, the problem of characteristic extraction errors caused by false detection and missing detection of the attitude estimation method is corrected.

Description

Human skeleton sequence construction method based on Kalman filtering

Technical Field

The invention belongs to the field of computer vision, and particularly relates to a human skeleton sequence construction method based on Kalman filtering.

Background

Video-based behavior recognition is one of the representative tasks for understanding video information, and the task of recognizing human actions in video is called video behavior recognition. The current video behavior identification method based on deep learning comprises the following steps: double-stream Networks (Two-stream Networks), three-dimensional Convolutional Neural Networks (3D Convolutional Neural Networks), other non-end-to-end methods, and the like.

The proposal of space-time Convolutional network (STGCN) and P-CNN (Pose-based CNN) provides a new non-end-to-end method for behavior identification. The method extracts the joint points frame by frame through an advanced posture estimation method, clusters the joint points to form a skeleton as network input, and extracts and fuses the joint characteristics for identifying the video behavior.

Because the posture information is closely related to human behaviors, the method achieves a good effect on behavior recognition independent of background information. The mainstream human body posture estimation method does not correlate skeletons among different frames if openposition, but some methods can only analyze behaviors of the same person, and the method depends heavily on joint point information extracted by the posture estimation method, is quite sensitive to mistaken identification and missing identification results of the posture estimation method caused by disturbance in video frames, and randomly numbers each skeleton by the posture estimation method when a plurality of persons exist in a video simultaneously, so that the behavior identification result is wrong.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provides a human body skeleton sequence construction method based on Kalman filtering, which carries out observation value decision by comparing the output results of the current frame attitude estimation method through Kalman filtering module prediction characteristics and updates a processing queue by using the decision results so as to obtain a human body skeleton sequence. The problem of characteristic extraction errors caused by false detection and missing detection of the attitude estimation method is solved.

The purpose of the invention is realized by the following technical scheme: a human body skeleton sequence construction method based on Kalman filtering comprises the following steps:

s1, carrying out posture estimation on the video containing the human body behavior information frame by frame, and carrying out normalization processing on all joint point characteristics to obtain a skeleton set of related node characteristic information;

s2, numbering all skeletons of which the characteristics of the first frame of video are not all invalid values, and adding the skeletons into a processing queue;

s3, inputting all numbered skeleton sequences and skeleton sets in the processing queue into a Kalman filtering module for constructing human skeleton sequences, and updating the processing queue according to the processing result, wherein the updating unit is a frame;

and S4, repeatedly executing S3 on each frame of video until all frames are processed, wherein each numbered skeleton sequence in the processing queue is the constructed human skeleton sequence.

Further, in step S3, the kalman filtering module for constructing the human skeleton sequence includes a prediction module, a decision module, and an update module connected in series.

The prediction module is a Kalman filtering algorithm prediction part, prediction input is all numbered frameworks in a processing queue, and output specifically comprises the following steps:

wherein the content of the first and second substances,

for estimating posterior state according to t-1 time

And the estimated prior state at the moment T predicted by the motion model, wherein T is the total frame number, A is a state transition matrix,

estimating covariance for posterior state based on time t-1

The calculated prior state at the time T is used for estimating covariance, Q is a process covariance matrix, and superscript T represents transposition;

posterior state estimation

Is defined as:

x and y are defined as Z when t is 0 _{x，t，v，n} ，Z _{y，t，v，n} ，Z _{x，t，v，n} 、Z _{y，t，v，n} Normalizing the joint point characteristics of the joint v with the serial number n in the t frame; n is a framework serial number, and n is equal to m during initialization; v. of _x ，v _y When t is 0, t is not equal to 0,

representing the posterior state estimation of the joint v with the number m in the t frame for the iteration result of the Kalman filtering module;

when t is 0

Defined as an identity matrix, when t ≠ 0,

estimating covariance of posterior state of the joint v with the number m in the t frame for an iteration result of the Kalman filtering module;

the state transition matrix a is defined as:

dt is the video frame time interval;

the process covariance matrix Q is defined as:

is the coordinate variance of the joints of the human body,

is coordinate movement velocity variance parameter.

The decision module specifically comprises the following steps:

s31, calculating the matching degree of the processing queue and all detected skeletons of the current frame to obtain a Mahalanobis distance matrix D _t (ii) a The method specifically comprises the following steps: calculating a candidate set of each framework in the processing queue, wherein elements in the set are successfully matched current frame frameworks, and outputting a result through a prediction module

And

calculating each joint point v characteristic Z of all skeletons n of current frame t _t，v，n A priori estimate of each skeleton m in the corresponding processing queue

Is a horse-like distance D _t，v，nm ；

The mahalanobis distance matrix used for measuring the matching degree is as follows:

D _t，nm ＝min(D _t，v，nm )，v∈V

h is an observation matrix:

s32, obtaining D by using Hungarian algorithm _t Optimal matching of matrix, taking matching threshold alpha, as long as D is satisfied _t，nm If the matching is successful, the current frame observed value C is used _t，v，m ^- Set as a joint point feature Z _t，v，n ；D _t，nm If alpha is greater than alpha, the matching fails, and the observed value is set as a predicted value

S33, substituting factor D in S32 _t，nm Alpha or N > M results in unmatched skeletons being added to the processing queue with new numbers.

The updating module is specifically as follows: calculating the Kalman gain K _t，v，m Posterior estimation of joint point characteristics

Covariance with a posteriori estimate

Wherein

C _t，v，m ^- Outputting results by a prediction module and a decision module, wherein R is an observation noise covariance matrix:

estimating a variance for the pose;

finally update C _t，v，m ：

C _t，v，m Is set C _v，m Element (2) represents C _v，m At frame t, the features of joint v in skeleton number m.

The invention has the beneficial effects that: the invention uses the attitude estimation method to process the result initialization processing queue, takes the frame as the unit, predicts the characteristics through the Kalman filtering module, compares the output result of the current frame attitude estimation method to carry out the observed value decision, and uses the decision result to update the processing queue, thereby obtaining the human skeleton sequence. The method has the following advantages:

1. compared with the traditional Kalman filtering, the newly defined decision module further processes the observed value of the attitude estimation method, so that the Kalman filtering module has the capability of tracking the human motion;

2. through a Kalman filtering algorithm, the problem of characteristic extraction errors caused by false detection and missing detection of an attitude estimation method is corrected;

3. aiming at a video behavior recognition network such as STGCN (Standard template network) which needs to extract the same skeleton motion information, the problem that the network cannot be directly used due to the fact that the human skeleton serial number recognized by the posture estimation method is random in a multi-person scene is solved.

Drawings

FIG. 1 is a block diagram of the algorithm flow of the present invention;

FIG. 2 is a block diagram of a Kalman filtering module for constructing a human skeleton sequence according to the present invention;

FIG. 3 is a diagram comparing a frame skipping construction result with original data in skeleton detection according to the present invention;

FIG. 4 is a comparison graph of the construction result of the false detection in the skeleton detection and the original data in the present invention;

FIG. 5 is a diagram comparing a constructed result with original data when serial numbers of a multi-user scene framework are frequently exchanged;

FIG. 6 is a comparison graph of the constructed result and the original data when the simulated multi-person scenes are partially overlapped;

FIG. 7 is a comparison of the original data with the recognition result of the skeleton sequence constructed according to the present invention on STGCN.

Detailed Description

The invention provides a human body skeleton sequence construction method based on Kalman filtering, which is characterized in that probability distribution of target joint points is predicted by using the Kalman filtering according to a time sequence, an outlier and a missing value are processed by setting a probability threshold value to match with an optimal adaptive skeleton, rather than directly inputting a skeleton of posture estimation into a posture identification network, so that a more stable skeleton sequence can be provided. The technical scheme of the invention is further explained by combining the attached drawings.

As shown in FIG. 1, the human skeleton sequence construction method based on Kalman filtering of the invention comprises the following steps:

the skeleton set is defined as Z _{F，T，V，N} Where F ═ { F ═ x, F ═ y } is an articulation point feature, i.e., image twoDimensional coordinate information; t is the total frame number, V is the total joint number, if the human body posture estimation method openposition is used, the effective value of the joint number is V ═ V ₀ ，v ₁ ，...，v ₂₄ ](ii) a N is a total framework serial number which is a random number carried out in a certain frame according to the quantity of the identified frameworks, and the numbers of different frames are not related;

the normalization method comprises the following steps:

wherein Z is _{x，t，v，n} 、Z _{y，t，v，n} Normalized joint point feature of joint v at frame t, X, with sequence number n _t，v，n 、Y _t，v，n The attitude estimation result of the joint V with the sequence number n in the t frame, V _xmax 、V _ymax For maximum significant value on the corresponding feature, typically image width and height, V _xmin 、V _ymin Is the smallest valid value on the corresponding feature, typically zero.

S2, numbering all skeletons of which the characteristics of the first frame of video are not all invalid values, and adding the skeletons into a processing queue; the skeleton in the processing queue is defined as C _V，M M is the total skeleton number, V is the total joint number, C _v，m All feature sets of the joints V in the skeleton number M in the video set are represented, and M is 1, 2.

the kalman filtering module for constructing the human skeleton sequence includes a prediction module, a decision module and an update module connected in series, as shown in fig. 2.

wherein the content of the first and second substances,

for estimating posterior state according to t-1 time

estimating covariance for posterior state based on time t-1

posterior state estimation

Is defined as:

x and y are defined as Z when t is 0 _{x，t，v，n} ，Z _{y，t，v，n} N is the number of the skeleton, n is equal to m during initialization, v _x ，v _y When t is 0, t is not equal to 0,

when t is 0

Defined as an identity matrix, when t ≠ 0,

the state transition matrix a is defined as:

dt is the video frame time interval;

the process covariance matrix Q is defined as:

is the coordinate variance of the joints of the human body,

is coordinate movement velocity variance parameter.

The decision module specifically comprises the following steps:

s31, calculating the matching degree of the processing queue (numbered skeleton sequence) and all detected skeletons of the current frame to obtain a Mahalanobis distance matrix D _t (ii) a The method specifically comprises the following steps: calculating a candidate set of each skeleton in the processing queue, wherein elements in the set are the current frames successfully matchedSkeleton, outputting results through prediction module

And

Is a horse-like distance D _t，v，nm 。

D _t，nm＝ min(D _t，v，nm )，v∈V

h is an observation matrix:

s32, the matching problem of the numbered skeleton sequences and all skeletons of the current frame is an assignment problem essentially, and the Hungarian algorithm is used for obtaining D _t Optimal matching of matrix, taking matching threshold alpha, as long as D is satisfied _t，nm If the matching is successful, the current frame observed value C is used _t，v，m ^- Set as a joint point feature Z _t，v，n ；D _t，nm If alpha is greater than alpha, the matching fails, and the observed value is set as a predicted value

S33, converting factor D in S32 _t，nm Alpha or N > M results in unmatched skeletons being added to the processing queue with new numbers.

Covariance with a posteriori estimate

Wherein

estimating a variance for the pose;

finally update C _t，v，m ：

C _t，v，m Is set C _v，m Element (2) represents C _v， And when m is at the t-th frame, the characteristics of the joints v in the skeleton number m.

This example was experimentally verified using python3.8 in the Windows operating system using the data set Kinetics-700. And selecting the video which is relatively representative and can generate detection errors, detect missing frames and multi-person scenes as the material for constructing the skeleton sequence. The common parameters used in the verification process are listed in table 1.

TABLE 1 common parameters table

And inputting the video into an openposition human body posture estimation method, analyzing and outputting a result, normalizing the result, and inputting the result into a Kalman filtering module for constructing a human body skeleton sequence to obtain the constructed skeleton sequence.

Fig. 3 is a comparison diagram of a pose estimation result and a neck joint of a constructed skeleton sequence when frame skipping occurs in pose estimation, fig. 3(a) is an image drawn according to neck joint point features in an openposition output result when a 2 nd frame of data is shown, a human body skeleton is not detected at this time, fig. 3(b) is a comparison image of a normalized pose estimation result (marked with an asterisk) and a constructed skeleton sequence (marked with a dotted line) on an x axis of a frame, and fig. 3(c) is a comparison image on a corresponding y axis. In the frame 2, because the scene is complex, the human skeleton is not detected by openposition, namely the x and y characteristic coordinates are 0; at the moment, the outlier is filtered by the output value of the Kalman filtering module, and the framework sequence is updated by replacing the outlier with the prior estimation value.

Fig. 4 is a comparison diagram of the pose estimation result and the constructed skeleton sequence neck joint when the pose estimation has false detection, fig. 4(a) is an image drawn according to the neck joint point characteristics in the openposition output result when the 4 th frame of the data is shown, fig. 4(b) is a comparison image of the normalized pose estimation result (marked with an asterisk) and the constructed skeleton sequence (marked with a dotted line) on the x axis of the frame, and fig. 4(c) is a comparison image on the corresponding y axis. In the 4 th frame and the 5 th frame, because of background interference, openposition detection is wrong, x and y coordinates have large offset, and the x and y values are corrected after passing through a Kalman filtering module.

Fig. 5 is a diagram showing a comparison between the skeleton sequence and the posture estimation result when the skeleton number obtained by posture estimation is repeatedly switched, and fig. 5(a) is an image drawn based on the openposition output result. FIG. 5(b) shows the neck joint point characteristics of two skeletons in the openposition output result, the horizontal axis shows normalized x characteristics, the vertical axis shows normalized y characteristics, and the asterisks indicate the neck joint points Z with the posture estimation output skeleton number of 0 _{f，t，1，0} Characterised by the addition of a number 1Z _{f，t，1，1} Characterized in that the solid line and the dotted line respectively represent the neck joint point C in the skeleton sequences numbered 0 and 1 after the skeleton sequence is constructed _t，1，0 And C _t，1，1 The subscript v ═ 1 denotes the neck joint point. Fig. 5(c) and 5(d) are graphs comparing the results of the x-feature construction and the y-feature construction of fig. 5(b), respectively, in terms of frame number.

Fig. 5(b) shows that the two detected human skeletons are distributed in the upper left area and the lower right area, but openposition detects each frame, and the detection results between frames are independent, so the skeleton numbers are not fixed, and the skeleton numbers in the upper left area and the lower right area are frequently switched. The result shows that after a Kalman filtering module of the framework sequence is constructed, the characteristic changes of two framework frames and interframes are effectively tracked.

Fig. 6 is a comparison diagram before and after construction of a skeleton sequence when a simulation multi-person scene is partially overlapped, wherein a program generates two paths from (0.35, 0.47) to (0.39, 0.32) and from (0.5, 0.37) to (0.31, 0.46), samples the paths and adds gaussian noise, the gaussian noise is in accordance with N to (0, 0.0001), discrete points are obtained on the paths as simulation of joint coordinates, the skeleton serial number is randomly 0 or 1, joint point motion is simulated by the points, and fig. 6(a) is the generated discrete point coordinates. Fig. 6(b) and 6(c) are graphs comparing the results of constructing the x-feature and the y-feature of fig. 6(a), respectively, according to the number of frames.

It can be seen that in a multi-person scene environment, if correct joint point characteristics can be detected, even if the distance between joint points is small and the movement tracks are overlapped, the method can still capture the inter-frame joint information and has the capability of distinguishing different frameworks.

All the constructed skeleton sequences and the original data are input into an STGCN, the weights of the models are st _ gcn.kinetics.pt provided by the government, and the reasoning results of partial videos on the STGCN are shown in a table 2.

Sequence number: video sequence number, all videos are videos under the absering label in kinetics-700. The camera of the video 1 is stable, and the character behaviors are clear. The video 2 is the video shown in fig. 3, and the attitude estimation result is unstable, and has the conditions of missing detection and false detection. Video 3 is the video shown in fig. 5, and has a problem of switching the serial numbers of the multi-user frameworks. 1 ', 2 ', 3 ' are the results of the skeleton sequence constructed by the method of the invention corresponding to the 1, 2, 3 videos through STGCN inference:

frame number: representing the total number of identified video frames;

output frame number: the number of frames of the STGCN output characteristic is compressed to the original frame number T by the convolution of the multilayer TCN module

abselling: outputting the number of frames identified as the absering behavior in the frame number;

water skiing: outputting the number of frames identified as the water skiing behavior in the frame number, wherein the behavior is easy to be confused with absering;

other number of actions: outputting the behavior quantity identified as other behaviors in the frame number;

other behavior frame number: outputting the number of frames identified as other behaviors from the number of frames;

voting result: and (5) outputting the behavior with the maximum frame number in the frame number as the main behavior in the video as a voting result.

TABLE 2 comparison of inference results of partial videos on STGCN

The features of the abssering and the water skiing are similar in the case of ignoring the background, so that the recognition results of the abssering and the water skiing can be considered to be correct at the same time in this case. The comparison result between the original data and the human body skeleton recognition result constructed by the invention is obtained according to the table 2 and is shown in fig. 7, and in two adjacent bar graphs of each frame in the graph, the left side is data before processing, and the right side is data after processing. From the comparison result, the skeleton sequence constructed by the method is improved to different degrees in the identification accuracy compared with the direct identification of the openposition attitude estimation result. Meanwhile, the skeleton sequence constructed by the method can reduce the adverse effects of the false recognition and the missing recognition on the recognition result of the posture estimation method, namely the number of false recognition behaviors is reduced. For the STGCN needing to extract time information, the problem that the output framework sequence number of the attitude estimation method is random is effectively solved.

It will be appreciated by those of ordinary skill in the art that the embodiments described herein are intended to assist the reader in understanding the principles of the invention and are to be construed as being without limitation to such specifically recited embodiments and examples. Those skilled in the art can make various other specific changes and combinations based on the teachings of the present invention without departing from the spirit of the invention, and these changes and combinations are within the scope of the invention.

Claims

1. A human body skeleton sequence construction method based on Kalman filtering is characterized by comprising the following steps:

2. The method for constructing the human body skeleton sequence based on the kalman filter according to claim 1, wherein in the step S3, the kalman filter module for constructing the human body skeleton sequence includes a prediction module, a decision module and an update module connected in series.

3. The Kalman filtering based human body skeleton sequence construction method according to claim 2, characterized in that the prediction module is a Kalman filtering algorithm prediction part, prediction inputs are all numbered skeletons in a processing queue, and outputs specifically are:

wherein, the first and the second end of the pipe are connected with each other,

for estimating posterior state according to t-1 time

estimating covariance for posterior state based on time t-1

posterior state estimation

Is defined as:

when t is 0

Defined as an identity matrix, when t ≠ 0,

the posterior state estimation of the joint v with the number m in the t frame is represented as the iteration result of the Kalman filtering moduleA covariance;

the state transition matrix a is defined as:

dt is the video frame time interval;

the process covariance matrix Q is defined as:

is the coordinate variance of the joints of the human body,

is coordinate movement velocity variance parameter.

4. The Kalman filtering-based human body skeleton sequence construction method according to claim 2, characterized in that the decision module specifically comprises the following steps:

And

Is a horse-like distance D _t，v，nm ；

D _t，nm ＝min(D _t，v，nm )，v∈V

h is an observation matrix:

5. The Kalman filtering-based human body skeleton sequence construction method according to claim 2, characterized in that the updating module specifically is: calculating CarerMangan gain K _t，v，m Posterior estimation of joint point characteristics

Covariance with a posteriori estimate

Wherein

estimating a variance for the pose;

finally update C _t，v，m ^- ：