CN112686153A

CN112686153A - Three-dimensional skeleton key frame selection method for human behavior recognition

Info

Publication number: CN112686153A
Application number: CN202011608049.7A
Authority: CN
Inventors: 陈皓; 潘跃凯; 张凯伦
Original assignee: Xian University of Posts and Telecommunications
Current assignee: Xian University of Posts and Telecommunications
Priority date: 2020-12-30
Filing date: 2020-12-30
Publication date: 2021-04-20
Anticipated expiration: 2040-12-30
Also published as: CN112686153B

Abstract

The invention discloses a key frame selection method for behavior recognition aiming at a human body three-dimensional skeleton, and belongs to the field of computer vision and pattern recognition. The method comprises the following steps: firstly, acquiring a human body three-dimensional skeleton joint data stream from a video image through a depth sensor or an attitude estimation algorithm; secondly, extracting attitude characteristics and determining an inflection point frame in the sequence according to momentum change of motion of each body part; then, the pose feature vector is input into a key frame selection model of the fusion domain information and the number of key frames to obtain a key frame sequence. The model adopts binary coding, takes an inflection point frame as a population initialization identifier, and adopts a multi-objective binary difference algorithm to perform key frame coding optimization. The extracted key frame sequence has stronger motion summarization capability, and the number of the key frames can be adaptively adjusted according to the complexity of the behaviors, so that the optimized and generated key frame sequence can obtain higher accuracy in human behavior recognition.

Description

Three-dimensional skeleton key frame selection method for human behavior recognition

Technical Field

The invention belongs to the field of computer vision and pattern recognition, and particularly relates to a key frame selection method for human behavior recognition based on three-dimensional skeleton characteristics.

Background

The main research content of human behavior analysis is to analyze human behaviors in a video and classify and identify behavior categories by collecting behavior signals of a target human body. In recent years, human behavior recognition gradually becomes a hot research problem in the field of computer vision, and has wide application prospects and potential economic values in the fields of public security, human-computer interaction, sports, medical care and the like.

At present, research on human behavior recognition can be divided into behavior recognition based on skeletal joint point characteristics and behavior recognition based on non-skeletal characteristics according to different research data, wherein the behavior recognition based on the non-skeletal characteristics is mainly based on traditional image data. With the development and popularization of the depth sensor, the acquisition of high-precision three-dimensional skeleton joint point information becomes simple and convenient, meanwhile, the description of the skeleton posture to the behavior has the inherent advantages, the human body posture and the motion state can be accurately described, and the influence of factors such as background, illumination and the like is avoided. In the behavior recognition method based on the three-dimensional skeleton joint point characteristics, most of research is to process the whole behavior sequence, but not all frames in the sequence are meaningful for behavior recognition, so that effective selection of key frames can reduce data redundancy and computational complexity and can more effectively express behavior characteristics. Clustering is a common method for selecting key frames, and the method has strong generalization capability on motion description, but specific motion significance of data is not considered in the clustering process, frames separated by a certain distance in a clustering space can be clustered into the same class, and the time sequence of motion is ignored, so that distortion of motion analysis is easily caused, the number of clusters needs to be manually specified, and automatic processing is not easily realized.

Disclosure of Invention

Aiming at the problems existing in the extraction of the existing behavior recognition key frame, the invention provides a key frame selection method for performing behavior recognition on three-dimensional skeleton joint point characteristics. For this reason, the key technical problems to be solved include: extracting posture characteristics; establishing and solving an optimization problem; and generating the decoded key frame sequence and calculating the behavior classification identification. In order to achieve the purpose, the specific technical scheme of the invention is as follows:

a method for selecting and optimizing keyframes aiming at human body three-dimensional skeleton behavior recognition comprises the following steps:

step 1: and reading the three-dimensional skeleton joint point data. The method specifically comprises the following steps:

acquiring three-dimensional skeleton joint data from a video image through a depth sensor or a posture estimation algorithm, reading a behavior sequence according to the following structure, wherein one behavior sequence comprises T frames, position coordinates of N joint points are provided in each frame, and then a joint position matrix of the behavior sequence can be represented as follows:

wherein p is_tnRepresenting the nth joint, p, in the t-th frame_tn＝(x_tn,y_t,nz_t)，t∈{1,2,...,T}，n∈{1,2,...,N}，x_tn,y_tn,z_tnRespectively representing the x-axis, y-axis and z-axis coordinates of the joint point.

Step 2: and extracting the posture features. Calculating the normalized feature vector of each frame, specifically as follows:

wherein p is_t0For the center of the abdomen as a central reference point, p_t3For neck joint points, | | represents the Euclidean distance, and the vector d relative to the central reference point is calculated for each skeleton joint point_tnAnd obtaining the attitude characteristic by normalizing the distance from the neck to the center of the abdomen, wherein the attitude characteristic vector of the t-th frame is expressed as f_t＝[d_t1,d_t2,...,d_tn]。

And step 3: an inflection frame is determined. Defining frames reaching extreme values in the motion tracks of the important joints as turning point frames, extracting the turning point frames by utilizing motion track curves of the important joints because certain motion linkage relation exists among all joint points in the human body structure and terminal joint points such as left and right hands and left and right feet contain more motion information; the concrete solution is as follows: let the motion trail of one of the joints be S ═ { p_1n,p_2n,...,p_tnDenotes that S is mapped to a two-dimensional space: s → f (t, m), wherein f (t, m) is t frame which is the momentum of the relative displacement distance between the joint and the initial position, the frame where the local extreme point in f (t, m) is located is taken as an inflection point frame, and the sequence number of the inflection point frame of the sequence is recorded.

And 4, step 4: a problem space for selecting key frames is constructed. And defining adjacent frames before and after each key frame as the same domain, wherein the domain information defining the key frames is as follows:

wherein f is_i ^rRepresenting the pose feature vector of the ith frame in the r domain, and representing the key frame vector in the r domain as

dis (-) is used to represent information between two frames,

is the dot product of w and the Euclidean distance of each feature between two frames, w is a column vector matrix which represents the weight of the joint feature of each body part and is obtained by the ratio of the movement amount of the important joint point of each body part in the step 3; defining the inter-domain information of the key frame as:

wherein

Information expressed as key frames in the r-th and j-th domains, u_rjThe weight coefficient is a weight coefficient between two domains, the weight coefficient is related to the domain interval size, the key frame information difference of adjacent domains is relatively small, and the information difference between key frames with large domain intervals is relatively large; the attitude time sequence change characteristics are fully reserved by integrating intra-domain information and inter-domain information, and a domain information objective function for evaluating the quality of the key frame is defined as follows:

another objective function that measures the number of key frames is defined as the frame compression ratio:

wherein the frame_keyFor the number of selected key frames, frames_totalThe number of frames contained in the sequence of behaviors in total. Final objective function:

min{DI,FC}

and 5: the key frame selection mainly aims at searching for an optimal subset of three-dimensional skeleton characteristic frames, the process is essentially an optimal searching process, the key frame selection is converted into a multi-objective optimization problem in a binary coding space, and an improved multi-objective differential evolution algorithm is adopted to solve the model. The method specifically comprises the following steps:

step 5.1: initializing parameters, setting the current iteration time G to be 0, the maximum iteration time Gmax, the population size Np and the cross probability CR, and generating a probability PG;

step 5.2: aiming at the extraction problem of the key frame, binary coding is adopted in the text, a variable of 0-1 is used for representing the state of each frame in the sequence, when the value is 0, the frame is not the key frame, and when the value is 1, the frame is the key frame; the initial frame, the end frame and the inflection point frame are used as the acquisition key frames, other chromosome positions randomly generate an initialization population in a discontinuous way, and the fitness of the individuals is calculated according to the target function;

step 5.3: evolution algebra G is G + 1;

step 5.4: carrying out non-dominated sorting on individuals according to fitness values, and dividing the population into three communities, namely Pop1, Pop2 and Pop 3;

step 5.5: making an individual index i of the population equal to 0;

step 5.6: selecting individuals r1, r2 and r3 from the three communities respectively;

step 5.7: generating a variation vector V_i,G：

Wherein

Which represents an exclusive or operation, is performed,

indication and operation, <' > indication OR operation, X_r1,G、X_r2,GAnd X_r3,GAre parents taken from Pop1, Pop2, and Pop3, respectively. In order to fully retain the prior knowledge of the inflection point frame, in the process of generating the random vector F, the chromosome position fixed value of the inflection point frame is 1, and the values of other positions are determined by a random number and the generation probability;

step 5.8: carrying out intersection operation according to an intersection operator:

wherein j is a random index, rand_jFor randomly generating numbers, rand_j∈[0,1]Crossing with probability CR;

step 5.9: and carrying out selection operation according to a selection operator:

wherein f (-) is an objective function;

step 5.10: the index i is i +1, go to step 5.6 until i is NP, otherwise go to step 5.11;

step 5.11: if G is Gmax, the algorithm ends, the key frame set is output, otherwise step3 is transposed.

Step 6: and classifying based on the behavior of the key frame. And (5) decoding the binary codes obtained in the step (5) to obtain a key frame sequence, inputting the three-dimensional skeleton characteristics corresponding to the sequence into a human behavior classifier, and outputting a behavior classification result.

Drawings

Fig. 1 is a key frame extraction method for three-dimensional skeleton behavior recognition in an embodiment of the present invention;

FIG. 2 is a model of a human skeletal joint point in accordance with an embodiment of the present invention;

FIG. 3 is a momentum curve for determining inflection frames in an embodiment of the present invention;

FIG. 4 is a key frame extracted in an embodiment of the present invention;

FIG. 5 is a confusion matrix of recognition results according to an embodiment of the present invention;

Detailed Description

The invention is further described below with reference to the accompanying drawings. The technical solution and implementation process of the present invention are described below by taking a high throw as an example, which should not be taken as a limitation to the protection scope of the present invention. As shown in fig. 1, a process of a method for extracting a keyframe for three-dimensional skeleton behavior recognition is specifically implemented as follows:

step 1: and reading the three-dimensional skeleton joint point data. As shown in fig. 2, when the behavior sequence contains 42 frames and each frame contains 20 joint points, the joint position matrix of the whole behavior sequence can be represented as:

step 2: and extracting the posture features. For each frame, a normalized feature vector is calculated, such as where the 1 st joint feature of frame 1 is represented as:

d_1,1＝{-0.3924，0.8008,-0.1231}

then the 60-dimensional pose feature vector at frame 1 is:

f₁＝[-0.3924，0.8008,-0.1231,...,-0.0242,1.4841,-0.4189]

and step 3: an inflection frame is determined. In this sequence of actions, the momentum of the joint relative to the displacement distance from the initial position is as shown in fig. 3, and if local extrema are obtained at

points

1,8,15,27,32, and 37 in the figure, the number of the inflection point frame is {1,8,15,27,32, and 37 }.

And 4, step 4: and constructing a multi-target key frame extraction model fusing domain information and the number of key frames.

And 5: solving the key frame extraction model by a multi-target differential evolution algorithm based on binary coding. The optimal binary code of the behavior sequence can be obtained after optimization calculation as [ 0100000010001001000100110001000010010000000 ].

Step 6: and classifying based on the behavior of the key frame. The sequence of key frames obtained by decoding the optimal binary code obtained in step 5 is {0,8,12,15,19,22,23,27,32,35}, and as shown in fig. 4 for visualization, the features of the extracted key frame sequence are used as input values, input into the support vector machine classifier, and output a behavior class classification result as 'high throw'. Experiments the confusion matrix for the recognition results in the MSR-Action3D dataset is shown in fig. 5, with an average recognition accuracy of about 92.88%.

Claims

1. A method for selecting and optimizing keyframes aiming at human body three-dimensional skeleton behavior recognition comprises the following steps:

step 1: reading three-dimensional skeleton joint point data, specifically:

wherein p is_tnRepresenting the nth joint in the tth frame,

n∈{1,2,...,N}，x_tn,y_tn,z_tncoordinates respectively representing an x-axis, a y-axis and a z-axis of the joint point;

step 2: extracting the attitude characteristics, calculating the normalized characteristic vector of each frame, specifically:

wherein p is_t0For the center of the abdomen as a central reference point, p_t3For neck joint points, | | represents the Euclidean distance, and the vector d relative to the central reference point is calculated for each skeleton joint point_tnAnd obtaining the attitude characteristic by normalizing the distance from the neck to the center of the abdomen, wherein the attitude characteristic vector of the t-th frame is expressed as f_t＝[d_t1,d_t2,...,d_tn]；

And step 3: determining turning point frames, defining frames reaching extreme values in motion tracks of important joints as the turning point frames, extracting the turning point frames by utilizing motion track curves of the important joints because certain motion linkage relation exists among all joint points in a human body structure and terminal joint points such as left and right hands and left and right feet contain more motion information, and specifically solving the following steps: let the motion trail of one of the joints be S ═ { p_1n,p_2n,...,p_tnDenotes that S is mapped to a two-dimensional space: s → f (t, m), where f (t, m) is t frame which is the momentum of the displacement distance of the joint relative to the initial position, the frame where the local extreme point in f (t, m) is located is taken as the inflection point frame, and the inflection point frame of the sequence is recordedA serial number;

and 4, step 4: constructing a problem space for selecting the key frames, defining adjacent frames before and after each key frame as the same domain, and defining the intra-domain information of the key frames as follows:

wherein f is_i ^rRepresenting the pose feature vector of the ith frame in the r-th domain, and the key frame vector in the r-th domain is represented as f₀ ^rDis (-) is used to represent information between two frames,

wherein

wherein the frame_keyFor the number of selected key frames, frames_totalThe final objective function is the number of frames contained in the behavior sequence in total:

min{DI,FC}

and 5: the key frame selection mainly aims at searching for an optimal subset of three-dimensional skeleton characteristic frames, the process is essentially an optimal searching process, the key frame selection is converted into a multi-target optimization problem in a binary coding space, and an improved multi-target differential evolution algorithm is adopted to solve a model, and the method specifically comprises the following steps:

step 5.3: evolution algebra G is G + 1;

step 5.5: making an individual index i of the population equal to 0;

step 5.7: generating a variation vector V_i,G：

Wherein

Which represents an exclusive or operation, is performed,

indication and operation, <' > indication OR operation, X_r1,G、X_r2,GAnd X_r3,GIn order to fully retain the prior knowledge of the inflection point frame, the chromosome position fixed value of the inflection point frame is 1 in the process of generating the random vector F, and the values of other positions are determined by random numbers and the generation probability;

wherein f (-) is an objective function;

step 5.11: if G is Gmax, the algorithm is ended, the key frame set is output, and otherwise step3 is transposed;

step 6: and (4) based on behavior classification of the key frames, decoding the binary codes obtained in the step (5) to obtain a key frame sequence, inputting three-dimensional skeleton features corresponding to the sequence into a human behavior classifier, and outputting a behavior classification result.