CN114373146A

CN114373146A - Participant action identification method based on skeleton information and space-time characteristics

Info

Publication number: CN114373146A
Application number: CN202111568652.1A
Authority: CN
Inventors: 马丕明; 陈思颖; 栾春芳
Original assignee: Shandong University
Current assignee: Shandong University
Priority date: 2021-12-21
Filing date: 2021-12-21
Publication date: 2022-04-19

Abstract

A participant action identification method based on skeleton information and space-time characteristics belongs to the field of computer vision, and comprises the following steps: acquiring a coordinate sequence of human skeleton joint points in a video conference monitoring picture; obtaining a spatial feature sequence of human body actions by calculating joint angle features and joint point distance features; classifying the atomic actions of the single-frame images according to the spatial characteristics, and further determining the atomic action number sequence of the multi-frame images in the video; learning the time change characteristics of the atomic actions by constructing Hidden Markov Models (HMMs) corresponding to different participant actions; and identifying the participant actions by calculating the maximum log-likelihood value of the atomic action number sequence with unknown classification under different participant actions corresponding to the HMM. The invention can accurately and efficiently identify the human body participating in the conference aiming at the video conference.

Description

Participant action identification method based on skeleton information and space-time characteristics

Technical Field

The invention relates to a participant action identification method based on skeletal information and space-time characteristics, and belongs to the field of computer vision.

Background

With the development of image processing technology, the research of video conference systems has also changed significantly with the introduction of new technologies. In order to meet the diversified demands of users, the image processing technology is adopted to identify the actions of the participants in the video conference, the conference state of the participants can be timely and effectively reflected, and the related management departments can be helped to accurately master the conference-opening effect, so that the video conference is assisted to realize automation and intellectualization. The invention automatically identifies the human body action in the meeting state according to the obtained human body skeleton data of the participants so that the management department can more effectively arrange and schedule the meeting, and has practical significance and application value.

When the deep learning method is used for human body action recognition, the data are analyzed by constructing the hierarchical neural network with learning capability, and the defect is that huge data volume is needed, otherwise, the overfitting condition may occur in the model training process. The patent number CN 113255616 a discloses a method for identifying human behaviors in video by the inventor of collusion method and named as "a method for identifying video behaviors based on deep learning". The method comprises the following steps: constructing a video behavior recognition network; a two-dimensional convolutional neural network Resnet is used as a backbone network of the video behavior identification network, and a convolutional neural network of an inter-frame time domain information extraction module is inserted into the backbone network; the two-dimensional convolutional neural network Resnet is used for extracting static characteristics of a target in a video; the inter-frame time domain information extraction module is used for optimizing the backbone network, extracting inter-frame information features by using bilinear operation, and fusing the intra-frame information and the inter-frame information to obtain high-identification spatio-temporal features for behavior classification. The method adopts the human behavior training sample to train and optimize parameters of the neural network model, and further realizes human behavior recognition in the video. However, the neural network model constructed by the method has a large depth, so that a large number of training samples are required for the model. And the conference motion in the video conference has the characteristics of few types and few data samples, and model training overfitting may occur by using a deep neural network model, so that the motion recognition effect is poor.

Disclosure of Invention

In order to overcome the defects in the prior art, the invention provides a conference participation action recognition method based on skeleton information and space-time characteristics, and aims to solve the problem that the action recognition result is poor due to overfitting of model training under the condition that a video conference participation action data set is few.

The technical scheme adopted by the invention is as follows:

a conference participation action identification method based on skeleton information and space-time characteristics determines a human conference participation action by processing a coordinate sequence of human skeleton joint points in a video conference monitoring picture, and comprises the following steps:

1) skeletal joint point coordinate sequence acquisition

Acquiring coordinate information of 8 skeletal joint points of the upper half of the human body in a video picture, wherein the coordinate information comprises a nose tip, a neck center, a right shoulder end, a right upper arm center, a right wrist center, a left shoulder end, a left upper arm center and a left wrist center in sequence; a video comprises T frames of images, and the coordinate sequence of joint points in the video is represented as [ X ]₀，X₁，...，X_t，...,X_T-1]The coordinate sequence contained in the t-th frame image in the video is represented as X_t＝(x_t，0，x_t，1，...，x_t，l，...，x_t，15) Wherein

l＝0,1，…，15，(x_t,0,x_t,1) Coordinates representing the nose tip of the first joint in the image of the t-th frame, (x)_t,2,x_t,3) The coordinate representing the exact center of the neck of the second joint in the t-th image, and so on, (x)_t,14,x_t,15) Coordinates representing the center of the left wrist of the eighth joint point in the t frame image;

2) spatial feature sequence computation

a. For the t frame image, calculating the horizontal distance F between two joint points_1,t,cAnd a vertical distance F_2,t,cExtracting distance features of atomic actions, wherein

Arbitrarily selecting two joint points l from eight joint points l₁And l₂,l₁≠l₂Wherein l is₁,l₂0,2, 14 are coordinates in the horizontal direction, l₁,l₂1,3, 15 are coordinates in the vertical direction, the combination mode ordinal number of different 2 joint points is defined as c, and 28 modes are counted, then

When t frame image contains joint point l₁And l₂The corresponding horizontal coordinate is expressed as

The coordinate in the vertical direction is expressed as

And

l₁and l₂Horizontal distance F of_1,t,cAnd a vertical distance F_2,t,cIs expressed as:

b. for the t frame image, calculating the joint angle F related to the motion of the participant_3,t,dExtracting angular features of atomic actions, wherein

d is 0,1, 4, and the joint angles related to the participation action are respectively the included angle between the connecting straight line from the nose tip to the left shoulder end and the right shoulder end, the included angle between the connecting straight line from the center of the neck to the nose tip and the connecting straight line from the center of the neck to the left shoulder end, the included angle between the connecting straight line from the center of the neck to the nose tip and the connecting straight line from the center of the neck to the right shoulder end, the included angle between the upper arm of the left hand and the lower arm, and the included angle between the upper arm of the right hand and the lower arm; taking the angle between the connecting straight lines from the tip of the nose to the left and right shoulders as an example, the calculation formula of the angle is as follows:

wherein x in the above formula_t,0,x_t,1,x_t,4,x_t,5,x_t,10,x_t,11Coordinate values representing the tip of the nose, the left shoulder end and the right shoulder end;

c. extracting distance features and angle features of each frame of image in the video according to the steps a and b to obtain a spatial feature sequence of the participating actions, and expressing the spatial feature sequence as

Wherein F_t＝(F_1,t,0,...,F_1,t,27,F_2,t,0,...,F_2,t,27,F_3,t,0,...,F_3,t,4)；

3) Atomic action number sequence acquisition

a. According to a specific application scene, setting V-type atomic actions, carrying out priority ordering on the atomic actions according to the action attention degree, and obtaining an atomic action number set represented as

b. Since the spatial features are obviously different in different atomic motions, the v < th > is formulated by searching the range of joint angles and joint point distances which are most representative of each atomic motion,

judgment standard of the atomic-like action: definition of U_1,v,c、L_1,v,c、U_2,v,c、L_2,v,cThe upper limit and the lower limit of the c joint horizontal and vertical distance characteristic value range, U, representing the v-type atom motion_3,v,d、L_3,v,dRepresenting the upper limit and the lower limit of the value range of the angle characteristic of the d joint of the v-th type atom motion;

c. according to the atomic motion priority, classifying the human body motion in the t frame image; if the spatial feature of the t-th frame satisfies L_1,v,c<F_1,t,c<U_1,v,c、L_2,v,c<F_2,t,c<U_2,v,cAnd L_3,v,d<F_3,t,d<U_3,v,dIf the motion type in the t-th frame image belongs to the v-th type atomic motion, the observed value of the t-th frame image at the corresponding moment is

d. And c, classifying each frame of image in the video according to the atomic motion in the step c to obtain an observation sequence of the video, namely the atomic motion serial number is O ═ (O)₀,o₂,...,o_T-1)；

4) HMM construction of hidden Markov models corresponding to different participant actions

a. According to the specific application scene, if K-type participation actions are set, the participation action number set is expressed as

The training data of the kth-class participant action, i.e. the atomic action number sequence, is

Wherein o is_0,k,o_1,k,...,o_T-1,kIs an observation sequence; the hidden sequence corresponding to the observed sequence is I_k＝(i_0,k,i_1,k,...,i_T-1,k)，i_t,k∈Q，Q＝(q₀,q₁,...,q_N-1) Is a hidden state set, and N is the number of hidden states;

b. training data O according to kth-class participation_kPerforming HMM modeling on the corresponding participant action; the HMM parameters defining the kth-class conferencing actions are

r is the number of iterations in which the state transition matrix

Indicating that the image is in the state q at the corresponding time of the t-th frame image_nThe t +1 th frame image is shifted to the state q corresponding to the time of the frame image_mAnd the observed probability matrix is represented as

Representing the image in state q at the moment corresponding to the t-th frame_nUnder the condition (1), the observed value is the probability of the atomic action number v, and the initial state probability vector is

The state representing the corresponding time of the 0 th frame image is q_nThe probability of (d); symbol [ 2 ]]The subscripts N × N, N × V and 1 × N represent the dimensions of the matrix;

c. initializing HMM parameters to

Defining maximum iteration times R and maximum log-likelihood value error delta;

d. in the parameter

Under the conditions of (1), define α_t,k(n) represents that the corresponding moment of the t frame image is in the state q_nThe observation sequence is o_0,k,o_1,k,...,o_t,kForward probability of (d); the forward probability at the corresponding time of the 0 th frame image is expressed as

Set T to 0,1, …, T-2 for all hidden states

The forward probability is calculated as follows:

and when the parameter is

Under the condition that the observed sequence is O_kIs expressed as a probability log-likelihood value of

e. Under the premise that r is more than or equal to 1, probability log-likelihood values obtained by utilizing the r-th iteration and the r-1 th iteration

And

calculating the log-likelihood value error delta^rThe calculation formula is as follows:

(symbol)

is expressed in the parameter of

Under the condition that the observed sequence is O_kThe probability of (a) of (b) being,

representing HMM parameters obtained by the last iteration calculation;

f. definition of beta_t,k(n) denotes a t-th frame mapCorresponding to the state q at the moment_nUnder the condition that the observed sequence is o_t+1,k,o_t+2,k,...,o_T-1,kThe posterior probability of (1), the corresponding time of the T-1 frame image, beta_T-1(n) 1, T-2, T-3, …,0, for all hidden states

The backward probability is expressed as:

g. defining a probability xi_t,k(m, n) represents an observed sequence of O_kThe t frame image is in state q at corresponding time_nThe t +1 th frame image is shifted to the state q at the corresponding time_mThe formula is as follows:

and gamma is_t,k(n) represents that the corresponding time state of the t frame image is q_nIs expressed as:

h. computing HMM parameters using a re-estimation formula

Expressed as:

i. executing steps d to h until the iteration number R is R-1 or the log likelihood value error delta^r<Delta, obtaining trained HMM parameters

5) Human body participant action recognition

a. Numbering the sequence O ═ for atomic actions of unknown class (O)₀,o₂,...,o_T-1) Calculating the HMM parameter lambda corresponding to the O in different participant actions according to the method of step d in step 4)_kLog likelihood value log P [ O | λ ] of_k}；

b. And (3) performing conference participation action identification by calculating a maximum log-likelihood value, wherein the conference participation action number is expressed as:

wherein the equal sign right represents logP { O | λ_kAnd when the parameter k is the maximum value, the participant action identification is completed, and the identification result is the action type corresponding to the participant action number.

The invention has the beneficial effects that: the method is used for identifying the conference participation action in the video conference by combining human body skeleton information and the space-time information of video data. The spatial feature sequence of the human body action is calculated and the atomic action of a single frame image is classified, so that the atomic action number sequence of a plurality of frames of images in the video is determined, the spatial feature in the video image is fully utilized, and the identification effect is improved. Furthermore, by utilizing the characteristic of small training data amount, the HMM is adopted to carry out model construction on different participant actions, and the HMM is utilized to model a time process, so that time characteristics in a video can be better utilized, and a better participant action recognition effect is achieved.

Drawings

FIG. 1 is a schematic flow chart of the present invention.

Detailed Description

The invention is further described below, but not limited to, with reference to the following figures and examples.

Example (b):

a conference participation action identification method based on skeleton information and space-time characteristics determines a human conference participation action by processing a coordinate sequence of human skeleton joint points in a video conference monitoring picture, as shown in figure 1, and comprises the following steps:

1) skeletal joint point coordinate sequence acquisition

Acquiring coordinate information of 8 skeletal joint points of the upper half of the human body in a video picture, wherein the coordinate information comprises a nose tip, a neck center, a right shoulder end, a right upper arm center, a right wrist center, a left shoulder end, a left upper arm center and a left wrist center in sequence; a video comprises T frames of images, and the coordinate sequence of joint points in the video is represented as [ X ]₀,X₁,...,X_t,...,X_T-1]The coordinate sequence contained in the t-th frame image in the video is represented as X_t＝(x_t,0,x_t,1,...,x_t,l,...,x_t,15) Wherein

l＝0,1,…,15，(x_t,0,x_t,1) Coordinates representing the nose tip of the first joint in the image of the t-th frame, (x)_t,2,x_t，3) The coordinate representing the exact center of the neck of the second joint in the t-th image, and so on, (x)_t，14，x_t，15) Coordinates representing the center of the left wrist of the eighth joint point in the t frame image;

2) spatial feature sequence computation

a. For the t frame image, calculating the horizontal distance F between two joint points_1，t，cAnd a vertical distance F_2，t，cExtracting distance features of atomic actions, wherein

Arbitrarily selecting two joint points l from eight joint points l₁And l₂,l₁≠l₂Wherein l is₁，l₂0,2, 14 are coordinates in the horizontal direction, l₁,l₂1,3, 15 are coordinates in the vertical direction, the combination mode ordinal number of different 2 joint points is defined as c, and 28 modes are counted, then

The coordinate in the vertical direction is expressed as

And

l₁and l₂Horizontal distance F of_1,t，cAnd a vertical distance F_2，t，cIs expressed as:

b. for the t frame image, calculating the joint angle F related to the motion of the participant_3，t,dExtracting angular features of atomic actions, wherein

wherein x in the above formula_t,0,x_t,1,x_t,4,x_t,5，x_t,10,x_t,11Coordinate values representing the tip of the nose, the left shoulder end and the right shoulder end;

3) Atomic action number sequence acquisition

c. according to the atomic motion priority, classifying the human body motion in the t frame image; if the t-th frameSpatial characteristics satisfy L_1,v,c<F_1,t,c<U_1，v，c、L_2，v，c<F_2，t,c<U_2，v，cAnd L_3,v，d<F_3,t，d<U_3，v，dIf the motion type in the t-th frame image belongs to the v-th type atomic motion, the observed value of the t-th frame image at the corresponding moment is

d. And c, classifying each frame of image in the video according to the atomic motion in the step c to obtain an observation sequence of the video, namely the atomic motion serial number is O ═ (O)₀，o₂，...，o_T-1)；

Wherein o is_0,k,o_1，k,...,o_T-1,kIs an observation sequence; the hidden sequence corresponding to the observed sequence is I_k＝(i_0,k,i_1,k,...,i_T-1,k)，i_t,k∈Q，Q＝(q₀,q₁,...,q_N-1) Is a hidden state set, and N is the number of hidden states;

r is the number of iterations in which the state transition matrix

c. initializing HMM parameters to

d. in the parameter

Under the conditions of (1), define α_t,k(n) represents that the corresponding moment of the t frame image is in the state q_nThe observation sequence is o_0,k,o_1,k，...,o_t,kForward probability of (d); the forward probability at the corresponding time of the 0 th frame image is expressed as

Set T to 0,1, …, T-2 for all hidden states

The forward probability is calculated as follows:

and when the parameter is

And

(symbol)

is expressed in the parameter of

representing HMM parameters obtained by the last iteration calculation;

f. definition of beta_t,k(n) indicates that the t-th frame image is in the state q at the corresponding time_nUnder the condition that the observed sequence is o_t+1,k,o_t+2,k,...,o_T-1,kThe posterior probability of (1), the corresponding time of the T-1 frame image, beta_T-1(n) 1, T-2, T-3, …,0, for all hidden states

The backward probability is expressed as:

h. computing HMM parameters using a re-estimation formula

Expressed as:

5) Human body participant action recognition

Claims

1. A conference participation action identification method based on skeleton information and space-time characteristics determines a human conference participation action by processing a coordinate sequence of human skeleton joint points in a video conference monitoring picture, and comprises the following steps:

1) skeletal joint point coordinate sequence acquisition

Acquiring coordinate information of 8 skeletal joint points of the upper half of the human body in the video picture, namely the tip of the nose, the center of the neck, the right shoulder end, the center of the right upper arm, the center of the right wrist, the left shoulder end and the left upper arm in sequenceThe right center, the left wrist center; a video comprises T frames of images, and the coordinate sequence of joint points in the video is represented as [ X ]₀，X₁，...，X_t，...，X_T-1]The coordinate sequence contained in the t-th frame image in the video is represented as X_t＝(x_t，0，x_t，1，...，x_t，l，...，x_t，15) Wherein

l＝0，1，...，15，(x_t，0，x_t，1) Coordinates representing the nose tip of the first joint in the image of the t-th frame, (x)_t，2，x_t，3) The coordinate representing the exact center of the neck of the second joint in the t-th image, and so on, (x)_t，14，x_t，15) Coordinates representing the center of the left wrist of the eighth joint point in the t frame image;

2) spatial feature sequence computation

Arbitrarily selecting two joint points l from eight joint points l₁And l₂，l₁≠l₂Wherein l is₁，l₂0,2, 14 are coordinates in the horizontal direction, l₁，l₂1,3, 15 are coordinates in the vertical direction, the combination mode ordinal number of different 2 joint points is defined as c, and 28 modes are counted, then

The coordinate in the vertical direction is expressed as

And

and l₂Horizontal distance F of_1，t，cAnd a vertical distance F_2，t，cIs expressed as:

b. for the t frame image, calculating the joint angle F related to the motion of the participant_3，t，dExtracting angular features of atomic actions, wherein

d is 0,1, 4, and the joint angles related to the participation action are respectively the included angle between the connecting straight line from the nose tip to the left shoulder end and the right shoulder end, the included angle between the connecting straight line from the center of the neck to the nose tip and the connecting straight line from the center of the neck to the left shoulder end, the included angle between the connecting straight line from the center of the neck to the nose tip and the connecting straight line from the center of the neck to the right shoulder end, the included angle between the upper arm of the left hand and the lower arm, and the included angle between the upper arm of the right hand and the lower arm;

Wherein F_t＝(F_1，t，0，...，F_1，t，27，F_2，t，0，...，F_2，t，27，F_3，t，0，...，F_3，t，4)；

3) Atomic action number sequence acquisition

a. According to a specific application scene, setting V-type atomic actions, and performing priority ordering on the atomic actions according to the action attention degree to obtain an atomic action number setIs shown as

b. In different atomic actions, the space characteristics are obviously different, and the range of the joint angle and the joint point distance which are most representative of each atomic action is found to formulate the

Judgment standard of the atomic-like action: definition of U_1，v，c、L_1，v，c、U_2，v，c、L_2，v，cThe upper limit and the lower limit of the c joint horizontal and vertical distance characteristic value range, U, representing the v-type atom motion_3，v，d、L_3，v，dRepresenting the upper limit and the lower limit of the value range of the angle characteristic of the d joint of the v-th type atom motion;

c. according to the atomic motion priority, classifying the human body motion in the t frame image; if the spatial feature of the t-th frame satisfies L_1，v，c＜F_1，t，c＜U_1，v，c、L_2，v，c＜F_2，t，c＜U_2，v，cAnd L_3，v，d＜F_3，t，d＜U_3，v，dIf the motion type in the t-th frame image belongs to the v-th type atomic motion, the observed value of the t-th frame image at the corresponding moment is

Wherein o is_0，k，o_1，k，...，o_T-1，kIs an observation sequence; the hidden sequence corresponding to the observed sequence is I_k＝(i_0，k，i_1，k，...，i_T-1，k)，i_t，k∈Q，Q＝(q₀，q₁，...，q_N-1) Is a hidden state set, and N is the number of hidden states;

r is the number of iterations in which the state transition matrix

Indicating that the image is in the state q at the corresponding time of the t-th frame image_nUnder the conditions of (1), seeThe measured value is the probability of the atomic motion number v, and the initial state probability vector is

c. initializing HMM parameters to

d. in the parameter

Under the conditions of (1), define α_t，k(n) represents that the corresponding moment of the t frame image is in the state q_nThe observation sequence is o_0，k，o_1，k，...，o_t，kForward probability of (d); the forward probability at the corresponding time of the 0 th frame image is expressed as

Set T0, 1.. times, T-2 for all hidden states

The forward probability is calculated as follows:

and when the parameter is

Under the condition that the observed sequence is O_kTable of probability log-likelihood valuesShown as

And

(symbol)

is expressed in the parameter of

representing HMM parameters obtained by the last iteration calculation;

f. definition of beta_t，k(n) indicates that the t-th frame image is in the state q at the corresponding time_nUnder the condition that the observed sequence is o_t+1，k，o_t+2，k，...，o_T-1，kThe posterior probability of (1), the corresponding time of the T-1 frame image, beta_T-11, T-2, T-3, 0, for all hidden states

The backward probability is expressed as:

g. defining a probability xi_t，k(m, n) represents an observed sequence of O_kThe t frame image is in state q at corresponding time_nThe t +1 th frame image is shifted to the state q at the corresponding time_mThe formula is as follows:

and gamma is_t，k(n) represents that the corresponding time state of the t frame image is q_nIs expressed as:

h. computing HMM parameters using a re-estimation formula

Expressed as:

i. executing steps d to h until the iteration number R is R-1 or the log likelihood value error delta^r< delta, obtaining HMM parameters for completion of training

5) Human body participant action recognition

a. Numbering the sequence O ═ for atomic actions of unknown class (O)₀，o₂，...，o_T-1) Calculating the HMM parameter lambda corresponding to the O in different participant actions according to the method of step d in step 4)_kLog likelihood value log P [ O | λ ] of_k}；