CN110956141B

CN110956141B - Human body continuous action rapid analysis method based on local recognition

Info

Publication number: CN110956141B
Application number: CN201911216130.8A
Authority: CN
Inventors: 赵红领; 李润知; 崔莉亚; 刘皓东; 王菁
Original assignee: Zhengzhou University
Current assignee: Zhengzhou University
Priority date: 2019-12-02
Filing date: 2019-12-02
Publication date: 2023-02-28
Anticipated expiration: 2039-12-02
Also published as: CN110956141A

Abstract

The invention discloses a method for quickly analyzing human body continuous actions based on local identification, which mainly comprises rope skipping video acquisition, video data preprocessing, coordinate point acquisition, coordinate point windowing repeated identification, feature sequence construction, regression prediction model establishment and model calling, wherein video stream data in the process of double-foot jumping during the front face of a human body is acquired through a mobile phone, the basic information of a target object, including name, sex, age, height, weight and made action name, is recorded and stored, and a regression prediction model is constructed by utilizing a deep neural network.

Description

Human body continuous action rapid analysis method based on local recognition

Technical Field

The invention relates to the technical field of human motion action analysis, in particular to a human body continuous action rapid analysis method based on local recognition.

Background

With the development and application of computer technology and machine learning in recent years, video-based time series human motion analysis technology is rapidly emerging and has gained wide attention. At present, human motion analysis based on video time sequences is still a very challenging subject in computer vision, relates to multiple subjects such as image processing, pattern recognition and machine learning, and has wide application prospects in the fields of intelligent monitoring, man-machine interaction, rehabilitation exercise, physical training evaluation and the like.

The traditional neural network is easily misled by irrelevant information in the image when processing an input image with fuzzy characteristics and large ambiguity, and is difficult to ensure higher recognition rate.

The existing human behavior recognition algorithm based on videos has the problems of high complexity, poor robustness, low precision and the like. In addition, research work on the aspect of regression prediction analysis of human body actions is less, so the invention provides the human body action analysis method based on time series regression prediction, which has high robustness and stable time overhead, and has important significance for human body action analysis modeling, action quality evaluation and potential mining of sporters.

Disclosure of Invention

In view of the above situation, in order to overcome the defects of the prior art, the present invention aims to provide a method for rapidly analyzing human body continuous actions based on local recognition, which solves the problem of low accuracy of long-time regression prediction of actions in the exercise training process.

The technical scheme for solving the problem is that the method for rapidly analyzing the continuous actions of the human body based on local recognition mainly comprises rope skipping video acquisition, video data preprocessing, coordinate point acquisition, coordinate point windowing re-recognition, feature sequence construction, regression prediction model establishment and model calling, and specifically comprises the following steps:

step 1: collecting video stream data of the human body in the process of shaking the front face of the human body and jumping with two feet through a mobile phone, and recording and storing basic information of a target object, wherein the basic information comprises name, sex, age, height, weight and name of action;

step 2: preprocessing video data, and performing posture estimation on a human body of each frame in a video to obtain a key point coordinate, wherein the method comprises the following steps of:

a1, converting video data shot by different mobile phones into a uniform scale, setting the height of a video to be 530cm and the width to be 460cm;

step A2: obtaining the coordinate positions of 14 joint points of the nose, the neck, the right shoulder, the right elbow, the right wrist, the left shoulder, the left elbow, the left wrist, the right hip, the right knee, the right ankle, the left hip, the left knee and the left ankle of each frame of human body in the video by utilizing an Open-position method, wherein the coordinate is represented as CP _i ＝(x _i ,y _i ) Wherein i is from 1 to 14, wherein (x) _i ,y _i ) Coordinates of key points of the human body;

step A3: open-pos uses the Gaussian distance between predicted keypoints and true values to define confidence in keypoints and normalizes the confidence to [0,1]Is defined herein as the score of the keypoint score and results in output results inputs _i ＝(x _i ,y _i ,score _i )；

Step A4: the scores of the 14 key points were averaged and standard deviations, and the sum of the average and standard deviation was used as the score result for the overall key point,

score _{general (1)} ＝score _{Average out} +score _{Standard deviation of} ；

And 3, step 3: windowing the low predicted positions of the key points by using an image windowing technology and re-identifying the low predicted positions of the key points, and improving the prediction accuracy of the key points by using global information and local parts, wherein the method comprises the following steps of:

step B1, obtaining inputs by utilizing Open-pos _i ＝(x _i ,y _i ,score _i ) Setting a threshold th for the score, and finding out key points smaller than the threshold th;

b2, windowing the key points smaller than the threshold around the key points, and putting the image frames in the windows into the Open-pos network with the input modified;

b3, updating the coordinates of the key points obtained in the local frame by using the global information;

and 4, step 4: in order to further improve the robustness of the algorithm to shooting angles, target distance and recording process jitter factors, the key points are respectively set as the left hip C ₁ Right hip C ₁₁ Neck C ₈ Performing coordinate conversion by taking the gravity center of the three coordinate points as an origin to obtain data

Wherein

Defining the origin of coordinates for the transformed relative coordinates, i.e., C ₁ 、C ₁₁ 、C ₈ Center of gravity of three points C ₀ ＝(cx ₀ ,cy ₀ ) Wherein

Updating all coordinate points by taking the original point coordinates as reference;

and 5: accumulating the coordinate matrix inputs obtained by each frame to obtain an accumulated coordinate matrix of each video segment, and performing window segmentation on the accumulated coordinate matrix by using a sliding window, namely setting the length of the sliding window to be the accumulated coordinate and the score obtained by k frames, and setting the step length to be 1;

step 6: a regression prediction model is constructed by utilizing a deep neural network, and the method comprises the following steps:

step C1, constructing a network model by fusing RNN (neural network) with CNN, namely fusing a bidirectional LSTM (BilsTM) model and a model with two convolution layers and a global pooling layer to construct a deep neural network model;

and step C2: dividing data into a training set and a test set, training a training network model by using the test set, and storing a pre-training model;

and C3: and inputting the test data into the trained model to obtain a predicted result.

Due to the adoption of the technical scheme, compared with the prior art, the invention has the following advantages;

1. through rope skipping video acquisition, video data preprocessing, coordinate point acquisition, coordinate point windowing re-identification, feature sequence construction, regression prediction model establishment and model calling, the problems of low key point prediction in the alignment posture estimation process and long-time action analysis and motion potential discovery in the motion training process are solved, and a reference basis is provided for accurate action analysis in the motion process.

Drawings

FIG. 1 is a schematic diagram of the single swing foot jumping analysis of the preferred embodiment of the present invention;

FIG. 2 is a diagram of human body pose estimates at 14 points during rope skipping;

FIG. 3 is a partial position fenestration view of a human body;

FIG. 4 presents a graph of coordinate transformation of human key points in a rectangular coordinate system;

FIG. 5 is a general network architecture diagram;

FIG. 6 is a visualization result diagram of each layer of the network structure.

Detailed Description

The foregoing and other aspects, features and advantages of the invention will be apparent from the following more particular description of embodiments of the invention, as illustrated in the accompanying drawings in which reference is made to figures 1 to 6. The structural contents mentioned in the following embodiments are all referred to the attached drawings of the specification.

A human body continuous action rapid analysis method based on local identification mainly comprises rope skipping video acquisition, video data preprocessing, coordinate point acquisition, coordinate point windowing re-identification, feature sequence construction, regression prediction model establishment and model calling, and specifically comprises the following steps:

and 2, step: preprocessing video data, and performing posture estimation on a human body of each frame in a video to obtain a key point coordinate, wherein the method comprises the following steps of:

step A2: coordinate positions of 14 joint points of the nose, the neck, the right shoulder, the right elbow, the right wrist, the left shoulder, the left elbow, the left wrist, the right hip, the right knee, the right ankle, the left hip, the left knee and the left ankle of each frame of human body in the obtained video are represented by CP coordinates by utilizing an Open-position method _i ＝(x _i ,y _i ) Wherein i is from 1 to 14, wherein (x) _i ,y _i ) Coordinates of key points of the human body;

And 3, step 3: windowing and re-identifying the low predicted position of the key point by utilizing an image windowing technology, and improving the prediction accuracy of the key point by utilizing global information and local parts, wherein the method comprises the following steps:

Wherein

I.e. defining the origin of coordinates, i.e. C, for the transformed relative coordinates ₁ 、C ₁₁ 、C ₈ Center of gravity of three points C ₀ ＝(cx ₀ ,cy ₀ ) Wherein

and 6: a regression prediction model is constructed by utilizing a deep neural network, and the method comprises the following steps:

step C1, constructing a network model by fusing a CNN with an RNN, namely fusing a bidirectional LSTM (BilSTM) model and a model with two convolution layers and a global pooling layer to construct a deep neural network model;

In step 6, the BilSTM is a bidirectional LSTM, the bidirectional LSTM is formed by two LSTMs which are superposed up and down, the output is determined by the shape of the two LSTMs together, and a recursive network calculates the hidden vector from front to back

Another recurrent neural network calculates the hidden vector from back to front

Final output

The first layer of the convolution layer is a one-dimensional convolution with convolution kernel of 5 multiplied by 5 and added with regularization, the second layer of the convolution layer is a convolution kernel of 3 multiplied by 3 and added with regularization, the third layer is a global pooling layer, and the number of filters in the convolution process is respectively 64 and 32;

adding a Dropout layer in the BilSTM layer, setting the last layer of activation function as a linear activation function in order to realize linear regression, and selecting a regression loss function Mean Square Error (MSE) loss function as a loss function; in order to accelerate the convergence of the network model, control overfitting to accelerate the convergence and control overfitting to add a batch normalization layer into each convolution layer, the batch normalization layer comprises the following calculation processes:

wherein B = { z = _1,...,m Is the input of the batch, m is the size of the batch, μ _B Is the average of the batch data and,

is the variance of the batch process and,

to normalize the result,. Epsilon.is a minimum value, h _i Gamma and beta are parameters input into the network model for learning, the results after scale change and offset;

on the basis of the scheme: the convolutional layer is represented in the form of a layer when step 6 is performed

Wherein l is the number of layers,

is the output of the jth neuron at layer l,

is the ith input of the l-th layer, an

* Represents the convolution, w _ij Is the convolution kernel, bias is the bias term, M _j For a set of input feature maps, f (-) represents an activation function;

giving k continuous subsections in the regression prediction output to predict the key point coordinates and scores of the next frame, obtaining the total scores according to the step 2, and adding the obtained frames into the current sequence by using recursion to predict the data of the next frame;

the image windowing technology is that required target information is highlighted in an original image, and the purpose is to detect the coordinate position of a key point in the window;

when step 3 is executed, the area of the window is S, and the coordinates of the four points of the window are ld (kx) ₁ ,ky ₁ )、lu(kx ₂ ,ky ₂ )、rd(kx ₃ ,ky ₃ )、ru(kx ₄ ,ky ₄ )；

The maximum area of the image windowing is determined by the processing speed d of a CPU and the total operation amount m of a software fusion algorithm, wherein the larger the d, the smaller the m, the larger the windowing area, otherwise, the smaller the area, and the background image area is S ₁ The maximum area of the window is S ₂ ，S ₂ ＝dS ₁ /25；

The modified Open-position method adjusts the size format of input data into the length and width of the windowing, and outputs the coordinates of the target information converted according to the coordinates of the origin.

While the invention has been described in further detail with reference to specific embodiments thereof, it is not intended that the invention be limited to the specific embodiments thereof; for those skilled in the art to which the present invention pertains and related technologies, the extension, operation method and data replacement should fall within the protection scope of the present invention based on the technical solution of the present invention.

Claims

1. A human body continuous action rapid analysis method based on local identification is characterized by mainly comprising rope skipping video acquisition, video data preprocessing, coordinate point acquisition, coordinate point windowing re-identification, feature sequence construction, regression prediction model establishment and model calling, and specifically comprises the following steps:

step 1: collecting video stream data in the process of shaking the front face of the human body and jumping with two feet through a mobile phone, and recording and storing basic information of a target object, wherein the basic information comprises name, sex, age, height, weight and name of an action to be made;

a1, converting video data shot by different mobile phones into a uniform scale, setting the height of a video to be 530cm, and setting the width to be 460cm;

step A3: open-pose uses the Gaussian distance between the predicted keypoint and the true value to define the confidence of the keypoint and normalizes the confidence to [0,1]Is defined herein as the score of the keypoint score and results in output results inputs _i ＝(x _i ,y _i ,score _i )；

Step A4: the scores of the 14 key points are averaged and standard deviations are calculated, and the sum of the average and standard deviation is used as the score result of the whole key points,

And step 3: windowing and re-identifying the low predicted position of the key point by utilizing an image windowing technology, and improving the prediction accuracy of the key point by utilizing global information and local parts, wherein the method comprises the following steps:

Wherein

I.e. the converted relative coordinates

Defining origin of coordinates, i.e. C ₁ 、C ₁₁ 、C ₈ Center of gravity of three points C ₀ ＝(cx ₀ ,cy ₀ ) Wherein

2. The method as claimed in claim 1, wherein the BilSTM is a bidirectional LSTM in step 6, the bidirectional LSTM has two LSTMs superimposed one on top of the other, and the output has a common shape decision of the two LSTMs, wherein a recursive network calculates the hidden vector from front to back

Final output

The convolution layer first layer is a one-dimensional convolution with convolution kernel of 5 x 5 and added with regularization, the convolution kernel of the second layer is 3 x 3 and added with regularization, the third layer is a global pooling layer, and the number of filters in the convolution process is 64 and 32 respectively.

3. The method for rapidly analyzing human body continuous actions based on local recognition as claimed in claim 1, wherein a Dropout layer is added in a BilSTM layer, in order to realize linear regression, the last layer of activation function is set as a linear activation function, and the loss function selects a regression loss function Mean Square Error (MSE) loss function;

in order to accelerate the convergence of the network model, control overfitting to accelerate the convergence and control overfitting to add a batch normalization layer into each convolution layer, the batch normalization layer comprises the following calculation processes:

wherein B = { z = _1,...,m "is the input to the batch, m is the batch size, μ _B Is the average of the batch data and,

is the variance of the batch process and,

for normalized results, ε is a minimum value, h _i Gamma and beta are parameters input into the network model for learning as a result of the scale change and the offset;

the convolutional layer is represented in the form of a layer when step 6 is performed

Wherein l is the number of layers,

is the output of the jth neuron at level l,

is the ith input of the l-th layer, an

Represents the convolution, w _ij Is the convolution kernel, bias is the bias term, M _j For the set of input feature maps, f (-) denotes the stimulusA live function;

and (3) giving k continuous subsegments in the regression prediction output to predict the key point coordinates and the scores of the next frame, obtaining the total scores according to the step (2), and adding the obtained frames into the current sequence by utilizing recursion to predict the data of the next frame.

4. The method as claimed in claim 1, wherein the image windowing technique is to highlight the required target information in the original image in order to detect the coordinate position of the key point in the window;

when step 3 is executed, the area of the window is S, and the coordinates of four points of the window are ld (kx) at the upper left, the lower left, the upper right and the lower right ₁ ,ky ₁ )、lu(kx ₂ ,ky ₂ )、rd(kx ₃ ,ky ₃ )、ru(kx ₄ ,ky ₄ )；