CN110956141A

CN110956141A - Human body continuous action rapid analysis method based on local recognition

Info

Publication number: CN110956141A
Application number: CN201911216130.8A
Authority: CN
Inventors: 赵红领; 李润知; 崔莉亚; 刘皓东; 王菁
Original assignee: Zhengzhou University
Current assignee: Zhengzhou University
Priority date: 2019-12-02
Filing date: 2019-12-02
Publication date: 2020-04-03
Anticipated expiration: 2039-12-02
Also published as: CN110956141B

Abstract

The invention discloses a human body continuous action rapid analysis method based on local identification, which mainly comprises rope skipping video acquisition, video data preprocessing, coordinate point acquisition, coordinate point windowing repeated identification, feature sequence construction, regression prediction model establishment and model calling, wherein video stream data in the process of double-foot jumping of the front side of a human body are acquired through a mobile phone, basic information of a target object, including name, gender, age, height, weight and made action name, is recorded and stored, and a regression prediction model is constructed by utilizing a deep neural network.

Description

Human body continuous action rapid analysis method based on local recognition

Technical Field

The invention relates to the technical field of human motion and motion analysis, in particular to a human continuous motion rapid analysis method based on local recognition.

Background

With the development and application of computer technology and machine learning in recent years, video-based time series human motion analysis technology is rapidly emerging and has gained wide attention. At present, human body action analysis based on a video time sequence is still a very challenging subject in computer vision, relates to a plurality of subjects such as image processing, pattern recognition and machine learning, and has wide application prospects in the fields of intelligent monitoring, man-machine interaction, rehabilitation exercise, physical training evaluation and the like.

When the traditional neural network is used for processing input images with fuzzy features and large ambiguity, the traditional neural network is easily misled by irrelevant information in the images, and is difficult to ensure higher recognition rate.

The existing human behavior recognition algorithm based on videos has the problems of high complexity, poor robustness, low precision and the like. In addition, research work on the aspect of regression prediction analysis of human body actions is less, so the invention provides the human body action analysis method based on time series regression prediction, which has high robustness and stable time overhead, and has important significance for human body action analysis modeling, action quality evaluation and potential mining of sporters.

Disclosure of Invention

In view of the above situation, in order to overcome the defects of the prior art, the present invention aims to provide a method for rapidly analyzing human body continuous actions based on local recognition, which solves the problem of low accuracy of long-time regression prediction of actions in the exercise training process.

The technical scheme for solving the problem is that the method for rapidly analyzing the continuous actions of the human body based on local recognition mainly comprises rope skipping video acquisition, video data preprocessing, coordinate point acquisition, coordinate point windowing re-recognition, feature sequence construction, regression prediction model establishment and model calling, and specifically comprises the following steps:

step 1: collecting video stream data of the human body in the process of shaking the front face of the human body and jumping with two feet through a mobile phone, and recording and storing basic information of a target object, wherein the basic information comprises name, sex, age, height, weight and name of action;

step 2: preprocessing video data, and performing posture estimation on a human body of each frame in a video to obtain a key point coordinate, wherein the method comprises the following steps of:

a1, converting video data shot by different mobile phones into a uniform scale, setting the height of a video to be 530cm and the width to be 460 cm;

step A2: obtaining the coordinate positions of 14 joint points of the nose, the neck, the right shoulder, the right elbow, the right wrist, the left shoulder, the left elbow, the left wrist, the right hip, the right knee, the right ankle, the left hip, the left knee and the left ankle of each frame of human body in the video by utilizing an Open-position method, wherein the coordinate is represented as CP_i＝(x_i,y_i) Wherein i is from 1 to 14, wherein (x)_i,y_i) Coordinates of key points of the human body;

step A3: open-pos uses the Gaussian distance between predicted keypoints and true values to define confidence in keypoints and normalizes the confidence to [0, 1%]Is defined herein as the score of the keypoint and results in output results inputs_i＝(x_i,y_i,score_i)；

Step A4: the scores of the 14 key points were averaged and standard deviations, and the sum of the average and standard deviation was used as the score result for the overall key point,

score_{general assembly}＝score_Average+score_{Standard deviation of}；

And step 3: windowing and re-identifying the low predicted position of the key point by utilizing an image windowing technology, and improving the prediction accuracy of the key point by utilizing global information and local parts, wherein the method comprises the following steps:

step B1, obtaining inputs by using Open-pos_i＝(x_i,y_i,score_i) Setting a threshold th for the score, and finding out key points smaller than the threshold th;

b2, windowing the key points smaller than the threshold around the key points, and putting the image frames in the windows into the Open-pos network with modified input;

step B3, updating the key point coordinates obtained in the local frame by using the global information;

and 4, step 4: in order to further improve the robustness of the algorithm to shooting angles, target distance and recording process jitter factors, the key points are respectively set as the left hip C₁Right hip C₁₁Neck C₈Performing coordinate conversion by taking the gravity center of the three coordinate points as an origin to obtain data

Wherein

I.e. defining the origin of coordinates, i.e. C, for the transformed relative coordinates₁、C₁₁、C₈Center of gravity of three points C₀＝(cx₀,cy₀) Wherein

Updating all coordinate points by taking the original point coordinates as reference;

and 5: accumulating the coordinate matrix inputs obtained by each frame to obtain an accumulated coordinate matrix of each video segment, and performing window segmentation on the accumulated coordinate matrix by using a sliding window, namely setting the length of the sliding window to be the accumulated coordinate and the score obtained by k frames, and setting the step length to be 1;

step 6: a regression prediction model is constructed by utilizing a deep neural network, and the method comprises the following steps:

step C1, constructing a network model by fusing RNN through CNN, namely, fusing a bidirectional LSTM (BilsTM) model and a model with two convolution layers and a global pooling layer to construct a deep neural network model;

step C2: dividing data into a training set and a test set, training a training network model by using the test set, and storing a pre-training model;

step C3: and inputting the test data into the trained model to obtain a predicted result.

Due to the adoption of the technical scheme, compared with the prior art, the invention has the following advantages;

1. through rope skipping video acquisition, video data preprocessing, coordinate point acquisition, coordinate point windowing re-identification, feature sequence construction, regression prediction model establishment and model calling, the problems of low key point prediction in the alignment posture estimation process and long-time action analysis and movement potential discovery in the movement training process are solved, and a reference basis is provided for accurate action analysis in the movement process.

Drawings

FIG. 1 is a schematic diagram of the single-swing double-foot jumping analysis in the preferred embodiment of the present invention;

FIG. 2 is a diagram of human body pose estimates at 14 points during rope skipping;

FIG. 3 is a partial position fenestration view of a human body;

FIG. 4 is a diagram of coordinate transformation of key points of a human body in a rectangular coordinate system;

FIG. 5 is a diagram of an overall network architecture;

FIG. 6 is a visualization result diagram of each layer of the network structure.

Detailed Description

The foregoing and other aspects, features and advantages of the invention will be apparent from the following more particular description of embodiments of the invention, as illustrated in the accompanying drawings in which reference is made to figures 1 to 6. The structural contents mentioned in the following embodiments are all referred to the attached drawings of the specification.

A human body continuous action rapid analysis method based on local identification mainly comprises rope skipping video acquisition, video data preprocessing, coordinate point acquisition, coordinate point windowing re-identification, feature sequence construction, regression prediction model establishment and model calling, and specifically comprises the following steps:

score_{general assembly}＝score_Average+score_{Standard deviation of}；

Wherein

In step 6, the BilSTM is a bidirectional LSTM, the bidirectional LSTM is formed by two LSTMs which are superposed up and down, the output is determined by the shape of the two LSTMs together, and a recursive network calculates the hidden vector from front to back

Another recurrent neural network calculates the hidden vector from back to front

Final output

The first layer of the convolution layer is a one-dimensional convolution with convolution kernel of 5 multiplied by 5 and added with regularization, the second layer of the convolution layer is a convolution kernel of 3 multiplied by 3 and added with regularization, the third layer is a global pooling layer, and the number of filters in the convolution process is respectively 64 and 32;

adding a Dropout layer in the BilSTM layer, setting the last layer of activation function as a linear activation function in order to realize linear regression, and selecting a regression loss function Mean Square Error (MSE) loss function as a loss function; in order to accelerate the convergence of the network model, control overfitting to accelerate the convergence and control overfitting to add a batch normalization layer into each convolution layer, the batch normalization layer comprises the following calculation processes:

wherein B ═ { z ═ z_1,...,m"is the input to the batch, m is the batch size, μ_BIs the average of the batch data and,

is the variance of the batch process and,

for normalized results, ε is a minimum value, h_iAs a result of the scaling and shifting, γ and β are parameters learned in the input network model;

on the basis of the scheme: the convolutional layer is represented in the form of a layer when step 6 is performed

Wherein l is the number of layers,

is the output of the jth neuron at level l,

is the ith input of the l-th layer, an

Represents a convolution, w_ijIs the convolution kernel, bias is the bias term, M_jFor a set of input feature maps, f (-) represents an activation function;

giving k continuous subsegments in the output of regression prediction to predict the key point coordinates and scores of the next frame, obtaining total scores according to the step 2, and adding the obtained frames into the current sequence by utilizing recursion to predict the data of the next frame;

the image windowing technology is that required target information is highlighted in an original image, and the purpose is to detect the coordinate position of a key point in the window;

when step 3 is executed, the area of the window is S, and the coordinates of the four points of the window are ld (kx)₁,ky₁)、lu(kx₂,ky₂)、rd(kx₃,ky₃)、ru(kx₄,ky₄)；

The maximum area of the image windowing is determined by the processing speed d of a CPU and the total operation amount m of a software fusion algorithm, wherein the larger the d, the smaller the m, the larger the windowing area, otherwise, the smaller the area, and the background image area is S₁The maximum area of the window is S₂，S₂＝dS₁/25；

The modified Open-position method adjusts the size format of input data into the length and width of a window, and outputs the coordinates of target information converted according to the origin coordinates.

While the invention has been described in further detail with reference to specific embodiments thereof, it is not intended that the invention be limited to the specific embodiments thereof; for those skilled in the art to which the present invention pertains and related technologies, the extension, operation method and data replacement should fall within the protection scope of the present invention based on the technical solution of the present invention.

Claims

1. A human body continuous action rapid analysis method based on local identification is characterized by mainly comprising rope skipping video acquisition, video data preprocessing, coordinate point acquisition, coordinate point windowing re-identification, feature sequence construction, regression prediction model establishment and model calling, and specifically comprises the following steps:

score_{general assembly}＝score_Average+score_{Standard deviation of}；

Wherein

I.e. the relative coordinates after conversion

Defining origin of coordinates, i.e. C₁、C₁₁、C₈Center of gravity of three points C₀＝(cx₀,cy₀) Wherein

2. The method as claimed in claim 1, wherein the BilSTM is a bidirectional LSTM in step 6, the bidirectional LSTM has two LSTMs superimposed one on top of the other, and the output has a common shape decision of the two LSTMs, wherein a recursive network calculates the hidden vector from front to back

Final output

The convolution layer first layer is a one-dimensional convolution with convolution kernel of 5 x 5 and added with regularization, the convolution kernel of the second layer is 3 x 3 and added with regularization, the third layer is a global pooling layer, and the number of filters in the convolution process is 64 and 32 respectively.

3. The method for rapidly analyzing human body continuous actions based on local recognition as claimed in claim 1, wherein a Dropout layer is added in a BilSTM layer, in order to realize linear regression, the last layer of activation function is set as a linear activation function, and the loss function selects a regression loss function Mean Square Error (MSE) loss function;

in order to accelerate the convergence of the network model, control overfitting to accelerate the convergence and control overfitting to add a batch normalization layer into each convolution layer, the batch normalization layer comprises the following calculation processes:

is the variance of the batch process and,

the convolutional layer is represented in the form of a layer when step 6 is performed

Wherein l is the number of layers,

is the output of the jth neuron at level l,

is the l-th layerThe ith input, and

and (3) giving k continuous subsegments in the regression prediction output to predict the key point coordinates and the scores of the next frame, obtaining the total scores according to the step (2), and adding the obtained frames into the current sequence by utilizing recursion to predict the data of the next frame.

4. The method as claimed in claim 1, wherein the image windowing technique is used to highlight the required target information in the original image in order to detect the coordinate position of the key point in the window;