CN112257534B

CN112257534B - Method for estimating three-dimensional human body posture from video

Info

Publication number: CN112257534B
Application number: CN202011100735.3A
Authority: CN
Inventors: 刘晓平; 李书杰; 王冬; 沈子祺
Original assignee: Hefei University of Technology
Current assignee: Hefei University of Technology
Priority date: 2020-10-15
Filing date: 2020-10-15
Publication date: 2022-08-09
Anticipated expiration: 2040-10-15
Also published as: CN112257534A

Abstract

The invention discloses a method for estimating a three-dimensional human body posture from a video, which comprises the steps of obtaining a continuous video frame sequence, extracting a character posture of a single frame based on a first neural network model, and obtaining an initialized two-dimensional human body joint of each frame; extracting N frames of three-dimensional human body joint estimation of the initialized two-dimensional human body joint based on the second neural network model to obtain an initialized three-dimensional human body joint; acquiring continuous M frames of initialized two-dimensional human body joints, and performing three-dimensional joint filtering on the M frames of initialized two-dimensional human body joints based on a third neural network model to obtain a three-dimensional human body joint filter related to the initialized two-dimensional human body joints; and denoising the initialized three-dimensional human body joint by using a three-dimensional human body joint filter to obtain a denoised three-dimensional human body joint. The BVH-format human posture can be directly estimated through the video, and the BVH-format human posture has a better visual effect, so that the final human posture estimation result has a better animation role driving effect.

Description

Method for estimating three-dimensional human body posture from video

Technical Field

The invention relates to the technical field of three-dimensional human body estimation, in particular to a method for estimating a three-dimensional human body posture from a video.

Background

The three-dimensional human body posture estimation is a technology capable of directly obtaining the three-dimensional human body posture from a video, is widely applied to video evaluation and human body posture analysis in the video, is very important for judging the motion state of a human body, and can be used for playing a great promoting role in video monitoring and video character analysis.

In a three-dimensional posture estimation network, data in a three-dimensional coordinate format is almost adopted as an estimation result, but once the three-dimensional coordinate format becomes true value (ground route) data, the three-dimensional posture obtained by the network exists in the three-dimensional coordinate format, but the three-dimensional coordinate format data is not suitable for driving an animation role due to lack of rigid body constraint, so that the technical scheme of the existing network cannot be adapted when the driving requirement of the animation role is met.

Therefore, it is valuable to design a three-dimensional human pose network that directly estimates the driving animated character from the video.

Disclosure of Invention

Aiming at the problems, the invention provides a method for estimating the three-dimensional human body posture from the video, the BVH-format human body posture can be directly estimated through the video, the BVH-format human body posture has a better visual effect, the final human body posture estimation result has a better animation role driving effect, and the problems in the background technology can be effectively solved.

In order to achieve the purpose, the invention provides the following technical scheme: a method of estimating a three-dimensional human pose from a video, comprising:

acquiring a continuous video frame sequence, and extracting the character posture of a single frame based on a first neural network model to obtain an initialized two-dimensional human body joint of each frame;

acquiring continuous N frames of initialized two-dimensional human body joints, and extracting three-dimensional human body joint estimation of the N frames of initialized two-dimensional human body joints based on a second neural network model to obtain initialized three-dimensional human body joints;

acquiring continuous M frames of initialized two-dimensional human body joints, and performing three-dimensional joint filtering on the M frames of initialized two-dimensional human body joints based on a third neural network model to obtain a three-dimensional human body joint filter related to the initialized two-dimensional human body joints;

and denoising the initialized three-dimensional human body joint by using a three-dimensional human body joint filter to obtain a denoised three-dimensional human body joint.

As a preferred technical scheme, the training target of the denoised three-dimensional human body joint is a 51-channel BVH human body optimized posture;

the optimized posture of the BVH human body is obtained by bone optimization of the unoptimized posture of the BVH human body;

the BVH human body unoptimized posture contains 78 channels, and 51-channel BVH human body optimized postures of 51 bone rotation channels are obtained by removing useless bone rotation channels and initializing three-dimensional coordinates.

As a preferred technical solution of the present invention, the second neural network model includes a second main frame network and a second loss function;

the second main frame network comprises a joint expansion layer, an inter-frame joint association capturing layer, an inter-frame joint association extracting layer and an output layer;

the joint expansion layer comprises a joint number expansion processing layer for two-dimensional human body joints, the expanded joint number is sequentially input into an inter-frame joint association capturing layer and an inter-frame joint association extracting layer, the inter-frame joint association capturing layer and the inter-frame joint association extracting layer perform multi-frame association extraction on the expanded human body joints, the multi-frame association between the two-dimensional human body joints is extracted and input into an output layer, the multi-frame initialization three-dimensional human body joints related to the two-dimensional human body posture are obtained, and the training target of the initialization three-dimensional human body joints is the superposition of 51-channel BVH human body optimization posture, root node three-dimensional coordinates and 51-channel three-dimensional skeleton coordinates;

in the 51-channel BVH human body optimization posture, Euler angles are adopted to describe bone rotation, and the second loss function adopts a consistent matched error offset calculation unit to obtain a bone rotation estimation result theta _estimation And bone rotation tag data theta _label The difference between d:

bone rotation estimation result theta _estimation And bone rotation tag data theta _label The rotation angles are all rotation angles between 0 and 360 degrees, and the rotation angles are more comprehensively represented.

After the difference based on the bone rotation is determined, a second loss function is introduced to quickly find the bone rotation with the correct human posture based on Gaussian distribution, and the second loss function is determined as follows:

as a preferred technical solution of the present invention, the third neural network model includes a third main frame network and a third loss function,

the third main frame network comprises a joint expansion layer, an inter-frame joint noise capturing layer, an inter-frame joint noise extracting layer and an output layer;

the joint expansion layer comprises a joint number expansion processing layer for two-dimensional human body joints, the expanded joint number is sequentially input to an inter-frame joint noise capturing layer and an inter-frame joint noise extracting layer, the inter-frame joint noise capturing layer and the inter-frame joint noise extracting layer capture noise among multiple frames of the expanded human body joints, the noise among the multiple frames of the two-dimensional human body joints is captured and input to an output layer, the de-noised three-dimensional human body joints related to the multi-frame two-dimensional human body postures are obtained, and the training target of the third neural network model is a secondary optimized 51-channel BVH human body optimized posture;

the secondarily optimized 51-channel BVH human body optimized pose is obtained by secondarily optimizing 51 rotating channels of the 51-channel BVH human body optimized pose, and comprises the following steps:

step one, reducing the value range of theta, enabling theta 'to be theta mod360, enabling the theta' to belong to [0 degrees and 360 degrees ],

step two, if the channel i first frame

Subtracting 360 degrees from each frame value of the channel i, namely the jth frame of the channel i

Step three, calculating the jth, jth and 1 st frame errors E of the channel ith, wherein

If E is less than or equal to 180 degrees, no treatment is carried out, otherwise, the step is carried out

Expanding or contracting by 360 degrees,

step five, repeating the step 3) and the step 4) until the step five is finished

All have E less than or equal to 180 degrees,

the input is an unprocessed BVH rotation channel theta, the output is a processed BVH rotation channel theta', the optimization goal of the third neural network model is to make the filtered result and the tag data approach in value, and the secondary optimization is obtained according to the periodicity of Euler angles theta;

the final result after the processing of the filter network is the combination of three parts, namely an unfiltered coordinate channel, an individual channel without filtering and a plurality of channels after filtering;

the third loss function describes distance by using MSE, and the filtered result is recorded as theta _filtered The label data is theta _label At this time:

as a preferred technical solution of the present invention, the first neural network model obtains two-dimensional coordinate points after processing video through MASK R-CNN and CPN, inputs the two-dimensional coordinate points into the second neural network model and the third neural network model, and performs supervised learning by using different label data to obtain the initialized three-dimensional human body joint and the three-dimensional human body joint filter.

As a preferred technical solution of the present invention, the first neural network model processes a video to obtain an initialized two-dimensional human joint;

respectively inputting the initialized two-dimensional human body joint into the second neural network model and the third neural network model to respectively obtain an initialized three-dimensional human body joint and a three-dimensional human body joint filter;

the second neural network model contains a bone rotation loss function obtained by the consistent matched error offset calculation unit, and the third neural network model contains label data of a secondarily optimized 51-channel BVH human body optimized posture;

carrying out convolution processing on the initialized three-dimensional human body joint and the three-dimensional human body joint filter to obtain a denoised three-dimensional human body joint;

and the frame number of the initialized three-dimensional human body joint output by the second neural network model is kept consistent with that of the three-dimensional human body joint filter output by the third neural network model, a group of three-dimensional coordinates of the initialized three-dimensional human body joint is removed, and the three-dimensional human body joint filter filters 51 skeleton rotating channels of the obtained initialized three-dimensional human body joint.

As a preferred technical solution of the present invention, the inputting the initialized two-dimensional body joint into the second neural network model and the third neural network model respectively to obtain the initialized three-dimensional body joint and the three-dimensional body joint filter respectively includes:

the second neural network model obtains 27 continuous initialization two-dimensional human body joints to obtain a frame of 105-channel initialization three-dimensional human body joints (the 105 channels comprise 51-channel three-dimensional human body joint coordinates and 51-channel BVH human body optimization postures);

the third neural network model obtains 57 continuous initialized two-dimensional human body joints to obtain 31 frames of 51-channel three-dimensional human body joint filters, the third neural network model obtains continuous input frames by adopting a sliding window, the size of the sliding window is 27, the acceptance domain is 57, namely 57 frames of initialized two-dimensional human body joints with 17 coordinate points are input, and 31 frames of 51-channel three-dimensional human body joint filters are obtained;

the convolution processing of the initialized three-dimensional human body joint and the three-dimensional human body joint filter to obtain the denoised three-dimensional human body joint comprises the following steps:

acquiring 31 continuous frames of the initialized three-dimensional human body joint output by the second neural network model, and removing the root three-dimensional coordinates of the 31 continuous frames to obtain 31 frames of 51-channel BVH human body optimized postures, namely, the initialized three-dimensional human body joint containing noise;

and carrying out convolution denoising on the noise-containing initialized three-dimensional human body joint by using a three-dimensional human body joint filter with 31 frames of channel numbers 51 to obtain a denoised three-dimensional human body joint with 1 frame of channel numbers 51, namely the smooth BVH-format human body optimized posture.

As a preferred technical solution of the present invention, the second main frame network and the third main frame network are the same network;

the second main frame network joint expansion layer obtains and initializes two-dimensional human body joints, performs human body joint expansion through a convolution kernel,

the inter-frame joint association capturing layer and the inter-frame joint association extracting layer are used for acquiring inter-frame association of expanded human joints under the effect of cavity convolution to realize joint association extraction between input frames;

the output layer adopts data dimension reduction output to form an initialized three-dimensional human body joint with interframe correlation;

the third main frame network joint expansion layer obtains and initializes two-dimensional human body joints, performs human body joint expansion through a convolution kernel,

inter-frame joint noise capturing is carried out on the inter-frame joint noise capturing layer and the inter-frame joint noise extracting layer on the expanded human body joint under the cavity convolution effect;

the output layer adopts data dimension reduction output to form a three-dimensional human body joint filter which is suitable for noise capture of the second main frame network output.

Compared with the prior art, the invention has the beneficial effects that:

1. the method can be used for directly estimating the network of the BVH-format human body posture from the video, and by designing three different networks, the first neural network model can directly acquire and initialize the two-dimensional human body joint from the video, so that the BVH-format three-dimensional joint of the two-dimensional human body joint can be conveniently estimated and denoised in the follow-up process.

2. In the process, aiming at the BVH format characteristic, in the aspect of loss function design, the defect that the skeleton rotation angle is discontinuous in period is processed in a targeted mode by using a uniformly matched error counteracting calculation unit, and finally, the fact that a good gradient descending speed is kept when the error is small is guaranteed; finally, the BVH format characteristics are fully considered through a three-dimensional human body joint filter of a third neural network model, an optimization target is subjected to refining processing, the noise characteristics in the process of converting two dimensions into three-dimensional joints can be accurately captured, and the initialized three-dimensional human body joints are subjected to de-noising in a targeted manner, so that shaking, dislocation and inversion caused by skeleton rotation noise in the initialized three-dimensional human body joints are quickly removed, and the BVH format human body posture with high accuracy is formed; the second neural network model focuses on loss construction, so that the estimated value can reflect all joints including three-dimensional joint points and rotation more comprehensively, the globality of three-dimensional joint information is noticed, the construction of three-dimensional key points is facilitated, the accuracy rate of the estimated positions of the three-dimensional joint points is improved, the third neural network model focuses on rotation information errors more, and the construction and stable rotation of the three-dimensional human body posture are realized by matching the three neural network model and the rotation information errors.

3. After the third neural network model carries out tag data processing, and a similar main frame network is adopted in the second neural network model, the estimation results of some extreme deviation tag data in the second neural network model can be accurately corrected, after the second neural network model extracts the interframe joint related information, a relatively complete BVH format three-dimensional human body posture is formed, and errors caused by bone rotation angles such as jitter, dislocation and inversion are further eliminated through the third neural network model.

4. The second neural network model and the third neural network model provided by the application utilize similar main frames, so that the denoising effect is good, the parameter quantity of the model calculation process is reduced, the complexity of the model is reduced on the premise of obtaining the BVH-shaped human posture with high precision, the practicability is enhanced, and the network redundancy is further reduced.

5. Experiments are carried out on a Human 3.6M data set, quantitative experiments show that the method has better accuracy, and qualitative experiments show that the output result of the method has better smoothness.

Drawings

FIG. 1 is a schematic diagram of the overall network structure of the method of the present invention;

FIG. 2 is a detailed diagram of a second network architecture of the present invention;

FIG. 3 is a diagram illustrating a second network architecture according to the present invention;

FIG. 4 is a diagram illustrating a third network architecture according to the present invention;

fig. 5 is a visualization result display diagram according to the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Compared with the traditional method, the existing three-dimensional human body posture estimation almost adopts data in a three-dimensional coordinate format as an estimation result, but once the three-dimensional coordinate format becomes true value (ground route) data, the three-dimensional posture obtained by the network exists in the three-dimensional coordinate format. Therefore, the network attention of the current three-dimensional human body posture is how to train a human body posture approaching a three-dimensional coordinate format through a network, and it can be known from the characteristics of three-dimensional coordinate format data that the three-dimensional coordinate format data is not suitable for driving animated characters due to lack of rigid body constraints.

Therefore, when the requirements of driving the animation roles are met, the technical schemes of the existing networks cannot be adapted. The data form for driving the animation role can not be directly generated, and the defects of the existing network on the driving of the animation role are clear at a glance.

The BVH file contains the bone and limb joint rotation data for the character. BVH is a universal human-feature animation file format, widely supported by various animation software that is popular today. Typically available from motion capture hardware that records human behavioral movements. The first part of the BVH file defines the joint tree and the name of each joint point, the number of channels, the relative positions of the joints, i.e. the bone lengths of the parts of the human body. Due to the difference of human bodies, people with different ages and different stature proportions can be distinguished by changing the numerical value of the relative position. For each joint, there are three rotational parameters (rotation angle relative to axis X, Y, Z) to describe the motion information. The Hips joint point also contains three-dimensional space position parameters as a root joint point of the whole human body, thereby completing the complete description of the motion condition of the human body. The second part of the BVH file records the motion data, defining the length of the motion data duration (i.e. the number of frames) and the time interval between each frame. Each frame of data is provided in the joint order defined in the first section, and position information and rotation information of each joint node in each frame are recorded.

In the BVH form of human body posture, the driving of the animation character has better effect, because the information about the rotation of the human body joint is clearly recorded, and the problem of lack of rigid constraint in the three-dimensional form does not exist. Before research, the applicant conducts extensive research on the prior art, and finds that the existing neural network model is not adaptive to the body posture of the generated BVH, and the rotational information characteristic of the BVH is hardly noticed, so that the generated BVH body posture effect is far from expectation. Therefore, the present application has been designed as follows.

Example (b):

referring to fig. 1 to 5, the present invention provides a technical solution: a method for estimating three-dimensional human body posture from a video comprises

The existing three-dimensional human posture estimation network mostly takes three-dimensional coordinate format data as an estimation result, but the three-dimensional coordinate format data is not suitable for driving animation roles due to lack of rigid body constraint, the integral design is a method for specifically removing noise data among data, and in the process of directly estimating BVH format human postures from videos, because the BVH human postures have rotation information, internal noise needs to be eliminated in time, so that the rotation interference information among video frames is reduced, and the animation roles are better driven.

Preferably, the training target of the denoised three-dimensional human body joint is a 51-channel BVH human body optimized posture;

the BVH human body unoptimized posture contains 78 channels, and 51-channel BVH human body optimized postures of 51 bone rotation channels are obtained by removing useless bone rotation channels and initializing three-dimensional coordinates. BVH (BioVision hierarchy) is a data format which can embody the rigid body characteristic of a motion main body, is suitable for driving animation roles, adopts a tree structure to express Human bones, uses Euler angles to describe the rotation of the Human bones, and is also a motion data format adopted by most animation production software, so that the final denoised three-dimensional Human joints are obtained by initializing the three-dimensional Human joints and denoising, video frames adopted by the application are obtained on Human 3.6M video frames, a first neural network model is used for obtaining the initialized two-dimensional Human joints of the Human 3.6M video frames, and meanwhile, bone optimization is carried out on the non-optimized postures of the BVH Human body in the Human 3.6M data, thereby obtaining the optimized postures of the 51-channel BVH Human body of 51 bone rotation channels, and it is worth noting that the training processes of a second neural network model and a third neural network model of the application can carry out training aiming at the optimized postures of the 51-channel BVH Human body of 51 bone rotation channels, therefore, the final result of the BVH human body three-dimensional posture is close to the real BVH data, and the inter-frame relevance and the inter-frame noise can be pertinently concerned.

Preferably, the second neural network model comprises a second main frame network and a second loss function;

BVH adopts Euler angle to describe bone rotation, but Euler angle has discontinuity problem, the discontinuity can influence the training of network during training, Mean Squared Error (MSE) can not be used as loss function directly as three-dimensional coordinate partA function. Since the discontinuity of the Euler angle comes from the periodicity thereof, the actual similar rotation components have great numerical difference due to different periods, and the estimation result theta is correctly described _estimation And tag data theta _label The key to the distance d between is to describe the periodicity of the euler angles, which the applicant has found after verification that a sinusoidal function exactly satisfies! Aiming at the periodic Euler angle form, the second loss function of the method adopts the error offset calculation unit with consistent matching to obtain the periodic characteristic, which is a great innovation of the method, and the bone rotation estimation result theta is obtained _estimation And bone rotation tag data theta _label The difference between d:

bone rotation estimation result theta _estimation And bone rotation tag data theta _label The rotation angles are all rotation angles between 0 and 360 degrees, the rotation angles are more comprehensively represented, and d is the distance calculated by the error offset calculation unit for matching the rotation angles with one another.

However, after the processing of the consistent matching error cancellation calculation unit, because d is reduced in value in the sine function processing, if d is directly used as a loss function at this time, the network gradient descent speed is slow, the application provides a second loss function in a gaussian distribution form, and μ is a mathematical expectation of gaussian distribution, after a small μ is selected, a good gradient descent speed can be still maintained when the error is small, the second loss function introduces a skeleton rotation based on gaussian distribution to quickly find the correct posture of the human body, and the second loss function is determined as:

preferably, the third neural network model comprises a third main frame network and a third loss function,

in the application, the second neural network model provided by the method has noise in the process of estimating the three-dimensional human body posture, so that a third neural network model is designed by carrying out noise removal on the second neural network model, and the third neural network model is treated by adopting a special loss function form in consideration of the influence of the discontinuity of the Euler angle. However, in the third neural network model, only a special loss function cannot be used, because the optimization goal of the third neural network model is to make the filtered result numerically close to the tag data.

According to the periodicity of the euler angle theta, the secondarily optimized 51-channel BVH human body optimized posture is obtained by secondarily optimizing 51 rotating channels of the 51-channel BVH human body optimized posture, and the method comprises the following steps:

step two, if the channel i first frame

Step three, calculating a channel iJ, j +1 th frame error E, wherein

And step four, if E is less than or equal to 180 degrees, no treatment is carried out. Otherwise, will

Expanding or contracting by 360 degrees,

All have E less than or equal to 180 degrees,

wherein the input is an unprocessed BVH rotation channel theta, the output is a processed BVH rotation channel theta', the optimization goal of the third neural network model is to make the filtered result and the tag data close in value, and the second optimization is obtained according to the periodicity of Euler angle theta;

in the process, the distance between a very small amount of label data in the training data and the channel corresponding to the estimation result exceeds 180 degrees, and the channels can not be effectively denoised by the third neural network model. After observing the channels, the channels are intensively distributed in a fixed plurality of channels, and the numerical value of each channel presents obvious characteristics. The channel waveforms are approximate to straight lines and take values which are always equal to 180 degrees. It is difficult to generate large amplitude noise in these channels, so even if they are not subjected to filtering processing, they do not generate serious jitter. The final result of the BVH data after being processed by the filtering network is a combination of three parts, namely, an unfiltered coordinate channel, an individual channel without filtering, and a majority of channels after filtering.

Through the data preprocessing method for the rotation component channel, the discontinuity defect of Euler angles is avoided, the distance between adjacent frames is limited, the rotation component value is processed only according to periodicity, and the distance between the label data and the estimation result is reduced through a fixed rule, so that the data can be suitable for the third neural network model.

The complete BVH data consists of two parts, namely skeleton information and skeleton coordinate rotation channels, wherein the skeleton information is located in the front part of the whole BVH file, while only the part of the skeleton coordinate/rotation channel at the back part of the BVH, which participates in network training, only provides the human body motion track and the skeleton rotation information and does not contain the key skeleton information. Skeletal information is very critical to human posture estimation because even if people of different body types do the same type of motion, the form and amplitude of the motion will be different due to skeletal differences. In order to allow the network to learn the bone length information from the training data, the second neural network model output contains not only rotation angle information but also three-dimensional body joint coordinates. Therefore, in the training phase, the output data of the posture estimation network is 105 dimensions, wherein 1 to 51 dimensions are joint coordinate information, 52 to 105 dimensions are rotation angle information, and the bone rotation information often has jittering, misplacement and inverted noise, so that the third neural network model only carries out filtering noise reduction on the rotation angle part information in the filtering process.

It should be noted that the second neural network model and the third neural network model of the present application are mutually promoted, and the third neural network model is specifically embodied in that the third neural network model adjusts the optimization target, focuses on the bone rotation information, reduces the joint morphology of the BVH rotation channel, and thereby can better approach the true value of the BVH format human posture bone rotation, that is, can capture the noise of inter-frame jitter, misalignment, inversion and the like during the training process, and therefore, it is necessary to process the optimization target, otherwise the third neural network model cannot remove the effective inter-frame joint noise information for the second neural network model; and the second neural network model is designed by considering the characteristic that the euler angle describes the rotation of the skeleton and how to overcome the problem that the euler angle describes the discontinuous rotation of the skeleton, and designing a consistent matching error counteracting calculation unit to carry out calculation on the estimation result theta _estimation And tag data theta _label The distance d between the first neural network model and the second neural network model is processed, so that the joint form of a BVH rotating channel is not reduced, the relevance of the inter-frame joints can be better acquired, the three-dimensional human body posture of the animation character can be obtained through the combined design of the second neural network model and the third neural network model, and the estimated BVH three-dimensional human body posture can be ensuredThe human body posture noise information is greatly reduced.

The third loss function describes the distance by using MSE, and the result after filtering is recorded as theta _filtered The label data is theta _label At this time:

preferably, the first neural network model obtains two-dimensional coordinate points after processing videos through MASK R-CNN and CPN, the two-dimensional coordinate points are respectively input into the second neural network model and the third neural network model, and supervised learning is carried out by using different label data to respectively obtain the initialized three-dimensional human body joint and the three-dimensional human body joint filter.

Preferably, the first neural network model processes the video to obtain an initialized two-dimensional human body joint; the method and the device can adopt a two-dimensional human body posture extraction network combining MASK R-CNN and CPN to achieve the acquisition of the two-dimensional human body joints from the video.

Preferably, the respectively inputting the initialized two-dimensional human body joint into the second neural network model and the third neural network model to respectively obtain the initialized three-dimensional human body joint and the three-dimensional human body joint filter includes:

and carrying out convolution denoising on the noise-containing initialized three-dimensional human body joint by using a three-dimensional human body joint filter with 31 frames of channel numbers 51 to obtain a denoised three-dimensional human body joint with 1 frame of channel numbers 51, namely the smooth BVH-format human body optimized posture. After convolution denoising calculation is carried out on the 1 frame of the estimated three-dimensional BVH and the 31 frames of the estimated three-dimensional BVH, 51 rotating channel parts of the 1 frame of denoised simplified BVH data are obtained. Combining the 51 rotating channels with the first 3 coordinate channels in the BVH output by the second neural network model, replacing the known partial rotating channels without filtering with the original output of the second neural network model, and then adding channels with the values of 0 deleted before among the obtained 54 channels to finally restore the complete BVH three-dimensional human body posture of 78 channels.

Preferably, the second main frame network and the third main frame network are the same network;

In order to verify the overall effect of the present application, the following experiments were performed:

TABLE 1 estimation/Filtering result precision evaluation

The output result of the method has better smoothness as shown by qualitative experiments.

In table 1:

the method 1 comprises the following steps: ammar Q, Antonis A. MocapNET Ensemble of SNN Encoders for 3D Human dose Estimation in RGB Images [ C ]. British Machine Vision Conference (BMVC), 2019;

NET2 (without calculation unit) is a method where the second neural network model does not use a consistent matched error cancellation calculation unit; NET3 (raw data) means that the third neural network model did not adjust the optimization objective.

In a comparative experiment:

compared with the method 1, the human body posture accuracy after filtering and denoising is superior to that of the method 1 in the prior art in most action categories due to the network structure of the application.

In ablation experiments:

the BVH three-dimensional human body joint estimation result observes the output of the filter network before and after data processing, the data preprocessing step can effectively improve the precision of the output of the filter network, and the effect is better than that of a method that a second neural network model does not adopt a consistent matched error offset calculation unit and a method that a third neural network model does not adjust an optimization target.

The third neural network model can effectively filter the initialized three-dimensional human body posture output by the second neural network model, and the smoothness of the waveform is obviously improved compared with the output of the second neural network model. The core of the method is that after label data processing is carried out on the third neural network model, similar main frame networks are adopted in the second neural network model, estimation results of some extreme deviation label data in the second neural network model can be accurately corrected, after interframe joint related information is extracted from the second neural network model, a relatively complete BVH-format three-dimensional human body posture is formed, and errors of skeleton rotation such as shaking, dislocation, inversion and the like are further eliminated through the third neural network model.

As shown in fig. 5, the second neural network model/result visualization through the third neural network model of a segment of motion video shows that there are significant errors in the result of the input unprocessed data output, which indicates that the third neural network model can effectively remove the noise of the second neural network model.

It should be further noted that the second neural network model and the third neural network model of the present application are mutually promoted, and a similar main frame network is adopted, so that the estimated initialized three-dimensional human body joint can be adapted to the noise information consistent with the model itself for rapid removal. The design characteristics of this application are outstanding, specifically reflect in: the third neural network model adjusts the optimization target, reduces the joint morphology of the BVH rotating channel, and can better approach the truth value of the BVH-format human posture, namelyNoise such as frame jitter, dislocation, inversion and the like can be captured in the training process, so that the optimization target is necessary to be processed, otherwise, the third neural network model cannot remove effective frame joint noise information for the second neural network model; and the second neural network model is designed by considering the characteristic that the euler angle describes the rotation of the skeleton and how to overcome the problem that the euler angle describes the discontinuous rotation of the skeleton, and designing a consistent matching error counteracting calculation unit to carry out calculation on the estimation result theta _estimation And tag data theta _label The distance d between the two neural network models is processed, the association of the inter-frame joints can be better obtained without reducing the joint form of the BVH rotating channel, so that the three-dimensional human body posture of the driving animation role can be obtained through the combined design of the second neural network model and the third neural network model, and the noise information of the estimated BVH three-dimensional human body posture can be greatly reduced.

As shown in fig. 4, the second neural network model captures and extracts the inter-frame joint related information by using the cavity convolution, the adopted cavity convolution (large receptive field) has large-range information integration capability for the two-dimensional joint, 27 frames of two-dimensional postures with 17 coordinate points are input, and 1 frame of data of 105 channels is finally obtained after multilayer one-dimensional convolution, one-dimensional batch processing standardization, linear rectification function and random inactivation. According to the method, a second neural network model is designed to estimate the initialized two-dimensional human body joint to obtain the initialized three-dimensional human body joint, in the process, aiming at the BVH format characteristics, in the aspect of loss function design, the defect that the skeleton rotation angle is discontinuous in period is processed in a targeted mode by using a consistent matched error offset calculation unit, and finally, the good gradient descent speed is guaranteed to be kept when the error is small.

As shown in fig. 5, the third neural network model uses the cavity convolution to capture and extract the inter-frame joint noise information, in the present application, two different cavity convolutions are used, after the joint extension information is input to the input layer, the inter-frame joint noise capture layer can rapidly capture the noise between the extended joints, including jitter, dislocation, inversion, etc., and the relevance of the noise between the joints is obtained by using the cavity convolution again, so as to extract the relevance of similar noise between the joints, thereby better extracting the noise characteristics, and further realizing the noise reduction processing of the second neural network model. The sliding window size of the input layer is 27, the receiving domain is 57, namely 57 frames of two-dimensional postures with 17 coordinate points are input, and finally the filter with 31 frames of channel numbers 51 is obtained. The filtering effect of 31 frames can be used as a more general filtering feature in the prior art, and the innovative points of the present application are not described herein, and therefore, are not described herein again. By the aid of the three-dimensional human body joint filter of the third neural network model, BVH format characteristics are fully considered, an optimization target is subjected to refining processing, noise characteristics in the process of converting two dimensions into three-dimensional joints can be accurately captured, and the initialized three-dimensional human body joints are subjected to targeted denoising, so that noise data contained in the initialized three-dimensional human body joints are rapidly removed, and high-accuracy BVH format human body postures are formed. The second neural network model and the third neural network model provided by the application utilize similar main frames, the denoising effect is good, the parameter quantity of the model calculation process is reduced, the complexity of the model is reduced on the premise of obtaining the BVH-format human posture with high precision, the estimation network and the filter network complete 100 times of iterative training in 20 hours and 22 hours respectively on a single nVidia Geforce GTX titanium GPU, the trained model is used for processing videos with multiple sections of 1000 frames, the average time consumption of each section is less than 20ms, real-time processing can be almost realized, the practicability is enhanced, and the network redundancy is further reduced.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

Claims

1. A method for estimating three-dimensional human body posture from video is characterized by comprising

denoising the initialized three-dimensional human body joint by using a three-dimensional human body joint filter to obtain a denoised three-dimensional human body joint;

the training target of the denoised three-dimensional human body joint is a secondary optimized 51-channel BVH human body optimized posture;

the BVH human body unoptimized posture contains 78 channels, and 51-channel BVH human body optimized postures of 51 bone rotation channels are obtained by removing useless bone rotation channels and initializing three-dimensional coordinates;

the second neural network model comprises a second main frame network and a second loss function;

the joint expansion layer comprises a joint number expansion processing layer for two-dimensional human body joints, the expanded joint number is sequentially input into an inter-frame joint association capturing layer and an inter-frame joint association extracting layer, the inter-frame joint association capturing layer and the inter-frame joint association extracting layer capture and extract the association among multiple frames of the expanded human body joints, the association among the multiple frames of the two-dimensional human body joints is extracted and input into an output layer, the initialized three-dimensional human body joint related to the multi-frame two-dimensional human body posture is obtained, and the training target of the initialized three-dimensional human body joint is the superposition of a 51-channel BVH human body optimization posture, a root node three-dimensional coordinate and a 51-channel three-dimensional skeleton coordinate;

in the 51-channel BVH human optimized poseThe Euler angle is adopted to describe the bone rotation, and the second loss function adopts a consistent matched error offset calculation unit to obtain a bone rotation estimation result theta _estimation And bone rotation tag data theta _label The difference between d:

the third neural network model includes a third main frame network and a third loss function,

the joint expansion layer comprises a joint number expansion processing layer for two-dimensional human body joints, the expanded joint number is sequentially input to an inter-frame joint noise capture layer and an inter-frame joint noise extraction layer, the inter-frame joint noise capture layer and the inter-frame joint noise extraction layer capture and extract noise among multiple frames of the expanded human body joints, the noise among the multiple frames of the two-dimensional human body joints is extracted and input to an output layer, the de-noised three-dimensional human body joints related to the multiple frames of the two-dimensional human body postures are obtained, and the training target of the third neural network model is a secondary optimized 51-channel BVH human body optimized posture;

step two, if the channel i first frame theta _i ¹ ′>180 degrees, subtracting 360 degrees from each frame value of the channel i, namely the jth frame of the channel i

If E is less than or equal to 180 degrees, no treatment is carried out, otherwise, the treatment is carried out

Expanding or contracting by 360 degrees,

step five, repeating the step three and the step four until the step three and the step four are matched

All have E less than or equal to 180 degrees,

the second main frame network and the third main frame network are the same network;

the interframe joint noise capturing layer and the interframe joint noise extracting layer capture and extract interframe noise of the expanded human joints under the effect of cavity convolution;

and the output layer adopts data dimension reduction output to form a three-dimensional human body joint filter which is suitable for noise extraction output by the second main frame network.

2. The method of estimating three-dimensional human pose from video according to claim 1, wherein: and the first neural network model processes the video to obtain two-dimensional coordinate points, the two-dimensional coordinate points are respectively input into the second neural network model and the third neural network model, and supervised learning is carried out by utilizing different label data to respectively obtain the initialized three-dimensional human body joint and the three-dimensional human body joint filter.

3. The method of estimating three-dimensional human pose from video according to claim 1, wherein:

the first neural network model processes the video to obtain an initialized two-dimensional human body joint;

4. The method of estimating three-dimensional human body pose from video according to claim 3, wherein:

the step of respectively inputting the initialized two-dimensional human body joint into the second neural network model and the third neural network model to respectively obtain the initialized three-dimensional human body joint and the three-dimensional human body joint filter comprises the following steps:

the second neural network model obtains 27 continuous initialization two-dimensional human body joints to obtain a frame of 105 channels of initialization three-dimensional human body joints, wherein the 105 channels comprise 51 channels of three-dimensional human body joint coordinates and 51 channels of BVH human body optimization postures;