CN110009717A

CN110009717A - A kind of animated character's binding recording system based on monocular depth figure

Info

Publication number: CN110009717A
Application number: CN201910256680.6A
Authority: CN
Inventors: 陈莹; 沈栎; 化春键
Original assignee: Jiangnan University
Current assignee: Jiangnan University
Priority date: 2019-04-01
Filing date: 2019-04-01
Publication date: 2019-07-12
Anticipated expiration: 2039-04-01
Also published as: CN110009717B

Abstract

The invention discloses a kind of, and the animated character based on monocular depth figure binds recording system, belongs to video human Attitude estimation technical field.In the system, data handling procedure is based on machine learning and deep learning frame, from monocular depth figure, utilize three-dimensional information deep learning network, estimate human joint points coordinate in figure, human joint points coordinate estimated value is introduced into animated character and binds recording system, and is smoothed using filter algorithm, realizes that artis and animated character bind the binding of animated character in recording system.Pass through the estimation using three-dimensional information deep learning network implementations to body joint point coordinate, body joint point coordinate estimated value is introduced into animated character to bind in recording system, so that more accurate to the estimated value of human joint points coordinate in figure, it is recorded to be bound in animated character, it enables to human action in shooting picture to be accurately embodied on animated character, realizes that artis and animated character bind the accurate binding of animated character in recording system.

Description

A kind of animated character's binding recording system based on monocular depth figure

Technical field

The present invention relates to a kind of, and the animated character based on monocular depth figure binds recording system, belongs to video human posture and estimates Count technical field.

Background technique

Joint and limb of the human body attitude estimation i.e. based on image reconstruction people are dry, human body attitude tracking and joint based on image Point estimation in human-computer interaction, safety monitoring, motion analysis, augmented reality, virtual reality, medical treatment & health, game and animation field, All there is huge application potential and market.Current method mainly includes following two categories:

(1) top-down approach is called model driven method, it relies on the model constructed in advance or priori knowledge, passes through Posterior probability is matched or solved in image sequence and calculates corresponding aspect of model variable, corrects building in advance using variable Good model, makes model close to the specific posture of human body in image.The main calculation process of this method is to predict, match, more Newly, such method is mainly from model, and using data as the driving of model, therefore final precision is by the dual of model and data It influences, so there is a problem of that precision is not high.

(2) Bottom-up approach is called data-driven method, it by a large amount of matching primitives of image data, directly from It is returned in data and extracts target joint point, final precision fluctuation is big, and by data specific effect, and generalization ability is poor.

Summary of the invention

In order to solve presently, there are estimation method of human posture existing for precision it is not high so as to cause based on monocular depth Human action can not accurately be embodied in the problem on animated character in shooting picture in animated character's binding recording of figure, the present invention Provide a kind of animated character's binding recording system based on monocular depth figure.

The application provides a kind of animated character and binds data processing method in recording system, and the method includes following steps It is rapid:

Step (1) is handled 2D monocular depth figure using deeprior++ network, exports spatial offset xyz_ Offset；

The enhancing of step (2) data: 2D monocular depth figure is rotated, scaled and is translated and is mapped to three-dimensional Euclidean Space forms point cloud；Mapping equation is as follows:

Wherein, u, v are the arbitrary coordinate point under image coordinate system；u₀、v₀For the centre coordinate of image；x_ω、y_ω、z_ωIt indicates Three-dimensional coordinate point under world coordinate system；z_cThe z-axis value for indicating camera coordinate in world coordinate system, i.e., in 2D monocular depth figure Animated character to camera distance；R, T are respectively the 3x3 spin matrix and 3x1 translation matrix of outer ginseng matrix, and f/dx is to x Partial differential, f/dy is the partial differential to y；

The point cloud that step (3) is obtained using the spatial offset xyz_Offset amendment step (2) that step (1) obtains, so Revised cloud is trimmed using preset parameter afterwards, preliminarily forms set a little, the collection of the point preliminarily formed is collectively referred to as It is the square that a space size is 88x88x88 for voxel collection Cubic, voxel collection Cubic, wherein position a little is designated as 1, The position of no point is designated as 0；

The voxel collection Cubic that step (4) obtains step (3) inputs three-dimensional information deep learning network FeSHEN, obtains The maximum likelihood of animated character's artis responds position, and the maximum likelihood response position of artis is mapped to world coordinates later In system, 18 artis of animated character are finally predicted, obtain space coordinate of 18 artis in world coordinate system；

Step (5) handles space coordinate of 18 artis in world coordinate system using smoothing method, including Variation limitation and jitter smoothing；Wherein, variation limitation is for preventing artis from the variation of the movement beyond human body limit occur；It trembles It is dynamic smooth for avoiding because artis caused by noise is shaken；

Jitter smoothing algorithm is as follows:

Input: this frame coordinate input value X_t；The input value X of upper frame_t-1；

Output: smoothed out output valve

S1 calculates X_tWith X_t-1Between Euclidean distance dis；

S2 judges the size of dis, setting shake limits value Jitter；

If dis > Jitter, X '_t=X_t；

If dis≤Jitter, X is determined_tFor shake, then utilize following formula to X_tIt carries out smooth

S3 calculates the smooth value Y of this frame using Holter two fingers number smoothing formula_t；

Y_t=X '_t×(1-Smoothing)+(X_t-1+T_t-1)×Smoothing

Wherein, Smoothing is smoothing parameter, and value range is [0,1]；T_t-1For the Trend value of previous frame, by previous frame Trend formula calculate；

S4 calculates smooth value Y_tWith input value X_t-1Between difference Dis；

Dis=Y_t-X_t-1

S5 calculates the Trend value T of this frame using the double exponential trend formula of Holter_t；

T_t=Dis × Correction+T_t-1×(1-Correction)

Wherein, Correction is corrected parameter, and value range is [0,1]；

S6 calculates final predicted value using Holter two fingers number predictor formula

Wherein, Prediction value is [0, n]；

S7 checks predicted value using maximum filtering distance MaxDistWith input value X_tBetween Euclidean distance DisOut, if DisOut > MaxDist,

It has finally obtained for input value X_tSmoothed out output valveFor a three-dimensional vector, include Coordinate value of the artis on x, y, z axis；

Step (6) is according to smoothed out output valveIt establishes the animated character based on monocular depth figure and binds to record and be System.

Optionally, the 2D monocular depth figure uses KinectV2 the and Xtion Pro or Asus Xtion Pro of Microsoft Shooting obtains.

Optionally, the step (6) is according to smoothed out output valveEstablish the animated character based on monocular depth figure Bind recording system, comprising:

(1) by output valveThe human body estimation module of reading system, in human body estimation module using filter algorithm to pass Node coordinate is smoothed, and the editor that smoothed artis parameter is output to system is then recorded module；

(2) animated character's model is established in model binding module, animated character's artis and skeleton pattern is bound, generated Class people's animation model, the editor for being output to system record module；

(3) the human body picture and human body estimation module that editor's recording module captures photographic device export smoothed Artis parameter, the animated character's model established with model binding module are bound, and complete animated character's recording and binding is appointed Business；

The editor records the main interface that module is system, and user is chosen by main interface operation and preview model, true After determining cartoon scene and animated character, the recording for carrying out animated video using embedded video recording method is clicked, record is ultimately generated Video processed.

Optionally, in the step (4) three-dimensional information deep learning network FeSHEN network structure are as follows:

One convolution block is followed by a pond block, and four ME modules of connecting later continuously export two residual blocks, are most followed by One convolution block；The ME module is used for said three-dimensional body by pond block, residual block, four module compositions of warp block and supervision block Element estimation；

The convolution block is made of voxel convolution, voxel batch normalization layer, activation primitive, based on the convolution of three-dimensional information It calculates；The pond block is made of voxel down-sampling, voxel batch normalization layer, activation primitive, for reducing the ruler of three-dimensional feature figure Very little size；The residual block includes main line and branch line Liang Ge branch, and main line branch includes two convolution blocks, and branch line branch includes one A convolution block, the residual block is for being adjusted the port number of three-dimensional feature figure；The warp block by voxel up-sample, Voxel batch normalization layer, activation primitive composition, for merging three-dimensional feature figure, reduce the port number of characteristic pattern, finally obtain defeated Feature out；The supervision block includes upper and lower Liang Ge branch, and each branch is made of two residual blocks, and top set is used for three-dimensional Characteristic pattern carries out channel transformation, and first residual block of inferior division extracts non-linear spy for compressing three-dimensional feature figure Sign, second residual block of inferior division are used to carry out compressed supervision feature channel extension, so with it is former export feature into Row fusion；

Wherein, the Kernel size of residual block is 3x3x3, and the Kernel size of convolution block and warp block is 2x2x2, step Length is 2；The monitoring parameter of four ME modules is followed successively by [2,4,8,16], and output parameter is followed successively by [8,16,32,64].It is optional , when being rotated in the step (2) to 2D monocular depth figure, for by 2D monocular depth figure in X/Y plane [- 40,40] angle It is rotated in range.

Optionally, it when being zoomed in and out in the step (2) to 2D monocular depth figure, scales multiple range [0.8,1.2].

Optionally, translation is carried out to 2D monocular depth figure in the step (2) to translate in [- 8,8] voxel space.

Optionally, the variation that variation is limited to be arranged between frame and frame to artis in the step (5) limits, including three Kind method:

1. coordinate limits, the variation range of coordinate value in each reference axis is limited, and then control angle change；

2. swinging limitation, angle change of the limitation artis or so with front and back；

3. angle limits, the angle change of limitation artis in all directions.

Optionally, in step (5) the jitter smoothing algorithm S3, smoothing parameter Smoothing value is smaller, then this frame Smooth value Y_tIt is influenced by previous frame smaller.

Optionally, in step (5) the jitter smoothing algorithm S5, corrected parameter Correction value is bigger, then to pass The amendment of the deviation of node is faster.

Optionally, the method is based on machine learning and deep learning frame, from monocular depth figure, is believed using three-dimensional Deep learning network is ceased, human joint points coordinate in figure is estimated, the human joint points coordinate estimated value estimated is introduced dynamic It draws personage and binds recording system, artis is smoothed using filter algorithm, it is final to realize artis and animated character Bind the binding of animated character in recording system.

The medicine have the advantages that

By the estimation using three-dimensional information deep learning network implementations to human joint points coordinate in figure, by what is estimated Human joint points coordinate estimated value is introduced into animated character and binds in recording system, so that the estimation to human joint points coordinate in figure Be worth it is more accurate, thus animated character bind record, enable to human action in shooting picture to be accurately embodied in animation people On object, realize that artis and animated character bind the accurate binding of animated character in recording system.

Detailed description of the invention

To describe the technical solutions in the embodiments of the present invention more clearly, make required in being described below to embodiment Attached drawing is briefly described, it should be apparent that, drawings in the following description are only some embodiments of the invention, for For those of ordinary skill in the art, without creative efforts, it can also be obtained according to these attached drawings other Attached drawing.

Fig. 1 is deeprior++ structure chart.

Fig. 2 is artis variation limitation figure, wherein the limitation of crossbanding indicates coordinate includes ginseng as shown in (1) in Fig. 2 Number △ x, △ y, △ z indicate respective coordinates variation；Inclined stripe indicates mobile limitation, as shown in (2) in Fig. 2, include parameter △ S, △ T indicates swing amplitude and amplitude of rocking back and forth；Dotted striped indicates angle limitation, as shown in (3) in Fig. 2, includes ginseng Number △ A, indicates angle change.

Fig. 3 is that animated character binds recording system structure chart in the present invention.

Fig. 4 is the design drawing of the basic module of deep learning network in the present invention.

Fig. 5 is the design drawing of deep learning network in the present invention.

Fig. 6 is that FeSHEN network provided by the invention and V2V-PoseNet network are carried out to human body compound action sequence Comparative result figure when Attitude estimation.

Fig. 7 is that FeSHEN network provided by the invention and V2V-PoseNet network are carried out to human body continuous action sequence Comparative result figure when Attitude estimation.

Specific embodiment

To make the object, technical solutions and advantages of the present invention clearer, below in conjunction with attached drawing to embodiment party of the present invention Formula is described in further detail.

Embodiment one:

The present embodiment provides a kind of animated characters to bind the data processing method in recording system, for human body and animation people During object binding is recorded, the method is based on machine learning and deep learning frame and utilizes three-dimensional information from monocular depth figure Deep learning network FeSHEN, estimates human joint points coordinate in figure, and the human joint points coordinate estimated value estimated is drawn Enter animated character and bind recording system, artis is smoothed using filter algorithm, it is final to realize artis and animation Personage binds the binding of animated character in recording system.

Monocular depth figure shoots to obtain using KinectV2 the and Xtion Pro or Asus Xtion Pro of Microsoft.

Wherein three-dimensional information deep learning network FeSHEN is a kind of network end to end, following respectively from the mould of network Three block, structure and network losses function aspects introduce the establishment process of network:

(1) module of network

The application devises a networking component Monitored-Endecoder (ME mould for three-dimensional voxel estimation Block), specific structure is as shown in Figure 4.The ME module is made of pond block, residual block, warp block and supervision four modules of block, is Simplified figure is not come out Z axis information shown in the figure in Fig. 4.

Two rows below each module respectively indicate the module input and output, and number represents the sky of the module process data Between size and feature port number, the size of cube represents the size of characteristic pattern in module, the thickness channel of cube it is big It is small.

Leftmost dotted box portion is encoder in figure, and pond block is used to reduce the size of characteristic pattern, residual block For increasing feature number of channels, increases number of channels and be equal to the species number for increasing convolution kernel, the type of convolution kernel is more, energy Learn to be more conducive to the promotion to model performance to more features.

Intermediate dotted box portion is decoder, and wherein warp block is adopted while increasing the space size of characteristic pattern With less convolution kernel function Characteristics figure, port number is reduced, realizes compression and decoding.In an encoding process, by using smaller Step-length, compressive features, increase port number, expand feature map space, network characterization can be made richer, it is easier to drop to close Collection property key point position, thus positioning joint point.

The dotted box portion of rightmost represents monitor, is mainly made of the residual block of different I/O channel parameters, Channel parameters are denoted as Out and Moni, the former is known as output parameter, and the latter is monitoring parameter.The monitor there are two branch, by Two residual block compositions.Top set is used to carry out characteristic pattern channel transformation, and the port number of residual block is parameter Out；Inferior division First residual block, be intermediate monitoring module (Intermediate Monitor Block), the intermediate monitoring module it is defeated Entering port number is channel parameters Out, and output channel number is that monitoring parameter Moni is compressed characteristic pattern by the residual block, Extract nonlinear characteristic；Second residual block of inferior division, input channel number are monitoring parameter Moni, and output channel number is output Parameter Out is mainly used for carrying out compressed supervision feature channel extension, and then is merged with original output feature.According to The degree of refinement of the difference of supervision size every time, study to feature is different, and for different image resolution ratios, difference can be set Supervision parameter, to reach optimum prediction effect.

(2) network structure

Construct network basic unit be 3D convolution basic block, totally 4 kinds.

1st kind of basic block is by voxel convolution, voxel batch normalization (Batch Normalize) layer, activation primitive (ReLU) Composition, referred to as convolution block, are substantially carried out the convolutional calculation of three-dimensional information；

2nd kind of basic block is known as residual block, is mainly used for being adjusted the port number of three-dimensional feature figure；

3rd kind of basic block is used to carry out 3D voxel down-sampling, as principle with the pond 2D is, referred to as pond block, It is mainly used for the three-dimensional size for reducing characteristic pattern；

4th kind of basic block is for up-sampling 3D voxel, by voxel up-sampling (Upsampling), voxel batch normalizing Change (Batch Normalize) layer, activation primitive (ReLU) composition, referred to as warp block.Make in convolution block and warp block Help to simplify learning process with batch normalization layer and activation primitive, accelerates decrease speed.In network design process, residual block Kernel size be 3x3x3, the Kernel size of convolution block and warp block is 2x2x2, and step-length is 2.

The major part of high dimensional information encoding and decoding network (FeSHEN) with supervision uses different prisons by four ME modules Control parameter is composed in series.By connecting to ME module, deepens the network number of plies, using different monitoring parameters, realize to not Feature with degree of refinement extracts.It is handled in beginning and end part with convolution block, specific network structure such as Fig. 2 institute Show.

Number in Fig. 4 under module represents the size and port number of the resume module characteristic pattern, and wherein red numerical represents The monitoring parameter of monitoring parameter Moni, ME component is followed successively by [2,4,8,16], and blue digital represents output parameter Out, ME component Output parameter be followed successively by [8,16,32,64].

(3) network losses function

Three-dimensional information deep learning network FeSHEN is using the mean square error of Gauss extreme point mean value as loss function L, specific as follows:

Wherein, N indicates the label of artis,And H_nIt is the true value of n-th of artis and the Gauss of predicted value respectively Extreme point mean value, i, j, k are respectively the D coordinates value of n-th of artis.

Use Gauss extreme point mean value (mean of Gaussian peak) as the feature of coordinate points, is to consider To the prediction possibility to each artis, each future position and target point are sought into Gaussian mean, the spy as target point Sign, specific Gauss extreme point characteristics of mean indicate that process is as follows:

Wherein,It is the Gauss extreme point mean value of n-th of artis, i_n, j_n, k_nIt is the label coordinate of n-th of artis Value, and σ=1.7 are the standard deviations of Gauss extreme point.

It is as shown in Figure 5 to establish three-dimensional information deep learning network FeSHEN structure chart.

After establishing three-dimensional information deep learning network FeSHEN, executes following step and complete human joint points and animation Personage's binding:

It is mono- to shoot to obtain the 2D comprising human body using KinectV2 the and Xtion Pro or Asus Xtion Pro of Microsoft Mesh depth map；

Step (1) is handled 2D monocular depth figure using deeprior++ network, exports spatial offset xyz_ Offset；Deeprior++ network structure is as shown in Figure 1.

DeepPrior++:Improving disclosed in 2017 wherein can refer to for the introduction of deeprior++ network Content in Fast and Accurate 3D Hand Pose Estimation.

The enhancing of step (2) data: 2D monocular depth figure is rotated, scaled and is translated and is mapped to three-dimensional Euclidean Space forms point cloud；Wherein, it rotates to be and is rotated 2D monocular depth figure in X/Y plane [- 40,40] angular range；Scaling When, it scales multiple range [0.8,1.2]；Translation is the translation in [- 8,8] voxel space；Three dimensional euclidean space is mapped to be formed The mapping equation of point cloud is as follows:

The voxel collection Cubic that step (4) obtains step (3) inputs above-mentioned established three-dimensional information deep learning network FeSHEN, the maximum likelihood for obtaining animated character's artis respond position, later reflect the maximum likelihood response position of artis It is mapped in world coordinate system, finally predicts 18 artis of animated character, obtain 18 artis in world coordinate system Space coordinate；

The variation that variation is limited to be arranged between frame and frame to artis limits, referring to FIG. 2, including three kinds of methods:

3. angle limits, the angle change of limitation artis in all directions.

In Fig. 2, it is labelled with the method for limiting of artis movement with different colours, and is labelled with limitation ginseng by artis Number.Wherein crossbanding indicates coordinate limits, specific as shown in (1) in figure, includes parameter △ x, △ y, △ z, and indicates coordinate becomes Change；Inclined stripe indicates mobile limitation, includes parameter, Δ S, Δ T specifically as shown in (2) in figure, expression swings amplitude with before After rock amplitude；Dotted striped indicates angle limitation, specific as shown in (3) in figure, includes parameter, Δ A, indicates angle change.

Jitter smoothing algorithm is as follows:

Output: smoothed out output valve

S1 calculates X_tWith X_t-1Between Euclidean distance dis；

S2 judges the size of dis, setting shake limits value Jitter；

If dis > Jitter, X '_t=X_t；

Y_t=X '_t×(1-Smoothing)+(X_t-1+T_t-1)×Smoothing

Wherein, Smoothing is smoothing parameter, and value range is [0,1]；T_t-1For the Trend value of previous frame, by previous frame Trend formula calculate；According to above-mentioned Holter two fingers number smoothing formula it is found that smoothing parameter Smoothing value is smaller, The then smooth value Y of this frame_tIt is influenced by previous frame smaller；

S4 calculates smooth value Y_tWith input value X_t-1Between difference Dis；

Dis=Y_t-X_t-1

T_t=Dis × Correction+T_t-1×(1-Correction)

Wherein, Correction is corrected parameter, and value range is [0,1]；It is public according to the double exponential trends of above-mentioned Holter Formula is it is found that corrected parameter Correction value is bigger, then faster to the amendment of the deviation of artis.

Wherein, Prediction value be [0, n], be influence expectation predict future n frame predicted value parameter, be used for pair The following n frame image has an impact；

Above-mentioned maximum filtering distance MaxDist is set according to specific smoothness requirements, and the value is excessive, and will lead to cannot be right DisOut is filtered；The value is too small, will lead to smoothed out output valveTend to be single.

Step (6) is according to smoothed out output valveIt establishes the animated character based on monocular depth figure and binds to record and be System, it includes that human body estimation module, model binding module and editor record module three mainly which, which binds recording system, Module, specific structure are shown in Fig. 3:

(1) by output valveThe editor for being output to system records module；

In above-mentioned data handling procedure, using three-dimensional information deep learning it is network-evaluated go out figure in human joint points coordinate, Human joint points coordinate in figure is estimated compared with using other modes, such as: it is compared with V2V-PoseNet network, it should The structure of V2V-PoseNet network can refer to V2v-posenet:Voxel-to-voxel prediction disclosed in 2018 network for accurate 3d hand and humanpose estimation from a single depth map；

Comparison result please refers to Fig. 6 and Fig. 7, and Fig. 6 show a human motion sequence, the movement of human body in the motion sequence Foot is caused to have exceeded depth camera detection range, V2V-PoseNet is made that the estimation of mistake at this time, returns foot key point Initial position is arrived, as comparison diagram is evident that in (2) column in Fig. 6 and (3) column.And the application is done using FeSHEN Adjustment appropriate gives the predicted value closer to true value.

In Fig. 7, in (1), (2) two column comparison diagrams, for foot joint point, V2V-PoseNet is by human body Liang Ge foot Corresponding two point predictions are a point, and FeSHEN is then refined and separated, it can thus be appreciated that provided by the present application FeSHEN has stronger nonlinear prediction effect.And (3) (4) (5) three column comparison diagram in Fig. 7, then show that FeSHEN has Stronger robustness, for the movement of human body complexity, for example when human body occurs from when blocking, FeSHEN can be more accurately to variation Depth map information make a response, adjust predicted value, as a result closer to true value.

Synthesis is it is found that using human joint points coordinate in the network-evaluated figure out of three-dimensional information deep learning provided by the present application Closer true value, and it is non-linear more preferable, and robustness is stronger, may make prediction to tie to bind recording system in animated character Fruit is more accurate.

Part steps in the embodiment of the present invention, can use software realization, and corresponding software program can store can In the storage medium of reading, such as CD or hard disk.

The foregoing is merely presently preferred embodiments of the present invention, is not intended to limit the invention, it is all in spirit of the invention and Within principle, any modification, equivalent replacement, improvement and so on be should all be included in the protection scope of the present invention.

Claims

1. a kind of animated character binds the data processing method in recording system, which is characterized in that the method includes following steps It is rapid:

Step (1) is handled 2D monocular depth figure using deeprior++ network, exports spatial offset xyz_Offset；

The enhancing of step (2) data: 2D monocular depth figure is rotated, scaled and is translated and is mapped to three dimensional euclidean space Form point cloud；Mapping equation is as follows:

Wherein, u, v are the arbitrary coordinate point under image coordinate system；u₀、v₀For the centre coordinate of image；x_ω、y_ω、z_ωIndicate the world Three-dimensional coordinate point under coordinate system；z_cThe z-axis value for indicating camera coordinate in world coordinate system, i.e., it is dynamic in 2D monocular depth figure Distance of the picture personage to camera；R, T are respectively the 3x3 spin matrix and 3x1 translation matrix of outer ginseng matrix；

The point cloud that step (3) is obtained using the spatial offset xyz_Offset amendment step (2) that step (1) obtains, it is then sharp Revised cloud is trimmed with preset parameter, preliminarily forms set a little, the collecting for point preliminarily formed will be collectively referred to as body Element collection Cubic, voxel collection Cubic is the square that a space size is 88x88x88, wherein position a little is designated as 1, no point Position be designated as 0；

The voxel collection Cubic that step (4) obtains step (3) inputs three-dimensional information deep learning network FeSHEN, obtains animation The maximum likelihood of character joint point responds position, and the maximum likelihood response position of artis is mapped to world coordinate system later In, 18 artis of animated character are finally predicted, space coordinate of 18 artis in world coordinate system is obtained；

Step (5) handles space coordinate of 18 artis in world coordinate system using smoothing method, including variation Limitation and jitter smoothing；Wherein, variation limitation is for preventing artis from the variation of the movement beyond human body limit occur；Shake is flat It slides for avoiding because artis caused by noise is shaken；

Jitter smoothing algorithm is as follows:

Output: smoothed out output valve

S1 calculates X_tWith X_t-1Between Euclidean distance dis；

S2 judges the size of dis, setting shake limits value Jitter；

If dis > Jitter, X '_t=X_t；

Y_t=X '_t×(1-Smoothing)+(X_t-1+T_t-1)×Smoothing

Wherein, Smoothing is smoothing parameter, and value range is [0,1]；T_t-1For the Trend value of previous frame, by becoming for previous frame Gesture formula calculates；

S4 calculates smooth value Y_tWith input value X_t-1Between difference Dis；

Dis=Y_t-X_t-1

T_t=Dis × Correction+T_t-1×(1-Correction)

Wherein, Correction is corrected parameter, and value range is [0,1]；

Wherein, Prediction value is [0, n]；

S7 checks predicted value using maximum filtering distance MaxDistWith input value X_tBetween Euclidean distance DisOut, if DisOut > MaxDist, then

It has finally obtained for input value X_tSmoothed out output valve For a three-dimensional vector, pass is contained Coordinate value of the node on x, y, z axis；

Step (6) is according to smoothed out output valveIt establishes the animated character based on monocular depth figure and binds recording system.

2. the method according to claim 1, wherein the step (6) is according to smoothed out output valveIt builds The animated character of the monocular depth that is based on figure binds recording system, comprising:

(1) by output valveThe editor for being output to system records module；

(2) animated character's model is established in model binding module, animated character's artis and skeleton pattern is bound, generate class people Animation model, the editor for being output to system record module；

(3) editor records the smoothed joint of the human body picture that module captures photographic device and the output of human body estimation module Point parameter, the animated character's model established with model binding module are bound, and animated character's recording and binding task are completed；

The editor records the main interface that module is system, and user is chosen by main interface operation and preview model, dynamic determining After drawing scene and animated character, the recording for carrying out animated video using embedded video recording method is clicked, ultimately generates recording view Frequently.

3. the method according to claim 1, wherein three-dimensional information deep learning network in the step (4) The network structure of FeSHEN are as follows: a convolution block is followed by a pond block, four ME modules of connecting later, and continuous output two is residual Poor block is most followed by a convolution block；The ME module is made of pond block, residual block, warp block and supervision four modules of block, Estimate for three-dimensional voxel；

Wherein, the Kernel size of residual block is 3x3x3, and the Kernel size of convolution block and warp block is 2x2x2, and step-length is equal It is 2；The monitoring parameter of four ME modules is followed successively by [2,4,8,16], and output parameter is followed successively by [8,16,32,64].

4. the method according to claim 1, wherein being rotated in the step (2) to 2D monocular depth figure When, for 2D monocular depth figure is rotated in X/Y plane [- 40,40] angular range.

5. the method according to claim 1, wherein being zoomed in and out in the step (2) to 2D monocular depth figure When, it scales multiple range [0.8,1.2].

6. the method according to claim 1, wherein being translated in the step (2) to 2D monocular depth figure To be translated in [- 8,8] voxel space.

7. the method according to claim 1, wherein variation is limited to carry out to artis in the step (5) Limitation, the variation being arranged between frame and frame limits, including three kinds of methods:

3. angle limits, the angle change of limitation artis in all directions.

8. the method according to claim 1, wherein in the step (5) jitter smoothing algorithm S3, smoothing parameter Smoothing value is smaller, then the smooth value Y of this frame_tIt is influenced by previous frame smaller.

9. the method according to claim 1, wherein in the step (5) jitter smoothing algorithm S5, corrected parameter Correction value is bigger, then faster to the amendment of the deviation of artis.

10. -9 any method according to claim 1, which is characterized in that the method is based on machine learning and depth Frame is practised, estimates human joint points coordinate in figure from monocular depth figure using three-dimensional information deep learning network, it will The human joint points coordinate estimated value estimated introduces animated character and binds recording system, is carried out using filter algorithm to artis Smoothing processing, it is final to realize that artis and animated character bind the binding of animated character in recording system.