CN110310351B

CN110310351B - Sketch-based three-dimensional human skeleton animation automatic generation method

Info

Publication number: CN110310351B
Application number: CN201910597737.9A
Authority: CN
Inventors: 马昊; 李淑琴; 丁濛; 孟坤
Original assignee: Beijing Information Science and Technology University
Current assignee: Beijing Information Science and Technology University
Priority date: 2019-07-04
Filing date: 2019-07-04
Publication date: 2023-07-21
Anticipated expiration: 2039-07-04
Also published as: CN110310351A

Abstract

The invention relates to a three-dimensional human skeleton animation automatic generation method based on sketch, which takes an interaction mode for simplifying skeleton animation production as a starting point, provides head and tail frames of animation to be generated in a sketch image input mode, uses a tensorf low to construct a neural network frame to respectively realize sketch three-dimensional reconstruction and skeleton animation interpolation frame information automatic synthesis of the system, and finally realizes the three-dimensional human skeleton animation automatic generation method based on sketch. The front end code and the back end code of the system are logically separated, and the system has higher flexibility. The system can automatically preprocess the image input by the user, thereby meeting the input format of each part of network and reducing the interaction complexity of the whole system.

Description

Sketch-based three-dimensional human skeleton animation automatic generation method

Technical Field

The invention relates to the technical field of computer animation, in particular to a three-dimensional human skeleton animation automatic generation method based on sketches.

Background

In computer graphics, three-dimensional modeling techniques provide many necessary methods for converting objects in the real world into mathematical representations in a three-dimensional coordinate system and rendering by a computer program, thereby achieving an effect of simulating the real world in a virtual space. There are many mature three-dimensional modeling software such as CAD, maya, 3DsMax, etc. at present, and they are widely used in various fields in practical three-dimensional modeling. While these three-dimensional modeling software have been widely used in production environments, specialized training and relatively steep learning curves are often required for use of the software. Conventional three-dimensional modeling software emphasizes that the input form of a user is limited by using cumbersome interaction rules, so that the modeling accuracy is improved, and thus, a modeling task can take a huge time overhead even for a professional modeler.

To address this problem, sketch-based three-dimensional modeling techniques provide an efficient way to geometrically model in a way that uses sketches of hand drawings as inputs. Sketch-based modeling techniques introduce a gesture-based approach to building representation (CSG) modeling that enables rapid modeling of simple three-dimensional models. The three-dimensional modeling technology based on sketch mainly takes a two-dimensional sketch curve as input, and finally outputs a corresponding three-dimensional model according to the input sketch outline information. Sketch-based modeling techniques have been proposed and designed primarily for users with painting capabilities but lacking experience in the use of three-dimensional modeling software. In recent years, sketch-based modeling techniques have been increasingly used for three-dimensional modeling tasks, and have been widely applied to some special fields such as animal modeling, game character modeling, clothing design, hair modeling, and the like. By simple two-dimensional sketch outline strokes, a user can use a sketch input interface for modeling complex model objects with free geometric surfaces, so that the time overhead of the modeling period can be reduced in a more efficient manner.

Computer animation techniques have been the focus and difficulty of research relative to three-dimensional modeling. The computer animation exists in the form of an animation sequence, and on the basis of the animation sequence, the animation action needs to be summarized by using a storyboard, and a fixed scene is drawn in detail in specific key frames of an animation time axis, so that the dynamic information of the whole animation at a specific moment can be required to be reflected in each key frame. To ensure the continuity of the animation sequence, interpolation frames need to be added between a pair of key frames, and the number density of the interpolation frames determines the quality of the animation. Three-dimensional animation requires more rule constraints than two-dimensional animation, where research of three-dimensional skeletal animation of the human body is more challenging. In three-dimensional human skeletal animation, a human body is modeled by using joint chain-shaped bodies, wherein each fixed key frame can reflect the specific coordinate position of a skeletal joint point at a specific moment in a world coordinate system. In the whole three-dimensional bone animation production process, firstly, three-dimensional modeling is required to be carried out on a human bone structure, in order to simulate the position relation of a real human bone joint point, modeling staff is required to master professional three-dimensional modeling and human bone structure knowledge in the modeling process, and great labor and time cost are required to be input; and secondly, determining a key frame and interpolation frame model by referring to the coordinate position change track of the joint points in the real human skeleton movement process, and carrying out necessary editing work. The entire production process, although it can be implemented by using animation software, is a bottleneck limiting the use of users due to the need for a great deal of expertise and complex interaction rules.

Disclosure of Invention

In view of this, the present application provides a sketch-based three-dimensional human skeleton animation automatic generation method. The system uses a sketch modeling technology and a deep learning method to automatically generate three-dimensional human skeleton animation in a mode of inputting any two human action sketch images, so that the manufacturing efficiency of the three-dimensional human skeleton animation is improved by reducing the interaction frequency and complexity of the system.

The application is realized by the following technical scheme:

a three-dimensional human skeleton animation automatic generation method based on sketch comprises the following steps

Step 1, realizing interaction with a user, and receiving a human body action sketch image file input by the user;

step 2, calling a background model according to the sketch image file;

step 3, according to the human body action information in the head and tail frames in the animation sequence, completing the automatic synthesis of the missing interpolation frame in the animation sequence, thereby realizing the generation of the complete animation sequence;

and 4, rendering the generated data of the complete animation sequence to a screen to obtain the visualized three-dimensional human skeleton animation.

Further, in the step 2, calling a background model according to the sketch image file specifically includes:

step 201, performing an image preprocessing method according to a human body action sketch image input by a user to obtain sketch image data conforming to a network input format;

step 202, formulating a sketch image recognition network label to obtain an output result for describing the recognition capability of the network to the sketch image;

step 203, performing sketch image recognition network training, obtaining a specific sketch recognition result according to an input sketch image and realizing mapping to coordinate information of a three-dimensional space skeleton joint point;

step 204, obtaining coordinate information of bone joint points in the three-dimensional space of the human body.

Further, in step 201, the image preprocessing method specifically includes:

and carrying out image transformation on sketch image data input by a user by sequentially using a contour detection method, a filling method and an equal-proportion scaling method so as to obtain network input meeting a three-dimensional human skeleton model reconstruction model based on the sketch image.

Further, the image transformation is performed on the sketch image data input by the user by sequentially using a contour detection method, a filling method and an equal-proportion scaling method, and the method specifically comprises the following steps:

human body closed curve contour detection is carried out on a sketch image input by a user, so that a main area part capable of describing human body actions in the image is obtained;

filling the closed part according to the human body curve contour obtained by contour detection to improve the description capability of the image on human body actions;

the original image is converted into a network input format which satisfies a three-dimensional human skeleton model reconstruction model based on a sketch image, and unnecessary detail information in the original image is shielded.

Further, in step 202, the creating a sketch image identifies a network tag, which specifically includes:

the labels are divided into three layers according to the relation among human body actions, namely action category, action pattern category and action frame category, the description capacity of the three labels on the action image is from thick to thin, and the final action frame category label is used for describing the action information of a single frame in a specific animation sequence.

Further, in step 203, the performing sketch image recognition network training specifically includes:

the method for identifying and classifying the action sketch images by using convolutional neural network layering according to the formulation of sketch identification network labels comprises the following steps: training mode, parameter adjustment and error function setting;

the training mode uses tensorflow as a deep learning tool, the network is trained step by using a layered classification mode, a model fusion mode is used in the training to decompose a model which is difficult to train and train a weak classification model, and each part of models are fused to obtain a final result.

The parameter adjustment is used for adjusting parameters of each part of the network to achieve the optimal effect, and the parameters comprise: convolutional kernel size, weight and bias initialization settings, convolutional layer number, optimizer settings, and learning rate initialization settings;

the number of the convolution layers determines the dimension and the network calculated amount of the feature representation, the more the feature representations of the convolution layers are abstracted, the larger the calculated amount is, the less the feature representations of the convolution layers are close to the original data, and the smaller the calculated amount is.

Further, in step 3, the automatic synthesis of the missing interpolation frame in the animation sequence is performed according to the human motion information in the head and tail frames in the animation sequence, which specifically includes:

when any two motion frame data are given, the given data are used as the head and tail frames of a section of complete three-dimensional human skeleton animation, interpolation frame data missing between the two frames are automatically generated, and the used methods comprise a skeleton animation feature extraction method and an interpolation frame automatic synthesis method.

Furthermore, the skeleton animation feature extraction method uses a convolution self-coding network structure to undergo a data regeneration process through coding and decoding operations, input data of the network are complete skeleton animation sequence data, final output of the network is regeneration data of an animation sequence, an optimization strategy in network training is to minimize variance distance between original data and the regeneration data, and a trained model can realize feature extraction of the original skeleton animation through coding calculation.

Furthermore, the automatic synthesis method of the interpolation frame gradually restores the variation trend between actions by using a mode of combining a convolution feedforward network and interpolation operation, and finally generates a complete skeleton animation sequence, wherein the network layer structurally comprises an interpolation layer, a convolution layer and an activation layer which use a nearest neighbor interpolation strategy.

Further, the interpolation layer using the nearest neighbor interpolation strategy specifically includes: the nearest neighbor interpolation strategy can amplify the size of the original data, retain the original data information, and meet the final output data format size by gradually passing through the interpolation layer in the network calculation;

the convolution layer abstracts and twists the data passing through the interpolation layer to realize fitting to target output;

the activation layer is used for increasing the nonlinearity of the network, reducing the interdependence relation in network parameters and relieving the over fitting of the network.

Compared with the prior art, the invention has the advantages that:

1) The network model combining nearest neighbor interpolation and convolution strategy is proposed in the design of the automatic synthesis model of the interpolation frame, and an error function is set according to the variance distance between the real animation sequence output and the network model output, so that the model level can be improved by minimizing the error value in the network training stage.

2) And packaging the sketch three-dimensional reconstruction model and the three-dimensional human skeleton animation automatic synthesis model into specific functional modules by using a layered architecture and realizing a complete interaction function.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiments of the invention and together with the description serve to explain the invention and do not constitute a limitation on the invention.

FIG. 1 is a block diagram showing the constitution of an automatic generation method of the present invention;

FIG. 2 is a list of images generated by rendering three-dimensional skeletal joint point position information;

FIG. 3 is a schematic diagram of a two-dimensional sketch image recognition model employing a hierarchical structure;

FIG. 4 is a graph of error versus accuracy for training;

FIG. 5 is a schematic diagram of a three-dimensional bone model reconstruction from an input sketch of hand-drawn actions;

FIG. 6 is a schematic diagram of a skeleton animation automatic synthesis model structure;

FIG. 7 is a graph of error trend during training of an animated feature extraction model;

FIG. 8 is a diagram of an animation feature extraction model test result;

FIG. 9 is an error trend graph of the training period of the interpolated frame automatic synthesis network model;

FIG. 10 is a system timing diagram of an automatic generation method;

fig. 11 is an exemplary diagram of the operation result of the automatic generation method.

Detailed Description

Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present application as detailed in the accompanying claims.

The terminology used in the present application is for the purpose of describing particular embodiments only and is not intended to be limiting of the present application. As used in this application and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any or all possible combinations of one or more of the associated listed items.

The invention will now be described in further detail with reference to the drawings and examples.

In the process of producing the three-dimensional animation, compared with the traditional animation production software, the sketch drawing method is simpler and easier for a user, the user does not need to master excessive professional art knowledge and interaction rules, and the three-dimensional animation production concept can greatly improve the efficiency of the whole production period from the aspect of simplifying the input form by using the input mode of the two-dimensional animation. On the other hand, with the successful use of deep learning, particularly convolutional neural network (Convolutional Neural Networks, CNN) models in computer graphics and computer vision, and the existence of a large number of online motion capture databases, the production of human skeletal animation is also advancing in a more intelligent direction. The deep learning algorithm is used for extracting and learning the position change relation between the motion and the joint points in the motion capture database, and further, the computer program is used for automatically generating key frames and interpolation frames so as to replace the traditional complex manual skeleton animation editing work, so that the complexity of interaction is further reduced, and the manufacturing efficiency is improved.

Fig. 1 illustrates a framework for a sketch-based three-dimensional human skeletal animation automatic generation method in accordance with embodiment 1 of the present disclosure.

Referring to fig. 1, embodiment 1 of the present disclosure provides a sketch-based three-dimensional human skeletal animation automatic generation method, it being noted that the steps illustrated in the flow chart of the drawings may be performed in a computer system such as a set of computer-executable instructions, but in some cases the steps illustrated or described may be performed in a different order than that herein. Referring to fig. 1, the sketch-based three-dimensional human skeleton animation automatic generation method comprises the following steps:

head-to-tail frame human motion sketch image: sketch images which are input by a user and can describe human body actions are respectively used for describing the actions of a first frame and a last frame of an animation sequence to be generated;

image preprocessing: preprocessing an original input image by using contour detection, filling and equal proportion scaling respectively, so as to meet the input size of a subsequent network and amplify key pixel information in the image;

action type classification network: classifying action categories by adopting a model fusion strategy, wherein each part of weak classifier consists of two convolution layers and two full-connection layers, the number of convolution kernels in the convolution layers is 32 and 64 respectively, the convolution kernel size is 3 multiplied by 3, the number of nodes in the full-connection layers is 1024, and a single-layer BP network is adopted as a fusion device for model fusion;

action pattern form classification network: the method comprises the steps of adopting a single-step convolutional neural network, and comprising two convolutional layers and two full-connection layers, wherein the number of convolutional kernels in the convolutional layers is 32 and 64 respectively, the size of the convolutional kernels is 3 multiplied by 3, and the number of nodes in the full-connection layers is 1024, so that the method is used for identifying the specific action pattern form in a certain action category;

key frame positioning network: the single-step convolutional neural network is adopted, and consists of two convolutional layers and two full-connection layers; the number of convolution kernels in the convolution layer is 32 and 128 respectively, the size of the convolution kernels is 3 multiplied by 3, the number of nodes in the full connection layer is 2056, and specific action frame information is used and identified;

three-dimensional human skeleton joint point coordinate information: the classification result of the final input sketch can be output through a key frame positioning network, and the coordinate information of the three-dimensional human skeleton joint point can be obtained according to the known mapping relation from the classification result to the three-dimensional skeleton joint point coordinate;

automatic synthesis network of three-dimensional human skeleton animation interpolation frames: using a feedforward network structure combining nearest neighbor interpolation and convolution strategy to automatically synthesize complete motion trend information from known first two-frame motion information;

bone animation feature extraction and recovery model: the model can obtain the animation characteristic information according to the existing animation sequence, and can recover the final complete animation sequence according to the animation characteristic information. The input of the interpolation frame automatic synthesis network can obtain a section of complete action characteristic information, and the complete skeleton animation sequence data can be obtained by recovering the model.

Complete skeletal animation sequence data: automatically synthesizing a network and calculating a skeleton animation feature extraction and recovery model according to the interpolation frame, and finally generating complete skeleton animation sequence data;

three-dimensional rendering: according to absolute position information of human skeleton joint points expressed frame by frame in the obtained skeleton animation sequence data under a three-dimensional coordinate system, performing three-dimensional rendering by using OpenGL as a graphic class library;

three-dimensional human skeleton animation: and performing three-dimensional rendering according to the frame by frame, controlling the refreshing frequency of the screen, and finally obtaining three-dimensional human skeleton animation and outputting the three-dimensional human skeleton animation to the screen.

Referring to fig. 2, in order to provide a training data set for training of an action type classification network, an action pattern form classification network and a key frame positioning network, a three-dimensional animation sequence in a human motion capture database is used as original data and is analyzed frame by frame to obtain a skeleton structure in a three-dimensional space, a rendering mode is used to convert the three-dimensional skeleton structure into a two-dimensional frame motion image list, and a mapping corresponding to position information of each skeleton node in a three-dimensional coordinate system is established.

Referring to fig. 3, a model structure of a sketch classification network is given, in which action class classification networks are formed of two convolution layers and two full connection layers using a model fusion method, each weak classification network is formed of two convolution layers and two full connection layers, and softmax is used as a classifier, a cross entropy function as shown in formula (1) is used as an error function,

Loss(p,q)＝-∑ _j p _j log q _j (1)

performing model fusion by using a single-layer BP network according to a method shown in a formula (2);

p′＝∑ _i w _i ·p _i +b (2)

wherein p is _i Represents the classification result of the i-th weak classifier, p' represents the final classification result of the strong classifier, w _i The weight representing the output of the weak classifier i and b represents the offset.

The action pattern form classification network and the key frame positioning classification network are both composed of a convolutional neural network composed of two convolutional layers and two fully-connected layers, and a softmax is used as a classifier, and a cross entropy function shown in a formula (1) is used as an error function.

Referring to fig. 4, in order to demonstrate the convergence of the classification network and monitor the convergence change of the network during training, the variation trend of the error and accuracy during the training of the three networks is recorded, respectively.

The test results shown in the following table were obtained by testing three kinds of classification networks using the test set data.

Referring to fig. 5, in order to prove the effectiveness of the model, openGL is used as a tool library to perform three-dimensional rendering according to specific coordinate position information of the bone joint points in the three-dimensional space, which is identified by using the input sketch, so as to obtain a visualized three-dimensional bone model.

Referring to fig. 6, a structure of a three-dimensional human skeleton animation automatic synthesis model is given. The whole model may be divided into two parts, wherein the first part (left part) is the motion information feature extraction and restoration unit, in which the original motion data subjected to feature extraction is represented in the hidden unit in the form of a motion manifold. In addition, in order to achieve recovery from feature data in the action manifold to the original action data, this part of the unit is designed to use a convolutional self-coding network that can perform both feature extraction and feature recovery. The second part (right part) is an action trend information recovery unit, the part is connected with the top end of the action information extraction module, prediction and recovery of the lost action trend characteristic information are gradually completed in the whole calculation process, an output result is mapped to the hiding unit in an action manifold mode, and the part of network uses a feedforward network structure combining nearest neighbor interpolation and convolution strategies. And finally, the predicted and restored characteristic information further restores the complete action information through the action information restoring operation of the first partial unit, so that the interpolation and the completion of the missing frames of the animation sequence are realized.

Wherein the coding calculation in the convolutional self-coding network is as shown in formula (3):

wherein the weights areOffset->m=256 represents the number of hidden units, ω ₀ Representing the convolution kernel size of 3 x 3, operation +.>Representing convolution operation, ψ represents maximum pooling operation with a sampling kernel size of 3 and a sliding step size of 2 in a first dimension, the operation result will downsample the input data and halve the length of the first dimension of the input data, and finally the non-linearity of the network model is increased using Relu as an activation function. The input data X can be abstracted by the complete coding operation EC to obtain the action manifold +.>And stored to the hidden unit.

The decoding calculation in the convolutional self-coding network is shown in equation (4):

in the decoding calculation, the action manifold H in the hidden unit is taken as input, W ₀ ^T And W is equal to ₁ ^T Respectively is weight W ₀ And W is ₁ Is subjected to operationIn practice, deconvolution is performed.

The error function of the motion feature extraction network based on the convolutional self-coding network structure is shown in formula (5):

where X represents the original skeletal animation fragment.

Referring to fig. 7, an error trend based on an animation feature extraction model of a self-encoding network is given.

Referring to fig. 8, test results of an animation feature extraction model based on a self-encoding network are given.

The feedforward network calculation formula combining nearest neighbor interpolation and convolution strategy is shown as formula (6):

where the function NN (x, s) represents the size of the first dimension of x scaled up to s (e.g.)Then->)；/> And->Respectively representing the convolutionally calculated weights in each layer, where m ₁ ＝64、m ₂ ＝64、m ₃ ＝128，ω ₁ And omega ₂ Represents the convolution kernel size, ω, of 3×3 ₃ And omega ₄ Represents the convolution kernel size, ω, of 5×5 ₅ Represents a convolution kernel size of 7 x 7; b ₀₁ 、b ₀₂ 、b ₀₃ 、b ₀₄ And b ₀₅ The deviation of the convolution operation of each layer is shown. Through calculation, know +.>To increase the nonlinearity of the network, each layer of calculation results is nonlinear processed using Relu as the activation function. The error function during network training is shown in formula (7):

the meaning of the formula is to reflect the quality of the feed-forward network when performing motion manifold prediction by measuring the variance of the distance between motion trend information predicted by the feed-forward network and the real motion trend characteristic data.

Referring to fig. 9, an error trend for an automatic synthesis network model training period using interpolation frames of nearest neighbor interpolation in combination with a convolution strategy is given.

Referring to fig. 10, a system timing diagram of a sketch-based three-dimensional human skeleton animation automatic generation method is provided.

Referring to fig. 11, an example of the result of the system operation is given, and automatic conversion from given two human motion sketch images to three-dimensional human skeleton animation can be achieved.

Those of ordinary skill in the art will appreciate that all or a portion of the steps of the methods described above may be implemented by a program that instructs associated hardware, and the program may be stored on a computer readable storage medium such as a read-only memory, a magnetic or optical disk, etc. Alternatively, all or part of the steps of the above embodiments may be implemented using one or more integrated circuits, and accordingly, each module/unit in the above embodiments may be implemented in hardware or may be implemented in a software functional module. The present invention is not limited to any specific form of combination of hardware and software.

It is to be understood that various other embodiments of the present invention may be made by those skilled in the art without departing from the spirit and scope of the invention, and that various changes and modifications may be made in accordance with the invention without departing from the scope of the invention as defined in the following claims.

Claims

1. The automatic three-dimensional human skeleton animation generation method based on sketch is characterized by comprising the following steps of

step 2, calling a background model according to the sketch image file;

step 4, rendering the generated data of the complete animation sequence to a screen to obtain a visual three-dimensional human skeleton animation;

in the step 2, a background model is called according to the sketch image file, which specifically includes:

step 204, obtaining coordinate information of bone joint points in a human body three-dimensional space;

the image preprocessing method in step 201 specifically includes:

sequentially using a contour detection method, a filling method and an equal-proportion scaling method to carry out image transformation on sketch image data input by a user so as to obtain network input meeting a three-dimensional human skeleton model reconstruction model based on sketch images;

the method comprises the steps of sequentially carrying out image transformation on sketch image data input by a user by using a contour detection method, a filling method and an equal-proportion scaling method, and specifically comprises the following steps:

converting the original image into a network input format meeting a three-dimensional human skeleton model reconstruction model based on a sketch image, and shielding unnecessary detail information in the original image;

in step 202, the creating a sketch image identifies a network tag, which specifically includes:

2. The method according to claim 1, wherein in step 203, the training of the sketch image recognition network specifically comprises:

the training mode uses tensorflow as a deep learning tool, the network is trained step by using a layered classification mode, a model fusion mode is used in the training to decompose a model which is difficult to train and train a weak classification model, and each part of models are fused to obtain a final result;

3. The method according to claim 1, wherein in step 3, the automatic synthesis of the missing interpolation frame in the animation sequence is performed according to the human motion information in the head and tail frames in the animation sequence, specifically including:

4. The method for automatically generating three-dimensional skeletal animation of a human body according to claim 3, wherein,

the skeleton animation feature extraction method uses a convolution self-coding network structure to undergo a data regeneration process through coding and decoding operations, input data of the network are complete skeleton animation sequence data, final output of the network is regeneration data of an animation sequence, an optimization strategy in network training is to minimize variance distance between original data and regeneration data, and a trained model can realize feature extraction of the original skeleton animation through coding calculation.

5. The method for automatically generating three-dimensional skeletal animation of a human body according to claim 3, wherein,

the automatic synthesis method of the interpolation frame gradually restores the change trend among actions by using a mode of combining a convolution feedforward network and interpolation operation, and finally generates a complete skeleton animation sequence, wherein the network layer structurally comprises an interpolation layer, a convolution layer and an activation layer which use nearest neighbor interpolation strategies.

6. The method for automatically generating three-dimensional skeletal animation of a human body according to claim 5, wherein,

the interpolation layer using the nearest neighbor interpolation strategy specifically comprises: the nearest neighbor interpolation strategy can amplify the size of the original data, retain the original data information, and meet the final output data format size by gradually passing through the interpolation layer in the network calculation;