CN117152311A

CN117152311A - Three-dimensional expression animation editing method and system based on double-branch network

Info

Publication number: CN117152311A
Application number: CN202310967179.7A
Authority: CN
Inventors: 迟静; 任明国
Original assignee: Shandong University of Finance and Economics
Current assignee: Shandong University of Finance and Economics
Priority date: 2023-08-02
Filing date: 2023-08-02
Publication date: 2023-12-01

Abstract

The invention belongs to the technical field of animation editing, and provides a three-dimensional expression animation editing method and system based on a double-branch network, wherein the technical scheme is as follows: the high-frequency region dividing module automatically identifies the facial high-frequency region by using curvature and a K-Means clustering algorithm based on the improvement of a new space-time correlation criterion, and the rationality and the accuracy of region division are improved. The rough editing branch network is used for editing the whole facial grid to generate basic expressions, the fine editing branch network is used for editing a facial high-frequency area to perfect expression details, and the rough editing branch network and the fine editing branch network are combined, so that the accuracy of expression editing is guaranteed, and the running efficiency of a model is guaranteed. The introduction of the space change convolution enables the two branch editing networks to directly extract the space characteristics of the irregular three-dimensional facial grids, and further improves the accuracy of expression editing. The invention well generates the expression which meets the editing requirement of the user, is true and natural and has rich details.

Description

Three-dimensional expression animation editing method and system based on double-branch network

Technical Field

The invention belongs to the technical field of animation editing, and particularly relates to a three-dimensional expression animation editing method and system based on a double-branch network.

Background

The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.

Three-dimensional expression animation editing is a research hotspot and difficulty in the fields of computer graphics, computer animation, virtual reality and the like. The human expression is rich and various, even if the slight expression change plays an important role in the information transmission process, and meanwhile, the human is familiar with the human, so that any unnatural expression is very easy to be perceived, and therefore, how to synthesize the true natural expression is a very challenging problem in the three-dimensional expression animation editing research. The traditional expression animation production mode needs a large number of control elements on an animation operation model, has huge workload, so that a convenient operation mode is provided for a user, the user can simply, real-timely and efficiently complete expression editing work, thereby reducing the professional threshold of expression animation editing, improving the expression synthesis efficiency, and being another key problem in three-dimensional expression animation editing research.

Traditional three-dimensional expression animation editing methods can be classified into physical model-based expression animation editing, parameterized expression animation editing and sample-based expression animation editing. The three-dimensional expression animation editing based on the physical model mainly builds skeleton or muscle structures on the three-dimensional face, and realizes expression change by controlling displacement change of related structures. These methods require complicated operations by animators, are labor intensive, and are difficult to simulate details of the skin, such as wrinkles, etc. The three-dimensional expression animation editing based on parameterization is to construct a facial expression change function according to the change condition of the facial expression, and adjust the facial expression by means of the function. Based on the idea that three-dimensional expression animation editing of samples mostly adopts shape fusion deformation, a face model is expressed as a weighted linear combination of a plurality of known sample models, and new expression is generated by modifying the shape of the sample model or adjusting the fusion weight of each sample model. This method is simple to implement but requires a large sample size.

With the development of deep learning technology, researchers apply the deep learning technology to the field of three-dimensional expression animation editing, and the facial expression synthesis is realized by extracting the spatial characteristics and influencing factors of the facial grid deformation in a sample database. Conventional Convolutional Neural Networks (CNNs) are widely used to capture the spatial features of regular grids, but for irregular face grids, they typically contain an irregular topology, and thus are suitable for convolutional kernels of regular two-dimensional or three-dimensional grid data, and cannot be directly applied to face grids. One common compromise is to map three-dimensional grid data into a predefined UV space, then train a classical two-dimensional convolutional neural network to learn the features in UV space, and then restore it to a three-dimensional grid. But this approach is inevitably subject to parametric distortions, as well as seams in the UV map. Another method of applying the convolutional neural network to the three-dimensional grid is to define a grid convolutional operation, for example, the space change convolutional method can perform linear sampling on the convolutional kernel and learn the spatial characteristics of the grid when performing a convolutional operation on the irregular grid. However, these methods still have the problems that the accuracy of the results depends on the size of the sample data set, the training time is long, the sense of reality of the expression is lost to different degrees, and the like.

In summary, the existing three-dimensional expression animation editing method still has defects in aspects of improvement of expression detail simulation precision, enhancement of expression sense of reality, improvement of user operation convenience and the like.

Disclosure of Invention

In order to solve at least one technical problem in the background technology, the invention provides a three-dimensional expression animation editing method and system based on a double-branch network, which allow a user to freely designate a small number of control points on a face grid and drag the control points, so that real-time editing of facial expressions can be intuitively and simply realized, and the generated new expression meets the editing requirements of the user, is real and natural and has rich details.

In order to achieve the above purpose, the present invention adopts the following technical scheme:

the first aspect of the invention provides a three-dimensional expression animation editing method based on a dual-branch network, which comprises the following steps:

calculating the Gaussian curvature of each vertex of the three-dimensional face grid, clustering the vertices after curvature screening, and determining a face high-frequency area;

performing rough editing by combining a three-dimensional face grid and control point constraints through a rough editing network, wherein in the rough editing network, space change convolution and a sampling residual layer are introduced into all sampling residual blocks to process an irregular three-dimensional face grid so as to approximate to a basic expression expected by a user;

Fine editing is carried out based on the constraint of the face high-frequency region and the control point through a fine editing network, a sampling residual block in the fine editing network adopts the same design as a sampling residual block in the rough editing network, and the vertex position in the face high-frequency region set is changed through training parameters of the fine editing network, so that each high-frequency region is finely deformed to approximate expression details expected by a user;

and fusing the basic expression and the expression details to obtain a final new expression.

A second aspect of the present invention provides a three-dimensional expression animation editing system based on a dual-branch network, comprising:

a high frequency region determination module configured to: calculating the Gaussian curvature of each vertex of the three-dimensional face grid, clustering the vertices after curvature screening, and determining a face high-frequency area;

a coarse editing module configured to: performing rough editing by combining a three-dimensional face grid and control point constraints through a rough editing network, wherein in the rough editing network, space change convolution and a sampling residual layer are introduced into all sampling residual blocks to process an irregular three-dimensional face grid so as to approximate to a basic expression expected by a user;

a fine editing module configured to: fine editing is carried out based on the constraint of the face high-frequency region and the control point through a fine editing network, a sampling residual block in the fine editing network adopts the same design as a sampling residual block in the rough editing network, and the vertex position in the face high-frequency region set is changed through training parameters of the fine editing network, so that each high-frequency region is finely deformed to approximate expression details expected by a user;

An expression generation module configured to: and fusing the basic expression and the expression details to obtain a final new expression.

A third aspect of the present invention provides a computer-readable storage medium.

A computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps in the three-dimensional expression animation editing method based on a two-branch network as described in the first aspect.

A fourth aspect of the invention provides a computer device.

A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps in the two-branch network based three-dimensional expression animation editing method of the first aspect when the program is executed.

Compared with the prior art, the invention has the beneficial effects that:

1. the rough editing module is used for generating basic expressions, the fine editing module is used for enriching and perfecting expression details, and the rough editing module and the fine editing module are combined, so that the accuracy of expression editing is ensured, and the running efficiency of the model is also ensured. The introduction of the space change convolution enables the two branch editing modules to directly extract the space characteristics of the three-dimensional grid of the irregular face, and further improves the accuracy of expression editing.

2. The high-frequency region dividing method provided by the invention automatically identifies the facial high-frequency region by using the curvature and the K-Means clustering algorithm improved based on the new space-time correlation criterion, and greatly improves the rationality and accuracy of region division.

3. The invention introduces a new loss function constructed by vertex normal vector constraint, effectively improves the precision of the whole model, and solves the problem of joint when the high-frequency region is fused. The model can well generate the expression which meets the editing requirement of the user, is true and natural and has rich details.

Additional aspects of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention.

FIG. 1 is a flow chart of a three-dimensional expression animation editing method based on a dual-branch network provided by an embodiment of the invention;

fig. 2 is a diagram of a face high-frequency region division result provided by an embodiment of the present invention;

FIG. 3 is a schematic diagram of a rough editing module according to an embodiment of the present invention;

FIG. 4 is a block of sample residuals provided by an embodiment of the invention;

FIG. 5 is a schematic diagram of a fine editing module according to an embodiment of the present invention;

FIG. 6 is a partial expression editing result of the method according to the present invention;

FIG. 7 is a graph showing the comparison of the effect of different methods provided by the embodiment of the invention on the expression editing of a Ray model;

FIG. 8 is a graph showing the comparison of the effect of editing expressions on Monardo models according to the different methods provided by the embodiment of the present invention;

FIG. 9 is a graph comparing the effect of different methods provided by the embodiment of the invention on the editing of the expression of the Yoda model;

fig. 10 is a diagram of generating a continuous expression effect by changing the position of a control point according to an embodiment of the present invention.

Detailed Description

The invention will be further described with reference to the drawings and examples.

It should be noted that the following detailed description is illustrative and is intended to provide further explanation of the invention. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the present invention. As used herein, the singular is also intended to include the plural unless the context clearly indicates otherwise, and furthermore, it is to be understood that the terms "comprises" and/or "comprising" when used in this specification are taken to specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof.

In order to solve the technical problems mentioned in the background art, the invention provides a three-dimensional expression animation editing method based on a double-branch space convolution neural network. Given a three-dimensional face grid, the method allows the user to freely select control points on the face grid, and real-time editing of facial expressions is easily achieved by changing the positions of the control points.

The whole network architecture of the method of the invention is shown in figure 1, and consists of a high-frequency area dividing module, a rough editing module and a fine editing module. After the user selects and adjusts the control points, the three-dimensional face grid and the position change of the control points are input into the network, wherein the position change of the control points is constraint and guidance of integral deformation of the whole face grid, and the user expectation of the generated new expression is reflected. Firstly, a high-frequency region dividing module automatically determines the positions of the large curvature vertexes of the face by utilizing Gaussian curvature and elbow rule, and then automatically acquires the number of the high-frequency regions and the range of each region by utilizing an improved K-Means clustering algorithm and the elbow rule taking new space-time correlation as criteria according to the large curvature vertexes; then, the whole grid and the control point constraint are input into a rough editing module, coarse-grain grid deformation is obtained, the deformation can generate basic expressions meeting the requirements of users, and the execution speed is high; meanwhile, the high-frequency area of the face and the constraint of the control points are input into a fine editing module, fine-particle grid deformation is obtained, fine expression details such as skin wrinkles and the like which meet the requirements of users can be generated through the deformation, and therefore expression detail information is enriched; and finally, fusing the basic expression obtained by the rough editing module and the detail information obtained by the fine editing module to obtain a final new expression which meets the user's expectations and is true and natural.

Example 1

The embodiment provides a three-dimensional expression animation editing method based on a double-branch network, which comprises the following steps:

step 1: data acquisition

Given a three-dimensional face mesh I containing control points, let v= { V ₁ ,v ₂ ,...,v _n The face mesh I is a vertex set, and n represents the number of vertices; let p= { P ₁ ,p ₂ ,...,p _m The control points are the control point set of the face grid I, and m represents the number of the control points; let P '= { P' ₁ ,p′ ₂ ,...,p′ _m And the new position of the control point after the adjustment of the user. Order theRepresenting the change in position of the control point, i.e. the control point constraint, wherein +.>Representing vertex p _i Is used for the displacement vector of (a). Face mesh I and control Point constraint +.>Inputting into a network model;

the object of the present invention is to calculate the new positions of the vertices in the face mesh I via a network model, denoted V '= { V' ₁ ,v′ ₂ ,...,v′ _n }, i.e. estimating a functionThe change of the vertex position drives the deformation of the whole grid, and the new expression represented by the deformed grid I 'determined by the new vertex position V' accords with the editing expectation of a user, is rich in detail, and is true and natural.

Step 2: high frequency region division

The above process of estimating the new positions of the vertices of the mesh can also be understood as estimating a deformation of the mesh, which can make the expression presented by the new mesh conform to the expectations of the user and be true and natural. Different characteristics can be reflected by grids with different scales, and the processing efficiency is different;

The present embodiment divides the deformation of the mesh into two different scale deformations, coarse and fine.

The deformation of the coarse particles aims at the deformation of the whole grid, reflects the change of basic expressions such as happiness, anger, funeral and the like, and can approximate to the expression expected by a user; the deformation of the fine particles is directed to the deformation of the high-frequency region on the grid, reflects the variation of expression details such as skin wrinkles and the like, and can accurately approximate to the expression details expected by a user. The former has small operand and high approximation speed; the latter has high precision and can realize accurate simulation of the details of the expression. Therefore, the hierarchical grid deformation can enable the whole network model to generate fine expression details and keep good operation efficiency.

In the embodiment, a rough editing module is adopted to estimate grid deformation of coarse particles, namely, the whole grid is processed; the fine editing module is used to estimate the mesh deformation of the fine particles, i.e. to process the high frequency area on the mesh. Obviously, whether the high-frequency region is accurately identified or not will have an important influence on the subsequent branching processing and the final expression generating result. Based on the above, the embodiment provides a new space-time correlation clustering criterion and a high-frequency region division method, so as to effectively improve the rationality and accuracy of region division.

The step of dividing the high-frequency region specifically comprises the following steps:

step 201: firstly, solving the Gaussian curvature of each vertex of an input grid, automatically screening out large-curvature vertices according to an elbow rule, and increasing the weight of the large-curvature vertices during clustering;

specifically, the gaussian curvature of the vertices is calculated for the input three-dimensional face mesh. Vertex v _i Is:

wherein R (v) _i ) Representing vertex v _i Is the first order neighbor of (a)Connect vertex set, alpha _i,j Representing vertex v in the jth first-order contiguous triangle _i Corresponding angle, A (v _i ) Represented by vertex v _i The area of the polygon formed by the intersection of the perpendicular bisectors of the first-order adjoining sides.

After the curvatures are ranked according to the absolute values, the curvatures at the elbow positions are automatically selected by using an elbow rule, and the vertexes larger than the curvatures at the elbow positions are marked as large-curvature vertexes. Here, the elbow rule is used to calculate the degree of change in curvature, and therefore the cost function of the elbow rule is set to the absolute value of curvature, and the elbow position is the position where the absolute value of curvature decreases by the greatest extent. Since local sharp deformations tend to exist on the mesh where the curvature is large, these reflect detailed information of the mesh, and hence the high probability of a large curvature vertex is a part of the high frequency region. Based on this, the present embodiment will increase the weight of vertices of large curvature at the time of clustering to increase the probability that these vertices scratch into the high frequency region.

Step 202: the vertices are clustered by using an improved K-Means clustering algorithm and an elbow method, and the number of high-frequency areas and the range of each area are automatically determined.

After the high curvature vertexes are determined and the high weights are set for the high curvature vertexes, the facial grid vertexes are clustered, and the number of high-frequency areas and the range of each area are automatically determined.

In this embodiment, the elbow rule is adopted to automatically select the optimal number of clusters. The elbow rule here needs to take into account the degree of distortion of the class, whose cost function should reflect the correlation of the mesh vertices with the class center. Conventional relevance measures often use euclidean distances between mesh vertices and class centers, but for facial meshes, the relevance between vertices depends not only on the spatial distance between the two, but also on the motion consistency of the two during expression changes. Obviously, the more uniform the mesh vertices move during the change, the more closely the relationship between them, and the higher the probability that they belong to the same region. Based on the above, the invention provides a new space-time correlation criterion for measuring the degree of closeness between the grid vertex and the class center, and improves the K-Means algorithm based on the new space-time correlation criterion, and automatically determines the number of high-frequency regions and the range of each high-frequency region by using the elbow rule and the K-Means algorithm.

The new space-time correlation criterion provided by the embodiment considers the spatial adjacency between the vertexes and the motion consistency of the vertexes in the expression change process when measuring the relationship closeness between the vertexes. However, unlike the euclidean distance used in the conventional method, the present embodiment uses geodesic distances to calculate spatial proximity between vertices. The geodesic distance is the shortest distance between vertices measured along the face mesh surface, and obviously, compared with the Euclidean distance, the proximity relation of the vertices on the mesh curved surface in the space position can be reflected more accurately.

The new spatio-temporal correlation criteria are specifically expressed as follows:

the known training samples of the input face mesh are provided for T frames together, and these samples constitute the expression animation sequence of the character model. Without loss of generality, the input face mesh is considered as the first frame in the sequence, then for any two vertices v on that mesh _i And v _j The geodesic distance d (v) between them _i ,v _j ) The definition is as follows:

d(v _i ,v _j )＝minL(o(v _i ,v _j )) (2)

wherein o (v) _i ,v _j ) V is _i And v _j Is a path of all paths of (a); l (o (v) _i ,v _j ) Represents the geodesic distance, which is the sum of euclidean distances between adjacent points.

Vertex v _i And v _j The correlation degree value of the t frame in the expression animation sequence is defined as follows:

Wherein v is _i,t 、v _j,t Respectively represent the vertexes v _i And v _j And a position on a t frame grid of the expression animation sequence.And->The value ranges of the (E) are all 0,1]。

In formula (3)Calculating vertex v using geodesic distance _i And vertex v _j The greater the value of the spatial correlation between them, the higher the spatial proximity thereof; />Using vertex v _i And v _j Calculating the similarity of the displacement vectors between two frames along the expression animation sequence in the motion direction; />Using vertex v _i And vertex v _j The ratio of the moving distances between two frames calculates the closeness of their moving rates. />And->Embodying vertex v _i And v _j The greater the value is indicative of a higher consistency of motion as the character model changes in expression.

Calculating by using the formula (3) to obtain a vertex v _i And vertex v _j After the correlation degree value of each frame in the expression animation sequence, all the values are averaged to obtain a vertex v _i And v _j And the spatio-temporal correlation coefficient of (c) is shown in formula (4).

After the space-time correlation coefficient between any two vertexes on the grid is obtained according to the new space-time correlation criterion, clustering operation is carried out on the vertexes of the face grid so as to obtain a high-frequency region.

The invention improves the K-Means clustering algorithm based on the space-time correlation criterion, and takes the space-time correlation of the vertexes and the clustering centers as an index for measuring the similarity between the vertexes and the clustering centers in the clustering process, namely, the larger the space-time correlation coefficient is, the higher the similarity between the vertexes and the clustering centers is, and the higher the probability of being distributed in the class is. The index comprehensively considers the spatial adjacency and the motion consistency of the grid vertexes on the whole animation sequence, and can enable the clustering result to be more accurate.

The elbow rule is combined with the modified K-Means algorithm to automatically determine the number of clusters. The cost function of the elbow rule is defined as:

wherein z (·,) is the space-time correlation coefficient between the vertices, k is the number of clusters, C _i For the ith class, u _i Is the center of the ith class, v is C _i Is included in the vertex of (a).

And (3) for different clustering numbers, obtaining the range of each class by using an improved K-Means clustering algorithm, calculating the cost function value in the formula (5), wherein the position where the decreasing amplitude of the cost function value suddenly decreases is the elbow position, the corresponding clustering number at the position is the number of high-frequency regions, and the clustering result obtained by the improved K-Means algorithm is the range of each high-frequency region.

In the clustering process, the embodiment limits the maximum range of the formed classes to avoid all or redundant vertexes from being divided into high-frequency areas, thereby ensuring the accuracy of the high-frequency areas. This maximum range value may also be specified by the user. In the present experiment, the maximum boundary range of each class was defined to be no more than one half of the neutral face mesh width value.

FIG. 2 illustrates the use of the method of the present invention for different roles And (5) a result of the high-frequency area division of the model. Wherein the region R ₁ 、R ₂ And R is ₃ Respectively representing different high frequency regions.

Step 3: rough editing

When the expression editing is performed, the operation speed of the whole network is ensured while the accuracy of the facial expression is considered. The rough editing module designed in the model of the embodiment is used for estimating the grid deformation of the coarse particles, so that the whole deformed facial grid can approximately approximate to the basic expression expected by the user. The network structure of the module is simple, the parameter quantity is small, and the vertex position of the deformed grid can be rapidly calculated. On the basis, fine particles are deformed, so that the whole model has better operation efficiency. The module introduces a space change convolution layer and a sampling residual layer, well overcomes the defect that the traditional convolution neural network cannot directly process an irregular three-dimensional grid, can capture more characteristic information on the three-dimensional grid, and has a relatively simple network structure, so that the module can quickly approximate to the basic expression expected by a user.

As shown in fig. 3, the coarse editing module consists of 2 up-sampled residual blocks and 2 down-sampled residual blocks. The input to this module is the facial mesh I and control point constraints The control points on the face mesh are moved to the control point constraint +.>The new position specified, then all vertices V in the face mesh go through 2 downsampled residual blocks of step 4, from +.>Compressed toThen, through up-sampling residual blocks with 2 identical steps, the original dimension of the grid vertexes is restored, new positions of the vertexes are obtained, and the deformed face grid is obtainedI _c . The deformation process of the coarse particles can be expressed as a function:wherein f _c A coarse editing module neural network consisting of spatially varying convolution and sample residual layers is represented and parameterized as θ. Constraint of face mesh I with control points->Inputting the basic expression into a rough editing module, changing the position V of the grid vertex by the parameter theta of a training module to approximate to the basic expression expected by a user, and obtaining the deformed face grid I according to the topological structure of the input face grid I _c 。

The network learning of the invention is an end-to-end mapping from control point constraints to the final vertex positions of the face mesh, so that the process of changing mesh vertices in the network does not need to be supervised.

In order to overcome the problem that the traditional convolutional neural network cannot directly process the three-dimensional face grid due to the irregularity of the three-dimensional face grid, the invention introduces a space variation convolutional and sampling residual layer into all sampling residual blocks of the rough editing module to process the irregular three-dimensional face grid. In addition, the convolution operation can extract more space characteristic information in the face grid, so that the whole rough editing module has stronger approximation capability.

As shown in fig. 4, each sampling residual block is formed by combining a spatial variation convolution layer, an activation layer Elu and a residual layer, and the processing procedure is that an input grid passes through the spatial variation convolution layer, the activation layer and the sampling residual layer respectively, and then the results are added to obtain an output. When a conventional convolutional neural network performs a convolutional operation on a regular grid, the convolutional kernel is structurally regular and fixed, and the sampling weights on the grid are the same and are globally shared, and the globally shared sampling weights are called weight basis.

However, the face three-dimensional mesh is irregular and cannot be convolved using a fixed convolution kernel. Therefore, unlike the conventional convolution, when the space-variant convolution performs the convolution operation on the face grid, the convolution kernel is realized by linearly sampling the weight basis through a set of sampling parameters, and the set of parameters are different in different face grid areas, so that the irregular face grid can be well processed, and more space feature information can be extracted.

When training the network, the weight basis and the sampling parameters of each local area need to be trained simultaneously. Let the local area of the face grid with feature extraction by the spatial convolution kernel be v _i Is the local area->The feature extraction function of the spatially varying convolution of the local region on the face mesh is +.>Wherein y is _i Is the feature output of the local region, b is a learnable bias vector, W _i Is the weight coefficient of each vertex in the local region on the face mesh. W (W) _i From weight basic->And vertex v _i Sampling coefficients in different local regionsThe linear combination of the two, i.e. +.>Where η represents the number of local regions in the face mesh. B and E _i Are parameters that need to be trained.

The conventional pooling operation cannot process the face mesh because the sampling density is different for each point due to the irregularity of the face mesh, and thus the pooling operation is performed by means of Monte Carlo integration, the operation function of which isWherein the method comprises the steps ofA density coefficient for each vertex.

ρ _i Is a training parameter, shared throughout the data set.

Similarly, define the residual layer asG is an identity matrix when the input and output feature dimensions of the residual layer are the same, otherwise G is an lxo matrix shared throughout the face mesh vertices and needs to be derived by training. Here, L is the dimension of the input face mesh, and O is the dimension of the output feature matrix. The up-sampling and down-sampling residual blocks realized based on the operation design can well process irregular face grids, so that the whole rough editing module has a good approximation effect.

Step 4: fine editing

The rough editing module can make the facial grid quickly approximate the rough expression desired by the user, but it is difficult to accurately simulate high-frequency deformation such as skin wrinkles. The fine editing module only focuses on approximation of deformation of the high-frequency region, and can well realize high-precision reconstruction of the change of the surface details. The two are combined, so that the expression expected by the user to be true, natural and rich in detail can be quickly and accurately approximated.

As shown in fig. 5, the fine editing module is composed of 3 downsampled residual blocks and 3 upsampled residual blocks. The residual block is of the same design as the residual block in the coarse editing module.

After the face grid I is divided by the high-frequency area, the obtained high-frequency area set is I _s ＝{I ₁ ,I ₂ ,...,I _S And S is the number of high-frequency regions.

Grouping the high frequency regions I _s Constraint with control pointsInputting into a fine editing module, firstly collecting the high-frequency region set I _s The control point contained in (a) moves to the control point constraint +.>The new location specified; then the high frequency region is assembled I _s Through 2 downsampled residual blocks of step 3 from +.>Compressed to->Wherein r is I _s The number of top points in (a); then through up-sampling residual error blocks with 3 identical step sizes, I is obtained _s And recovering the vertex in the step (a) to the original dimension to obtain a new vertex position, and further obtaining a deformed high-frequency region. The deformation process of the fine particles can be expressed as a function +.>Wherein I is _r Is a set of deformed high frequency regions, θ _r Is a network parameter of the fine editing module. Grouping the high frequency regions I _s Constraint with control point->Input into a fine editing module through training network parameters theta _r To change I _s Fine deformation of each high-frequency region is realized to approximate expression details expected by a user.

Coarse particle deformation is carried out on the three-dimensional facial grid I through a coarse editing module, and then the grid I containing basic expression is obtained _c High frequency region set I in grid I _s After fine particle deformation is carried out by a fine editing module, a high-frequency area I containing expression detail information is obtained _r And fusing the expression information of the two to obtain the three-dimensional facial grid I' which meets the requirements of the user and contains the expression detail information. In particular, the method comprises the steps of,grid I _c The position of the vertex in the medium-high frequency region is replaced by I _r The positions of the corresponding vertexes in the model are used for obtaining the final new positions V 'of all vertexes, and further obtaining the final deformed three-dimensional face grid I'.

Loss function:

In order to improve the approximation accuracy of the model, the invention provides a new loss function for solving the optimal network parameters. The network in both the coarse editing module and the fine editing module employ the loss function. The new penalty function constrains both the difference in vertex positions between the generated mesh and the reference mesh and the difference in vertex normal vectors between the two vertices. Obviously, for two meshes, if the positions of their corresponding vertices are the same, the normal vectors of the corresponding vertices are also the same, and the shapes of the two meshes, i.e. the facial expressions, must be the same. Thus, the new loss function is defined as follows:

wherein L is _pose To constrain the reference mesh and generate the loss function of mesh vertex position, L _normal To constrain the loss function of both vertex normal vectors, v _i To reference the vertex position of the mesh, v _i ' vertex position for generating mesh, n _i For vertex normal vector of reference mesh, n _i ' vertex normal vector for generating mesh.

The present embodiment uses L1 loss instead of L2 loss because L1 loss can produce vivid features and is more robust. And the accuracy of the model can be further improved by introducing the vertex normal vector loss. In addition, because the network model of the invention carries out independent processing on the high-frequency area of the face, when the deformation information of the high-frequency area obtained by the fine editing module is fused to the grid after the whole deformation obtained by the rough editing module, the problems of boundary and joint are inevitably generated, and the normal vector loss of the vertex generates the grid through constraint, so that the normal line of the vertex is consistent with the normal line of the corresponding vertex on the reference grid as far as possible, and the problem is exactly solved.

Experimental results

1. Experimental environment and parameters

The invention trains the model parameter theta in the rough editing module and the model parameter theta in the fine editing module _r To minimize the loss function L. The model was trained using Adam optimizer, with batch size set to 64. The learning rate of the optimization process was initially 0.0001, which would be reduced by 1% per cycle. The forward and backward processes of the whole network are realized by Pytorch and run in parallel on the GPU. The experimental environments were Ubuntu20.04, geForceRTX2080, CUDA10.1 and Pytorch1.4.0.

2. Data set

The invention adopts three role models of Ray, monardo and Yoda to carry out experimental verification. The expression dataset for each character model contains 10,000 expression samples. Because the grid model of each character has different complexity, in a specific experiment, different amounts of expression sample data are uniformly and randomly selected from the data set for each model as an experimental data set. Wherein, the Ray model selects 3,000 expression samples, the Monardo model selects 5,000 expression samples, and the Yoda model selects 8,000 expression samples. Experimental data set for each role was as per 8:1: the scale of 1 is divided into training, validation and test data.

3. Loss function ablation experiment

The present invention proposes a new loss function for model training that not only constrains the vertex positions (L _pose ) And simultaneously constrains vertex normal vector (L _normal ). In order to verify the beneficial effect of the introduction of vertex normal vector constraint on improving the model precision, the design ablation experiment of the invention is as follows: for different character models, 1) training the network model using only vertex position constraints as a loss function, 2) simultaneously training the network model using vertex position constraints and vertex normal vector constraints as loss functions. Comparing the test set of character models generates an average of the errors between the grid and the reference grid in two different situations. For each pair of the generated mesh and the reference mesh, the error is defined as the average of Euclidean distances of the vertex positions corresponding to bothValues. Table 1 shows the error comparison results of the ablation experiments performed on each character model. Therefore, the introduction of the vertex normal vector constraint obviously reduces errors and effectively improves the accuracy of grid approximation.

Table 1 error comparison results of ablation experiments

4. Comparing the experimental results

Once the network model is trained, the user can be allowed to freely select the control points on the character model, and the expression of the character model is changed by changing the positions of the control points, so that the expression expected by the user is obtained. FIG. 6 illustrates a portion of the new expression that the user selects control points and edits for different character models. Wherein the dots are control points selected by the user, and the square dots are new positions to which the user moves the control points. Therefore, the method of the invention obtains better results in expression editing, and the generated grid is the same as the expression of the reference grid and has rich detail characteristics.

To further verify the effectiveness of the method of the present invention, the network model of the present invention was compared to an LBS model, a network model proposed by Bailey et al. The LBS model is to set bone and vertex weights to the face mesh to guide mesh deformation, the present invention generates 16 blocks of bone altogether, each vertex being given 8 non-zero weights. The Bailey' method uses a network model built by a traditional convolutional neural network, the model divides a high-frequency region according to approximation errors, and processes a UV map of a three-dimensional grid.

Fig. 7 to 9 illustrate partial expression editing results respectively generated using the method of the present invention, the LBS method, and the Bailey' method for each character model. Therefore, the generated grid obtained by the LBS method loses many expression details; the grid generated by the Bailey' method has larger error, and even part of the generated grid is distorted; the expression generated by the method is very similar to the reference grid, and the method has rich and accurate details.

Table 2 lists the average of the errors between the generated grid and the reference grid obtained by three models, ray, monarado and Yoda, on the test set. The error definition is as in table 1. As can be seen from Table 2, the error of the method of the present invention is greatly reduced for all character models compared with the LBS and Bailey' methods, indicating that the deformed grid generated by the method of the present invention is closer to the reference grid and the approximation accuracy is higher.

TABLE 2 average vertex position error

	Ray	Monarudo	Yoda
				LBS	0.64	0.95	7.34
Bailey' method	0.85	3.94	5.11
				The method of the invention	0.37	0.38	2.77

Fig. 10 illustrates the process of creating an arbitrary expression from neutral expression edits using the method of the present invention. These expressions are not referenced to the grid, and are completely new expressions generated from control point constraints. As can be seen from the figure, the method of the present invention can gradually deform the facial grid by gradually changing the positions of the control points (the dots are the home positions of the control points, and the square dots are the new positions to which the control points move), so as to generate continuous facial expressions. Therefore, the method can provide an intuitive, simple and quick expression editing tool for the user, generates the facial expression meeting the user's expectations, has the advantages of true and natural expression and contains rich detail characteristics.

Example two

The embodiment provides a three-dimensional expression animation editing system based on a dual-branch network, which comprises the following steps:

Example III

The present embodiment provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps in the three-dimensional expression animation editing method based on a dual-branch network as described in the embodiment one.

Example IV

The embodiment provides a computer device, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor realizes the steps in the three-dimensional expression animation editing method based on the dual-branch network when executing the program.

It will be appreciated by those skilled in the art that embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of a hardware embodiment, a software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, magnetic disk storage, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

Those skilled in the art will appreciate that implementing all or part of the above-described methods in accordance with the embodiments may be accomplished by way of a computer program stored on a computer readable storage medium, which when executed may comprise the steps of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random access Memory (Random AccessMemory, RAM), or the like.

The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. The three-dimensional expression animation editing method based on the double-branch network is characterized by comprising the following steps of:

2. The three-dimensional expression animation editing method based on a dual-branch network as claimed in claim 1, wherein the curvature screening process comprises:

after sorting the Gaussian curvatures according to the absolute values, automatically selecting the curvatures at the elbow positions by using an elbow rule, and marking the vertexes larger than the curvatures at the elbow positions as large-curvature vertexes;

wherein, the curvature at the elbow position is automatically selected by utilizing the elbow rule, which comprises the following steps: the cost function of the elbow rule is set as the absolute value of the curvature, and the elbow position is the position where the curvature absolute value decreases by the largest extent.

3. The three-dimensional expression animation editing method based on the dual-branch network according to claim 1, wherein the clustering of vertices to determine a facial high-frequency region specifically comprises:

forming an expression animation sequence of the corresponding role model by using the known training samples of the facial grids;

the input face mesh is considered as the first frame in the sequence;

calculating the correlation degree value of any two vertexes in each frame in the expression animation sequence; the correlation degree value of any two vertexes in each frame in the expression animation sequence is the product of three parts, and the method comprises the following steps: the first part is the geodesic distance of any two vertexes on the first frame and is used for calculating the spatial correlation between the two vertexes, the second part is the included angle of the displacement vector of any two vertexes between the two frames, and the third part is the ratio of the moving distance of any two vertexes between the two frames; the second part and the third part are used for calculating the motion consistency of the two vertexes;

Averaging all the correlation degree values to obtain space-time correlation coefficients of any two vertexes;

and clustering the face grid vertexes according to the space-time correlation coefficient to obtain a face high-frequency region.

4. The three-dimensional expression animation editing method based on the dual-branch network as set forth in claim 1, wherein when the vertexes are clustered, a K-Means clustering algorithm is improved based on a space-time correlation criterion, the space-time correlation between the vertexes and a clustering center is used as an index for measuring the similarity between the vertexes and the clustering center in the clustering process, and the higher the instant space-time correlation coefficient is, the higher the similarity between the vertexes and the clustering center is, and the higher the probability of being distributed to the class is;

based on the improved K-Means clustering algorithm, the method is combined with an elbow rule to automatically determine the number and the range of high-frequency areas, and specifically comprises the following steps:

for different cluster numbers, obtaining the range of each class by using a modified K-Means clustering algorithm;

and (3) defining and calculating a cost function value according to a cost function of an elbow rule, wherein the position where the cost function value is suddenly reduced in amplitude is the elbow position, the number of clusters corresponding to the position is the number of high-frequency regions, and the clustering result obtained by improving the K-Means algorithm under the number is the range of each high-frequency region.

5. The three-dimensional expression animation editing method based on a dual-branch network according to claim 1, wherein in the rough editing network, a spatially-varying convolution and a sampling residual layer are introduced into all sampling residual blocks, so as to process an irregular three-dimensional face grid, and the method specifically comprises:

when the space change convolution carries out convolution operation on the face grid, linear sampling is carried out on the globally shared sampling weight through a group of sampling parameters, and the group of parameters are different in different face grid areas;

according to the definition of the residual layerWhen the input and output feature dimensions of the residual layer are the same, G is an identity matrix, otherwise G is an L X O matrix shared in the whole face grid vertex and needs to be obtained by training; wherein (1)>Local region of face mesh for feature extraction for spatial convolution kernel, v _i For the local area->Vertex of (p)' _i For the density coefficient of each vertex, L is the dimension of the input face mesh and O is the dimension of the output feature matrix.

6. The three-dimensional expression animation editing method based on a dual-branch network according to claim 1, wherein the changing of the vertex positions in the set of high-frequency regions of the face by training parameters of the fine editing network, thereby finely deforming each high-frequency region, specifically comprises:

The fine editing network comprises 3 downsampling residual blocks and 3 upsampling residual blocks, control points contained in the high-frequency region set are moved to new positions designated by control point constraint, then all vertexes in the high-frequency region set are compressed from a first scale to a second scale after passing through the downsampling residual blocks with the step length of 3, and then the vertexes in the high-frequency region set are restored to the original dimensions after passing through the upsampling residual blocks with the step length of 3, so that new vertex positions are obtained, and further a deformed high-frequency region is obtained.

7. The three-dimensional expression animation editing method based on a dual-branch network as claimed in claim 1, wherein the loss function when the coarse editing network and the fine editing network are trained is as follows:

L＝L _pose +L _normal

wherein L is _pose To constrain the reference mesh and generate the loss function of mesh vertex position, L _normal To constrain the loss function of both vertex normal vectors, v _i To reference the vertex position of the mesh, v' _i To generate vertex positions for the mesh, n _i For vertex normal vector of reference mesh, n _i ' vertex normal vector for generating mesh.

8. The three-dimensional expression animation editing system based on the double-branch network is characterized by comprising:

9. A computer readable storage medium having stored thereon a computer program, which when executed by a processor, implements the steps of the two-branch network based three-dimensional expression animation editing method of any of claims 1-7.

10. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the steps in the two-branch network based three-dimensional emotion animation editing method of any of claims 1-7 when the program is executed by the processor.