CN106023288A

CN106023288A - Image-based dynamic substitute construction method

Info

Publication number: CN106023288A
Application number: CN201610331428.3A
Authority: CN
Inventors: 周昆; 曹晨
Original assignee: Zhejiang University ZJU; Lenovo Beijing Ltd
Current assignee: Zhejiang University ZJU; Lenovo Beijing Ltd
Priority date: 2016-05-18
Filing date: 2016-05-18
Publication date: 2016-10-12
Anticipated expiration: 2036-05-18
Also published as: CN106023288B

Abstract

The invention discloses an image-based dynamic substitute construction method. The method comprises the steps of: firstly carrying out data collection and data preprocessing; collecting a series of facial images with set action expressions of a user by using a common network camera, and carrying out preprocessing operations such as segmentation and characteristic point calibration and the like on the images; then generating a facial fusion model and a hair deformation model of the user based on the processed images and obtaining an image-based substitute expression of the user; in a real-time facial animation driving process, driving a substitute expression to generate a corresponding facial and hair geometry according to facial action expression parameters obtained through tracking; and finally mapping the collected images based on the obtained facial and hair geometry, fusing the mapped images according to the image confidence degree and generating a real facial animation image. The facial animation result generated by using the image-based dynamic substitute construction method has the characteristics of high sense of reality, strong expressive force, rich details, high restoration degree and the like.

Description

Dynamic substitution construction method based on image

Technical Field

The invention relates to the technical field of performance-based facial animation technology and facial expression and action redirection technology, in particular to a facial animation method based on image dynamic substitution.

Background

Compared with human face muscle models (such as VENKATARAMANA, k., LODHAA, s., AND RAGHAVAN, R.2005.A kinetic-kinetic model for simulating skin with human rings 29,5(Oct), 756-770.), AND procedural parametric models (such as JIMENEZ, j., ECHEVARRIA, J.I., OAT, c., AND genetic rez, d.2011.pro gpu 2.AK pets ltd., ch.practical AND real Facial Wrinkles evaluation.), data-driven methods are more common in creating dynamic substitutions, which is advantageous in that the method can obtain real human face actions with very low-cost computation. For example, a multi-dimensional linear face model (VLASIC, d., bran, m., PFISTER, h., AND popovy' C, j.2005.facetransfer with multilingual models. acm trans. graph.24,3(July), 426-433.) uses a unified model to represent the user individual coefficients AND expression coefficients in the face shape in different dimensions. A multi-dimensional linear face model is often used to create a user-specific face fusion model by fitting an input depth map or video sequence. The dynamic geometric changes in the face fusion model can be linearly modeled for use in real-time face tracking and animation systems, such as real-time face animation methods based on depth cameras. These linear models are widely used in real-time face tracking and animation methods because of computational efficiency, but cannot show details on the face, such as wrinkles.

In some high-end products such as movie production, special hardware devices (e.g. LightStage system) are used to produce highly realistic dynamic avatar, which contains rich human face details such as skin folds. But these special hardware devices cannot be used in applications that are intended for the average user.

Still other techniques create a dynamic avatar based on a single frame image. A method of inputting a facial image of each of the user AND the target avatar, such as (sarragih, j., LUCEY, s., AND COHN, j.2011.real-time avatar estimation from expression AFGR, 213-. In the real-time operation stage, the human face action of the user in the input video is tracked by fitting a deformable human face model, and then the action of the user is transferred to the substitute on the basis of the mapping function learned in the preprocessing stage, so that the corresponding action shape of the substitute is generated. (CAO, C., WENG, Y., ZHOU, S., TONG, Y., AND ZHOU, K.2014.faceware house: A3 d facial expression database for visual computing. IEEE Transactions on Visualization AND Computer Graphics 20,3(Mar.),413 + 425.) A technique is introduced, which can enable any user to drive a static face image to make various face animation effects through performance. The technology firstly uses a multi-dimensional linear face model to fit a face fusion model for the face in a static image, and extracts the texture of the face model from the image. In the real-time operation process, the rigid motion parameters and the non-rigid expression coefficients of the head obtained by tracking are transferred to the face fusion model to generate the face animation at a brand-new visual angle. For hair, the technology uses a single-view-angle-based hair modeling technology to create a three-dimensional hair line model, and the three-dimensional hair line model moves and is rendered together with the head in a real-time running process to generate a final result. Because only one image is used, the resulting avatar animation effect is not expressive, and in particular, does not produce the details of expressive wrinkles. In addition, large head rotation and exaggerated facial expression are difficult problems which cannot be solved by the series of methods.

Some dynamic body-replacing methods based on multiple images of the user are proposed recently, and the methods are dedicated to helping the ordinary user generate complete and detailed-feature-rich human face animation body-replacing. (ICHIM, a.e., bouuaziz, s., AND PAULY, m.2015.dynamic 3d avatar creation from hand-held video input. access computers. graph.34,4(July),45: 1-45: 14.) a three-dimensional dynamic avatar method based on handheld device video, expressed in two dimensions, is proposed. According to a series of natural expression images input by a user, firstly, a motion structure algorithm is used for obtaining point cloud of a human face shape, and a natural expression model of the user is fitted according to the point cloud data. And then, generating a medium-scale user specific face fusion model by fitting the natural expression model and the face expression in the collected video. And then, obtaining the fine-scale human face details, such as human face wrinkles and other information through a shading algorithm, and expressing the fine-scale human face details by using a normal map and an ambient light shielding map. (GARRIDO, p, zolhole, m, CASAS, d, VALGAERTS, l, varasai, k, PEREZ, p, AND THEOBALT, c.2016. reconfiguration of personalized 3d face lines from a single video, acm trans. graph.) a fully automatic method is proposed that can automatically construct a three-dimensional dynamic human face avatar from monocular video face video data (such as a traditional movie). The substitute of the method is expressed based on the geometric hierarchy of three scales, from the basic actions of the human face described by the sparse geometric scale to the details of the human face described by the precise geometric scale, such as skin wrinkles. The two methods respectively use multilayer geometric expressions to represent the substitution, and the operation is complex. Furthermore, and more importantly, the movement effects of hair have not been effectively addressed in previous work.

Disclosure of Invention

The invention aims to provide a novel image-based dynamic human face avatar construction method aiming at the defects of the prior art, and the avatar expression can provide a human face animation result with sense of reality and expressiveness under the driving of a human face expression tracking and animation system. Compared with other substitution body expressions, the substitution body model provided by the invention has more complete expression, so that the obtained animation effect is more real and persuasive. In the invention, the data required in the preprocessing process can be obtained by common users using common equipment, such as a network camera, a home computer and the like, and the human face animation driving can be carried out on the substitute body by the performance of any user in the running process, so that the method is very suitable for various virtual reality applications, such as network games, video chatting, remote education and the like.

The purpose of the invention is realized by the following technical scheme:

a face dynamic substitution construction method based on images mainly comprises the following steps:

1. data acquisition and preprocessing: the method comprises the steps of using a common network camera to collect a series of facial images of a user with set action expressions, and carrying out preprocessing work such as segmentation and feature point calibration on the images.

2. Image-based avatar construction: and generating a human face fusion model and a hair deformation model of the user based on the processed image, and then obtaining the avatar expression of the user based on the image.

3. Real-time face and hair geometry generation: in the real-time human face animation driving process, the substitute expression is driven to generate corresponding human face and hair geometry according to the human face action expression parameters obtained by tracking.

4. And (3) real-time face animation synthesis: and mapping the acquired images based on the obtained face and hair geometry, and fusing the mapped images according to image confidence coefficients to generate a real face animation image.

The invention has the advantages that the avatar expression is complete, compared with the prior dynamic avatar expression method, the dynamic avatar based on the image provided by the invention not only comprises organs such as human faces, eyes, teeth and the like, but also comprises hair which is difficult to model and head ornaments on the hair, so the avatar expression is more complete; in addition, the details of the facial expression changes, such as folding and wrinkles of the skin, do not need to reconstruct the corresponding three-dimensional geometry in the substitute expression, and are completely and implicitly contained in the acquired image. In summary, compared with other human face substitution animation methods, the human face animation provided by the invention based on the substitution has the characteristics of more authenticity and expressive force, low equipment requirement, easiness in use, suitability for any user and the like, so that the human face animation can be applied to various applications such as online games, video chatting and the like, and has a wide application prospect.

Drawings

FIG. 1 is an exemplary diagram of data acquisition and preprocessing of the present invention, from left to right: and acquiring images, image segmentation levels, hair layer transparent channels and marked feature points.

Fig. 2 is an exemplary diagram of a face fusion model and a hair deformation model constructed according to the present invention, in which an example of three expressions of a user is shown, in the diagram, a first behavior is an input image, and a second behavior is a reconstructed face and hair mesh model.

Fig. 3 is an exemplary illustration of an animation result of the image-based dynamic avatar of the present invention, showing an example of 7 different users, each row from left to right, respectively: the front image in the image, the reconstructed face and hair grid model and 5 synthesized face animation results with different postures and different expressions are collected.

Fig. 4 is a diagram of an example of comparison between an animation result of an image-based dynamic avatar and a real image, wherein the first line is a human face animation generated by the method, and the second line is the real image for driving the generation of the human face animation.

Detailed Description

In the dynamic substitution construction process, firstly, a group of images of specific actions and expressions of a user are collected, and then the images need to be preprocessed, including layering and characteristic point calibration; based on the processed images, a three-dimensional face and hair model is established. Wherein the face model can fit the face and the model for the feature points by matching in the image; for the hair geometric model, there is no universal hair model, because the shapes of the hairs are very different for different users, and it is difficult to use a global model to represent the shapes of the hairs. Thus, the present invention can only rely on the input images to build a three-dimensional model of the hair. Furthermore, for hair, especially long hair, non-rigid deformations can occur due to factors such as gravity and interaction with the body under different head postures. To address these complex problems, the present invention builds a deformable hair model that can closely simulate the dynamic hair changes during head rotation. To create this deformable hair model, the invention first estimates the depth of the hair region separately in each acquired image; then combining all the images to execute joint optimization of hair depth; and finally, constructing a global deformable hair model based on the optimized depths. In addition, in another aspect of the image avatar configuration, other parts of the human body, such as eyes, teeth and the body, are modeled separately and represented by a flat plate.

In the process of driving the system by real-time animation, for each frame of input image, the invention firstly uses the existing real-time face tracking method to obtain the face action parameters in the image, including the head rigid transformation parameters and the non-rigid facial expression coefficients, and the parameters are transferred to the dynamic image avatar to generate the three-dimensional grid corresponding to the avatar. The system then maps and fuses the input captured images with the help of this three-dimensional mesh to generate an animated image of the face of the avatar at the new perspective. In the fusion process, the invention calculates the weight of each pixel in the final image of each mapping image in the fusion process by depending on three-dimensional geometry so as to ensure that different areas in the final result are smooth and connected seamlessly. The invention is used on different users having different hair styles, with convincing results.

1. Data acquisition and preprocessing

1.1 data acquisition

For each user, the invention uses a common network camera to collect 32 images: including 15 different head poses and 17 different expressions. The first set of 15 images recorded different head poses of the user, including different rotations, that maintain natural expressions. These head posture actions are expressed by rotational euler angles, which are: the raw direction is from-60 ° to 60 °, sampling at 20 ° intervals (keeping pitch and roll directions 0 °); pitch direction from-30 ° to 30 °, sample 15 ° apart but remove 0 ° (keeping other directions 0 °); the sampling in the roll direction is the same as the pitch direction. These head rotations by the user do not need to exactly meet the set angle, but only need an approximate angle.

Next, two-step pre-processing of the acquired image is required: image segmentation and feature point calibration.

1.2 image segmentation

The first step of the pre-processing is to segment the acquired images, each image into different levels: face, hair (including headwear on hair), glasses, teeth, body, and background. The method uses a Lazy Snaping tool to segment each image on the basis of a small amount of manual interaction. In the invention, manual interaction can finish the segmentation of each layer by only simply dividing a few strokes on the image. In addition, because the hair has complexity at the boundary due to the semi-transparent characteristic of the hair, the invention further executes an image extraction algorithm on the hair layer, further processes the hair layer and adds a transparent channel (alpha channel) to the hair layer.

1.3 feature point calibration

In the second step of the pre-processing, the invention requires that each image I acquired is subjected to_iSemi-automatic calibration of its characteristic points S_i. These feature points describe the two-dimensional position of a series of features of the face in the image, including the contours of the mouth, eyes, face, etc. The method firstly uses a real-time face tracking algorithm described in (CAO, C., HOU, Q., AND ZHOU, K.2014.displayed dynamic reconstruction for real-time facial tracking AND animation.ACMTrans.Graph.33,4(July),43: 1-43: 10) to automatically calibrate the feature points in the image, AND then uses a dragging tool to perform manual correction.

2. Image-based avatar construction

2.1 construction of the face fusion model

Based on the calibrated collected image { (I)_i,S_i) The invention constructs a human face fusion model for representing the low-resolution dynamic human face geometry. Firstly, based on a faceWarehouse face database, the invention generates an initial face grid for each image fittingThe invention then uses mesh deformation for eachCorrected to obtain F_i(ii) a Finally, the invention is based on { F }_iCalculating the expression fusion model { B } of the user_j}。

First, for each image (I)_i,S_i) The invention uses a bilinear model in a face warehouse and a 3-order data tensor C to calculate the global individual coefficient w of the user^idAnd the expression coefficient in each imageTo fit the face initial gridThe fitting method follows (CAO, C., WENG, y., LIN, s., AND ZHOU, k.2013.3d shape reconstruction for real-time facial animation. acm trans. graph.32,4(July),41: 1-41: 10.), AND firstly generates a user-specific fusion model based on the face data tensor C, AND then generates a grid matched with each image according to the fusion model

Initial grid obtained by face warehouse fittingOnly a rough approximation, in order to match the grid to the image more accurately, the methodThe invention uses a mesh deformation algorithm to further modify the mesh. The purpose of the grid correction is to make each image I_iCorresponding grid F_iThe corresponding vertex is matched with the two-dimensional characteristic point of the image, and the matching energy described by the invention is as follows:

E_{l d} = \underset{k}{Σ} | | Π (F_{i}^{{init}^{(v_{k})}}) - s_{i, k} | |^{2}

wherein pi (·) is a projection operator of the camera, a three-dimensional point in a camera coordinate system is projected to obtain a two-dimensional point position in the image,is the initial face mesh, s, fitted in the previous step_i,kIs S_iTwo-dimensional position of the k-th feature point in, v_kIs the vertex number corresponding to the feature point in the mesh.

In order to ensure the smoothness of the grid in grid deformation, the method adds a Laplace regular energy term in the grid deformation:

E_{l a p} = \underset{k}{Σ} | | {Δv}_{k} - \frac{δ_{k}}{| {Δv}_{k} |} {Δv}_{k} | |^{2}

where delta is the discrete laplacian based on the cosine formula on the grid,_kis the initial meshThe length of the laplacian vector at the kth vertex. After the grid deformation correction, each image obtains a grid F which is accurately matched_i。

Mesh { F) based on these modifications_iAnd combining the expression coefficients of each image obtained in the fitting initial gridThe user-modified face fusion model { B } can be obtained by using an example-based face skeleton algorithm described in (LI, H., WEISE, T., AND PAULY, M.2010.example-based facial training. ACMTrans.Graph.29,4(July),32: 1-32: 6.)_j}。

2.2 construction of Hair deformation model

Constructing a model of hair in an animated avatar is much more difficult than constructing a model of a human face. First, different people have different hair styles, and it is difficult to use a global hair model to represent various hair styles, and only a model of the hair can be completely reconstructed from the image. Furthermore, for hair, particularly long hair, non-rigid movement can occur as the head rotates due to the effects of gravity, interaction with the body, and the like. Therefore, if the hair is regarded as a rigid object and moves along with the head, the hair cannot be matched with the acquired image, and the generated human face animation effect lacks reality. Therefore, the invention creates a deformed hair model for the image substitute, which is used for approximately simulating the dynamic change of the hair.

The invention requires only the construction of a low resolution hair model to aid in mapping and fusing images in real-time operation. To construct such a hair model, the present invention first estimates depth values at the hair regions of each image; all depth maps are then jointly optimized, wherein for jointly optimizing long hairs which undergo non-rigid motion, the invention requires finding correspondences between hair pixels of different images; and finally, generating a hair grid with a consistent topological structure for each image based on the depth maps.

The invention uses a single-view hair modeling method in (CHAI, M., WANG, L., WENG, Y., YU, Y., GUO, B., AND ZHOU, K.2012.Single-view hair modeling for reporting. The method combines both boundary and smoothing energy terms for optimization. First, the invention is based on a segmented hair region Ω_hCalculating a contour of hairThe depth value of the pixel on the contour can be directly based on the face mesh F generated in the previous step_iInitialization is performed. Initial depth D⁰The setting method comprises the following steps: for an internal contour pixel (overlapped with the face on the image), the depth of the pixel is directly set as a face mesh F_iUpper depth; for the external contour pixel points, the depth of the contour pixel points is set as a face mesh F_iAverage of the outer contour point depths. Thus, the boundary energy term can be described as:

E_{s i l} = \underset{p &Element; Ω_{h}}{Σ} (| | D_{p} - D_{p}^{0} | |^{2} + | | n_{p} - &dtri; Ω_{h} | |^{2})

wherein D is_pIs the depth value of each pixel that the invention needs to solve for,is the initial depth, n_pIs normal to the pixel, andrepresenting the image gradient along the hair contour on the two-dimensional image.

The smoothing energy for the second term, which is used to ensure that the solved hair depth and normal are as smooth as possible, can be described as:

E_{s m} = \underset{p &Element; Ω_{h}}{Σ} \underset{q &Element; N (p)}{Σ} (ω_{d} | | D_{p} - D_{q} | |^{2} + ω_{n} | | n_{p} - n_{q} | |^{2})

wherein p is the hair region Ω_hN (p) is the four-neighborhood pixel set of pixel p, q is one of the neighborhood pixels, D_pAnd D_qDepth of p and q, respectively, n_pAnd n_qNormal to p and q, ω, respectively_dAnd ω_nRespectively, are weights for controlling depth and normal smoothing. By jointly optimizing energy E_sil+E_smObtaining each image I_iDepth D of_i。

The depth calculation method described above processes each image separately without considering the consistency of hair depth for different images. The hair model generated in this way cannot match each image. Therefore, the invention needs to consider the hair depth consistency among different images and uses a global optimization method to jointly solve all the imagesHair depth of (2). The joint optimization process is iteratively executed in an alternating manner, and each depth map is processed in turn in each iteration process; while correcting a depth map D_iIn the meantime, other depth maps are fixed and taken as a constraint pair D_iAnd (6) optimizing. Specifically, the present invention first maps the depth map { D } to other depths_j}_j≠iConversion to D_iIn the camera coordinate system of (1), is expressed asThe depth consistency between different images can then be expressed as D_iAndthe sum of the pixel differences between, i.e.:

E_{c o n} = \underset{j}{Σ} \underset{p}{Σ} | | d_{i, p} - {\hat{d}}_{j, p} | |^{2}

wherein d is_i,pAndare each D_iAnddepth value at pixel p, D_iAndrespectively, the depth map of the ith image and the converted depth map. This joint constraint energy is associated with the boundary energy term E described above_silAnd the smoothing energy term E_smCombined with joint optimization, the depth map D after joint optimization can be obtained_i。

Next, the present invention will describe how to map a depth map D_jConversion to D_iIn the camera coordinate system of to generateFor the case of short hair of the user, it can be assumed that the hair moves rigidly with the head, so D can be transformed by the rigid transformation parameters of the head mesh model fitted in the two images_jConversion to D by rigid transformation_iThe specific formula in the camera coordinate system is as follows:

P ({\hat{d}}_{j, p}) = (R_{i}, T_{i}) \cdot {(R_{j}, T_{j})}^{- 1} \cdot P (d_{j, p})

wherein P (-) denotes converting a pixel point in the depth map to a three-dimensional point position under the camera coordinate system, R and T are rotation and translation parameters for converting the mesh from the object coordinate system to the camera coordinate system when fitting the face mesh, d_i,pAndare each D_iAndthe depth value at pixel p.

However, for long hair, the movement of the hair cannot be described simply as a rigid movement with the head, because the hair will undergo non-rigid changes due to gravity or contact with the body. Therefore, the present invention requires calculating D_iAnd D_jCorresponding relation C between_ijIn the following, the present invention will describe in detail how to calculate the correspondence between hairs in two images. Based on this correspondence, the present invention is directed to D_jIs made into a grid deformation to obtainThe deformation-optimized energy is described as follows:

E_{j} = \underset{(c_{i}, c_{j}) &Element; C_{i j}}{Σ} | | {\hat{d}}_{j, c_{j}} - d_{i, c_{i}} | |^{2} + ω_{l} \underset{k}{Σ} | | {Δv}_{k} - \frac{δ_{k}}{| {Δv}_{k} |} {Δv}_{k} | |^{2}

wherein (c)_i,c_j) Is C_ijOne set of the corresponding relations in the group,andare each D_iAndthe depth value of the upper corresponding pixel point, the vertex v_kIs based on a depth mapThe k-th vertex on the constructed grid, in the Laplace energy term, delta is a discrete Laplace operator based on a cosine formula on the grid,_kis a firstStarting gridLength of Laplace vector on kth vertex, ω_lThe weight for controlling the laplace energy is set to 10 in the present invention.

According to the previous description, the hair may contain some non-rigid motion as it moves with the head rotation. Therefore, the present invention requires the computation of a depth map D before performing the depth joint optimization_iAnd another depth map D_jThe corresponding relation between them. The algorithm for finding the corresponding relation comprises three steps: calculating the corresponding relation of image space, roughly matching and building and correcting the corresponding relation.

In a first step, the invention first calculates the image I using the PatchMatch algorithm (BARNES, C., SHECHTMAN, E., FINKELSTEIN, A., AND GOLDMAN, D.B.2009.PatchMatch: A random corrected image for structural image evaluation. ACM Trans. graph.28,3(July),24: 1-24: 11.)_iAnd I_jThe correspondence of the middle hair area. But the correspondence so generated is not accurate enough on some pixels, so in a second step, the invention uses a mesh deformation algorithm to compute a rough matching relationship. The invention first of all lies in the picture I_iThe hair area of (A) is constructed into a regular grid P_iWherein each pixel, in combination with D_iA vertex in the mesh is constructed. Then, for P_iIf the error of all the pixels in the neighborhood of 3 × 3 around the vertex in the PatchMatch algorithm calculation is less than the given threshold of 0.05 and the offsets of the pixels through the PatchMatch calculation are similar, the invention calculates the average value of the pixel offsets of the domains, applies the average value to the original vertex and moves the pixel offset to a new position, and if the vertex does not satisfy the above, the offset obtained by the PatchMatch algorithm is not reasonable_iUsing the Laplace mesh deformation algorithm to P as the position constraint_iA deforming operation is performed. Deformed mesh P_iIs rendered to I_jIs shown inLike in space, thus according to P_iAnd P_i' rendering projections in two images, I can be obtained_iAnd I_jA rough match between pixels. In the last step, based on such a rough matching, in I_iEach pixel of the image hair region, at I_jIf the error given by PatchMatch is still larger than the threshold value after the correction of the step, the corresponding relation of the pixel is calibrated to be illegal, and the pixel point is constrained and removed in the joint constraint energy.

Based on these obtained hair depth maps, a deformable hair model of the user can be constructed. Specifically, the depth maps of all the images are converted and deformed to be in the same camera coordinate system, namely a first image I₁In the camera space of (a). For any one depth map D_j(j ≠ 1), a regular grid is first created based on the depth of the pixels, and then deformed using the same algorithm as when calculating the correspondence. And (4) taking each vertex of the deformed mesh as a three-dimensional point to obtain the three-dimensional point cloud of the hair model. In this point cloud, the invention removes outliers in the point cloud using the following rules: if the value in the z direction normal to a certain point is less than a given threshold value of 0.5, the point is marked as an outlier. Based on the three-dimensional point cloud, the invention uses a Poisson plane reconstruction algorithm to generate an image I₁Corresponding hair grid H₁。

Due to other images I_j(j ≠ 1) to I₁Is known in previous calculations, the invention finally relates to H₁Rigid transformation and non-rigid grid deformation are carried out, and the rigid transformation and the non-rigid grid deformation are transformed into other 14 images with different head postures, so that the grid shape { H ] of the hair in the images with different head postures is obtained_i}_{i＝1,2,...,15}. The present invention takes these 15 hair mesh sets as the deformable hair model of the user, which expresses the non-rigid motion space of the hair under different head poses. Need to make sure thatNote that the present invention assumes that facial expressions have no effect on the shape of the hair.

2.3 construction of eye, tooth and body models

To make the avatar more complete, the present invention continues to describe constructing models of other parts, including eyes, teeth, and body. Unlike the face and hair, the movement of these parts is relatively simple and does not vary much from expression to expression. Thus, the present invention is based on a positive natural expression image (i.e., I)₁) Models of the eyes and body are constructed, and models of the teeth are constructed based on images of the open tooth expressions.

Two plates were used in the present invention to express eyes: one for expressing the iris and the other for expressing the white of the eye. The invention firstly calculates the bounding boxes of the corresponding eye vertexes on the grids based on the fitted face grids in each image as the rectangular areas of the eyes. The position and size of the iris is then passed through the largest ellipse automatically detected in the rectangle, and this detected iris is then copied into a flat-plate model of the iris. For a white plate, the invention first copies the eye image into the white plate and removes the area of the iris therein. For the missing pixels after the iris is removed, the invention uses the PatchMatch algorithm to synthesize the color of the missing pixels by taking the white region as a source.

The present invention uses two plates, representing the upper and lower jaw teeth in a proxy, respectively. In the expression image of the exposed teeth, similar to the eye model, the invention firstly calculates the bounding box according to the corresponding vertex position in the face mesh to determine the tooth area. The size of the dental plate is determined by the corresponding vertices in the face mesh. The present invention also provides a drag tool for manually correcting inaccurate plate positions in dental configurations.

The invention uses a flat plate to approximate the model of the upper body, the color of the flat plate is directly from I₁The image of the middle body layer. The depth of the body plate is defined as the depth of the outline point of the face gridAverage value.

3. Real-time face and hair geometry generation

In the real-time operation process, the invention uses a monocular camera-based face tracking system in (CAO, C., HOU, Q., AND ZHOU, K.2014. displaydynamic expression for real-time facial tracking AND animation. ACMTrans. graph.33,4(July),43: 1-43: 10.) to capture the face action of the user AND drive the image substitution. Specifically, in the operation process, for each input video frame, the face tracking system obtains parameters of the face motion, including rigid head transformations R, T and facial expression coefficients e. Based on these parameters, the present invention generates a geometric model of the current frame avatar face and hair.

3.1 face geometry Generation

Substitution human face fusion model { B) based on pre-calculation_jThe invention uses the following formula to generate the face geometric mesh F of the current frame:

F = R (B_{0} + Σ_{j = 1}^{46} e_{j} B_{j}) + T

wherein, R and T are rigid rotation and translation parameters of the current frame face given by the face tracking system, B₀Is a grid of natural expression, e_jIs the value of the jth term of the expression coefficient e, B_jIs the expression grid in the fusion model.

In order to make the head and the neck connected seamlessly, the invention needs to be fixedThe position of the neck position point in the head model is fixed with the body plate, and then the positions of other points are updated through a grid deformation algorithm based on Laplace. The updated head geometry mesh is notedFinally the system needs to pass F andthree-dimensional registration in between to recalculateRigid transformation parameter of

3.2 Hair geometry Generation

Human face geometric mesh based on the generationThe invention proceeds to generate a geometric mesh H of hair. Similar to the face mesh construction method, the hair geometric mesh is also a hair mesh through the pre-captured image { H }_iThe insertion value is as follows:

H = \hat{R} \cdot (Σ_{i = 1}^{15} r_{i} H_{i}) + \hat{T}

wherein,andrespectively, the face rigid transformation parameters r calculated in the previous step_iIs a hair grid H of pre-captured images_iBased on the current frame head rotationAnd rotation of the head in the pre-captured image { R_iCalculating:

r_{i} = \frac{e^{- ω_{r} | | \hat{R} - R_{i} | |}}{Σ_{j = 1}^{15} e^{- ω_{r} | | \hat{R} - R_{j} | |}}

wherein, ω is_rIs an interpolation parameter, set to 10 in the present invention, e is a natural base number, the current frame head rotatesAnd rotation of the head in the pre-captured image { R_iAre all quaternion expressions.

4. Real-time face animation synthesis

4.1 mapping images

Generating a current frame geometric meshAnd H, combining the mesh face mesh { F ] in the pre-collected image_iAnd hair grid { H }_iMapping of the Pre-acquired image { I }, the present invention_iGet a series of mapped images, represented as

And finally, obtaining an animation image driven by the substitute, wherein each pixel of the animation image is obtained by weighted averaging of corresponding pixels in the mapping image. In order to obtain the weight of each pixel in each mapping image, the invention firstly calculates the weight on the vertex of the corresponding grid, and then interpolates the weight of each pixel based on the radial basis function.

4.2 vertex weight calculation

Specifically, each of the mapping imagesThe invention first computes a gridAnd each vertex v on H_kWeight w (v) of_k). The core idea is that when the normal/expression of the current frame grid is similar in the collected image, and the normal/expression is collectedThe grid vertices in the collection image are more normal to the line of sight, and the vertex in the captured image is given a greater weight, expressed mathematically as follows:

w (v_{i, k}) = e^{- ω_{z} (1 - n_{i, k}^{z})} \cdot e^{- ω_{n} (1 - n_{i, k} \cdot n_{k})} \cdot α_{i, k} e^{- ω_{e} (1 - ψ (e_{i}, e))}

wherein v is_kIs a gridA certain vertex in v_i,kIs to collect an image I_iMiddle correspondence grid F_i/H_iOf the corresponding vertex, n_i,k/n_kIs the vertex v_i,k/v_kIn the direction of the corresponding normal direction,is normal to n_i,kComponent in the z direction, ω_z，ω_nAnd ω_eAre the relative importance of controlling each component, respectively, and are set to 5,10 and 30 in the present invention, α_i,kThe method is a binary template on a grid manually calibrated for a specific expression, a region related to expression semantics is set to be 1, and the rest part of the template is set to be 0. It should be noted that the process of manually calibrating the binary template is disposable and has no relation to the avatar itself. The method can be used in all image substitutes by only calibrating the corresponding template on a universal expression fusion model. And psi (e)_iAnd e) the calculation method is as follows:

ψ (e_{i}, e) = \frac{(e_{i} \cdot e)}{| | e_{i} | | | | e | |}

wherein (·) denotes a dot product of two vectors, e_iAnd e are respectively the acquired images I_iAnd the expression coefficient of the current frame I.

4.3 Pixel weight calculation and image Synthesis

After the weight of each grid vertex in each image is obtained, the system can calculate each image I through radial basis function interpolation_iWeight w on a single pixel p_i,p：

w_{i, p} = \underset{k}{Σ} e^{- ω_{u} | | u_{p} - u_{i, k} | |^{2}} β_{k} w (v_{i, k})

Wherein u is_pIs a two-dimensional coordinate of a pixel p, u_i,kIs the vertex v_i,kProjected position on the image, w_uFor controlling each vertex v_i,kβ size of the image-influencing zone_kIs a visibility item that is a list of items,if the vertex v in the current frame mesh_kAll images are normalized after calculating the weights of a single pixel, so that the sum of the weights of the same pixel in all images is 1, ∑_iw_i,p＝1。

It is noted that in practice the present invention does not utilize all vertices of the mesh, but rather performs radial basis function interpolation by uniformly sampling the mesh to obtain sampled vertices with origin 1/10. Experiments show that the small number of points can not only increase the operation speed, but also help the fusion between different images to be smoother, and a satisfactory result can be obtained.

In addition to the face and hair, during the process of face animation synthesis, images of other parts of the avatar, including eyes, teeth and body, are generated and combined with the face and hair naturally to generate the final result.

For eyes, the invention firstly adds two characteristic points indicating the position of the iris in all training data in (CAO, C., HOU, Q., AND ZHOU, K.2014. displayed dynamic presentation for real-time facial tracking AND evaluation. ACMTrans. graph.33,4(July),43: 1-43: 10), so that the position of the iris can be tracked in the real-time operation process to describe the rotation of the eyeball. In the process of generating the eye image of the substitute body, the eye white plate and the head are directly subjected to rigid motion by the invention, namely, the eye white plate is directly subjected to rigid transformationAnd switching to the corresponding position. For iris plates, except for rigid transformationsIn addition, the invention needs to calculate the translation of the iris according to the position of the iris feature point obtained by tracking, and adds the translation to the iris flat plate to realize the rotation of the eyeball.

For teeth, the flat plate of the upper jaw teeth is directly connected with the head, and the flat plate and the head do rigid motion together, and rigid transformation parameters are utilizedDirectly aligning and carrying out rigid transformation; and the lower jaw teeth move with the jaw.

For the body, the invention makes the head move in translation together, namely, the translation parameterApplied to the body plate, transformed to the corresponding position and rendered in image space as a background to other parts (faces, hair, teeth and eyes).

In the invention, only one low-resolution grid is used for expressing the avatar, and detail information on the human face, such as skin folding, wrinkles and the like, is implicitly expressed in the image-based avatar expression in the invention. Furthermore, and more importantly, the image-based avatar representation of the present invention allows for effective treatment of hair and headwear thereon, none of which has been previously addressed in prior work.

Examples of the embodiments

The method described in the invention is implemented on a common desktop computer (Intel i73.6GHz CPU,32GB memory, NVidia Gtx 760 video card). All captured images are captured with a common webcam, which can provide 1280 × 720 resolution images. To construct an animated avatar of an image, it usually takes 10 minutes for image acquisition, 40 minutes for image pre-processing and 15 minutes for computing the face fusion model and the hair deformation model. In real-time operation, the invention combines (CAO, C., HOU, Q., AND ZHOU, K.2014. displayed dynamic expression for real-time facial tracking AND animation. ACM trans. graph.33,4(July),43: 1-43: 10.) real-time face tracking system, which needs about 30 milliseconds to generate animation effect for each frame of input image driving image avatar. In general, the more 25 frames/second the present invention can be achieved on a typical desktop computer.

From the practical results, the invention can create real dynamic substitution for different users with different hairstyles and head ornaments. Unlike previous work in which geometric hierarchies were generated for facial details such as wrinkles, the results of the present invention may naturally include the user's various wrinkle details thanks to an image-based avatar rendering method. The invention can also process the large rotation and the hair change caused by the large rotation, and fully embodies the functions of the human face fusion model and the hair deformation model.

Claims

1. A dynamic avatar construction method based on images is characterized by comprising the following steps:

(1) data acquisition and preprocessing: the method comprises the steps of collecting facial images of a series of action expressions of a user by using a network camera, and carrying out preprocessing work such as segmentation, feature point calibration and the like on the images.

(2) Virtual avatar construction based on images: and generating a human face fusion model and a hair deformation model of the user based on the processed image, and then obtaining the avatar expression of the user based on the image.

(3) Real-time face and hair geometry generation: in the real-time human face animation driving process, the substitute expression is driven to generate corresponding human face and hair geometry according to the human face action expression parameters obtained by tracking.

(4) And (3) real-time face animation synthesis: and mapping the acquired images based on the obtained face and hair geometry, and fusing the mapped images according to image confidence coefficients to generate a real face animation image.

2. An image-based dynamic avatar construction method as claimed in claim 1, wherein said step 1 mainly comprises the following sub-steps:

(1.1) requiring the user to perform a series of action expressions and acquiring corresponding facial images.

(1.2) segmenting the acquired image by using a Lazy Snaping algorithm to segment different layers: face, hair, teeth, eyes, body, and background.

And (1.3) calibrating two-dimensional feature points of each image by using a two-dimensional feature point regressor, and carrying out simple manual repair on the unsatisfied part in the automatic calibration result by using a dragging tool.

3. An image-based dynamic avatar construction method as claimed in claim 1, wherein said step 2 mainly comprises the following sub-steps:

(2.1) fitting a unified user individual coefficient and rigid transformation parameters and non-rigid expression coefficients of the face in each image based on the existing three-dimensional face expression database to obtain a face grid matched with each image; then, further deformation correction is carried out on the grids, so that the face grids are more consistent with the image; and generating a face fusion model of the user based on the corrected grids.

(2.2) solving the depth value of each pixel in the head area of each image based on the face mesh of each image; combining the face grids and the hair areas of all the images, and jointly optimizing the depth value of the hair area in the image; and finally, based on the depth information of the hair in all the images, a deformation model of the hair of the user is created.

And (2.3) constructing corresponding eye, tooth and body models for the substitute expression based on the face fusion model generated by fitting.

4. An image-based dynamic avatar construction method according to claim 1, wherein said step 3 mainly comprises the following sub-steps:

and (3.1) acquiring the face action parameters of the current frame from the input face video based on the existing real-time face tracking system, wherein the face action parameters comprise rigid rotation and translation parameters and non-rigid expression coefficients, and generating the face geometry of the user by using the parameters.

And (3.2) generating the hair geometry of the user by using the rigid rotation and translation parameters and the non-rigid expression coefficients.

5. An image-based dynamic avatar construction method according to claim 1, wherein said step 4 mainly comprises the following sub-steps:

and (4.1) mapping the collected images to obtain a series of mapped images based on the geometry of the human face and the hair generated by the current frame.

And (4.2) calculating the weight of each vertex in each mapping image according to the difference between the human face and hair models generated by the current frame and the corresponding grids in the acquired image.

(4.3) in each mapping image, based on the weight of each vertex, calculating the corresponding weight on each pixel by using a radial basis function; and fusing the mapping images based on the weights to obtain a final human face animation image.