CN115512039A

CN115512039A - 3D face construction method

Info

Publication number: CN115512039A
Application number: CN202210954071.XA
Authority: CN
Inventors: 杜兰; 陈彬; 杜轶锋; 周国华; 叶国伟; 廖森平; 曾文君; 蒋仕坚
Original assignee: Ke Da Southern China Co ltd; Iflytek South China Artificial Intelligence Research Institute Guangzhou Co ltd
Current assignee: Ke Da Southern China Co ltd; Iflytek South China Artificial Intelligence Research Institute Guangzhou Co ltd
Priority date: 2022-08-10
Filing date: 2022-08-10
Publication date: 2022-12-23

Abstract

The invention discloses a 3D face construction method, and relates to the technical field of bionics. The method specifically comprises the following steps: acquiring feature vectors of a plurality of visual angle images of a human face, fusing the feature vectors to generate a fused feature vector; generating a three-dimensional human face shape model according to the fusion feature vector; and filling textures in the three-dimensional shape model to generate a human face model. The method aims to improve the construction precision of the digital human face.

Description

3D face construction method

Technical Field

The invention relates to the technical field of bionics, in particular to a 3D face construction method.

Background

3D modeling is a fundamental research in the field of computer graphics processing, involving typical applications such as 3D games, visual communication systems, virtual reality augmentation and medical simulation, etc., which may be referred to as digital humans. Especially with the emergence of the MPEG-4 face coding system for the current network bandwidth improvement, real-time, ultra-low bit rate visual application, 3D face modeling has become a research hotspot in the field of video telephony and virtual netmeeting applications. However, fast 3D face modeling is always a very challenging topic. Due to the wide application scene, more and more researchers make a great deal of research on 3D face modeling, and great research results are obtained. For example, an author in a document "Rapid and Automatic 3D Face Modeling Using Active appearance models" proposes an Active appearance model improvement method Using an initial position and an example model to select in a Face feature extraction process, and adjusts the Face model by Using a characteristic Face and space affine transformation rule to realize matching of the position and the shape of a corresponding Face. The imaging method is simple, but the imaging precision is not high, local ghost appears, and the method is only suitable for application with low imaging requirements.

Therefore, how to improve the accuracy of face construction of digital people is called a technical problem to be solved urgently.

Disclosure of Invention

The invention mainly aims to provide a 3D face construction method, aiming at improving the construction precision of digital human faces.

In order to achieve the above object, the present invention provides a 3D face construction method, including the following steps:

acquiring feature vectors of a plurality of visual angle images of a human face, fusing the feature vectors to generate fused feature vectors;

generating a three-dimensional human face shape model according to the fusion feature vector;

and filling the texture of the three-dimensional shape model to generate a human face model.

In an embodiment of the present application, acquiring feature vectors of a plurality of perspective images of a human face, and generating a fused feature vector after fusing the feature vectors includes:

acquiring face data of a plurality of different visual angles of the same face;

coding each face data through a depth convolution neural network sharing weight to obtain a face feature vector of each visual angle;

and after splicing the plurality of face feature vectors, performing feature fusion through a multilayer perceptron network to generate a fusion feature vector.

In an embodiment of the present application, generating a three-dimensional face shape model according to the fused feature vector includes:

acquiring the fusion feature vector;

decoding the fusion characteristic vector through a multilayer perception network to obtain a three-dimensional model head attitude vector;

decoding the fusion characteristic vector through a graph convolution neural network to obtain a three-dimensional vertex vector; and

and generating a three-dimensional human face shape model according to the three-dimensional model head pose vector and the three-dimensional vertex vector.

In an embodiment of the present application, the head pose vector comprises:

at least one of a pitch angle of the face, a yaw angle of the face, a roll angle of the face, and a displacement vector of the three-dimensional space.

In an embodiment of the present application, texture filling the three-dimensional shape model includes:

acquiring the fusion feature vector;

and decoding the fusion characteristic vector through a large-scale generation countermeasure network to obtain face texture parameters, and filling the three-dimensional shape model according to the obtained face texture parameters.

In an embodiment of the present application, the method further includes calculating a similarity between any two perspective images of the plurality of perspective images of the human face; and when the similarity between any two visual angle images does not meet the preset condition, not fusing the characteristic vectors of the two visual angle images.

In an embodiment of the present application, a calculation formula of the similarity between any two face images is as follows:

，

where l (x, y) represents the similarity in luminance of sample x and sample y; c (x, y) represents the similarity of sample x and sample y in contrast; s (x, y) represents the similarity of the structure of sample x and sample y; generally take c ₃ ＝c ₂ /2；

Wherein, mu _x μ _y Is the mean, σ, of sample x and sample y, respectively _x σ _y Is the variance, σ, of sample x and sample y, respectively _xy The covariance of sample x and sample y; constant c ₁ And c ₂ The calculation formula is as follows:

c ₁ ＝(k ₁ L) ²

c ₂ ＝(k ₂ L) ²

wherein k is ₁ Defaults to 0.01, k ₂ Defaults to 0.03, L is the range of pixel values;

the Structural Similarity (SSIM) formula is as follows:

SSIM(x,y)＝f[l(x,y)·c(x,y)·s(x,y)]。

by adopting the technical scheme, the characteristic vectors of the multiple visual angle images of the face are obtained, and then the characteristic vectors of the multiple visual angle images are fused, so that errors caused by visual angle differences are reduced, and the accuracy of face texture and shape reconstruction is improved.

Drawings

The invention is described in detail below with reference to specific embodiments and the attached drawing figures, wherein:

fig. 1 is a schematic structural diagram of a first embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in detail with reference to the accompanying drawings and examples. It should be understood that the following specific examples are only for illustrating the present invention and are not to be construed as limiting the present invention.

As shown in fig. 1, in order to achieve the above object, the present invention provides a 3D face construction method, including the following steps:

acquiring feature vectors of a plurality of visual angle images of a human face, fusing the feature vectors to generate a fused feature vector;

Specifically, the 3D face construction method comprises the following steps of obtaining a plurality of visual angle images of a face, obtaining the visual angle images by adopting a VR3D instantaneous three-dimensional imaging system, instantly obtaining all-dimensional photo information of the same object at the same moment, reducing errors among the plurality of visual angle images of the face, correspondingly improving the similarity of the plurality of visual angle images of the face, and facilitating construction of a face model at a later stage.

Of course, it is conceivable that the multiple view angle images of the human face may be captured in time sharing by using a common optical camera.

The feature vectors of the multiple visual angle images of the human face are obtained, and a neural convolution network or a deep convolution neural network sharing weight can be adopted.

The neural convolutional network is a feedforward neural network which comprises convolution calculation and has a deep structure, and is one of representative algorithms of deep learning. The convolutional neural network has the characteristic learning ability and can carry out translation invariant classification on input information according to the hierarchical structure of the convolutional neural network. The convolutional neural network is constructed by imitating a biological visual perception mechanism, supervised learning and unsupervised learning can be carried out, and convolutional kernel parameter sharing in an implicit layer and sparsity of interlayer connection enable the convolutional neural network to learn lattice characteristics such as pixels and audio with small calculation amount, so that the convolutional neural network has a stable effect and has no additional characteristic engineering requirements on data.

The input layer of the convolutional neural network may process multidimensional data, and commonly, the input layer of the one-dimensional convolutional neural network receives a one-dimensional or two-dimensional array, where the one-dimensional array is usually a time or frequency spectrum sample; the two-dimensional array may include a plurality of channels; an input layer of the two-dimensional convolutional neural network receives a two-dimensional or three-dimensional array; the input layer of the three-dimensional convolutional neural network receives a four-dimensional array.

The hidden layer of the neural network comprises a convolution layer, a pooling layer and a full-connection layer, and a convolution kernel in the convolution layer comprises a weight coefficient. The convolution layer has the function of extracting features of input data, and comprises a plurality of convolution kernels, wherein each element forming the convolution kernels corresponds to a weight coefficient and a deviation value. Each neuron in the convolution layer is connected with a plurality of neurons of an area close to the position in the previous layer, the size of the area depends on the size of a convolution kernel, and the convolution kernel regularly sweeps an input characteristic to multiply and sum matrix elements of the input characteristic and superpose deviation values when in work.

After the feature extraction is performed on the convolutional layer, the output feature map is transmitted to the pooling layer for feature selection and information filtering. The pooling layer contains a preset pooling function, the function of which is to replace the result of a single point in the feature map with the feature map statistics of its neighboring area. The step of selecting the pooling area by the pooling layer is the same as the step of scanning the characteristic diagram by the convolution kernel, and the pooling size, the step length and the filling are controlled. Since the extraction of feature vectors by the neural convolutional network is a conventional prior art, the specific work flows are not described one by one.

The method comprises the steps of obtaining feature vectors of a plurality of visual angle images of a human face through a neural convolution network, and fusing the obtained feature vectors to generate fused feature vectors. By fusing the feature vectors of a plurality of visual angle images, errors caused by visual angle differences can be remarkably reduced. The fusion between the feature vectors of a plurality of visual angle images adopts a multilayer perception network to perform feature fusion, the multilayer perception network in the application refers to a neural network which comprises at least one hidden layer and is composed of fully-connected layers, and the output of each hidden layer is transformed through an activation function. The multi-layer perception network can realize the rapid fusion of the feature vectors of a plurality of visual angle images, and can also reduce errors caused in the fusion process.

The three-dimensional face shape model is generated according to the fusion characteristic vector, the three-dimensional face shape model is constructed through the attitude vector and the shape vector, the input parameters are the fusion characteristic vector when the three-dimensional face shape model is constructed, and errors caused by visual angle difference can be obviously reduced due to the fusion characteristic vector, so that the face shape reconstruction precision is directly improved.

After the three-dimensional face shape model is completed, texture filling is carried out on the three-dimensional face shape model, because the parameters input by the texture filling are fusion feature vectors, and because the fusion feature vectors can obviously reduce errors caused by visual angle differences, the precision of the texture filling is greatly improved.

The three-dimensional face shape model in the application refers to a mathematical model, and a three-dimensional face frame can be simulated on a display terminal through the mathematical model. The human face model in the application refers to a mathematical model, a human face can be simulated on a display terminal through the human face model, and the human face model is obtained by matching a three-dimensional human face shape model with texture filling.

In an embodiment of the present application, the obtaining feature vectors of multiple perspective images of a human face and fusing the feature vectors to generate fused feature vectors includes:

acquiring face data of a plurality of different visual angles of the same face;

and splicing a plurality of face feature vectors, and performing feature fusion through a multilayer perceptron network to generate a fusion feature vector.

Specifically, the obtaining feature vectors of a plurality of view images of a human face and fusing the feature vectors to generate fused feature vectors includes:

the method includes the steps of acquiring face data of a same face from a plurality of different visual angles, and conceivably, in order to improve accuracy of face model components, the number of the visual angle images of the face is at least more than 8.

After data of a plurality of different visual angles of the same face are obtained, the data of each face are coded through a weight-sharing deep convolution neural network, the weight-sharing deep convolution neural network in the application means that parameters of a filter in a convolution kernel are fixed and unchangeable when the convolution kernel is used for calculating, all positions of a scanning surface are completed by moving the scanning surface according to a preset step number, and all scanned pixels share the weight in the filter for calculating.

The uniformity of the calculation result can be ensured through the depth convolution neural network sharing the weight, and after the depth convolution neural network sharing the weight is used for coding each face data, the face feature vector of each visual angle is correspondingly obtained. And splicing the plurality of face feature vectors. And performing feature fusion on the spliced human face feature vector through a multilayer perceptron network to generate a fusion feature vector.

By adopting the technical scheme, the face data of the images of the multiple visual angles are coded by the depth convolution neural network coder sharing the weight, the face characteristic vector of each visual angle is obtained, then the characteristic vectors of the multiple visual angles are spliced, and the characteristic fusion is carried out through the multilayer perceptron network, so that the input is provided for the shape reconstruction and the texture reconstruction. The feature vectors of a plurality of visual angle images are fused, so that errors caused by visual angle differences can be remarkably reduced, and the accuracy of face texture and shape reconstruction is improved.

acquiring the fusion feature vector;

Specifically, generating the three-dimensional face shape model according to the fusion feature vector includes: acquiring a fusion characteristic vector, decoding the fusion characteristic vector through a multilayer perception network after acquiring the fusion characteristic vector, and acquiring a three-dimensional model head attitude vector after decoding; and predicting the head attitude parameters of the three-dimensional model through the head attitude vector of the three-dimensional model.

After the fusion characteristic vector is obtained, the fusion characteristic vector is decoded through a graph convolution neural network to obtain a three-dimensional vertex vector. The method based on the image two-dimensional convolution leads to a large number of parameters of a depth network, and because the local geometric features of the three-dimensional face structure are not fully utilized, the method based on the image convolution neural network decodes the fusion feature vector, realizes information transmission between three-dimensional vertexes by means of image convolution, and can fully utilize the local geometric features of the three-dimensional face structure. End-to-end reconstruction of the three-dimensional face model is realized, and the construction precision of the face model is improved.

In an embodiment of the present application, the head pose vector comprises:

Specifically, the head pose vector includes: pitch angle of the face, yaw angle of the face, roll angle of the face, and displacement vector of the three-dimensional space, which includes motion in 6 degrees of freedom. Various visual angles of the human face are covered, and the precision of the human face model during construction is improved.

In an embodiment of the application, texture filling the three-dimensional shape model comprises:

acquiring the fusion feature vector;

Specifically, the traditional three-dimensional face texture reconstruction scheme expresses face textures through low-dimensional texture parameters, and has the problems of insufficient reality and resolution. In the application, the fusion feature vector is obtained, and is decoded by using a large-scale generation countermeasure network, so that the high-resolution face texture is generated. The generated face texture can be combined with a micro-renderer and a face shape to generate a multi-view reconstructed face image. And through reconstruction loss, the multi-view face discrimination loss drive generator network is trained. Batch normalization and spectral normalization were performed by applying the hinge loss function in the network training. The training of the confrontation network is effectively stabilized, and the resolution and the quality of the generated face texture image are improved.

The batch normalization in the application is a deep neural network training skill, which not only can accelerate the convergence speed of the model, but also more importantly, relieves the problem of gradient dispersion (characteristic distribution is scattered) in a deep network to a certain extent, so that the deep network model is easier and more stable to train.

The spectrum normalization in the application refers to setting the total light intensity to be one, so that the light intensity on the longitudinal axis is a decimal number, thereby facilitating the training of the countermeasure network.

In an embodiment of the present application, the method further includes calculating a similarity between any two perspective images of the plurality of perspective images of the human face; and when the similarity between any two visual angle images does not meet the preset condition, the feature vectors of the two visual angle images are not fused.

Specifically, the similarity between any two perspective images of the multiple perspective images of the human face is calculated. Structural similarity measures a measure of similarity between two given images and its numerical size is determined based on three key features between samples. Brightness, contrast, and texture, respectively.

When the similarity between any two visual angle images does not meet the preset condition, the fact that a large difference exists between the two visual angle images is indicated, and the feature vectors of the two visual angle images are not fused, so that the difference is prevented from being expanded. Simple structure and convenient implementation.

,

where l (x, y) represents the similarity of sample x and sample y in luminance; c (x, y) represents the similarity of sample x and sample y in contrast; s (x, y) represents the similarity of the sample x and the sample y in structure; generally take c ₃ ＝c ₂ /2；

Wherein, mu _x μ _y Is the mean, σ, of sample x and sample y, respectively _x σ _y Is the variance, σ, of sample x and sample y, respectively _xy Covariance of sample x and sample y; constant c ₁ And c ₂ The calculation formula is as follows:

c ₁ ＝(k ₁ L) ²

c ₂ ＝(k ₂ L) ²

wherein k is ₁ Defaults to 0.01,k ₂ Defaults to 0.03, L is the range of pixel values;

the Structural Similarity (SSIM) formula is as follows:

SSIM(x,y)＝f[l(x,y)·c(x,y)·s(x,y)]。

the above description is only a preferred embodiment of the present invention, and is not intended to limit the scope of the present invention, and all modifications and equivalents of the present invention, which are made by the contents of the present specification and the accompanying drawings, or directly/indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims

1. A3D face construction method is characterized by comprising the following steps:

2. The 3D face construction method according to claim 1, wherein the obtaining and fusing feature vectors of a plurality of view images of a face to generate a fused feature vector comprises:

acquiring face data of a plurality of different visual angles of the same face;

3. The 3D face construction method according to claim 2, wherein generating a three-dimensional face shape model from the fused feature vectors comprises:

acquiring the fusion feature vector;

4. A method of constructing a 3D face as claimed in claim 2 wherein the head pose vector comprises:

5. The 3D face construction method of claim 2, wherein texture filling the three-dimensional shape model comprises:

acquiring the fusion feature vector;

decoding the fusion characteristic vector through a large-scale generation countermeasure network to obtain a face texture parameter, and filling the three-dimensional shape model according to the obtained face texture parameter.

6. The 3D face construction method according to claim 1, further comprising calculating a similarity between any two perspective images of the plurality of perspective images of the face; and when the similarity between any two visual angle images does not meet the preset condition, not fusing the characteristic vectors of the two visual angle images.

7. The 3D face construction method according to claim 6, wherein the calculation formula of the similarity between any two face images is as follows:

，

where l (x, y) represents the similarity in luminance of sample x and sample y; c (x, y) represents the similarity of sample x and sample y in contrast; s (x, y) represents the similarity of the sample x and the sample y in structure; generally take c ₃ ＝c ₂ /2；

c ₁ ＝(k ₁ L) ²

c ₂ ＝(k ₂ L) ²

the Structural Similarity (SSIM) formula is as follows:

SSIM(x,y)＝f[l(x,y)·c(x,y)·s(x,y)]。