CN108230438B - Face reconstruction method and device for voice-driven auxiliary side face image - Google Patents

Face reconstruction method and device for voice-driven auxiliary side face image Download PDF

Info

Publication number
CN108230438B
CN108230438B CN201711461073.0A CN201711461073A CN108230438B CN 108230438 B CN108230438 B CN 108230438B CN 201711461073 A CN201711461073 A CN 201711461073A CN 108230438 B CN108230438 B CN 108230438B
Authority
CN
China
Prior art keywords
face
model
voice
information
parameters
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201711461073.0A
Other languages
Chinese (zh)
Other versions
CN108230438A (en
Inventor
刘烨斌
苏肇祺
戴琼海
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Xinchangyuan Technology Co ltd
Original Assignee
Tsinghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tsinghua University filed Critical Tsinghua University
Priority to CN201711461073.0A priority Critical patent/CN108230438B/en
Publication of CN108230438A publication Critical patent/CN108230438A/en
Application granted granted Critical
Publication of CN108230438B publication Critical patent/CN108230438B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T17/00Three dimensional [3D] modelling, e.g. data description of 3D objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Graphics (AREA)
  • Geometry (AREA)
  • Image Analysis (AREA)
  • Image Processing (AREA)

Abstract

The invention discloses a face reconstruction method and a face reconstruction device for a voice-driven auxiliary side face image, wherein the method comprises the following steps: extracting a training data set; acquiring the corresponding relation between the voice characteristics and the expression parameters; extracting the voice characteristics of the voice waveform window corresponding to each frame, and storing the characteristic information into an input file; obtaining facial expression parameters corresponding to each frame of voice information; acquiring characteristic point information of a side face to acquire a corresponding relation between a side face image and a face characteristic point by a deep learning method; fitting the positions of the characteristic points on the side face by using the existing face model, solving the expression parameters and the shape parameters of the face, and introducing the expression parameters obtained by solving the sound information for weighting; and performing texture mapping and matching on the fitted face model to obtain a final face reconstruction result. The method can enable mouth movement information which is difficult to obtain through the side face image to be tracked and reconstructed well, and effectively improves the reliability of reconstruction.

Description

Face reconstruction method and device for voice-driven auxiliary side face image
Technical Field
The invention relates to the technical field of computer vision, in particular to a method and a device for reconstructing a human face by using voice to drive an auxiliary side face image.
Background
The high-quality three-dimensional model has important application value in various fields of movie and television entertainment, cultural relic protection, machining and the like. The human face three-dimensional reconstruction method is a great problem in the field of three-dimensional reconstruction because the human face expression is rich.
The face reconstruction technology of the related art mainly aims at face reconstruction of a front face image, and for AR (augmented reality) equipment, if face information of a wearer needs to be reconstructed, in order to not affect the sight of the wearer, a micro camera needs to be arranged on two sides of AR glasses, which brings difficulty to face reconstruction, especially reconstruction of mouth movement, because the mouth information acquired by a side face image is often incomplete.
Disclosure of Invention
The present invention is directed to solving, at least to some extent, one of the technical problems in the related art.
Therefore, an object of the present invention is to provide a face reconstruction method for voice-driven auxiliary side face images, which can better track and reconstruct the mouth motion information that is difficult to obtain through the side face images, and effectively improve the reconstruction reliability.
Another object of the present invention is to provide a face reconstruction device for voice-driving auxiliary side face images.
In order to achieve the above object, an embodiment of the present invention provides a face reconstruction method for driving an auxiliary side face image by sound, including the following steps: extracting a training data set, wherein if a wearer speaks a section of corpus to the camera lens through the front face of the AR device, voice information collected by the camera is extracted and converted into voice features, facial image information collected by the camera is extracted, and expression parameters are extracted, wherein the voice features and the expression parameters are respectively used as input and output of a deep learning training set; performing deep learning training on the training data set by using a convolutional neural network to obtain the corresponding relation between the voice characteristics and the expression parameters; in the process of testing and using a neural network model, acquiring side face information and speaking voice information of a user through miniature cameras erected on two sides of the AR device, extracting voice characteristics of a voice waveform window corresponding to each frame, and storing the characteristic information into an input file; putting the input file into a trained neural network model to obtain a facial expression parameter corresponding to each frame of voice information; for the side face information, a training set is also made, the feature points on the front face are mapped onto the side face through calibrated camera parameters through the existing front face feature point matching, the feature point information of the side face is obtained, and the corresponding relation between the side face image and the face feature points is obtained through a deep learning method; fitting the positions of the characteristic points on the side face by using the existing face model, solving the expression parameters and the shape parameters of the face, and introducing the expression parameters obtained by solving the sound information for weighting; and performing texture mapping and matching on the fitted face model to obtain a final face reconstruction result.
According to the face reconstruction method of the voice-driven auxiliary side face image, the collected voice information and the face movement can be matched by utilizing deep learning, then the extracted face movement parameters are used for indicating the three-dimensional reconstruction of the face model, and the voice information auxiliary reconstruction method is added, so that the mouth movement information which is difficult to obtain through the side face image can be well tracked and reconstructed, and the reconstruction reliability is effectively improved.
In addition, the face reconstruction method for the voice-driven auxiliary side face image according to the above embodiment of the present invention may further have the following additional technical features:
further, in one embodiment of the present invention, when the training set is collected, the collected results include a sequence of side face images captured by a camera on the AR device, a sequence of front face images captured by a front camera, and voice information of the wearer collected by the camera.
Further, in an embodiment of the present invention, the extracting the training data set further includes: extracting parameters of the collected voice waveform window of each frame through a linear predictive coding method to serve as the voice characteristics of the frame, and inputting the parameters serving as a training set; and carrying out face model fitting on the collected front face image sequence and extracting the expression parameters to be output as a training set.
Further, in an embodiment of the present invention, the fitting process constrains distances between positions of the model feature points projected onto the image and the feature point positions detected on the frontal face image, and the regular constraints of the model parameters themselves, specifically:
E=Edata+λEreg
Figure BDA0001530277150000021
Ereg=∑id∈(|wid|-3)+∑exp∈(|wexp|-3),
wherein R, t, wid,wexpFor the parameters to be solved, Proj is the camera projection matrix,
Figure BDA0001530277150000022
tensor base, p, for the ith eigenpoint on the modeliAnd e (-) is a step function and is the coordinate of the ith characteristic point on the image.
Further, in an embodiment of the present invention, the expression parameters obtained by solving the sound information are introduced into a formula for weighting, where the formula is:
Figure BDA0001530277150000023
wexp=λ1waudio2wsideface
wherein V is the vertex position after the face model is transformed, CrIs a bilinear tensor parameter basis of the model, widIs a shape parameter of the model, wexpIs an expression parameter of the model, waudioFor expression parameters solved by the sound information, wsidefaceFor expression parameters, lambda, solved by side-face image information1,λ2Weights fitted for the two sets of expression parameters.
In order to achieve the above object, another embodiment of the present invention provides a face reconstruction apparatus for driving an auxiliary side face image by voice, including: the system comprises a first extraction module, a second extraction module and a third extraction module, wherein if a face of an AR device wearer speaks a section of corpus to a camera lens, voice information collected by a camera is extracted and converted into voice features, face image information collected by the camera is extracted, and expression parameters are extracted, and the voice features and the expression parameters are respectively used as input and output of a deep learning training set; the first acquisition module is used for carrying out deep learning training on the training data set by using a convolutional neural network to acquire the corresponding relation between the voice feature and the expression parameter; the second extraction module is used for acquiring side face information and speaking voice information of a user through miniature cameras erected on two sides of the AR device in the process of testing and using the neural network model, extracting voice characteristics of a voice waveform window corresponding to each frame and storing the characteristic information into an input file; the input file is placed into a trained neural network model to obtain facial expression parameters corresponding to each frame of voice information; the second acquisition module is used for making a training set for the side face information, mapping the feature points on the front face to the side face through the calibrated camera parameters by matching the existing front face feature points, acquiring the feature point information of the side face and acquiring the corresponding relation between the side face image and the face feature points by a deep learning method; the weighting module is used for fitting the positions of the feature points on the side face by using the existing face model, solving the expression parameters and the shape parameters of the face, and introducing the expression parameters obtained by solving the sound information for weighting; and the processing module is used for performing texture mapping and matching on the fitted face model to obtain a final face reconstruction result.
The face reconstruction device for the voice-driven auxiliary side face image can utilize deep learning to match collected voice information with face movement, then use the extracted face movement parameters to indicate the three-dimensional reconstruction of a face model, and add a voice information auxiliary reconstruction method, so that mouth movement information which is difficult to obtain through the side face image can be better tracked and reconstructed, and the reconstruction reliability is effectively improved.
In addition, the face reconstruction device for voice-driven auxiliary side face image according to the above embodiment of the present invention may further have the following additional technical features:
further, in one embodiment of the present invention, when the training set is collected, the collected results include a sequence of side face images captured by a camera on the AR device, a sequence of front face images captured by a front camera, and voice information of the wearer collected by the camera.
Further, in an embodiment of the present invention, the first extraction module is further configured to extract parameters from the collected speech waveform window of each frame through a linear predictive coding method, where the parameters are used as the speech features of the frame to be input as a training set, perform face model fitting on the collected front face image sequence, and extract the expression parameters to be output as the training set.
Further, in an embodiment of the present invention, the fitting process constrains distances between positions of the model feature points projected onto the image and the feature point positions detected on the frontal face image, and the regular constraints of the model parameters themselves, specifically:
E=Edata+λEreg
Figure BDA0001530277150000041
Ereg=∑id∈(|wid|-3)+∑exp∈(|wexp|-3),
wherein R, t, wid,wexpFor the parameters to be solved, Proj is the camera projection matrix,
Figure BDA0001530277150000042
tensor base, p, for the ith eigenpoint on the modeliAnd e (-) is a step function and is the coordinate of the ith characteristic point on the image.
Further, in an embodiment of the present invention, the expression parameters obtained by solving the sound information are introduced into a formula for weighting, where the formula is:
Figure BDA0001530277150000043
wexp=λ1waudio2wsideface
wherein V is the vertex position after the face model is transformed, CrIs a bilinear tensor parameter basis of the model, widIs a shape parameter of the model, wexpIs an expression parameter of the model, waudioFor expression parameters solved by the sound information, wsidefaceFor expression parameters, lambda, solved by side-face image information1,λ2Weights fitted for the two sets of expression parameters.
Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.
Drawings
The foregoing and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:
FIG. 1 is a flow chart of a method for face reconstruction with voice-driven auxiliary side-face images according to an embodiment of the present invention;
fig. 2 is a schematic structural diagram of a face reconstruction apparatus for voice-driving an auxiliary side face image according to an embodiment of the present invention.
Detailed Description
Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are illustrative and intended to be illustrative of the invention and are not to be construed as limiting the invention.
The following describes a face reconstruction method and apparatus for a voice-driven auxiliary side face image according to an embodiment of the present invention with reference to the accompanying drawings, and first, a face reconstruction method for a voice-driven auxiliary side face image according to an embodiment of the present invention will be described with reference to the accompanying drawings.
Fig. 1 is a flowchart of a face reconstruction method for voice-driven auxiliary side face images according to an embodiment of the present invention.
As shown in fig. 1, the method for reconstructing a human face by using voice to drive an auxiliary side face image comprises the following steps:
in step S101, a training data set is extracted, wherein if the AR device wearer speaks a segment of corpus through the front face of the camera lens, the sound information collected by the camera is extracted and converted into speech features, the facial image information collected by the camera is extracted and expression parameters are extracted, and the speech features and the expression parameters are respectively used as input and output of the deep learning training set.
It can be understood that the embodiment of the invention can collect the face movement and the speaking voice of the person by erecting the miniature cameras on the two sides of the AR glasses of the user, and extract the waveform characteristics in the voice information by using a deep learning method while collecting and reconstructing the side face information.
Namely, micro cameras are erected on two sides of the AR equipment, relative parameters of the two cameras are calibrated through a calibration plate, a wearer wears the AR equipment, and the cameras are erected right in front of the wearer to collect a training set required by deep learning.
Optionally, in an embodiment of the present invention, when the training set is collected, the collected results include a sequence of side face images captured by a camera on the AR device, a sequence of front face images captured by a front camera, and voice information of the wearer collected by the camera.
For example, after three cameras are synchronized, the training set starts to be acquired, and the acquired results include a side face image sequence captured by the camera on the AR device, a front face image sequence captured by the front camera, and the voice information of the wearer acquired by the camera.
Further, in an embodiment of the present invention, extracting the training data set further includes: extracting parameters from the collected voice waveform window of each frame by a linear predictive coding method to serve as the voice characteristics of the frame to be input as a training set; and carrying out face model fitting on the collected front face image sequence and extracting expression parameters to be output as a training set.
It can be understood that, in the embodiment of the present invention, parameters of the acquired speech waveform window of each frame may be extracted by an LPC (linear predictive coding) method to serve as speech feature information of the frame, and the extracted parameters serve as input of a training set; and carrying out face model fitting on the collected front face image sequence, extracting expression parameters and outputting the expression parameters as a training set.
Further, in an embodiment of the present invention, the fitting process constrains distances between positions of the model feature points projected onto the image and the feature point positions detected on the frontal face image, and the regular constraints of the model parameters themselves, specifically:
E=Edata+λEreg
Figure BDA0001530277150000051
Ereg=∑id∈(|wid|-3)+∑exp∈(|wexp|-3),
wherein R, t, wid,wexpFor the parameters to be solved, Proj is the camera projection matrix,
Figure BDA0001530277150000052
tensor base, p, for the ith eigenpoint on the modeliAnd e (-) is a step function and is the coordinate of the ith characteristic point on the image.
The fitting process constrains the distance between the position of the model feature point projected on the image and the position of the feature point detected on the frontal image, and the regularization constraint of the model parameters themselves. The specific form is as follows:
E=Edata+λEreg
Figure BDA0001530277150000061
Figure BDA0001530277150000062
description of the respective parameters: r, t, wid,wexp-parameters to be solved, Proj-camera projection matrix,
Figure BDA0001530277150000063
tensor base, p, corresponding to the ith eigenpoint on the modeliThe method comprises the steps of obtaining a coordinate of the ith characteristic point on an image, determining a step function, constraining the minimum value by using a Gaussian-Newton gradient descent method, and solving to obtain R, t, wid,wexpAnd extracting wexpAnd outputting as a training set.
In step S102, a convolutional neural network is used to perform deep learning training on the training data set, and a corresponding relationship between the speech feature and the expression parameter is obtained.
It can be understood that the embodiment of the invention can train a data set and use a convolutional neural network for deep learning training, so as to obtain the corresponding relation between the voice feature and the expression parameter. And carrying out deep learning training on the training set through a convolutional neural network to obtain a trained model.
In step S103, in the process of testing and using the neural network model, the miniature cameras erected on both sides of the AR device collect the side face information and the speaking voice information of the user, extract the voice features of the voice waveform window corresponding to each frame, and store the feature information into the input file.
It can be understood that in the process of testing and using the model, the miniature cameras erected at two sides of the AR device are used for collecting the side face information and the speaking voice information of the user, extracting the voice characteristics of the voice waveform window corresponding to each frame, and storing the characteristic information into the input file.
For example, a training set is made for the side face information, feature points on the front face are mapped onto the side face through calibrated camera parameters by an existing front face feature point matching method, the feature point information of the side face is obtained, and the corresponding relation between the side face image and the face feature points is obtained through a deep learning method; or detecting the feature point information of the side face by using the existing side face feature point detection method.
In step S104, the input file is placed in the trained neural network model to obtain the facial expression parameters corresponding to each frame of speech information.
It can be understood that the embodiment of the invention can put the input file into the trained neural network model to obtain the facial expression parameters corresponding to each frame of voice information.
In step S105, a training set is also created for the side face information, and feature points on the front face are mapped onto the side face through calibrated camera parameters by existing front face feature point matching, so as to obtain feature point information of the side face, so as to obtain a corresponding relationship between the side face image and the face feature points through a deep learning method.
It can be understood that, for the side face information, a training set is also made, and the feature points on the front face are mapped onto the side face through calibrated camera parameters by using the existing method for matching the feature points of the front face, so as to obtain the feature point information of the side face, thereby obtaining the corresponding relationship between the side face image and the feature points of the face through a deep learning method.
In step S106, the existing face model is used to fit the positions of the feature points on the side face, the expression parameters and shape parameters of the face are solved, and the expression parameters obtained by solving the sound information are introduced for weighting.
It can be understood that, in the embodiment of the present invention, the existing face model (e.g., FaceWarehouse) may be used to fit the positions of the feature points on the side face, and the expression parameters and the shape parameters of the face may be solved. And introducing expression parameters obtained by solving the sound information for weighting.
It can be understood that the expression parameters obtained by solving the sound information according to the embodiment of the present invention are introduced into the formula for weighting:
Figure BDA0001530277150000071
wexp=λ1waudio+λ2wsideface
wherein V is the vertex position after the face model is transformed, CrIs a bilinear tensor parameter basis of the model, widIs a shape parameter of the model, wexpIs an expression parameter of the model, waudioFor expression parameters solved by the sound information, wsidefaceFor expression parameters, lambda, solved by side-face image information1,λ2Weights fitted for the two sets of expression parameters.
For example, during testing and use, only the AR device and the miniature cameras on both sides, there is no front facing camera. Extracting characteristic points of the side face information by the method, and performing energy minimum solving by using the energy items to obtain R, t, wid,wsideface. Meanwhile, the collected voice information is used for extracting voice characteristics by the method, and the voice characteristics are endowed to a trained model to obtain the output facial expression parametersNumber waudio. Finally, two groups of expression parameters w are obtainedsidefaceAnd waudioLinear weighting is performed by the following formula:
Figure BDA0001530277150000072
wexp=λ1waudio+λ2wsideface
and converting the parameters of the face model into a three-dimensional face model.
In step S107, texture mapping and matching are performed on the fitted face model to obtain a final face reconstruction result.
That is to say, the embodiment of the invention can perform texture matching and mapping on the generated face model to complete three-dimensional modeling.
According to the face reconstruction method of the voice-driven auxiliary side face image, which is provided by the embodiment of the invention, the acquired voice information and the face movement can be matched by utilizing deep learning, then the extracted face movement parameters are used for pointing to the three-dimensional reconstruction of the face model, and the voice information is added to assist the reconstruction method, so that the mouth movement information which is difficult to obtain through the side face image can be better tracked and reconstructed, and the reconstruction reliability is effectively improved.
Next, a face reconstruction apparatus for voice-driving an auxiliary side face image according to an embodiment of the present invention will be described with reference to the accompanying drawings.
Fig. 2 is a schematic structural diagram of a face reconstruction apparatus for voice-driving an auxiliary side face image according to an embodiment of the present invention.
As shown in fig. 2, the face reconstruction device 10 for voice-driving an auxiliary side face image includes: a first extraction module 100, a first acquisition module 200, a second extraction module 300, a placement module 400, a second acquisition module 500, a weighting module 600, and a processing module 700.
The first extraction module 100 is configured to extract a training data set, wherein if the front face of the AR device wearer says a segment of corpus of the camera lens, the sound information collected by the camera is extracted and converted into speech features, the face image information collected by the camera is extracted and expression parameters are extracted, and the speech features and the expression parameters are respectively used as input and output of the deep learning training set. The first obtaining module 200 is configured to perform deep learning training on a training data set by using a convolutional neural network, and obtain a corresponding relationship between a speech feature and an expression parameter. The second extraction module 300 is configured to, during a process of testing and using the neural network model, collect side face information and spoken speech information of a user through the micro cameras erected on both sides of the AR device, extract speech features of a speech waveform window corresponding to each frame, and store the feature information in an input file. The embedding module 400 is configured to embed the input file into the trained neural network model to obtain a facial expression parameter corresponding to each frame of speech information. The second obtaining module 500 is configured to make a training set for the side face information in the same manner, map feature points on the front face to the side face through calibrated camera parameters by using existing front face feature point matching, obtain feature point information of the side face, and obtain a correspondence between the side face image and the face feature points by using a deep learning method. The weighting module 600 is configured to fit feature point positions on the side face with an existing face model, solve expression parameters and shape parameters of the face, and introduce expression parameters obtained through solving of sound information for weighting. The processing module 700 is configured to perform texture mapping and matching on the fitted face model to obtain a final face reconstruction result. The device 10 of the embodiment of the invention can better track and reconstruct the mouth movement information which is difficult to obtain through the side face image, and effectively improve the reconstruction reliability
Further, in one embodiment of the present invention, the training set is collected, and the collected results include a sequence of side face images captured by a camera on the AR device, a sequence of front face images captured by a front camera, and voice information of the wearer captured by the camera.
Further, in an embodiment of the present invention, the first extraction module 100 is further configured to extract parameters from the acquired speech waveform window of each frame through a linear predictive coding method, as speech features of the frame, to serve as a training set input, perform face model fitting on the acquired front face image sequence, and extract expression parameters, to serve as a training set output.
Further, in an embodiment of the present invention, the fitting process constrains distances between positions of the model feature points projected onto the image and the feature point positions detected on the frontal face image, and the regular constraints of the model parameters themselves, specifically:
E=Edataand λ Ereg
Figure BDA0001530277150000081
Ereg=∑id∈(|wid|-3)+∑exp∈(|wexp|-3),
Wherein R, t, wid,wexpFor the parameters to be solved, Proj is the camera projection matrix,
Figure BDA0001530277150000091
tensor base, p, for the ith eigenpoint on the modeliAnd e (-) is a step function and is the coordinate of the ith characteristic point on the image.
Further, in an embodiment of the present invention, the expression parameters obtained by solving the sound information are introduced into a formula for weighting, where the formula is:
Figure BDA0001530277150000092
wexp=λ1waudio+λ2wsideface
wherein V is the vertex position after the face model is transformed, CrIs a bilinear tensor parameter basis of the model, widIs a shape parameter of the model, wexpIs an expression parameter of the model, waudioFor expression parameters solved by the sound information, wsidefaceFor expression parameters, lambda, solved by side-face image information1,λ2Weights fitted for the two sets of expression parameters.
It should be noted that the foregoing explanation on the embodiment of the method for reconstructing a human face by using a voice to drive an auxiliary side face image is also applicable to the device for reconstructing a human face by using a voice to drive an auxiliary side face image in this embodiment, and details are not repeated here.
According to the face reconstruction device for the voice-driven auxiliary side face image, which is provided by the embodiment of the invention, the collected voice information and the face movement can be matched by utilizing deep learning, then the extracted face movement parameters are used for pointing to the three-dimensional reconstruction of a face model, and a voice information auxiliary reconstruction method is added, so that the mouth movement information which is difficult to obtain through the side face image can be well tracked and reconstructed, and the reconstruction reliability is effectively improved.
In the description of the present invention, it is to be understood that the terms "central," "longitudinal," "lateral," "length," "width," "thickness," "upper," "lower," "front," "rear," "left," "right," "vertical," "horizontal," "top," "bottom," "inner," "outer," "clockwise," "counterclockwise," "axial," "radial," "circumferential," and the like are used in the orientations and positional relationships indicated in the drawings for convenience in describing the invention and to simplify the description, and are not intended to indicate or imply that the referenced devices or elements must have a particular orientation, be constructed and operated in a particular orientation, and are therefore not to be considered limiting of the invention.
Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present invention, "a plurality" means at least two, e.g., two, three, etc., unless specifically limited otherwise.
In the present invention, unless otherwise expressly stated or limited, the terms "mounted," "connected," "secured," and the like are to be construed broadly and can, for example, be fixedly connected, detachably connected, or integrally formed; can be mechanically or electrically connected; they may be directly connected or indirectly connected through intervening media, or they may be connected internally or in any other suitable relationship, unless expressly stated otherwise. The specific meanings of the above terms in the present invention can be understood by those skilled in the art according to specific situations.
In the present invention, unless otherwise expressly stated or limited, the first feature "on" or "under" the second feature may be directly contacting the first and second features or indirectly contacting the first and second features through an intermediate. Also, a first feature "on," "over," and "above" a second feature may be directly or diagonally above the second feature, or may simply indicate that the first feature is at a higher level than the second feature. A first feature being "under," "below," and "beneath" a second feature may be directly under or obliquely under the first feature, or may simply mean that the first feature is at a lesser elevation than the second feature.
In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.
Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made to the above embodiments by those of ordinary skill in the art within the scope of the present invention.

Claims (8)

1. A human face reconstruction method for driving an auxiliary side face image by sound is characterized by comprising the following steps:
extracting a training data set, wherein if a wearer speaks a section of corpus to the camera lens through the front face of the AR device, voice information collected by the camera is extracted and converted into voice features, facial image information collected by the camera is extracted, and expression parameters are extracted, wherein the voice features and the expression parameters are respectively used as input and output of a deep learning training set;
performing deep learning training on the training data set by using a convolutional neural network to obtain the corresponding relation between the voice characteristics and the expression parameters;
in the process of testing and using a neural network model, acquiring side face information and speaking voice information of a user through miniature cameras erected on two sides of the AR device, extracting voice characteristics of a voice waveform window corresponding to each frame, and storing the characteristic information into an input file;
putting the input file into a trained neural network model to obtain a facial expression parameter corresponding to each frame of voice information;
for the side face information, a training set is also made, the feature points on the front face are mapped onto the side face through calibrated camera parameters through the existing front face feature point matching, the feature point information of the side face is obtained, and the corresponding relation between the side face image and the face feature points is obtained through a deep learning method;
the method comprises the following steps of fitting the positions of characteristic points on a side face by using an existing face model, solving expression parameters and shape parameters of the face, and introducing expression parameters obtained by solving sound information for weighting, wherein the fitting process constrains the distance between the positions of the model characteristic points projected on an image and the positions of the characteristic points detected on a front face image, and the regular constraint of the model parameters per se, and specifically comprises the following steps: e ═ Edata+λEreg
Figure FDA0002416331480000011
Ereg=∑id∈(|wid|-3)+∑exp∈(|wexpL-3), wherein R, t, wid,wexpFor the parameters to be solved, Proj is the camera projection matrix,
Figure FDA0002416331480000012
tensor base, p, for the ith eigenpoint on the modeliIs the coordinate of the ith characteristic point on the image, and epsilon (-) is a step function, widIs a shape parameter of the model, wexpExpression parameters of the model; and
and performing texture mapping and matching on the fitted face model to obtain a final face reconstruction result.
2. The method of claim 1, wherein the training set is collected, and the collected results comprise a sequence of side images captured by a camera on the AR device, a sequence of front images captured by a front camera, and the camera's voice information of the wearer.
3. The method for reconstructing a human face from an audio-driven auxiliary side-face image according to claim 2, wherein the extracting a training data set further comprises:
extracting parameters of the collected voice waveform window of each frame through a linear predictive coding method to serve as the voice characteristics of the frame, and inputting the parameters serving as a training set;
and carrying out face model fitting on the collected front face image sequence and extracting the expression parameters to be output as a training set.
4. The method of claim 1, wherein the expression parameters obtained by the solution of the audio information are introduced into the formula for weighting as follows:
Figure FDA0002416331480000021
wexp=λ1waudio2wsideface
wherein V is the vertex position after the face model is transformed, CrIs a bilinear tensor parameter basis of the model, widIs a shape parameter of the model, wexpIs an expression parameter of the model, waudioFor expression parameters solved by the sound information, wsidefaceFor expression parameters, lambda, solved by side-face image information1,λ2Weights fitted for the two sets of expression parameters.
5. A face reconstruction device that uses voice to drive an auxiliary side face image, comprising:
the system comprises a first extraction module, a second extraction module and a third extraction module, wherein if a face of an AR device wearer speaks a section of corpus to a camera lens, voice information collected by a camera is extracted and converted into voice features, face image information collected by the camera is extracted, and expression parameters are extracted, and the voice features and the expression parameters are respectively used as input and output of a deep learning training set;
the first acquisition module is used for carrying out deep learning training on the training data set by using a convolutional neural network to acquire the corresponding relation between the voice feature and the expression parameter;
the second extraction module is used for acquiring side face information and speaking voice information of a user through miniature cameras erected on two sides of the AR device in the process of testing and using the neural network model, extracting voice characteristics of a voice waveform window corresponding to each frame and storing the characteristic information into an input file;
the input file is placed into a trained neural network model to obtain facial expression parameters corresponding to each frame of voice information;
the second acquisition module is used for making a training set for the side face information, mapping the feature points on the front face to the side face through the calibrated camera parameters by matching the existing front face feature points, acquiring the feature point information of the side face and acquiring the corresponding relation between the side face image and the face feature points by a deep learning method;
the weighting module is used for solving the expression parameters and the shape parameters of the face by using the feature point positions on the existing face model fitting side face, and introducing the expression parameters obtained by solving the sound information for weighting, wherein the fitting process constrains the distance between the positions of the model feature points projected on the image and the feature point positions detected on the frontal face image, and the regular constraints of the model parameters are specifically as follows: e ═ Edata+λEreg
Figure FDA0002416331480000022
Figure FDA0002416331480000023
Ereg=∑id∈(|wid|-3)+∑exp∈(|wexpL-3), wherein R, t, wid,wexpFor the parameters to be solved, Proj is the camera projection matrix,
Figure FDA0002416331480000024
tensor base, p, for the ith eigenpoint on the modeliIs the coordinate of the ith characteristic point on the image, and epsilon (-) is a step function, widIs a shape parameter of the model, wexpExpression parameters of the model; and
and the processing module is used for performing texture mapping and matching on the fitted face model to obtain a final face reconstruction result.
6. The device for reconstructing a human face by using voice to drive auxiliary side-face images according to claim 5, wherein the results collected when the training set is collected include a sequence of side-face images captured by a camera on the AR device, a sequence of front-face images captured by a front camera, and the voice information of the wearer collected by the camera.
7. The device for reconstructing a human face according to claim 6, wherein the first extracting module is further configured to extract parameters from the collected speech waveform window of each frame by a linear predictive coding method as the speech features of the frame to be input as a training set, perform face model fitting on the collected front face image sequence, and extract the expression parameters to be output as the training set.
8. The device for reconstructing a human face by using a voice-driven auxiliary side face image according to claim 5, wherein the expression parameters obtained by solving the voice information are introduced into the formula for weighting:
Figure FDA0002416331480000031
wexp=λ1waudio2wsideface
wherein V is the vertex position after the face model is transformed, CrIs a bilinear tensor parameter basis of the model, widIs a shape parameter of the model, wexpIs an expression parameter of the model, waudioFor expression parameters solved by the sound information, wsidefaceFor expression parameters, lambda, solved by side-face image information1,λ2Weights fitted for the two sets of expression parameters.
CN201711461073.0A 2017-12-28 2017-12-28 Face reconstruction method and device for voice-driven auxiliary side face image Active CN108230438B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711461073.0A CN108230438B (en) 2017-12-28 2017-12-28 Face reconstruction method and device for voice-driven auxiliary side face image

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711461073.0A CN108230438B (en) 2017-12-28 2017-12-28 Face reconstruction method and device for voice-driven auxiliary side face image

Publications (2)

Publication Number Publication Date
CN108230438A CN108230438A (en) 2018-06-29
CN108230438B true CN108230438B (en) 2020-06-19

Family

ID=62645827

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711461073.0A Active CN108230438B (en) 2017-12-28 2017-12-28 Face reconstruction method and device for voice-driven auxiliary side face image

Country Status (1)

Country Link
CN (1) CN108230438B (en)

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TW202009924A (en) * 2018-08-16 2020-03-01 國立臺灣科技大學 Timbre-selectable human voice playback system, playback method thereof and computer-readable recording medium
KR102537784B1 (en) * 2018-08-17 2023-05-30 삼성전자주식회사 Electronic device and control method thereof
CN110211582A (en) * 2019-05-31 2019-09-06 量子动力(深圳)计算机科技有限公司 A kind of real-time, interactive intelligent digital virtual actor's facial expression driving method and system
CN111243626B (en) * 2019-12-30 2022-12-09 清华大学 Method and system for generating speaking video
CN111294665B (en) * 2020-02-12 2021-07-20 百度在线网络技术(北京)有限公司 Video generation method and device, electronic equipment and readable storage medium
CN111432233B (en) * 2020-03-20 2022-07-19 北京字节跳动网络技术有限公司 Method, apparatus, device and medium for generating video
CN112001992A (en) * 2020-07-02 2020-11-27 超维视界(北京)传媒科技有限公司 Voice-driven 3D virtual human expression sound-picture synchronization method and system based on deep learning
CN112102468B (en) * 2020-08-07 2022-03-04 北京汇钧科技有限公司 Model training method, virtual character image generation device, and storage medium
CN113269872A (en) * 2021-06-01 2021-08-17 广东工业大学 Synthetic video generation method based on three-dimensional face reconstruction and video key frame optimization
CN113822969B (en) * 2021-09-15 2023-06-09 宿迁硅基智能科技有限公司 Training neural radiation field model, face generation method, device and server

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102157010A (en) * 2011-05-25 2011-08-17 上海大学 Method for realizing three-dimensional facial animation based on layered modeling and multi-body driving
US9302179B1 (en) * 2013-03-07 2016-04-05 Posit Science Corporation Neuroplasticity games for addiction
CN103218842B (en) * 2013-03-12 2015-11-25 西南交通大学 A kind of voice synchronous drives the method for the three-dimensional face shape of the mouth as one speaks and facial pose animation
CN104951743A (en) * 2015-03-04 2015-09-30 苏州大学 Active-shape-model-algorithm-based method for analyzing face expression
CN106878677B (en) * 2017-01-23 2020-01-07 西安电子科技大学 Student classroom mastery degree evaluation system and method based on multiple sensors

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
Simultaneous Facial Feature Tracking and Facial Expression Recognition;Yongqiang Li 等;《IEEE TRANSACTIONS ON IMAGE PROCESSING》;20130731;第2559-2573页 *

Also Published As

Publication number Publication date
CN108230438A (en) 2018-06-29

Similar Documents

Publication Publication Date Title
CN108230438B (en) Face reconstruction method and device for voice-driven auxiliary side face image
CN110321754B (en) Human motion posture correction method and system based on computer vision
CN102697508B (en) Method for performing gait recognition by adopting three-dimensional reconstruction of monocular vision
CN110428493B (en) Single-image human body three-dimensional reconstruction method and system based on grid deformation
CN103971408B (en) Three-dimensional facial model generating system and method
CN110544301A (en) Three-dimensional human body action reconstruction system, method and action training system
Uddin et al. Human activity recognition using body joint‐angle features and hidden Markov model
CN112069933A (en) Skeletal muscle stress estimation method based on posture recognition and human body biomechanics
CN106997605B (en) A method of foot type video is acquired by smart phone and sensing data obtains three-dimensional foot type
CN111931804B (en) Human body action automatic scoring method based on RGBD camera
CN110969124A (en) Two-dimensional human body posture estimation method and system based on lightweight multi-branch network
CN109919141A (en) A kind of recognition methods again of the pedestrian based on skeleton pose
CN108734776A (en) A kind of three-dimensional facial reconstruction method and equipment based on speckle
CN105138980A (en) Identify authentication method and system based on identity card information and face identification
CN110544302A (en) Human body action reconstruction system and method based on multi-view vision and action training system
CN109063643B (en) Facial expression pain degree identification method under condition of partial hiding of facial information
CN111012353A (en) Height detection method based on face key point recognition
CN110717391A (en) Height measuring method, system, device and medium based on video image
JP2002197443A (en) Generator of three-dimensional form data
CN112365578A (en) Three-dimensional human body model reconstruction system and method based on double cameras
CN104732586B (en) A kind of dynamic body of 3 D human body and three-dimensional motion light stream fast reconstructing method
He et al. A New Kinect‐Based Posture Recognition Method in Physical Sports Training Based on Urban Data
CN107704851A (en) Character recognition method, Public Media exhibiting device, server and system
JP2021086274A (en) Lip reading device and lip reading method
CN110060334A (en) Calculating integration imaging image reconstructing method based on Scale invariant features transform

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20221220

Address after: Room 3346, Floor 3, International Innovation Expo Center, No. 267, Kejiyuan Road, Baiyang Street, Qiantang District, Hangzhou, Zhejiang 310020

Patentee after: Hangzhou Xinchangyuan Technology Co.,Ltd.

Address before: 100084 Tsinghua Yuan, Beijing, Haidian District

Patentee before: TSINGHUA University