CN108230438B - Face reconstruction method and device for voice-driven auxiliary side face image - Google Patents
Face reconstruction method and device for voice-driven auxiliary side face image Download PDFInfo
- Publication number
- CN108230438B CN108230438B CN201711461073.0A CN201711461073A CN108230438B CN 108230438 B CN108230438 B CN 108230438B CN 201711461073 A CN201711461073 A CN 201711461073A CN 108230438 B CN108230438 B CN 108230438B
- Authority
- CN
- China
- Prior art keywords
- face
- model
- voice
- information
- parameters
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 63
- 238000012549 training Methods 0.000 claims abstract description 60
- 238000013135 deep learning Methods 0.000 claims abstract description 29
- 230000008921 facial expression Effects 0.000 claims abstract description 10
- 238000013507 mapping Methods 0.000 claims abstract description 10
- 238000000605 extraction Methods 0.000 claims description 14
- 230000008569 process Effects 0.000 claims description 14
- 238000003062 neural network model Methods 0.000 claims description 13
- 238000013527 convolutional neural network Methods 0.000 claims description 8
- 230000006870 function Effects 0.000 claims description 8
- 238000012360 testing method Methods 0.000 claims description 8
- 239000011159 matrix material Substances 0.000 claims description 7
- 238000012545 processing Methods 0.000 claims description 4
- 230000001815 facial effect Effects 0.000 claims description 3
- 238000010586 diagram Methods 0.000 description 2
- 239000011521 glass Substances 0.000 description 2
- 239000000463 material Substances 0.000 description 2
- 230000004075 alteration Effects 0.000 description 1
- 230000003190 augmentative effect Effects 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000011478 gradient descent method Methods 0.000 description 1
- 238000003754 machining Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 230000001360 synchronised effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T17/00—Three dimensional [3D] modelling, e.g. data description of 3D objects
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Software Systems (AREA)
- General Physics & Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Biomedical Technology (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- Biophysics (AREA)
- Mathematical Physics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- Computer Graphics (AREA)
- Geometry (AREA)
- Image Analysis (AREA)
- Image Processing (AREA)
Abstract
The invention discloses a face reconstruction method and a face reconstruction device for a voice-driven auxiliary side face image, wherein the method comprises the following steps: extracting a training data set; acquiring the corresponding relation between the voice characteristics and the expression parameters; extracting the voice characteristics of the voice waveform window corresponding to each frame, and storing the characteristic information into an input file; obtaining facial expression parameters corresponding to each frame of voice information; acquiring characteristic point information of a side face to acquire a corresponding relation between a side face image and a face characteristic point by a deep learning method; fitting the positions of the characteristic points on the side face by using the existing face model, solving the expression parameters and the shape parameters of the face, and introducing the expression parameters obtained by solving the sound information for weighting; and performing texture mapping and matching on the fitted face model to obtain a final face reconstruction result. The method can enable mouth movement information which is difficult to obtain through the side face image to be tracked and reconstructed well, and effectively improves the reliability of reconstruction.
Description
Technical Field
The invention relates to the technical field of computer vision, in particular to a method and a device for reconstructing a human face by using voice to drive an auxiliary side face image.
Background
The high-quality three-dimensional model has important application value in various fields of movie and television entertainment, cultural relic protection, machining and the like. The human face three-dimensional reconstruction method is a great problem in the field of three-dimensional reconstruction because the human face expression is rich.
The face reconstruction technology of the related art mainly aims at face reconstruction of a front face image, and for AR (augmented reality) equipment, if face information of a wearer needs to be reconstructed, in order to not affect the sight of the wearer, a micro camera needs to be arranged on two sides of AR glasses, which brings difficulty to face reconstruction, especially reconstruction of mouth movement, because the mouth information acquired by a side face image is often incomplete.
Disclosure of Invention
The present invention is directed to solving, at least to some extent, one of the technical problems in the related art.
Therefore, an object of the present invention is to provide a face reconstruction method for voice-driven auxiliary side face images, which can better track and reconstruct the mouth motion information that is difficult to obtain through the side face images, and effectively improve the reconstruction reliability.
Another object of the present invention is to provide a face reconstruction device for voice-driving auxiliary side face images.
In order to achieve the above object, an embodiment of the present invention provides a face reconstruction method for driving an auxiliary side face image by sound, including the following steps: extracting a training data set, wherein if a wearer speaks a section of corpus to the camera lens through the front face of the AR device, voice information collected by the camera is extracted and converted into voice features, facial image information collected by the camera is extracted, and expression parameters are extracted, wherein the voice features and the expression parameters are respectively used as input and output of a deep learning training set; performing deep learning training on the training data set by using a convolutional neural network to obtain the corresponding relation between the voice characteristics and the expression parameters; in the process of testing and using a neural network model, acquiring side face information and speaking voice information of a user through miniature cameras erected on two sides of the AR device, extracting voice characteristics of a voice waveform window corresponding to each frame, and storing the characteristic information into an input file; putting the input file into a trained neural network model to obtain a facial expression parameter corresponding to each frame of voice information; for the side face information, a training set is also made, the feature points on the front face are mapped onto the side face through calibrated camera parameters through the existing front face feature point matching, the feature point information of the side face is obtained, and the corresponding relation between the side face image and the face feature points is obtained through a deep learning method; fitting the positions of the characteristic points on the side face by using the existing face model, solving the expression parameters and the shape parameters of the face, and introducing the expression parameters obtained by solving the sound information for weighting; and performing texture mapping and matching on the fitted face model to obtain a final face reconstruction result.
According to the face reconstruction method of the voice-driven auxiliary side face image, the collected voice information and the face movement can be matched by utilizing deep learning, then the extracted face movement parameters are used for indicating the three-dimensional reconstruction of the face model, and the voice information auxiliary reconstruction method is added, so that the mouth movement information which is difficult to obtain through the side face image can be well tracked and reconstructed, and the reconstruction reliability is effectively improved.
In addition, the face reconstruction method for the voice-driven auxiliary side face image according to the above embodiment of the present invention may further have the following additional technical features:
further, in one embodiment of the present invention, when the training set is collected, the collected results include a sequence of side face images captured by a camera on the AR device, a sequence of front face images captured by a front camera, and voice information of the wearer collected by the camera.
Further, in an embodiment of the present invention, the extracting the training data set further includes: extracting parameters of the collected voice waveform window of each frame through a linear predictive coding method to serve as the voice characteristics of the frame, and inputting the parameters serving as a training set; and carrying out face model fitting on the collected front face image sequence and extracting the expression parameters to be output as a training set.
Further, in an embodiment of the present invention, the fitting process constrains distances between positions of the model feature points projected onto the image and the feature point positions detected on the frontal face image, and the regular constraints of the model parameters themselves, specifically:
E=Edata+λEreg,
Ereg=∑id∈(|wid|-3)+∑exp∈(|wexp|-3),
wherein R, t, wid,wexpFor the parameters to be solved, Proj is the camera projection matrix,tensor base, p, for the ith eigenpoint on the modeliAnd e (-) is a step function and is the coordinate of the ith characteristic point on the image.
Further, in an embodiment of the present invention, the expression parameters obtained by solving the sound information are introduced into a formula for weighting, where the formula is:
wexp=λ1waudio+λ2wsideface,
wherein V is the vertex position after the face model is transformed, CrIs a bilinear tensor parameter basis of the model, widIs a shape parameter of the model, wexpIs an expression parameter of the model, waudioFor expression parameters solved by the sound information, wsidefaceFor expression parameters, lambda, solved by side-face image information1,λ2Weights fitted for the two sets of expression parameters.
In order to achieve the above object, another embodiment of the present invention provides a face reconstruction apparatus for driving an auxiliary side face image by voice, including: the system comprises a first extraction module, a second extraction module and a third extraction module, wherein if a face of an AR device wearer speaks a section of corpus to a camera lens, voice information collected by a camera is extracted and converted into voice features, face image information collected by the camera is extracted, and expression parameters are extracted, and the voice features and the expression parameters are respectively used as input and output of a deep learning training set; the first acquisition module is used for carrying out deep learning training on the training data set by using a convolutional neural network to acquire the corresponding relation between the voice feature and the expression parameter; the second extraction module is used for acquiring side face information and speaking voice information of a user through miniature cameras erected on two sides of the AR device in the process of testing and using the neural network model, extracting voice characteristics of a voice waveform window corresponding to each frame and storing the characteristic information into an input file; the input file is placed into a trained neural network model to obtain facial expression parameters corresponding to each frame of voice information; the second acquisition module is used for making a training set for the side face information, mapping the feature points on the front face to the side face through the calibrated camera parameters by matching the existing front face feature points, acquiring the feature point information of the side face and acquiring the corresponding relation between the side face image and the face feature points by a deep learning method; the weighting module is used for fitting the positions of the feature points on the side face by using the existing face model, solving the expression parameters and the shape parameters of the face, and introducing the expression parameters obtained by solving the sound information for weighting; and the processing module is used for performing texture mapping and matching on the fitted face model to obtain a final face reconstruction result.
The face reconstruction device for the voice-driven auxiliary side face image can utilize deep learning to match collected voice information with face movement, then use the extracted face movement parameters to indicate the three-dimensional reconstruction of a face model, and add a voice information auxiliary reconstruction method, so that mouth movement information which is difficult to obtain through the side face image can be better tracked and reconstructed, and the reconstruction reliability is effectively improved.
In addition, the face reconstruction device for voice-driven auxiliary side face image according to the above embodiment of the present invention may further have the following additional technical features:
further, in one embodiment of the present invention, when the training set is collected, the collected results include a sequence of side face images captured by a camera on the AR device, a sequence of front face images captured by a front camera, and voice information of the wearer collected by the camera.
Further, in an embodiment of the present invention, the first extraction module is further configured to extract parameters from the collected speech waveform window of each frame through a linear predictive coding method, where the parameters are used as the speech features of the frame to be input as a training set, perform face model fitting on the collected front face image sequence, and extract the expression parameters to be output as the training set.
Further, in an embodiment of the present invention, the fitting process constrains distances between positions of the model feature points projected onto the image and the feature point positions detected on the frontal face image, and the regular constraints of the model parameters themselves, specifically:
E=Edata+λEreg,
Ereg=∑id∈(|wid|-3)+∑exp∈(|wexp|-3),
wherein R, t, wid,wexpFor the parameters to be solved, Proj is the camera projection matrix,tensor base, p, for the ith eigenpoint on the modeliAnd e (-) is a step function and is the coordinate of the ith characteristic point on the image.
Further, in an embodiment of the present invention, the expression parameters obtained by solving the sound information are introduced into a formula for weighting, where the formula is:
wexp=λ1waudio+λ2wsideface,
wherein V is the vertex position after the face model is transformed, CrIs a bilinear tensor parameter basis of the model, widIs a shape parameter of the model, wexpIs an expression parameter of the model, waudioFor expression parameters solved by the sound information, wsidefaceFor expression parameters, lambda, solved by side-face image information1,λ2Weights fitted for the two sets of expression parameters.
Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.
Drawings
The foregoing and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:
FIG. 1 is a flow chart of a method for face reconstruction with voice-driven auxiliary side-face images according to an embodiment of the present invention;
fig. 2 is a schematic structural diagram of a face reconstruction apparatus for voice-driving an auxiliary side face image according to an embodiment of the present invention.
Detailed Description
Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are illustrative and intended to be illustrative of the invention and are not to be construed as limiting the invention.
The following describes a face reconstruction method and apparatus for a voice-driven auxiliary side face image according to an embodiment of the present invention with reference to the accompanying drawings, and first, a face reconstruction method for a voice-driven auxiliary side face image according to an embodiment of the present invention will be described with reference to the accompanying drawings.
Fig. 1 is a flowchart of a face reconstruction method for voice-driven auxiliary side face images according to an embodiment of the present invention.
As shown in fig. 1, the method for reconstructing a human face by using voice to drive an auxiliary side face image comprises the following steps:
in step S101, a training data set is extracted, wherein if the AR device wearer speaks a segment of corpus through the front face of the camera lens, the sound information collected by the camera is extracted and converted into speech features, the facial image information collected by the camera is extracted and expression parameters are extracted, and the speech features and the expression parameters are respectively used as input and output of the deep learning training set.
It can be understood that the embodiment of the invention can collect the face movement and the speaking voice of the person by erecting the miniature cameras on the two sides of the AR glasses of the user, and extract the waveform characteristics in the voice information by using a deep learning method while collecting and reconstructing the side face information.
Namely, micro cameras are erected on two sides of the AR equipment, relative parameters of the two cameras are calibrated through a calibration plate, a wearer wears the AR equipment, and the cameras are erected right in front of the wearer to collect a training set required by deep learning.
Optionally, in an embodiment of the present invention, when the training set is collected, the collected results include a sequence of side face images captured by a camera on the AR device, a sequence of front face images captured by a front camera, and voice information of the wearer collected by the camera.
For example, after three cameras are synchronized, the training set starts to be acquired, and the acquired results include a side face image sequence captured by the camera on the AR device, a front face image sequence captured by the front camera, and the voice information of the wearer acquired by the camera.
Further, in an embodiment of the present invention, extracting the training data set further includes: extracting parameters from the collected voice waveform window of each frame by a linear predictive coding method to serve as the voice characteristics of the frame to be input as a training set; and carrying out face model fitting on the collected front face image sequence and extracting expression parameters to be output as a training set.
It can be understood that, in the embodiment of the present invention, parameters of the acquired speech waveform window of each frame may be extracted by an LPC (linear predictive coding) method to serve as speech feature information of the frame, and the extracted parameters serve as input of a training set; and carrying out face model fitting on the collected front face image sequence, extracting expression parameters and outputting the expression parameters as a training set.
Further, in an embodiment of the present invention, the fitting process constrains distances between positions of the model feature points projected onto the image and the feature point positions detected on the frontal face image, and the regular constraints of the model parameters themselves, specifically:
E=Edata+λEreg,
Ereg=∑id∈(|wid|-3)+∑exp∈(|wexp|-3),
wherein R, t, wid,wexpFor the parameters to be solved, Proj is the camera projection matrix,tensor base, p, for the ith eigenpoint on the modeliAnd e (-) is a step function and is the coordinate of the ith characteristic point on the image.
The fitting process constrains the distance between the position of the model feature point projected on the image and the position of the feature point detected on the frontal image, and the regularization constraint of the model parameters themselves. The specific form is as follows:
E=Edata+λEreg
description of the respective parameters: r, t, wid,wexp-parameters to be solved, Proj-camera projection matrix,tensor base, p, corresponding to the ith eigenpoint on the modeliThe method comprises the steps of obtaining a coordinate of the ith characteristic point on an image, determining a step function, constraining the minimum value by using a Gaussian-Newton gradient descent method, and solving to obtain R, t, wid,wexpAnd extracting wexpAnd outputting as a training set.
In step S102, a convolutional neural network is used to perform deep learning training on the training data set, and a corresponding relationship between the speech feature and the expression parameter is obtained.
It can be understood that the embodiment of the invention can train a data set and use a convolutional neural network for deep learning training, so as to obtain the corresponding relation between the voice feature and the expression parameter. And carrying out deep learning training on the training set through a convolutional neural network to obtain a trained model.
In step S103, in the process of testing and using the neural network model, the miniature cameras erected on both sides of the AR device collect the side face information and the speaking voice information of the user, extract the voice features of the voice waveform window corresponding to each frame, and store the feature information into the input file.
It can be understood that in the process of testing and using the model, the miniature cameras erected at two sides of the AR device are used for collecting the side face information and the speaking voice information of the user, extracting the voice characteristics of the voice waveform window corresponding to each frame, and storing the characteristic information into the input file.
For example, a training set is made for the side face information, feature points on the front face are mapped onto the side face through calibrated camera parameters by an existing front face feature point matching method, the feature point information of the side face is obtained, and the corresponding relation between the side face image and the face feature points is obtained through a deep learning method; or detecting the feature point information of the side face by using the existing side face feature point detection method.
In step S104, the input file is placed in the trained neural network model to obtain the facial expression parameters corresponding to each frame of speech information.
It can be understood that the embodiment of the invention can put the input file into the trained neural network model to obtain the facial expression parameters corresponding to each frame of voice information.
In step S105, a training set is also created for the side face information, and feature points on the front face are mapped onto the side face through calibrated camera parameters by existing front face feature point matching, so as to obtain feature point information of the side face, so as to obtain a corresponding relationship between the side face image and the face feature points through a deep learning method.
It can be understood that, for the side face information, a training set is also made, and the feature points on the front face are mapped onto the side face through calibrated camera parameters by using the existing method for matching the feature points of the front face, so as to obtain the feature point information of the side face, thereby obtaining the corresponding relationship between the side face image and the feature points of the face through a deep learning method.
In step S106, the existing face model is used to fit the positions of the feature points on the side face, the expression parameters and shape parameters of the face are solved, and the expression parameters obtained by solving the sound information are introduced for weighting.
It can be understood that, in the embodiment of the present invention, the existing face model (e.g., FaceWarehouse) may be used to fit the positions of the feature points on the side face, and the expression parameters and the shape parameters of the face may be solved. And introducing expression parameters obtained by solving the sound information for weighting.
It can be understood that the expression parameters obtained by solving the sound information according to the embodiment of the present invention are introduced into the formula for weighting:
wexp=λ1waudio+λ2wsideface,
wherein V is the vertex position after the face model is transformed, CrIs a bilinear tensor parameter basis of the model, widIs a shape parameter of the model, wexpIs an expression parameter of the model, waudioFor expression parameters solved by the sound information, wsidefaceFor expression parameters, lambda, solved by side-face image information1,λ2Weights fitted for the two sets of expression parameters.
For example, during testing and use, only the AR device and the miniature cameras on both sides, there is no front facing camera. Extracting characteristic points of the side face information by the method, and performing energy minimum solving by using the energy items to obtain R, t, wid,wsideface. Meanwhile, the collected voice information is used for extracting voice characteristics by the method, and the voice characteristics are endowed to a trained model to obtain the output facial expression parametersNumber waudio. Finally, two groups of expression parameters w are obtainedsidefaceAnd waudioLinear weighting is performed by the following formula:
wexp=λ1waudio+λ2wsideface,
and converting the parameters of the face model into a three-dimensional face model.
In step S107, texture mapping and matching are performed on the fitted face model to obtain a final face reconstruction result.
That is to say, the embodiment of the invention can perform texture matching and mapping on the generated face model to complete three-dimensional modeling.
According to the face reconstruction method of the voice-driven auxiliary side face image, which is provided by the embodiment of the invention, the acquired voice information and the face movement can be matched by utilizing deep learning, then the extracted face movement parameters are used for pointing to the three-dimensional reconstruction of the face model, and the voice information is added to assist the reconstruction method, so that the mouth movement information which is difficult to obtain through the side face image can be better tracked and reconstructed, and the reconstruction reliability is effectively improved.
Next, a face reconstruction apparatus for voice-driving an auxiliary side face image according to an embodiment of the present invention will be described with reference to the accompanying drawings.
Fig. 2 is a schematic structural diagram of a face reconstruction apparatus for voice-driving an auxiliary side face image according to an embodiment of the present invention.
As shown in fig. 2, the face reconstruction device 10 for voice-driving an auxiliary side face image includes: a first extraction module 100, a first acquisition module 200, a second extraction module 300, a placement module 400, a second acquisition module 500, a weighting module 600, and a processing module 700.
The first extraction module 100 is configured to extract a training data set, wherein if the front face of the AR device wearer says a segment of corpus of the camera lens, the sound information collected by the camera is extracted and converted into speech features, the face image information collected by the camera is extracted and expression parameters are extracted, and the speech features and the expression parameters are respectively used as input and output of the deep learning training set. The first obtaining module 200 is configured to perform deep learning training on a training data set by using a convolutional neural network, and obtain a corresponding relationship between a speech feature and an expression parameter. The second extraction module 300 is configured to, during a process of testing and using the neural network model, collect side face information and spoken speech information of a user through the micro cameras erected on both sides of the AR device, extract speech features of a speech waveform window corresponding to each frame, and store the feature information in an input file. The embedding module 400 is configured to embed the input file into the trained neural network model to obtain a facial expression parameter corresponding to each frame of speech information. The second obtaining module 500 is configured to make a training set for the side face information in the same manner, map feature points on the front face to the side face through calibrated camera parameters by using existing front face feature point matching, obtain feature point information of the side face, and obtain a correspondence between the side face image and the face feature points by using a deep learning method. The weighting module 600 is configured to fit feature point positions on the side face with an existing face model, solve expression parameters and shape parameters of the face, and introduce expression parameters obtained through solving of sound information for weighting. The processing module 700 is configured to perform texture mapping and matching on the fitted face model to obtain a final face reconstruction result. The device 10 of the embodiment of the invention can better track and reconstruct the mouth movement information which is difficult to obtain through the side face image, and effectively improve the reconstruction reliability
Further, in one embodiment of the present invention, the training set is collected, and the collected results include a sequence of side face images captured by a camera on the AR device, a sequence of front face images captured by a front camera, and voice information of the wearer captured by the camera.
Further, in an embodiment of the present invention, the first extraction module 100 is further configured to extract parameters from the acquired speech waveform window of each frame through a linear predictive coding method, as speech features of the frame, to serve as a training set input, perform face model fitting on the acquired front face image sequence, and extract expression parameters, to serve as a training set output.
Further, in an embodiment of the present invention, the fitting process constrains distances between positions of the model feature points projected onto the image and the feature point positions detected on the frontal face image, and the regular constraints of the model parameters themselves, specifically:
E=Edataand λ Ereg,
Ereg=∑id∈(|wid|-3)+∑exp∈(|wexp|-3),
Wherein R, t, wid,wexpFor the parameters to be solved, Proj is the camera projection matrix,tensor base, p, for the ith eigenpoint on the modeliAnd e (-) is a step function and is the coordinate of the ith characteristic point on the image.
Further, in an embodiment of the present invention, the expression parameters obtained by solving the sound information are introduced into a formula for weighting, where the formula is:
wexp=λ1waudio+λ2wsideface,
wherein V is the vertex position after the face model is transformed, CrIs a bilinear tensor parameter basis of the model, widIs a shape parameter of the model, wexpIs an expression parameter of the model, waudioFor expression parameters solved by the sound information, wsidefaceFor expression parameters, lambda, solved by side-face image information1,λ2Weights fitted for the two sets of expression parameters.
It should be noted that the foregoing explanation on the embodiment of the method for reconstructing a human face by using a voice to drive an auxiliary side face image is also applicable to the device for reconstructing a human face by using a voice to drive an auxiliary side face image in this embodiment, and details are not repeated here.
According to the face reconstruction device for the voice-driven auxiliary side face image, which is provided by the embodiment of the invention, the collected voice information and the face movement can be matched by utilizing deep learning, then the extracted face movement parameters are used for pointing to the three-dimensional reconstruction of a face model, and a voice information auxiliary reconstruction method is added, so that the mouth movement information which is difficult to obtain through the side face image can be well tracked and reconstructed, and the reconstruction reliability is effectively improved.
In the description of the present invention, it is to be understood that the terms "central," "longitudinal," "lateral," "length," "width," "thickness," "upper," "lower," "front," "rear," "left," "right," "vertical," "horizontal," "top," "bottom," "inner," "outer," "clockwise," "counterclockwise," "axial," "radial," "circumferential," and the like are used in the orientations and positional relationships indicated in the drawings for convenience in describing the invention and to simplify the description, and are not intended to indicate or imply that the referenced devices or elements must have a particular orientation, be constructed and operated in a particular orientation, and are therefore not to be considered limiting of the invention.
Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present invention, "a plurality" means at least two, e.g., two, three, etc., unless specifically limited otherwise.
In the present invention, unless otherwise expressly stated or limited, the terms "mounted," "connected," "secured," and the like are to be construed broadly and can, for example, be fixedly connected, detachably connected, or integrally formed; can be mechanically or electrically connected; they may be directly connected or indirectly connected through intervening media, or they may be connected internally or in any other suitable relationship, unless expressly stated otherwise. The specific meanings of the above terms in the present invention can be understood by those skilled in the art according to specific situations.
In the present invention, unless otherwise expressly stated or limited, the first feature "on" or "under" the second feature may be directly contacting the first and second features or indirectly contacting the first and second features through an intermediate. Also, a first feature "on," "over," and "above" a second feature may be directly or diagonally above the second feature, or may simply indicate that the first feature is at a higher level than the second feature. A first feature being "under," "below," and "beneath" a second feature may be directly under or obliquely under the first feature, or may simply mean that the first feature is at a lesser elevation than the second feature.
In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.
Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made to the above embodiments by those of ordinary skill in the art within the scope of the present invention.
Claims (8)
1. A human face reconstruction method for driving an auxiliary side face image by sound is characterized by comprising the following steps:
extracting a training data set, wherein if a wearer speaks a section of corpus to the camera lens through the front face of the AR device, voice information collected by the camera is extracted and converted into voice features, facial image information collected by the camera is extracted, and expression parameters are extracted, wherein the voice features and the expression parameters are respectively used as input and output of a deep learning training set;
performing deep learning training on the training data set by using a convolutional neural network to obtain the corresponding relation between the voice characteristics and the expression parameters;
in the process of testing and using a neural network model, acquiring side face information and speaking voice information of a user through miniature cameras erected on two sides of the AR device, extracting voice characteristics of a voice waveform window corresponding to each frame, and storing the characteristic information into an input file;
putting the input file into a trained neural network model to obtain a facial expression parameter corresponding to each frame of voice information;
for the side face information, a training set is also made, the feature points on the front face are mapped onto the side face through calibrated camera parameters through the existing front face feature point matching, the feature point information of the side face is obtained, and the corresponding relation between the side face image and the face feature points is obtained through a deep learning method;
the method comprises the following steps of fitting the positions of characteristic points on a side face by using an existing face model, solving expression parameters and shape parameters of the face, and introducing expression parameters obtained by solving sound information for weighting, wherein the fitting process constrains the distance between the positions of the model characteristic points projected on an image and the positions of the characteristic points detected on a front face image, and the regular constraint of the model parameters per se, and specifically comprises the following steps: e ═ Edata+λEreg,Ereg=∑id∈(|wid|-3)+∑exp∈(|wexpL-3), wherein R, t, wid,wexpFor the parameters to be solved, Proj is the camera projection matrix,tensor base, p, for the ith eigenpoint on the modeliIs the coordinate of the ith characteristic point on the image, and epsilon (-) is a step function, widIs a shape parameter of the model, wexpExpression parameters of the model; and
and performing texture mapping and matching on the fitted face model to obtain a final face reconstruction result.
2. The method of claim 1, wherein the training set is collected, and the collected results comprise a sequence of side images captured by a camera on the AR device, a sequence of front images captured by a front camera, and the camera's voice information of the wearer.
3. The method for reconstructing a human face from an audio-driven auxiliary side-face image according to claim 2, wherein the extracting a training data set further comprises:
extracting parameters of the collected voice waveform window of each frame through a linear predictive coding method to serve as the voice characteristics of the frame, and inputting the parameters serving as a training set;
and carrying out face model fitting on the collected front face image sequence and extracting the expression parameters to be output as a training set.
4. The method of claim 1, wherein the expression parameters obtained by the solution of the audio information are introduced into the formula for weighting as follows:
wexp=λ1waudio+λ2wsideface,
wherein V is the vertex position after the face model is transformed, CrIs a bilinear tensor parameter basis of the model, widIs a shape parameter of the model, wexpIs an expression parameter of the model, waudioFor expression parameters solved by the sound information, wsidefaceFor expression parameters, lambda, solved by side-face image information1,λ2Weights fitted for the two sets of expression parameters.
5. A face reconstruction device that uses voice to drive an auxiliary side face image, comprising:
the system comprises a first extraction module, a second extraction module and a third extraction module, wherein if a face of an AR device wearer speaks a section of corpus to a camera lens, voice information collected by a camera is extracted and converted into voice features, face image information collected by the camera is extracted, and expression parameters are extracted, and the voice features and the expression parameters are respectively used as input and output of a deep learning training set;
the first acquisition module is used for carrying out deep learning training on the training data set by using a convolutional neural network to acquire the corresponding relation between the voice feature and the expression parameter;
the second extraction module is used for acquiring side face information and speaking voice information of a user through miniature cameras erected on two sides of the AR device in the process of testing and using the neural network model, extracting voice characteristics of a voice waveform window corresponding to each frame and storing the characteristic information into an input file;
the input file is placed into a trained neural network model to obtain facial expression parameters corresponding to each frame of voice information;
the second acquisition module is used for making a training set for the side face information, mapping the feature points on the front face to the side face through the calibrated camera parameters by matching the existing front face feature points, acquiring the feature point information of the side face and acquiring the corresponding relation between the side face image and the face feature points by a deep learning method;
the weighting module is used for solving the expression parameters and the shape parameters of the face by using the feature point positions on the existing face model fitting side face, and introducing the expression parameters obtained by solving the sound information for weighting, wherein the fitting process constrains the distance between the positions of the model feature points projected on the image and the feature point positions detected on the frontal face image, and the regular constraints of the model parameters are specifically as follows: e ═ Edata+λEreg, Ereg=∑id∈(|wid|-3)+∑exp∈(|wexpL-3), wherein R, t, wid,wexpFor the parameters to be solved, Proj is the camera projection matrix,tensor base, p, for the ith eigenpoint on the modeliIs the coordinate of the ith characteristic point on the image, and epsilon (-) is a step function, widIs a shape parameter of the model, wexpExpression parameters of the model; and
and the processing module is used for performing texture mapping and matching on the fitted face model to obtain a final face reconstruction result.
6. The device for reconstructing a human face by using voice to drive auxiliary side-face images according to claim 5, wherein the results collected when the training set is collected include a sequence of side-face images captured by a camera on the AR device, a sequence of front-face images captured by a front camera, and the voice information of the wearer collected by the camera.
7. The device for reconstructing a human face according to claim 6, wherein the first extracting module is further configured to extract parameters from the collected speech waveform window of each frame by a linear predictive coding method as the speech features of the frame to be input as a training set, perform face model fitting on the collected front face image sequence, and extract the expression parameters to be output as the training set.
8. The device for reconstructing a human face by using a voice-driven auxiliary side face image according to claim 5, wherein the expression parameters obtained by solving the voice information are introduced into the formula for weighting:
wexp=λ1waudio+λ2wsideface,
wherein V is the vertex position after the face model is transformed, CrIs a bilinear tensor parameter basis of the model, widIs a shape parameter of the model, wexpIs an expression parameter of the model, waudioFor expression parameters solved by the sound information, wsidefaceFor expression parameters, lambda, solved by side-face image information1,λ2Weights fitted for the two sets of expression parameters.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711461073.0A CN108230438B (en) | 2017-12-28 | 2017-12-28 | Face reconstruction method and device for voice-driven auxiliary side face image |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711461073.0A CN108230438B (en) | 2017-12-28 | 2017-12-28 | Face reconstruction method and device for voice-driven auxiliary side face image |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108230438A CN108230438A (en) | 2018-06-29 |
CN108230438B true CN108230438B (en) | 2020-06-19 |
Family
ID=62645827
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201711461073.0A Active CN108230438B (en) | 2017-12-28 | 2017-12-28 | Face reconstruction method and device for voice-driven auxiliary side face image |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108230438B (en) |
Families Citing this family (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
TW202009924A (en) * | 2018-08-16 | 2020-03-01 | 國立臺灣科技大學 | Timbre-selectable human voice playback system, playback method thereof and computer-readable recording medium |
KR102537784B1 (en) * | 2018-08-17 | 2023-05-30 | 삼성전자주식회사 | Electronic device and control method thereof |
CN110211582A (en) * | 2019-05-31 | 2019-09-06 | 量子动力(深圳)计算机科技有限公司 | A kind of real-time, interactive intelligent digital virtual actor's facial expression driving method and system |
CN111243626B (en) * | 2019-12-30 | 2022-12-09 | 清华大学 | Method and system for generating speaking video |
CN111294665B (en) * | 2020-02-12 | 2021-07-20 | 百度在线网络技术(北京)有限公司 | Video generation method and device, electronic equipment and readable storage medium |
CN111432233B (en) * | 2020-03-20 | 2022-07-19 | 北京字节跳动网络技术有限公司 | Method, apparatus, device and medium for generating video |
CN112001992A (en) * | 2020-07-02 | 2020-11-27 | 超维视界(北京)传媒科技有限公司 | Voice-driven 3D virtual human expression sound-picture synchronization method and system based on deep learning |
CN112102468B (en) * | 2020-08-07 | 2022-03-04 | 北京汇钧科技有限公司 | Model training method, virtual character image generation device, and storage medium |
CN113269872A (en) * | 2021-06-01 | 2021-08-17 | 广东工业大学 | Synthetic video generation method based on three-dimensional face reconstruction and video key frame optimization |
CN113822969B (en) * | 2021-09-15 | 2023-06-09 | 宿迁硅基智能科技有限公司 | Training neural radiation field model, face generation method, device and server |
Family Cites Families (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102157010A (en) * | 2011-05-25 | 2011-08-17 | 上海大学 | Method for realizing three-dimensional facial animation based on layered modeling and multi-body driving |
US9302179B1 (en) * | 2013-03-07 | 2016-04-05 | Posit Science Corporation | Neuroplasticity games for addiction |
CN103218842B (en) * | 2013-03-12 | 2015-11-25 | 西南交通大学 | A kind of voice synchronous drives the method for the three-dimensional face shape of the mouth as one speaks and facial pose animation |
CN104951743A (en) * | 2015-03-04 | 2015-09-30 | 苏州大学 | Active-shape-model-algorithm-based method for analyzing face expression |
CN106878677B (en) * | 2017-01-23 | 2020-01-07 | 西安电子科技大学 | Student classroom mastery degree evaluation system and method based on multiple sensors |
-
2017
- 2017-12-28 CN CN201711461073.0A patent/CN108230438B/en active Active
Non-Patent Citations (1)
Title |
---|
Simultaneous Facial Feature Tracking and Facial Expression Recognition;Yongqiang Li 等;《IEEE TRANSACTIONS ON IMAGE PROCESSING》;20130731;第2559-2573页 * |
Also Published As
Publication number | Publication date |
---|---|
CN108230438A (en) | 2018-06-29 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108230438B (en) | Face reconstruction method and device for voice-driven auxiliary side face image | |
CN110321754B (en) | Human motion posture correction method and system based on computer vision | |
CN102697508B (en) | Method for performing gait recognition by adopting three-dimensional reconstruction of monocular vision | |
CN110428493B (en) | Single-image human body three-dimensional reconstruction method and system based on grid deformation | |
CN103971408B (en) | Three-dimensional facial model generating system and method | |
CN110544301A (en) | Three-dimensional human body action reconstruction system, method and action training system | |
Uddin et al. | Human activity recognition using body joint‐angle features and hidden Markov model | |
CN112069933A (en) | Skeletal muscle stress estimation method based on posture recognition and human body biomechanics | |
CN106997605B (en) | A method of foot type video is acquired by smart phone and sensing data obtains three-dimensional foot type | |
CN111931804B (en) | Human body action automatic scoring method based on RGBD camera | |
CN110969124A (en) | Two-dimensional human body posture estimation method and system based on lightweight multi-branch network | |
CN109919141A (en) | A kind of recognition methods again of the pedestrian based on skeleton pose | |
CN108734776A (en) | A kind of three-dimensional facial reconstruction method and equipment based on speckle | |
CN105138980A (en) | Identify authentication method and system based on identity card information and face identification | |
CN110544302A (en) | Human body action reconstruction system and method based on multi-view vision and action training system | |
CN109063643B (en) | Facial expression pain degree identification method under condition of partial hiding of facial information | |
CN111012353A (en) | Height detection method based on face key point recognition | |
CN110717391A (en) | Height measuring method, system, device and medium based on video image | |
JP2002197443A (en) | Generator of three-dimensional form data | |
CN112365578A (en) | Three-dimensional human body model reconstruction system and method based on double cameras | |
CN104732586B (en) | A kind of dynamic body of 3 D human body and three-dimensional motion light stream fast reconstructing method | |
He et al. | A New Kinect‐Based Posture Recognition Method in Physical Sports Training Based on Urban Data | |
CN107704851A (en) | Character recognition method, Public Media exhibiting device, server and system | |
JP2021086274A (en) | Lip reading device and lip reading method | |
CN110060334A (en) | Calculating integration imaging image reconstructing method based on Scale invariant features transform |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
TR01 | Transfer of patent right | ||
TR01 | Transfer of patent right |
Effective date of registration: 20221220 Address after: Room 3346, Floor 3, International Innovation Expo Center, No. 267, Kejiyuan Road, Baiyang Street, Qiantang District, Hangzhou, Zhejiang 310020 Patentee after: Hangzhou Xinchangyuan Technology Co.,Ltd. Address before: 100084 Tsinghua Yuan, Beijing, Haidian District Patentee before: TSINGHUA University |