CN108230438B

CN108230438B - Face reconstruction method and device for voice-driven auxiliary side face image

Info

Publication number: CN108230438B
Application number: CN201711461073.0A
Authority: CN
Inventors: 刘烨斌; 苏肇祺; 戴琼海
Original assignee: Tsinghua University
Current assignee: Hangzhou Xinchangyuan Technology Co ltd
Priority date: 2017-12-28
Filing date: 2017-12-28
Publication date: 2020-06-19
Anticipated expiration: 2037-12-28
Also published as: CN108230438A

Abstract

The invention discloses a face reconstruction method and a face reconstruction device for a voice-driven auxiliary side face image, wherein the method comprises the following steps: extracting a training data set; acquiring the corresponding relation between the voice characteristics and the expression parameters; extracting the voice characteristics of the voice waveform window corresponding to each frame, and storing the characteristic information into an input file; obtaining facial expression parameters corresponding to each frame of voice information; acquiring characteristic point information of a side face to acquire a corresponding relation between a side face image and a face characteristic point by a deep learning method; fitting the positions of the characteristic points on the side face by using the existing face model, solving the expression parameters and the shape parameters of the face, and introducing the expression parameters obtained by solving the sound information for weighting; and performing texture mapping and matching on the fitted face model to obtain a final face reconstruction result. The method can enable mouth movement information which is difficult to obtain through the side face image to be tracked and reconstructed well, and effectively improves the reliability of reconstruction.

Description

Face reconstruction method and device for voice-driven auxiliary side face image

Technical Field

The invention relates to the technical field of computer vision, in particular to a method and a device for reconstructing a human face by using voice to drive an auxiliary side face image.

Background

The high-quality three-dimensional model has important application value in various fields of movie and television entertainment, cultural relic protection, machining and the like. The human face three-dimensional reconstruction method is a great problem in the field of three-dimensional reconstruction because the human face expression is rich.

The face reconstruction technology of the related art mainly aims at face reconstruction of a front face image, and for AR (augmented reality) equipment, if face information of a wearer needs to be reconstructed, in order to not affect the sight of the wearer, a micro camera needs to be arranged on two sides of AR glasses, which brings difficulty to face reconstruction, especially reconstruction of mouth movement, because the mouth information acquired by a side face image is often incomplete.

Disclosure of Invention

The present invention is directed to solving, at least to some extent, one of the technical problems in the related art.

Therefore, an object of the present invention is to provide a face reconstruction method for voice-driven auxiliary side face images, which can better track and reconstruct the mouth motion information that is difficult to obtain through the side face images, and effectively improve the reconstruction reliability.

Another object of the present invention is to provide a face reconstruction device for voice-driving auxiliary side face images.

In order to achieve the above object, an embodiment of the present invention provides a face reconstruction method for driving an auxiliary side face image by sound, including the following steps: extracting a training data set, wherein if a wearer speaks a section of corpus to the camera lens through the front face of the AR device, voice information collected by the camera is extracted and converted into voice features, facial image information collected by the camera is extracted, and expression parameters are extracted, wherein the voice features and the expression parameters are respectively used as input and output of a deep learning training set; performing deep learning training on the training data set by using a convolutional neural network to obtain the corresponding relation between the voice characteristics and the expression parameters; in the process of testing and using a neural network model, acquiring side face information and speaking voice information of a user through miniature cameras erected on two sides of the AR device, extracting voice characteristics of a voice waveform window corresponding to each frame, and storing the characteristic information into an input file; putting the input file into a trained neural network model to obtain a facial expression parameter corresponding to each frame of voice information; for the side face information, a training set is also made, the feature points on the front face are mapped onto the side face through calibrated camera parameters through the existing front face feature point matching, the feature point information of the side face is obtained, and the corresponding relation between the side face image and the face feature points is obtained through a deep learning method; fitting the positions of the characteristic points on the side face by using the existing face model, solving the expression parameters and the shape parameters of the face, and introducing the expression parameters obtained by solving the sound information for weighting; and performing texture mapping and matching on the fitted face model to obtain a final face reconstruction result.

According to the face reconstruction method of the voice-driven auxiliary side face image, the collected voice information and the face movement can be matched by utilizing deep learning, then the extracted face movement parameters are used for indicating the three-dimensional reconstruction of the face model, and the voice information auxiliary reconstruction method is added, so that the mouth movement information which is difficult to obtain through the side face image can be well tracked and reconstructed, and the reconstruction reliability is effectively improved.

In addition, the face reconstruction method for the voice-driven auxiliary side face image according to the above embodiment of the present invention may further have the following additional technical features:

further, in one embodiment of the present invention, when the training set is collected, the collected results include a sequence of side face images captured by a camera on the AR device, a sequence of front face images captured by a front camera, and voice information of the wearer collected by the camera.

Further, in an embodiment of the present invention, the extracting the training data set further includes: extracting parameters of the collected voice waveform window of each frame through a linear predictive coding method to serve as the voice characteristics of the frame, and inputting the parameters serving as a training set; and carrying out face model fitting on the collected front face image sequence and extracting the expression parameters to be output as a training set.

Further, in an embodiment of the present invention, the fitting process constrains distances between positions of the model feature points projected onto the image and the feature point positions detected on the frontal face image, and the regular constraints of the model parameters themselves, specifically:

E＝E_data+λE_reg，

E_reg＝∑_id∈(|w_id|-3)+∑_exp∈(|w_exp|-3)，

wherein R, t, w_id，w_expFor the parameters to be solved, Proj is the camera projection matrix,

tensor base, p, for the ith eigenpoint on the model_iAnd e (-) is a step function and is the coordinate of the ith characteristic point on the image.

Further, in an embodiment of the present invention, the expression parameters obtained by solving the sound information are introduced into a formula for weighting, where the formula is:

w_exp＝λ₁w_audio+λ₂w_sideface，

wherein V is the vertex position after the face model is transformed, C_rIs a bilinear tensor parameter basis of the model, w_idIs a shape parameter of the model, w_expIs an expression parameter of the model, w_audioFor expression parameters solved by the sound information, w_sidefaceFor expression parameters, lambda, solved by side-face image information₁，λ₂Weights fitted for the two sets of expression parameters.

In order to achieve the above object, another embodiment of the present invention provides a face reconstruction apparatus for driving an auxiliary side face image by voice, including: the system comprises a first extraction module, a second extraction module and a third extraction module, wherein if a face of an AR device wearer speaks a section of corpus to a camera lens, voice information collected by a camera is extracted and converted into voice features, face image information collected by the camera is extracted, and expression parameters are extracted, and the voice features and the expression parameters are respectively used as input and output of a deep learning training set; the first acquisition module is used for carrying out deep learning training on the training data set by using a convolutional neural network to acquire the corresponding relation between the voice feature and the expression parameter; the second extraction module is used for acquiring side face information and speaking voice information of a user through miniature cameras erected on two sides of the AR device in the process of testing and using the neural network model, extracting voice characteristics of a voice waveform window corresponding to each frame and storing the characteristic information into an input file; the input file is placed into a trained neural network model to obtain facial expression parameters corresponding to each frame of voice information; the second acquisition module is used for making a training set for the side face information, mapping the feature points on the front face to the side face through the calibrated camera parameters by matching the existing front face feature points, acquiring the feature point information of the side face and acquiring the corresponding relation between the side face image and the face feature points by a deep learning method; the weighting module is used for fitting the positions of the feature points on the side face by using the existing face model, solving the expression parameters and the shape parameters of the face, and introducing the expression parameters obtained by solving the sound information for weighting; and the processing module is used for performing texture mapping and matching on the fitted face model to obtain a final face reconstruction result.

The face reconstruction device for the voice-driven auxiliary side face image can utilize deep learning to match collected voice information with face movement, then use the extracted face movement parameters to indicate the three-dimensional reconstruction of a face model, and add a voice information auxiliary reconstruction method, so that mouth movement information which is difficult to obtain through the side face image can be better tracked and reconstructed, and the reconstruction reliability is effectively improved.

In addition, the face reconstruction device for voice-driven auxiliary side face image according to the above embodiment of the present invention may further have the following additional technical features:

Further, in an embodiment of the present invention, the first extraction module is further configured to extract parameters from the collected speech waveform window of each frame through a linear predictive coding method, where the parameters are used as the speech features of the frame to be input as a training set, perform face model fitting on the collected front face image sequence, and extract the expression parameters to be output as the training set.

E＝E_data+λE_reg，

E_reg＝∑_id∈(|w_id|-3)+∑_exp∈(|w_exp|-3)，

w_exp＝λ₁w_audio+λ₂w_sideface，

Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.

Drawings

The foregoing and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

FIG. 1 is a flow chart of a method for face reconstruction with voice-driven auxiliary side-face images according to an embodiment of the present invention;

fig. 2 is a schematic structural diagram of a face reconstruction apparatus for voice-driving an auxiliary side face image according to an embodiment of the present invention.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are illustrative and intended to be illustrative of the invention and are not to be construed as limiting the invention.

The following describes a face reconstruction method and apparatus for a voice-driven auxiliary side face image according to an embodiment of the present invention with reference to the accompanying drawings, and first, a face reconstruction method for a voice-driven auxiliary side face image according to an embodiment of the present invention will be described with reference to the accompanying drawings.

Fig. 1 is a flowchart of a face reconstruction method for voice-driven auxiliary side face images according to an embodiment of the present invention.

As shown in fig. 1, the method for reconstructing a human face by using voice to drive an auxiliary side face image comprises the following steps:

in step S101, a training data set is extracted, wherein if the AR device wearer speaks a segment of corpus through the front face of the camera lens, the sound information collected by the camera is extracted and converted into speech features, the facial image information collected by the camera is extracted and expression parameters are extracted, and the speech features and the expression parameters are respectively used as input and output of the deep learning training set.

It can be understood that the embodiment of the invention can collect the face movement and the speaking voice of the person by erecting the miniature cameras on the two sides of the AR glasses of the user, and extract the waveform characteristics in the voice information by using a deep learning method while collecting and reconstructing the side face information.

Namely, micro cameras are erected on two sides of the AR equipment, relative parameters of the two cameras are calibrated through a calibration plate, a wearer wears the AR equipment, and the cameras are erected right in front of the wearer to collect a training set required by deep learning.

Optionally, in an embodiment of the present invention, when the training set is collected, the collected results include a sequence of side face images captured by a camera on the AR device, a sequence of front face images captured by a front camera, and voice information of the wearer collected by the camera.

For example, after three cameras are synchronized, the training set starts to be acquired, and the acquired results include a side face image sequence captured by the camera on the AR device, a front face image sequence captured by the front camera, and the voice information of the wearer acquired by the camera.

Further, in an embodiment of the present invention, extracting the training data set further includes: extracting parameters from the collected voice waveform window of each frame by a linear predictive coding method to serve as the voice characteristics of the frame to be input as a training set; and carrying out face model fitting on the collected front face image sequence and extracting expression parameters to be output as a training set.

It can be understood that, in the embodiment of the present invention, parameters of the acquired speech waveform window of each frame may be extracted by an LPC (linear predictive coding) method to serve as speech feature information of the frame, and the extracted parameters serve as input of a training set; and carrying out face model fitting on the collected front face image sequence, extracting expression parameters and outputting the expression parameters as a training set.

E＝E_data+λE_reg，

E_reg＝∑_id∈(|w_id|-3)+∑_exp∈(|w_exp|-3)，

The fitting process constrains the distance between the position of the model feature point projected on the image and the position of the feature point detected on the frontal image, and the regularization constraint of the model parameters themselves. The specific form is as follows:

E＝E_data+λE_reg

description of the respective parameters: r, t, w_id，w_exp-parameters to be solved, Proj-camera projection matrix,

tensor base, p, corresponding to the ith eigenpoint on the model_iThe method comprises the steps of obtaining a coordinate of the ith characteristic point on an image, determining a step function, constraining the minimum value by using a Gaussian-Newton gradient descent method, and solving to obtain R, t, w_id，w_expAnd extracting w_expAnd outputting as a training set.

In step S102, a convolutional neural network is used to perform deep learning training on the training data set, and a corresponding relationship between the speech feature and the expression parameter is obtained.

It can be understood that the embodiment of the invention can train a data set and use a convolutional neural network for deep learning training, so as to obtain the corresponding relation between the voice feature and the expression parameter. And carrying out deep learning training on the training set through a convolutional neural network to obtain a trained model.

In step S103, in the process of testing and using the neural network model, the miniature cameras erected on both sides of the AR device collect the side face information and the speaking voice information of the user, extract the voice features of the voice waveform window corresponding to each frame, and store the feature information into the input file.

It can be understood that in the process of testing and using the model, the miniature cameras erected at two sides of the AR device are used for collecting the side face information and the speaking voice information of the user, extracting the voice characteristics of the voice waveform window corresponding to each frame, and storing the characteristic information into the input file.

For example, a training set is made for the side face information, feature points on the front face are mapped onto the side face through calibrated camera parameters by an existing front face feature point matching method, the feature point information of the side face is obtained, and the corresponding relation between the side face image and the face feature points is obtained through a deep learning method; or detecting the feature point information of the side face by using the existing side face feature point detection method.

In step S104, the input file is placed in the trained neural network model to obtain the facial expression parameters corresponding to each frame of speech information.

It can be understood that the embodiment of the invention can put the input file into the trained neural network model to obtain the facial expression parameters corresponding to each frame of voice information.

In step S105, a training set is also created for the side face information, and feature points on the front face are mapped onto the side face through calibrated camera parameters by existing front face feature point matching, so as to obtain feature point information of the side face, so as to obtain a corresponding relationship between the side face image and the face feature points through a deep learning method.

It can be understood that, for the side face information, a training set is also made, and the feature points on the front face are mapped onto the side face through calibrated camera parameters by using the existing method for matching the feature points of the front face, so as to obtain the feature point information of the side face, thereby obtaining the corresponding relationship between the side face image and the feature points of the face through a deep learning method.

In step S106, the existing face model is used to fit the positions of the feature points on the side face, the expression parameters and shape parameters of the face are solved, and the expression parameters obtained by solving the sound information are introduced for weighting.

It can be understood that, in the embodiment of the present invention, the existing face model (e.g., FaceWarehouse) may be used to fit the positions of the feature points on the side face, and the expression parameters and the shape parameters of the face may be solved. And introducing expression parameters obtained by solving the sound information for weighting.

It can be understood that the expression parameters obtained by solving the sound information according to the embodiment of the present invention are introduced into the formula for weighting:

w_exp＝λ₁w_audio+λ₂w_sideface，

For example, during testing and use, only the AR device and the miniature cameras on both sides, there is no front facing camera. Extracting characteristic points of the side face information by the method, and performing energy minimum solving by using the energy items to obtain R, t, w_id，w_sideface. Meanwhile, the collected voice information is used for extracting voice characteristics by the method, and the voice characteristics are endowed to a trained model to obtain the output facial expression parametersNumber w_audio. Finally, two groups of expression parameters w are obtained_sidefaceAnd w_audioLinear weighting is performed by the following formula:

w_exp＝λ₁w_audio+λ₂w_sideface，

and converting the parameters of the face model into a three-dimensional face model.

In step S107, texture mapping and matching are performed on the fitted face model to obtain a final face reconstruction result.

That is to say, the embodiment of the invention can perform texture matching and mapping on the generated face model to complete three-dimensional modeling.

According to the face reconstruction method of the voice-driven auxiliary side face image, which is provided by the embodiment of the invention, the acquired voice information and the face movement can be matched by utilizing deep learning, then the extracted face movement parameters are used for pointing to the three-dimensional reconstruction of the face model, and the voice information is added to assist the reconstruction method, so that the mouth movement information which is difficult to obtain through the side face image can be better tracked and reconstructed, and the reconstruction reliability is effectively improved.

Next, a face reconstruction apparatus for voice-driving an auxiliary side face image according to an embodiment of the present invention will be described with reference to the accompanying drawings.

As shown in fig. 2, the face reconstruction device 10 for voice-driving an auxiliary side face image includes: a first extraction module 100, a first acquisition module 200, a second extraction module 300, a placement module 400, a second acquisition module 500, a weighting module 600, and a processing module 700.

The first extraction module 100 is configured to extract a training data set, wherein if the front face of the AR device wearer says a segment of corpus of the camera lens, the sound information collected by the camera is extracted and converted into speech features, the face image information collected by the camera is extracted and expression parameters are extracted, and the speech features and the expression parameters are respectively used as input and output of the deep learning training set. The first obtaining module 200 is configured to perform deep learning training on a training data set by using a convolutional neural network, and obtain a corresponding relationship between a speech feature and an expression parameter. The second extraction module 300 is configured to, during a process of testing and using the neural network model, collect side face information and spoken speech information of a user through the micro cameras erected on both sides of the AR device, extract speech features of a speech waveform window corresponding to each frame, and store the feature information in an input file. The embedding module 400 is configured to embed the input file into the trained neural network model to obtain a facial expression parameter corresponding to each frame of speech information. The second obtaining module 500 is configured to make a training set for the side face information in the same manner, map feature points on the front face to the side face through calibrated camera parameters by using existing front face feature point matching, obtain feature point information of the side face, and obtain a correspondence between the side face image and the face feature points by using a deep learning method. The weighting module 600 is configured to fit feature point positions on the side face with an existing face model, solve expression parameters and shape parameters of the face, and introduce expression parameters obtained through solving of sound information for weighting. The processing module 700 is configured to perform texture mapping and matching on the fitted face model to obtain a final face reconstruction result. The device 10 of the embodiment of the invention can better track and reconstruct the mouth movement information which is difficult to obtain through the side face image, and effectively improve the reconstruction reliability

Further, in one embodiment of the present invention, the training set is collected, and the collected results include a sequence of side face images captured by a camera on the AR device, a sequence of front face images captured by a front camera, and voice information of the wearer captured by the camera.

Further, in an embodiment of the present invention, the first extraction module 100 is further configured to extract parameters from the acquired speech waveform window of each frame through a linear predictive coding method, as speech features of the frame, to serve as a training set input, perform face model fitting on the acquired front face image sequence, and extract expression parameters, to serve as a training set output.

E＝E_dataand λ E_reg，

E_reg＝∑_id∈(|w_id|-3)+∑_exp∈(|w_exp|-3)，

w_exp＝λ₁w_audio+λ₂w_sideface，

It should be noted that the foregoing explanation on the embodiment of the method for reconstructing a human face by using a voice to drive an auxiliary side face image is also applicable to the device for reconstructing a human face by using a voice to drive an auxiliary side face image in this embodiment, and details are not repeated here.

According to the face reconstruction device for the voice-driven auxiliary side face image, which is provided by the embodiment of the invention, the collected voice information and the face movement can be matched by utilizing deep learning, then the extracted face movement parameters are used for pointing to the three-dimensional reconstruction of a face model, and a voice information auxiliary reconstruction method is added, so that the mouth movement information which is difficult to obtain through the side face image can be well tracked and reconstructed, and the reconstruction reliability is effectively improved.

In the description of the present invention, it is to be understood that the terms "central," "longitudinal," "lateral," "length," "width," "thickness," "upper," "lower," "front," "rear," "left," "right," "vertical," "horizontal," "top," "bottom," "inner," "outer," "clockwise," "counterclockwise," "axial," "radial," "circumferential," and the like are used in the orientations and positional relationships indicated in the drawings for convenience in describing the invention and to simplify the description, and are not intended to indicate or imply that the referenced devices or elements must have a particular orientation, be constructed and operated in a particular orientation, and are therefore not to be considered limiting of the invention.

Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present invention, "a plurality" means at least two, e.g., two, three, etc., unless specifically limited otherwise.

In the present invention, unless otherwise expressly stated or limited, the terms "mounted," "connected," "secured," and the like are to be construed broadly and can, for example, be fixedly connected, detachably connected, or integrally formed; can be mechanically or electrically connected; they may be directly connected or indirectly connected through intervening media, or they may be connected internally or in any other suitable relationship, unless expressly stated otherwise. The specific meanings of the above terms in the present invention can be understood by those skilled in the art according to specific situations.

In the present invention, unless otherwise expressly stated or limited, the first feature "on" or "under" the second feature may be directly contacting the first and second features or indirectly contacting the first and second features through an intermediate. Also, a first feature "on," "over," and "above" a second feature may be directly or diagonally above the second feature, or may simply indicate that the first feature is at a higher level than the second feature. A first feature being "under," "below," and "beneath" a second feature may be directly under or obliquely under the first feature, or may simply mean that the first feature is at a lesser elevation than the second feature.

In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.

Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made to the above embodiments by those of ordinary skill in the art within the scope of the present invention.

Claims

1. A human face reconstruction method for driving an auxiliary side face image by sound is characterized by comprising the following steps:

extracting a training data set, wherein if a wearer speaks a section of corpus to the camera lens through the front face of the AR device, voice information collected by the camera is extracted and converted into voice features, facial image information collected by the camera is extracted, and expression parameters are extracted, wherein the voice features and the expression parameters are respectively used as input and output of a deep learning training set;

performing deep learning training on the training data set by using a convolutional neural network to obtain the corresponding relation between the voice characteristics and the expression parameters;

in the process of testing and using a neural network model, acquiring side face information and speaking voice information of a user through miniature cameras erected on two sides of the AR device, extracting voice characteristics of a voice waveform window corresponding to each frame, and storing the characteristic information into an input file;

putting the input file into a trained neural network model to obtain a facial expression parameter corresponding to each frame of voice information;

for the side face information, a training set is also made, the feature points on the front face are mapped onto the side face through calibrated camera parameters through the existing front face feature point matching, the feature point information of the side face is obtained, and the corresponding relation between the side face image and the face feature points is obtained through a deep learning method;

the method comprises the following steps of fitting the positions of characteristic points on a side face by using an existing face model, solving expression parameters and shape parameters of the face, and introducing expression parameters obtained by solving sound information for weighting, wherein the fitting process constrains the distance between the positions of the model characteristic points projected on an image and the positions of the characteristic points detected on a front face image, and the regular constraint of the model parameters per se, and specifically comprises the following steps: e ═ E_data+λE_reg，

E_reg＝∑_id∈(|w_id|-3)+∑_exp∈(|w_expL-3), wherein R, t, w_id，w_expFor the parameters to be solved, Proj is the camera projection matrix,

tensor base, p, for the ith eigenpoint on the model_iIs the coordinate of the ith characteristic point on the image, and epsilon (-) is a step function, w_idIs a shape parameter of the model, w_expExpression parameters of the model; and

and performing texture mapping and matching on the fitted face model to obtain a final face reconstruction result.

2. The method of claim 1, wherein the training set is collected, and the collected results comprise a sequence of side images captured by a camera on the AR device, a sequence of front images captured by a front camera, and the camera's voice information of the wearer.

3. The method for reconstructing a human face from an audio-driven auxiliary side-face image according to claim 2, wherein the extracting a training data set further comprises:

extracting parameters of the collected voice waveform window of each frame through a linear predictive coding method to serve as the voice characteristics of the frame, and inputting the parameters serving as a training set;

and carrying out face model fitting on the collected front face image sequence and extracting the expression parameters to be output as a training set.

4. The method of claim 1, wherein the expression parameters obtained by the solution of the audio information are introduced into the formula for weighting as follows:

w_exp＝λ₁w_audio+λ₂w_sideface，

5. A face reconstruction device that uses voice to drive an auxiliary side face image, comprising:

the system comprises a first extraction module, a second extraction module and a third extraction module, wherein if a face of an AR device wearer speaks a section of corpus to a camera lens, voice information collected by a camera is extracted and converted into voice features, face image information collected by the camera is extracted, and expression parameters are extracted, and the voice features and the expression parameters are respectively used as input and output of a deep learning training set;

the first acquisition module is used for carrying out deep learning training on the training data set by using a convolutional neural network to acquire the corresponding relation between the voice feature and the expression parameter;

the second extraction module is used for acquiring side face information and speaking voice information of a user through miniature cameras erected on two sides of the AR device in the process of testing and using the neural network model, extracting voice characteristics of a voice waveform window corresponding to each frame and storing the characteristic information into an input file;

the input file is placed into a trained neural network model to obtain facial expression parameters corresponding to each frame of voice information;

the second acquisition module is used for making a training set for the side face information, mapping the feature points on the front face to the side face through the calibrated camera parameters by matching the existing front face feature points, acquiring the feature point information of the side face and acquiring the corresponding relation between the side face image and the face feature points by a deep learning method;

the weighting module is used for solving the expression parameters and the shape parameters of the face by using the feature point positions on the existing face model fitting side face, and introducing the expression parameters obtained by solving the sound information for weighting, wherein the fitting process constrains the distance between the positions of the model feature points projected on the image and the feature point positions detected on the frontal face image, and the regular constraints of the model parameters are specifically as follows: e ═ E_data+λE_reg，

and the processing module is used for performing texture mapping and matching on the fitted face model to obtain a final face reconstruction result.

6. The device for reconstructing a human face by using voice to drive auxiliary side-face images according to claim 5, wherein the results collected when the training set is collected include a sequence of side-face images captured by a camera on the AR device, a sequence of front-face images captured by a front camera, and the voice information of the wearer collected by the camera.

7. The device for reconstructing a human face according to claim 6, wherein the first extracting module is further configured to extract parameters from the collected speech waveform window of each frame by a linear predictive coding method as the speech features of the frame to be input as a training set, perform face model fitting on the collected front face image sequence, and extract the expression parameters to be output as the training set.

8. The device for reconstructing a human face by using a voice-driven auxiliary side face image according to claim 5, wherein the expression parameters obtained by solving the voice information are introduced into the formula for weighting:

w_exp＝λ₁w_audio+λ₂w_sideface，