CN115984452A

CN115984452A - Head three-dimensional reconstruction method and equipment

Info

Publication number: CN115984452A
Application number: CN202111197148.5A
Authority: CN
Inventors: 刘帅; 吴连朋
Original assignee: Juhaokan Technology Co Ltd
Current assignee: Juhaokan Technology Co Ltd
Priority date: 2021-10-14
Filing date: 2021-10-14
Publication date: 2023-04-18

Abstract

The application relates to the technical field of three-dimensional reconstruction, and provides a method and equipment for three-dimensional reconstruction of a head, wherein expression parameters are extracted from face images and corresponding voice data respectively, first weights of the expression parameters extracted from the voice data are adjusted according to a comparison result of an acquisition frame rate and a rendering frame rate of the face images, second weights of the expression parameters are adjusted according to a distance between every two human body models in a virtual space, the extracted expression parameters and the determined weights are sent to a rendering display end, and the rendering display end drives a prestored parameterized head model to move according to the first weights and the second weights, so that three-dimensional reconstruction is completed. By using the expression parameters extracted from the voice data, the problem of model reconstruction failure caused by face image deletion can be solved, and the robustness of three-dimensional reconstruction is improved; and the influence of sound on the facial expression is considered, and the three-dimensional reconstruction is carried out by combining the voice data, so that the reconstructed head three-dimensional model is more real.

Description

Head three-dimensional reconstruction method and equipment

Technical Field

The present application relates to the field of three-dimensional reconstruction technologies, and in particular, to a method and an apparatus for three-dimensional reconstruction of a head.

Background

The human body three-dimensional reconstruction is the basis for realizing a remote three-dimensional communication system, and the head three-dimensional reconstruction is the key of the human body three-dimensional reconstruction, so that the three-dimensional reconstruction effect is directly influenced.

At present, a parameterized head model is generally used for head three-dimensional reconstruction, and the parameterized head model comprises shape parameters, expression parameters and pose parameters. The expression parameters can drive the parameterized head model to perform non-rigid deformation so as to express various expression changes of the human face.

In a remote holographic communication scene, voice is used as a medium for information transmission in a three-dimensional reconstruction process, and facial expression changes are caused, so that head three-dimensional reconstruction can be performed by combining voice data.

Disclosure of Invention

The embodiment of the application provides a method and a device for three-dimensional reconstruction of a head, which are used for performing three-dimensional reconstruction of the head by means of sound and improving the robustness and the authenticity of the three-dimensional reconstruction.

In a first aspect, an embodiment of the present application provides a method for three-dimensional reconstruction of a head, including:

acquiring voice data corresponding to each frame of face image;

extracting a first expression parameter from the face image and extracting a second expression parameter from corresponding voice data to obtain a target driving parameter, wherein the target driving parameter is used for driving a prestored parameterized head model to move;

adjusting the first weight of the second expression parameter according to the acquisition frame rate of the face image, and adjusting the second weight of the target driving parameter according to the distance between the human body three-dimensional models corresponding to every two target objects;

and sending the first expression parameter, the second expression parameter, the first weight and the second weight to a rendering end, so that the rendering end drives the parameterized head model to move according to the first expression parameter and the second expression parameter according to the first weight and the second weight.

Optionally, the adjusting the first weight of the second expression parameter according to the frame rate of acquiring the face image includes:

if the acquisition frame rate is less than a preset rendering frame rate of the rendering end, setting a first weight of the second expression parameter extracted from the voice data corresponding to the corresponding facial image as 1 for the facial image without the first expression parameter, and reducing the first weight of the second expression parameter extracted from the voice data corresponding to the corresponding facial image for the facial image with the first expression parameter; or

And if the acquisition frame rate is not less than the preset rendering frame rate of the rendering end, reducing the first weight of the second expression parameter extracted from the voice data corresponding to the face image.

Optionally, the adjusting the second weight of the target driving parameter according to the distance between the three-dimensional human body models corresponding to the two target objects includes:

determining the model grade of the parameterized head model of the target object according to the corresponding relation between the pre-established distance and the model grade;

adjusting a second weight of the target drive parameter based on the model class.

Optionally, the adjusting the second weight of the target driving parameter according to the model grade includes:

if the model grade is smaller than a first preset grade, reducing a second weight of the target driving parameter; or

And if the model grade is greater than a second preset grade, increasing a second weight of the target driving parameter, wherein the first preset grade is less than or equal to the second preset grade.

Optionally, the method further includes:

extracting depth data from each frame of face depth image;

optimizing the parameterized head model based on the extracted depth data.

In a second aspect, an embodiment of the present application provides a reconstruction device, including a processor, a memory, a display, and at least one external communication interface, where the processor, the memory, the display, and the external communication interface are connected by a bus;

the at least one communication interface is configured to acquire an image of a target object and acquire voice data of the target object;

the memory having stored therein a computer program, the processor being configured to perform the following operations based on the computer program:

aiming at each frame of face image, acquiring voice data corresponding to the face image;

Optionally, the processor adjusts the first weight of the second expression parameter according to an acquisition frame rate of the facial image, and is configured to:

if the acquisition frame rate is less than a preset rendering frame rate of the rendering end, setting a first weight of the second expression parameter extracted from the corresponding voice data to be 1 for the face image without the first expression parameter, and reducing the first weight of the second expression parameter extracted from the corresponding voice data for the face image with the first expression parameter; or

Optionally, the processor adjusts the second weight of the target driving parameter according to a distance between the three-dimensional human body models corresponding to two target objects, and is specifically configured to:

determining the model grade of each model according to the corresponding relation between the pre-established distance and the model grade;

and adjusting the second weight of the target driving parameter according to the model grade.

Optionally, the processor adjusts the second weight of the target driving parameter according to the model grade, and is specifically configured to:

If the model grade is larger than a second preset grade, increasing a second weight of the target driving parameter, wherein the first preset grade is smaller than or equal to the second preset grade.

Optionally, the processor is further configured to:

extracting depth data from each frame of face depth image;

optimizing the parameterized head model based on the extracted depth data.

In a third aspect, embodiments of the present application provide a computer-readable storage medium storing computer-executable instructions for causing a computer to execute a head three-dimensional reconstruction method.

In the embodiment of the application, corresponding voice data is acquired aiming at each frame of face image, and a first expression parameter and a second expression parameter are respectively extracted from the face image and the voice data to obtain a target driving parameter, so that model driving is performed by combining sound; furthermore, the first weight of the second expression parameter is adjusted according to the acquisition frame rate of the face image, the second weight of the target driving parameter is adjusted according to the distance between every two models, the target driving parameter and the determined weight are sent to the rendering end, and the rendering end drives the pre-stored parameterized head model to move by using the first expression parameter and the second expression parameter according to the first weight and the second weight, so that three-dimensional reconstruction is completed. By using the second expression parameters extracted from the voice data, the problem that the model reconstruction fails due to the absence of the first expression parameters can be solved, and the robustness of three-dimensional reconstruction is improved; moreover, the influence of sound on the facial expression is considered, and the three-dimensional reconstruction is carried out by combining the voice data, so that the reconstructed head three-dimensional model is more real.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the description below are some embodiments of the present application, and those skilled in the art can obtain other drawings based on the drawings without inventive labor.

Fig. 1 schematically illustrates a conventional three-dimensional reconstruction system architecture diagram provided by an embodiment of the present application;

FIG. 2 is a diagram illustrating an architecture of a three-dimensional reconstruction system provided by an embodiment of the present application;

fig. 3 is a flowchart illustrating a method for three-dimensional reconstruction of a head according to an embodiment of the present application;

FIG. 4 is a diagram illustrating a remote three-dimensional communication interaction scenario provided by an embodiment of the application;

FIG. 5 is a diagram illustrating the relationship between three head parameters and a head model provided by an embodiment of the present application;

fig. 6 schematically illustrates a driving process provided by an embodiment of the present application;

fig. 7 schematically shows a structure diagram of a reconstruction terminal according to an embodiment of the present application.

Detailed Description

To make the objects, embodiments and advantages of the present application clearer, the following is a clear and complete description of exemplary embodiments of the present application with reference to the attached drawings in exemplary embodiments of the present application, and it is apparent that the exemplary embodiments described are only a part of the embodiments of the present application, and not all of the embodiments.

All other embodiments, which can be derived by a person skilled in the art from the exemplary embodiments described herein without making any inventive step, are intended to be within the scope of the claims appended hereto. In addition, while the disclosure herein has been presented in terms of one or more exemplary examples, it should be appreciated that aspects of the disclosure may be implemented solely as a complete embodiment.

It should be noted that the brief descriptions of the terms in the present application are only for the convenience of understanding the embodiments described below, and are not intended to limit the embodiments of the present application. These terms should be understood in their ordinary and customary meaning unless otherwise indicated.

The terms "first," "second," "third," and the like in the description and claims of this application and in the above-described drawings are used for distinguishing between similar or analogous objects or entities and are not necessarily intended to limit the order or sequence of any particular one, unless otherwise indicated. It is to be understood that the terms so used are interchangeable under appropriate circumstances such that the embodiments described herein are, for example, capable of operation in sequences other than those illustrated or otherwise described herein.

Furthermore, the terms "comprises" and "comprising," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a product or device that comprises a list of elements is not necessarily limited to those elements explicitly listed, but may include other elements not expressly listed or inherent to such product or device.

The term "module," as used herein, refers to any known or later developed hardware, software, firmware, artificial intelligence, fuzzy logic, or combination of hardware and/or software code that is capable of performing the functionality associated with that element.

At present, the architecture of the remote three-dimensional communication interactive system is shown in fig. 1, and includes a collection end, a transmission end, and a rendering display end. The system comprises an acquisition end, a transmission end, a rendering display terminal, a rendering display end and a display end, wherein the acquisition end is responsible for acquiring data (comprising RGB images, RGBD images and voice data), three-dimensional model reconstruction is carried out according to the acquired data, the data of the three-dimensional model are sent to the transmission end, the transmission end carries out data coding and then transmits the data to the rendering display terminal, the rendering display end receives and decodes the data of the three-dimensional model, and a character model in a three-dimensional scene is rendered and displayed according to the decoded data of the three-dimensional model.

The head reconstruction is a very important component in the human body three-dimensional reconstruction, and the simulation degree of the head model directly influences the face-to-face immersion feeling during VR/AR display in the remote three-dimensional communication interaction process.

Currently, three-dimensional reconstruction of the head is expressed three-dimensionally by parameterizing the head model. The parameterized head model is characterized in that dimension reduction analysis (including but not limited to a principal component analysis method, network self-coding and the like) is carried out on data of a large number of high-precision three-dimensional human head models scanned in advance to generate a group of basis functions, the group of basis functions are subjected to linear or nonlinear mixing to obtain different head three-dimensional models, and the mixed basis functions can be used as parameterized expression of the human head models.

In the remote interaction process, the change of the voice also causes the change of the action of the human face, so that the voice can be used as a medium for transmitting the three-dimensional information of the head. At present, the method for driving the face animation by using the voice data mainly comprises the following steps: 1) Extracting mouth shape features from the audio and video based on the sound track information or pronunciation phonemes, and driving the face animation by using the mouth shape features; 2) Constructing a human face animation based on the parameterized head model or the physiological face model; 3) Fusing facial actions represented by voice with expression vectors, and driving the human face to move by utilizing the fused features; 4) And selecting expressions of corresponding emotions by utilizing the voice-represented facial actions, and driving the facial animation based on the selected expressions.

In the embodiment of the application, the remote three-dimensional communication interaction system mainly relates to a real-time three-dimensional reconstruction technology, a three-dimensional data coding, decoding and transmission technology, an immersive VR/AR rendering display technology and the like.

As can be seen from the system architecture shown in fig. 1, when performing three-dimensional reconstruction, first, acquired data is obtained from various sensors, and then, a three-dimensional reconstruction method is used to perform processing, so as to reconstruct three-dimensional information. The three-dimensional information of the human body relates to geometric shape, motion pose and material data, the high-precision model usually means larger data transmission amount and has rendering time delay, and the immersive VR/AR rendering of the rendering display end usually needs higher model precision, so that the rendering time delay conflicts with the model precision.

In order to balance the transmitted data volume and rendering quality, an embodiment of the application provides an architecture diagram of a remote three-dimensional communication interaction system, and referring to fig. 2, in the architecture, when a collection end performs head three-dimensional reconstruction by using collected image data and voice data, a driving parameter for driving a parameterized head model to move is extracted, a cloud end only transmits the driving parameter, and a rendering display end drives the parameterized head model to move based on the driving parameter. Compared with the data transmission of the three-dimensional model in fig. 1, the data volume is reduced by nearly 100 times, so that the rendering time delay caused by data transmission is reduced while the accuracy of the reconstructed model is ensured.

Specifically, the method comprises the following steps:

the acquisition end comprises an RGBD camera, a microphone and a host or a workstation, wherein the RGBD camera is used for acquiring an RGB image and an RGBD image; the microphone is used for collecting voice data in the interaction process; the host or the workstation is used for carrying out face recognition on the RGB image to obtain a face image, extracting expression parameters from the face image and the voice data, extracting depth data from the RGBD image, optimizing the expression parameters by using the depth data, and adjusting the weight of the expression parameters for driving the parameterized head model.

The transmission terminal can be a cloud server, a server cluster and the like, and is used for encoding the extracted expression parameters and transmitting the encoded expression parameters to the rendering display terminal.

The rendering display end is a VR/AR display terminal with an interaction function, three-dimensional model data (including texture data, material data, pose data and the like) of two interaction parties are stored in the rendering display end, the received expression parameters are decoded, the parameterized head model is driven to move, a real model matched with the target object is obtained, and the three-dimensional model is rendered and displayed.

It should be noted that the system architecture shown in fig. 2 may be deployed according to different usage scenarios, for example, in a live broadcast scenario, a main broadcast end sets a collection end of the system, a user end sets a rendering display end of the system, and a user may view a three-dimensional model through the rendering display end to experience an immersion feeling of a face-to-face interaction in a virtual world; for example, in a conference scene, the acquisition end and the rendering display end of the system are arranged in two conference rooms of a teleconference at the same time, so that real-time remote three-dimensional communication in the two conference rooms is carried out.

Based on the system architecture shown in fig. 2, fig. 3 exemplarily shows a head three-dimensional reconstruction method provided by an embodiment of the present application, which is performed by an acquisition end and mainly includes the following steps:

s301: and acquiring voice data corresponding to the face image aiming at each frame of face image.

In step S301, the collection end collects, by using a sound sensor (e.g., a microphone), voice data corresponding to each frame of face image in the remote interaction process.

S302: and extracting a first expression parameter from the face image and extracting a second expression parameter from corresponding voice data to obtain a target driving parameter.

In S302, face detection is performed on an RGB image of each frame of the target object acquired by the RGBD camera to obtain a face image, a first expression parameter is extracted from the face image, and a second expression parameter is extracted by using voice data corresponding to the face image to obtain a target driving parameter. And the first expression parameter and the second expression parameter are transmitted to a rendering display end through a transmission end, and the rendering display end drives a prestored parameterized head model to move according to the first expression parameter and the second expression parameter.

The method for extracting the first expression parameter from the face image is well established, and includes, but is not limited to, principal Component Analysis (PCA), convolutional Neural Networks (CNN), hidden Markov Models (HMM), and the like, and this is not described in detail in this embodiment. The process of extracting the second expression parameter from the voice data is described in detail below.

Specifically, a training sample set is obtained from an audio and video database, feature vectors of audio and video frames are extracted, and a neural network model is trained on the basis of the extracted feature vectors to obtain a model representing a complex relation between voice and facial landmark points. And extracting a second expression parameter from the voice data based on the trained model.

S303: and adjusting the first weight of the second expression parameter according to the acquisition frame rate of the face image, and adjusting the second weight of the target driving parameter according to the distance between the human body three-dimensional models corresponding to the two target objects.

In S303, the frame rate of acquiring the facial image is the frame rate of acquiring the RGB image by the RGBD camera, the frame rate of acquiring is compared with the rendering frame rate of the rendering display end, and the weight of the voice data corresponding to the facial image is adjusted according to the comparison result. It should be noted that the acquisition frame rate and the rendering frame rate may be preset, and are determined by the hardware structure of the device itself.

In specific implementation, if the acquisition frame rate is less than the rendering frame rate, setting a first weight of a second expression parameter extracted from voice data corresponding to a face image to be 1 for the face image without a first expression parameter, so as to make up for the influence of the absence of the first expression parameter on model reconstruction, and turning down the first weight of the second expression parameter extracted from the voice data corresponding to the face image for the face image with the first expression parameter; if the acquisition frame rate is not less than the rendering frame rate, it is indicated that each frame of image at the rendering display end has a corresponding first expression parameter, so as to drive the parameterized head model to move, at this time, the first weight of a second expression parameter corresponding to the face image can be adjusted to reduce the influence of the second expression parameter on the model.

For example, assuming that the rendering frame rate set by the rendering display end is 60 frames and the collection frame rate of the expression parameters is only 40 frames, 20 frames of the rendering display end lack the first expression parameters for driving the parameterized head model to move, and for the 20 missing frames of images, setting the first weight of the second expression parameters extracted from the voice data corresponding to the corresponding face image to be 1, that is, driving the parameterized head model of the corresponding frame to move by sound; and aiming at the 40 frames of images with expression parameters, reducing the first weight of the second expression parameter corresponding to the corresponding frame.

It should be noted that, in the embodiment of the present application, the manner of adjusting the first weight is not limited, and the first weight may be adjusted to be decreased according to a set step size, or may be adjusted to be a fixed value.

In the embodiment of the present application, when the first weight is turned down in S302, the magnitude of the first weight may be set according to actual conditions.

For example, when the expression parameter has a large influence on the facial changes (e.g., laughing), the first weight (e.g., 0.1) after the further adjustment may be smaller than a preset threshold value to highlight the driving result of the expression parameter on the parameterized head model, and when the expression parameter has a small influence on the facial changes (e.g., smiling), the first weight (e.g., 0.5) after the further adjustment may be larger than a preset threshold value to highlight the driving result of the speech data on the parameterized head model.

Fig. 4 illustrates a remote three-dimensional communication interaction scene provided by an embodiment of the application, as shown in fig. 4, 4 objects are in remote communication, and three-dimensional human models of the 4 objects are placed in the same virtual space to realize an immersive experience of face-to-face interaction. The 4 objects play different roles in the communication process, the number of the objects with the same role can be multiple, and the positions of the human body three-dimensional models of the objects in the virtual space can be preset.

It should be noted that, the embodiment of the present application does not have a limiting requirement on the reconstruction method of the three-dimensional model of the human body, and includes but is not limited to: reconstructing a human body three-dimensional model of a target object in advance according to scanning data of the target object scanned by a scanner; alternatively, three-dimensional data of a parameterized human body model (e.g., SLMP model, STAR model) is extracted from RGB images of a target object acquired by a camera, and a human body three-dimensional model is generated from the extracted three-dimensional data.

Taking the communication scenario shown in fig. 4 as an example, when S303 is executed, the distance between two models is determined according to the position information of the three-dimensional models of the human body corresponding to different objects in the same virtual space. Alternatively, the distance used may be a euclidean distance in three-dimensional space.

And after the distance between every two models is determined, adjusting the second weight of the target driving parameter according to the distance.

In the embodiment of the present application, different distances correspond to different model levels, and as shown in table 1, the correspondence may be established in advance.

TABLE 1 distance versus model rank

Distance between two adjacent devices	Model classes
		D1	K1
D2	K2
		D3	K3
D4	K4
		...	...

Wherein, the closer the distance between the two models is, the more similar the grades of the two models are, and the more similar the fineness of the two models is, i.e. the more similar the degree of influence of the change of facial expression on the details of the models is. Therefore, the weight of the expression parameter can be adjusted according to the distance between the two models.

In executing S303, a model level of the parameterized head model of the target object is determined according to a correspondence between the distance and the model level, and the second weight of the target driving parameter is adjusted according to the determined model level.

In specific implementation, if the model grade is less than a first preset grade, which indicates that the distance between the two models is larger, the second weight of the target driving parameter is reduced, and if the model grade is greater than a second preset grade, which indicates that the distance between the two models is smaller, the second weight of the target driving parameter is increased, wherein the first preset grade is less than or equal to the second preset grade. The adjustment formula of the second weight is as follows:

m = W (e 1+ W1 × e 2) formula 1

Wherein e1 represents a first expression parameter, e2 represents a second expression parameter, W1 represents a first weight corresponding to the second expression parameter, W represents a second weight corresponding to a target driving parameter, wherein the target driving parameter comprises the first expression parameter and the second expression parameter, and M represents the weighted target driving parameter.

S304: and sending the first expression parameter, the second expression parameter, the first weight and the second weight to a rendering end, so that the rendering end drives the parameterized head model to move according to the first weight and the second weight and the first expression parameter and the second expression parameter.

The parameterized head model can express a human head model with real-time non-rigid deformation characteristics through a small amount of parameters, can generate a three-dimensional head model based on a single picture, and is not influenced by geometric deficiency of an invisible area. The classical parameterized head model mainly comprises models such as 3DMM, DECA, FLAME and the like, model parameters mainly comprise shape, expression and pose, and the shape of the face in the head three-dimensional model can be regarded as a result of the coaction of the parameters. Among other things, the DECA model supports the recovery of a three-dimensional model of the head with detailed features (e.g., wrinkles) from a single picture.

The parameterized head model adopted in the embodiment of the application is a FLAME model, the FLAME model is composed of two parts, namely a standard Linear Blend Skin (LBS) and a Blend Shape (Blend Shape), the number of grid vertexes in the adopted standard grid model is N =5023, and the number of joints is K =4 (the joints are respectively located on the neck, the lower jaw and two eyeballs). The parameterized head model formula is:

wherein,

represents a head shape parameter, < > or >>

Represents a head pose parameter (including a motion parameter of the head skeleton), or>

Are facial expression parameters. />

One vertex coordinate of the head three-dimensional geometric model can be uniquely identified. W () represents a linear skin function for transforming a head model mesh T along a joint, J () represents a function for predicting the position of different head joint points, T represents a head model mesh, B _s () Representing the influence function of head shape parameters on the head model mesh T, B _p () Representing the influence function of the head pose parameters on the head model mesh T, B _e () Representing the influence function of facial expression parameters on the head model mesh T, T _p () And s, p, e and omega respectively represent head shape weight, head posture weight, facial expression weight and skinning weight. s, p, e, ω are obtained by training pre-constructed head sample data.

Fig. 5 exemplarily shows a relationship diagram of three head parameters and a head model provided by an embodiment of the present application, wherein (a) part represents an influence of a head shape parameter on a geometric model, (b) part represents an influence of a head posture parameter on the geometric model, and (c) part represents an influence of a facial expression parameter on the geometric model.

And when S304 is executed, the acquisition end sends the determined weight and expression parameters to the rendering end, and the rendering end drives the FLAME model to move.

In an alternative embodiment, a Voice Operated Character Animation (VOCA) model may be used to map Voice data to the expression parameter implementation model.

The VOCA model uses a unique 4D face data set (set) comprising approximately 29 minute 4D scans captured at 60fps and 12 speakers' synchronized audio. A neural network is trained based on this set of facial data, which network can separate facial movements from individuals (identities). And, because the data set contains synchronous audio of a plurality of speakers, the training model can learn a plurality of realistic speaking styles. During animation, the VOCA model also provides animation control functions to change speaking styles, individual-identity dependent facial shapes and poses (i.e., head, jaw, and eye rotation) during production.

It should be noted that the VOCA as a learning model can take any speech signal as input, even speech in languages other than English (e.g., chinese, french, japanese, etc.), and can realistically animate a large number of different adult faces.

The VOCA model is able to output realistic facial animation based on input speech data and static three-dimensional head mesh data (embodiments of the present application use a flame model as the static three-dimensional head mesh data).

Fig. 6 exemplarily shows a driving process diagram provided by the embodiment of the application, and as shown in fig. 6, after the rendering end receives the first weight, the second weight, the first expression parameter and the second expression parameter sent by the acquisition end, the first expression parameter and the second expression parameter drive the parameterized head model to move according to the second weight corresponding to the first expression parameter and the product of the first weight and the second weight corresponding to the second expression parameter, so as to obtain the driven head three-dimensional model.

In an alternative implementation manner, the parameterized head model in the embodiment of the present application may be fitted in advance according to geometric data extracted from a face image of a target object.

To improve the realism of the parameterized head model, in some embodiments, the parameterized head model may also be optimized using RGBD cameras to capture depth images. Specifically, a face depth image is segmented from an RGBD image of a target object acquired by an RGBD camera, depth data of the target object is extracted for each frame of face depth image, and geometric parameters of a parameterized head model are updated according to the extracted depth data, so that optimization of the parameterized head model is realized.

In the embodiment of the application, when a plurality of target objects are subjected to remote three-dimensional communication interaction, the human three-dimensional models of different target objects can be changed (including distance and gesture) according to actual motion conditions, and in order to enable the human three-dimensional models to be matched with the gesture of a real target object, the embodiment of the application combines the extracted expression parameters in the face image and the voice data to drive the parameterized head three-dimensional model to move, so that the authenticity of a reconstructed model is improved, the problem of model reconstruction failure caused by face image loss is solved, and the robustness of three-dimensional reconstruction is improved. Specifically, a first weight of a second expression parameter extracted from the voice data is adjusted by using a comparison result of the acquisition frame rate and the rendering frame rate, a second weight common to the first expression parameter and the second expression parameter is adjusted by using a distance of the human body three-dimensional model in a virtual space, the extracted first expression parameter, the extracted second expression parameter, the extracted first weight and the extracted second weight are sent to a rendering display end, the rendering display end drives the parameterized head three-dimensional model to move according to the second weight corresponding to the first expression parameter and a product of the first weight and the second weight corresponding to the second expression parameter, so that the model is driven to move by using sound under the condition that geometric data is not transmitted or the acquisition frame rate is insufficient, and the robustness of three-dimensional reconstruction is improved; and the second weight is adjusted according to the distance between every two models in the virtual space, so that the vertex of the head three-dimensional model of the target object is changed in a self-adaptive manner, and the rendering efficiency is improved.

Based on the same technical concept, the embodiment of the present application provides a reconstruction apparatus, which can execute the flow of the head three-dimensional reconstruction method provided by the embodiment of the present application, and can achieve the same technical effect, and the process is not repeated here.

Referring to fig. 7, the reconstruction apparatus includes a processor 701, a memory 702, a display 703 and at least one external communication interface 704, the display 703 and the memory 702 are connected to the processor 701 through a bus 705; the at least one external communication interface 704 is configured to acquire image and voice data of a target object, the display 703 is configured to display a driven three-dimensional model of the head, the memory 702 stores a computer program, and the processor 701 implements the head three-dimensional reconstruction method in the foregoing embodiments by executing the computer program.

Embodiments of the present application also provide a computer-readable storage medium for storing instructions that, when executed, may implement the methods of the foregoing embodiments.

The embodiments of the present application also provide a computer program product for storing a computer program, where the computer program is used to execute the method of the foregoing embodiments.

Finally, it should be noted that: the above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present application.

The foregoing description, for purposes of explanation, has been presented in conjunction with specific embodiments. However, the illustrative discussions above are not intended to be exhaustive or to limit the embodiments to the precise forms disclosed above. Many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles and the practical application, to thereby enable others skilled in the art to best utilize the embodiments and various embodiments with various modifications as are suited to the particular use contemplated.

Claims

1. A method of three-dimensional reconstruction of a head, comprising:

acquiring voice data corresponding to each frame of face image;

and sending the first expression parameter, the second expression parameter, the first weight and the second weight to a rendering end, so that the rendering end drives the parameterized head model to move according to the first weight and the second weight and the first expression parameter and the second expression parameter.

2. The method of claim 1, wherein the adjusting the first weight of the second expression parameter according to the frame rate of acquisition of the facial image comprises:

if the acquisition frame rate is less than a preset rendering frame rate of the rendering end, setting a first weight of the second expression parameter extracted from the voice data corresponding to the corresponding facial image as 1 for the facial image without the first expression parameter, and reducing the first weight of the second expression parameter extracted from the voice data corresponding to the corresponding facial image for the facial image with the first expression parameter; or alternatively

3. The method of claim 1, wherein the adjusting the second weight of the target driving parameter according to the distance between the three-dimensional models of the human body corresponding to the two target objects comprises:

4. The method of claim 3, wherein said adjusting a second weight of said target drive parameter based on said model class comprises:

5. The method of any one of claims 1-4, further comprising:

extracting depth data from each frame of face depth image;

optimizing the parameterized head model based on the extracted depth data.

6. A reconstruction device comprising a processor, a memory, a display and at least one external communication interface, said processor, said memory, said display and said external communication interface being connected by a bus;

acquiring voice data corresponding to each frame of face image;

adjusting the first weight of the second expression parameter according to the acquisition frame rate of the facial image, and adjusting the second weight of the target driving parameter according to the distance between the human body three-dimensional models corresponding to every two target objects;

7. The reconstruction device of claim 6, wherein the processor adjusts the first weight of the second expression parameter according to an acquisition frame rate of the facial image, having a configuration configured to:

And if the acquisition frame rate is not less than the rendering frame rate preset by the rendering end, reducing the first weight of the second expression parameter extracted from the voice data corresponding to the face image.

8. The reconstruction device of claim 6, wherein the processor adjusts the second weight of the target driving parameter based on a distance between three-dimensional models of the human body corresponding to two target objects, in particular configured to:

9. The reconstruction device of claim 8, wherein the processor adjusts the second weight of the target drive parameter in accordance with the model class, specifically configured to:

10. The reconstruction device of any one of claims 6-9, wherein the processor is further configured to:

extracting depth data from each frame of face depth image;

optimizing the parameterized head model based on the extracted depth data.