CN115909015A

CN115909015A - Construction method and device of deformable nerve radiation field network

Info

Publication number: CN115909015A
Application number: CN202310119675.7A
Authority: CN
Inventors: 杨延东; 朱红
Original assignee: Suzhou Inspur Intelligent Technology Co Ltd
Current assignee: Suzhou Inspur Intelligent Technology Co Ltd
Priority date: 2023-02-15
Filing date: 2023-02-15
Publication date: 2023-04-04
Anticipated expiration: 2043-02-15
Also published as: CN115909015B

Abstract

The embodiment of the invention provides a method and a device for constructing a deformable nerve radiation field network, which relate to the technical field of model training and comprise the following steps: acquiring first image parameters and audio characteristic data corresponding to video frames in a training video containing characters; obtaining the face contour of a person in the video frame according to the first image parameter corresponding to the video frame; inputting the face contour and the audio characteristic data into a deformable nerve radiation field network to obtain a rendering video frame containing a virtual character; acquiring a third image parameter corresponding to a video frame of a target video corresponding to a training video, and calculating error data of the third image parameter corresponding to the video frame and a second image parameter corresponding to a rendered video frame corresponding to the video frame; and when the deformable neural radiation field network is determined to meet the convergence condition according to the error data, obtaining the trained deformable neural radiation field network. The virtual character with more sense of reality and expressive force can be synthesized through the trained deformable nerve radiation field network.

Description

Construction method and device of deformable nerve radiation field network

Technical Field

The embodiment of the invention relates to the technical field of model training, in particular to a method and a device for constructing a deformable nerve radiation field network.

Background

Currently, a dynamic digital speaker (Talking head) based on audio driving is receiving more and more attention, and for the digital speaker, the digital speaker can be understood as a synthetic virtual character, and the digital speaker can be widely applied to actual scenes, such as retail anchor, image lecture, teleconferencing, movie production and the like. However, synthesizing a realistic and expressive digital speaker is a very challenging task, and the technical difficulty is not only the real-time synchronization of the mouth and the audio of the digital speaker, but also the solution of the details of geometric dynamic changes such as eye contact, and the like, and the absence of 3D (3 Dimensions three-dimensional) space supervision information, and the like.

The existing digital speaker synthesis technical scheme is a hot research field, and the main technical scheme can be divided into three categories: image-based models, implicit models and explicit models. First, the image-based model approach can synthesize high quality results, but distortions may occur when dealing with large pose or expression changes, and lack geometric and temporal consistency, resulting in a result that may be due to the fact that the deformation information of the three-dimensional surface is obtained from 2D (2 Dimensions two-dimensional) pictures; secondly, the method based on the implicit model can solve the problem of consistency of space geometry and time sequence to a certain extent, but is only limited by static scene reconstruction and is difficult to generalize to the expression or the gesture which does not appear; also, explicit model synthesis can produce geometrically consistent and easily controlled digital humans, but is limited only by cranial structures and cannot synthesize hair information, or leads to spatiotemporal inconsistency problems due to loose constraints on the backbone geometry.

Disclosure of Invention

The embodiment of the invention provides a method and a device for constructing a deformable nerve radiation field network, electronic equipment and a computer readable storage medium, which are used for solving or partially solving the problem that a virtual character synthesized based on audio drive in the prior art lacks sense of reality and expressive force.

The embodiment of the invention discloses a method for constructing a deformable nerve radiation field network, which comprises the following steps:

acquiring first image parameters and audio characteristic data corresponding to video frames in a training video containing characters; the training video is provided with a corresponding target video, and the target video is a video of a virtual character corresponding to a character containing the training video;

obtaining the human face contour of the figure in the video frame according to the first image parameter corresponding to the video frame;

inputting the face contour and the audio characteristic data into a deformable nerve radiation field network to obtain a rendered video frame containing a virtual character; the rendered video frame comprises a second image parameter of a preset visual angle;

acquiring a third image parameter corresponding to a video frame of the target video, and calculating error data of the third image parameter corresponding to the video frame and a second image parameter corresponding to the rendered video frame corresponding to the video frame;

and when the deformable neural radiation field network is determined to meet the convergence condition according to the error data, obtaining the trained deformable neural radiation field network.

Optionally, the obtaining the face contour of the person in the video frame according to the first image parameter corresponding to the video frame includes:

inputting the first image parameters into a preset human face model for training;

when the preset face model meets the convergence condition, obtaining a trained face model;

and obtaining the human face contour of the figure in the video frame according to the trained human face model.

Optionally, the first image parameters comprise at least camera parameters, appearance data, expression data and pose data.

Optionally, the inputting the face contour and the audio feature data into a deformable neural radiation field network to obtain a rendered video frame including a virtual character includes:

inputting the face contour and the audio feature data into the deformable neural radiation field network;

determining incident light rays of the video frame corresponding to the face contour according to the face contour;

according to the incident light, determining the positions of the incident light and the sampling points of the human face contour;

and obtaining a rendering video frame containing the virtual character according to the sampling point position and the audio characteristic data.

Optionally, the face contour is used to distinguish a foreground and a background corresponding to a video frame in the training video.

Optionally, the method further comprises:

when the incident ray intersects with the human face contour, taking the incident ray as a foreground corresponding to a video frame in the training video;

and when the incident ray does not intersect with the face contour, taking the incident ray as a background corresponding to a video frame in the training video.

Optionally, the method further comprises:

and determining the coordinate code and the view angle direction of the sampling point position according to the sampling point position of the incident light.

Optionally, the deformable neural radiation field network comprises an implicit deformation code and an implicit appearance code; the implicit deformation code is used for constructing expression changes of the face appearance of the person in the video frame, and the implicit appearance code is used for constructing changes of different illumination and post-shooting processing of the video frame.

acquiring sampling point positions of incident light rays corresponding to the video frames, and determining coordinate codes and visual angle directions of the sampling point positions;

and inputting the coordinate code and the visual angle direction of the sampling point position, and the implicit deformation code and the implicit appearance code in the deformable nerve radiation field network into the deformable nerve radiation field network to obtain a rendered video frame containing a virtual character.

Optionally, the deformable neural radiation field network comprises a radiation field network and an encoder network.

Optionally, the radial field network is configured to generate a volume density of the video frame for a preset view, and the encoder network is configured to generate a color of the video frame for the preset view.

Optionally, the second image parameters include at least volume density and color.

Optionally, the method further comprises:

and inputting the first image parameters corresponding to the video frame and the implicit deformation code of the deformable nerve radiation field network into the radiation field network of the deformable nerve radiation field network to obtain the volume density corresponding to the video frame at the preset visual angle.

Optionally, the method further comprises:

and inputting the visual angle direction corresponding to the video frame, the implicit appearance code of the deformable nerve radiation field network and the audio characteristic data into an encoder network of the deformable nerve radiation field network to obtain the color corresponding to the video frame at a preset visual angle.

Optionally, the method further comprises:

and obtaining a rendering video frame corresponding to the training video according to the volume density and the color.

Optionally, the obtaining a third image parameter corresponding to a video frame of the target video and calculating error data of the third image parameter corresponding to the video frame and a second image parameter corresponding to the rendered video frame corresponding to the video frame includes:

acquiring the volume density and color corresponding to the video frame in the target video;

and calculating error data of the volume density and the color corresponding to the video frame in the target video and the volume density and the color corresponding to the rendered video frame according to the volume density and the color corresponding to the rendered video frame.

Optionally, after obtaining the trained deformable neural radiation field network when it is determined that the deformable neural radiation field network satisfies the convergence condition according to the error data, the method further includes:

and synthesizing the video containing the characters into the video containing the virtual characters according to the trained deformable nerve radiation field network.

Optionally, the synthesizing a video including a human being into a video including a virtual human being according to the trained deformable nerve radiation field network includes:

taking the error data as an optimization target of the deformable neural radiation field network;

and optimizing the optimization target, and synthesizing the video containing the characters into the video containing the virtual characters.

The embodiment of the invention also discloses a device for constructing the deformable nerve radiation field network, which comprises the following components:

the data acquisition module is used for acquiring first image parameters and audio characteristic data corresponding to video frames in a training video containing characters; the training video is provided with a corresponding target video, and the target video is a video of a virtual character corresponding to a character containing the training video;

the face contour acquisition module is used for acquiring the face contour of the person in the video frame according to the first image parameter corresponding to the video frame;

the rendering video frame acquisition module is used for inputting the face contour and the audio characteristic data into a deformable nerve radiation field network to obtain a rendering video frame containing a virtual character; the rendered video frame comprises a second image parameter of a preset visual angle;

an error data calculation module, configured to obtain a third image parameter corresponding to a video frame of the target video, and calculate error data of the third image parameter corresponding to the video frame and a second image parameter corresponding to the rendered video frame corresponding to the video frame;

and the deformable nerve radiation field network construction module is used for obtaining the trained deformable nerve radiation field network when the deformable nerve radiation field network is determined to meet the convergence condition according to the error data.

Optionally, the face contour acquiring module is specifically configured to:

inputting the first image parameters to a preset human face model for training;

Optionally, the rendered video frame acquiring module is specifically configured to:

and obtaining a rendered video frame containing the virtual character according to the sampling point position and the audio characteristic data.

Optionally, the apparatus further comprises:

the foreground obtaining module is used for taking the incident ray as a foreground corresponding to a video frame in the training video when the incident ray is intersected with the face contour;

and the background acquisition module is used for taking the incident ray as a background corresponding to a video frame in the training video when the incident ray is not intersected with the face contour.

Optionally, the apparatus further comprises:

and the position data acquisition module is used for determining the coordinate code and the visual angle direction of the position of the sampling point according to the position of the sampling point of the incident light.

acquiring sampling point positions of incident light rays corresponding to the video frames, and determining coordinate codes and view angle directions of the sampling point positions;

Optionally, the apparatus further comprises:

and the volume density acquisition module is used for inputting the first image parameters corresponding to the video frame and the implicit deformation code of the deformable nerve radiation field network into the radiation field network of the deformable nerve radiation field network to obtain the volume density corresponding to the video frame at a preset visual angle.

Optionally, the apparatus further comprises:

and the color acquisition module is used for inputting the visual angle direction corresponding to the video frame, the implicit appearance code of the deformable nerve radiation field network and the audio characteristic data into an encoder network of the deformable nerve radiation field network to obtain the color corresponding to the video frame at the preset visual angle.

Optionally, the error data calculation module is specifically configured to:

Optionally, the apparatus further comprises:

and the first virtual character video synthesis module is used for synthesizing the videos containing the characters into the video containing the virtual characters according to the trained deformable nerve radiation field network.

Optionally, the apparatus further comprises:

an optimization target determination module, configured to use the error data as an optimization target of the deformable neural radiation field network;

and the second virtual character video synthesis module is used for optimizing the optimization target and synthesizing the video containing the characters into the video containing the virtual characters.

The embodiment of the invention also discloses electronic equipment which comprises a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory finish mutual communication through the communication bus;

the memory is used for storing a computer program;

the processor is configured to implement the method according to the embodiment of the present invention when executing the program stored in the memory.

Embodiments of the present invention also disclose a computer-readable storage medium having instructions stored thereon, which, when executed by one or more processors, cause the processors to perform the method according to the embodiments of the present invention.

The embodiment of the invention has the following advantages:

in the embodiment of the invention, first image parameters and audio characteristic data corresponding to a video frame in a training video containing a person are obtained, wherein the training video has a corresponding target video, the target video is a video of a virtual person corresponding to the person containing the training video, a face contour of the person in the video frame is obtained according to the first image parameters corresponding to the video frame, then the face contour and the audio characteristic data are input into a deformable nerve radiation field network, a rendered video frame containing the virtual person is obtained, and the rendered video frame contains a second image parameter of a preset visual angle; and further acquiring a third image parameter corresponding to the video frame of the target video, calculating error data of the third image parameter corresponding to the video frame and a second image parameter corresponding to the rendered video frame corresponding to the video frame, and obtaining the trained deformable nerve radiation field network when the deformable nerve radiation field network meets the convergence condition according to the error data. The method comprises the steps of processing input video frames containing characters through the deformable nerve radiation field network to obtain a trained deformable nerve radiation field network, and synthesizing videos containing the characters into videos containing virtual characters according to the trained deformable nerve radiation field network, so that expressive power and persuasive power of the virtual characters are effectively improved, and meanwhile, smoothness and reality of three-dimensional visual representation of the virtual characters are improved.

Drawings

FIG. 1 is a schematic flow chart of a prior art method for synthesizing a virtual character according to an embodiment of the present invention;

FIG. 2 is a second flowchart of a prior art virtual character synthesizing method according to an embodiment of the present invention;

FIG. 3 is a flowchart illustrating steps of a method for constructing a network of deformable neural radiation fields according to an embodiment of the present invention;

fig. 4 is a schematic flowchart of a method for constructing a deformable neural radiation field network according to an embodiment of the present invention;

fig. 5 is a block diagram of a device for constructing a deformable neural radiation field network according to an embodiment of the present invention;

FIG. 6 is a schematic structural diagram of a computer-readable storage medium provided in an embodiment of the present invention;

fig. 7 is a schematic diagram of a hardware structure of an electronic device implementing various embodiments of the present invention.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present invention more comprehensible, the present invention is described in detail with reference to the accompanying drawings and the detailed description thereof.

As an example, currently, a dynamic digital speaker based on audio driving is receiving more and more attention, and for the digital speaker, it can be understood as a synthetic virtual character, and the digital speaker can be widely applied to actual scenes, such as retail anchors, image representatives, teleconferences, movie production, and the like. The existing digital speaker synthesis technical scheme is a hot research field, and the main technical scheme can be divided into three categories: image-based models, implicit models and explicit models. First, the image-based model approach can synthesize high quality results, but distortions may occur when dealing with large pose or expression changes, and lack geometric and temporal consistency, resulting in a result that may be due to the deformation information of the three-dimensional surface being obtained from 2D pictures; secondly, the method based on the implicit model can solve the problem of consistency of space geometry and time sequence to a certain extent, but is only limited by static scene reconstruction and is difficult to generalize to the expression or the gesture which does not appear; also, explicit model synthesis can produce geometrically consistent and easily controlled digital humans, but is limited only by cranial structures and cannot synthesize hair information, or leads to spatiotemporal inconsistency problems due to loose constraints on the backbone geometry. However, synthesizing a realistic and expressive digital speaker is a very challenging task, and the technical difficulty is not only the real-time synchronization of the mouth and the audio of the digital speaker, but also the solution of the details of geometric dynamic changes such as eye contact and the like, and the lack of 3D space supervision information and the like. Specifically, the existing technical solutions are mainly an image-based model, an implicit model and an explicit model, and the main technical solutions thereof are as follows:

image-based models, without relying on any representation in 3D space, employ a Warping field (Warping Fields) to transform an image to match a new pose or expression, or employ a codec structure, the encoder extracting an identification code from a given source image, and the decoder synthesizing an output image based on this identification code and input features, the decoder possibly relying on information such as facial keypoints, facial contours, etc. Although this approach may synthesize high quality results, distortions may occur when dealing with large pose or expression changes, and lack geometric and temporal consistency, possibly due to the fact that deformation information for these three-dimensional surfaces is obtained from 2D pictures.

Implicit models, such as Symbolic Distance Functions (SDFs) or Voxel (Voxel) representations, are generally used, and this technical route is to represent the faces of people as discrete implicit characteristic Voxel grids to synthesize dynamic transformations, and a scheme combining Neural radiation fields (NeRF) and Voxel rendering is also receiving much attention, and this scheme generally utilizes low-dimensional parameters of face models or audio signals to synthesize digital speakers; although the implicit model method can solve the consistency problem of space geometry and time sequence to a certain extent, the method is only limited by static scene reconstruction and is difficult to generalize to the expression or pose which does not appear.

An explicit model mainly adopts an explicit triangular mesh feature representation method, specifically, deformation model parameters are used as prior information to reconstruct facial features of a digital speaker from incomplete (partial occlusion) or noise data (depth map), wherein the explicit deformation model is constructed by fitting a series of 3D head scan maps and is used for providing statistical information of facial shapes, action expressions and geometric textures, and a Generative confrontation network (GAN for short) is also adopted for generation and optimization.

In addition, there are methods that use 2D neural rendering to learn how to generate realistic digital speakers, which, although they can generate geometrically consistent and easily controlled digital speakers, are limited only by cranial structures and cannot synthesize hair information, or lead to spatiotemporal inconsistency problems due to loose constraints on the backbone geometry.

Referring to fig. 1, there is shown a schematic flow diagram of a method for synthesizing a virtual character in the prior art according to an embodiment of the present invention, and in particular, as shown in fig. 1, it is a scheme that a mouth motion and personalized features are synthesized by using a discrete Face impedance distribution Neural Field (DFA-NeRF for short) with decoupled facial Attributes, the mouth motion and personalized features are predicted by audio and used as an input of a dynamic Neural radiation Field, so that the mouth motion of a synthesized digital speaker is synchronized with audio input, and a more natural visual effect is generated. As shown in fig. 1, the method of using the nerve radiation field framework with facial attribute decoupling is combined with audio information to predict mouth activity and perform personalized modeling work of facial expressions, the core point is that the audio information is highly correlated with the mouth activity, while personalized motions such as head and blink are weakly correlated with audio, and each person has different performance. Specifically, motion and facial expression information are extracted from the video, wherein the facial motion and expression information can be deconstructed into eye motion and mouth motion, a probability model based on a Gaussian Process-variant auto-encoder (GP-VAE for short) is adopted to model personalized head and eye motion, and a contrast learning strategy is adopted to synchronously relate and learn audio semantics and mouth motion. And finally, the generated action characteristics are taken as the visual angle direction, and the generated eye characteristic information and the synchronous audio characteristic information are spliced to drive a neural radiation field frame with decoupled facial attributes to easily win virtual persons (digital speakers).

However, the scheme cannot perform synthetic modeling on the hair style of the head at present, and the rendering process is slow, mainly because a probability sampling model based on a gaussian process is additionally adopted, and the scheme does not support multi-language audio signal input.

Referring to fig. 2, a second flow diagram of a method for synthesizing a virtual character in the prior art provided in the embodiment of the present invention is shown, and as shown in fig. 2, the second flow diagram is a digital speaker Neural radiation field rendering technical scheme (AD-NeRF for short) based on Audio driving, specifically, the deformation of the Head and the clothes is modeled by a Neural radiation field, so as to solve the problem of inconsistent Head movement and clothes movement.

Although the scheme can better solve the problem of inconsistent clothes and head movements, the phenomenon of unnatural mouth synthesis results exists, which is caused by inconsistent audio signals in the reasoning and training processes, and when the mouth expression changes obviously, the problem of distortion to a certain degree occurs.

In contrast, one of the core invention points of the present invention is to obtain a first image parameter and audio feature data corresponding to a video frame in a training video containing a character, wherein the training video has a corresponding target video, the target video is a video of a virtual character corresponding to the character containing the training video, a face contour of the character in the video frame is obtained according to the first image parameter corresponding to the video frame, and then the face contour and the audio feature data are input into a deformable neural radiation field network to obtain a rendered video frame containing the virtual character, wherein the rendered video frame contains a second image parameter of a preset view angle; and then acquiring a third image parameter corresponding to a video frame of the target video, calculating error data of the third image parameter corresponding to the video frame and a second image parameter corresponding to a rendered video frame corresponding to the video frame, and when determining that the deformable nerve radiation field network meets a convergence condition according to the error data, acquiring a trained deformable nerve radiation field network, so as to synthesize a video containing a character into a video containing a virtual character according to the trained deformable nerve radiation field network. The input video frames containing the characters are processed through the deformable nerve radiation field network, and then the virtual characters are synthesized, so that the expressive force and persuasive force of the virtual characters are effectively improved, and meanwhile, the three-dimensional visual representation fluency and reality of the virtual characters are improved.

Referring to fig. 3, a flowchart illustrating steps of a method for constructing a deformable neural radiation field network provided in an embodiment of the present invention is shown, and specifically, the method may include the following steps:

301, acquiring a first image parameter and audio characteristic data corresponding to a video frame in a training video containing a person; the training video is provided with a corresponding target video, and the target video is a video of a virtual character corresponding to a character containing the training video;

as for the training video, it may be a video for inputting to the deformable neural radiation field network for training, which may be understood as training data, where the training video may be a monocular RGB (color mode) facial speaking video or a self-timer speaking video, which at least needs to include a person, and it may be understood that the video may include a background, such as a building, besides the presence of the person. A video frame is also understood to be an image, one video frame generally corresponds to one image, and a video segment is composed of a plurality of video frames.

For the first image parameters, it may include camera parameters, appearance data, expression data, and pose data; the camera parameters may be a viewing angle direction, shooting light or light, and the like when the camera is used for shooting a video; for the shape data, it can be expressed as the shape of the head of the task, or hair, feature information of the face, etc.; for the expression data, it can be the expression of human, such as happiness, anger, sadness, music, etc.; the gesture data may be a spoken movement of the mouth of the task or a gesture of another part. It should be noted that, in the embodiment of the present invention, for convenience of understanding, the listed data are simple, that is, for the data included in the first image parameter, in an actual application, it may be far more than the listed data, and a person skilled in the art may select the data according to the actual situation, which is not limited in the embodiment of the present invention.

The audio feature data may be feature information of an audio signal in a video, the audio feature information may be mel-frequency cepstral Coefficients (MFCC) features, and the mel-frequency cepstral Coefficients features are voice features most commonly used in terms of voice Recognition (Speech Recognition) and Speaker Recognition (Speaker Recognition).

For the target video, the target video may be a target video corresponding to the training video, and the target video is a video of a virtual character corresponding to a character including the training video; the virtual character can be understood as a digital speaker, the digital speaker can be created by means of advanced technologies such as artificial intelligence and deep learning, the five sense organs and body parts of the digital speaker are modeled by combining human body proportion, the digital speaker has the mouth-lip expression, limb actions, voice tone, image quality and emotional expression similar to those of a real person, and the digital speaker gives enthusiasm, generous and natural overall impression to people.

In an example, referring to fig. 4, a flowchart of a method for constructing a deformable neural radiation field network according to an embodiment of the present invention is shown, as shown in the figure, a segment of video Frames (Input Frames) is Input, that is, a segment of video includes a plurality of video Frames, and assuming that a frame rate in an Input speaker video is sampled to be 25FPS (frame rate), it should be noted that synchronization of audio signals corresponding to the video Frames needs to be ensured at this time, where for synchronization of the audio signals, alignment audio frequencies may be adopted to ensure synchronization of the audio signals, so that camera parameters, appearance data, expression data, and Pose data of each frame may be extracted by adopting a facial Expression detail Capture and Animation module (DECA) and a facial key point fitting module (Landmark fitting), and at the same time, mel cepstrum coefficient features of the audio signals may be extracted by taking 100 milliseconds as a time window.

In the embodiment of the invention, first image parameters and audio characteristic data corresponding to video frames in a training video containing characters are obtained, wherein the training video has a corresponding target video, and the target video is a video of a virtual character corresponding to the character containing the training video.

Step 302, obtaining the human face contour of the person in the video frame according to the first image parameter corresponding to the video frame;

wherein, for the human face contour, the human face contour can be understood as the face shape of a person; the face contour can be obtained based on a face model (flag model), wherein the face model is a common three-dimensional head statistical model, and can fit model parameters based on a given face data set and output personalized expressions and postures.

In specific implementation, camera parameters, shape data, expression data and posture data, which are obtained by preprocessing each frame in a training video through a facial expression detail capturing and animation module and a facial key point fitting module, can be respectively input into a face model for fitting training to obtain a face contour, wherein the face contour can be used for separating the foreground and the background of a video frame, and it can be understood that the foreground is a part containing characters and the background is a part not containing characters.

In the embodiment of the invention, after the first image parameter and the audio characteristic data corresponding to the video frame in the training video containing the character are obtained, the face contour of the character in the video frame is obtained according to the first image parameter corresponding to the video frame.

Step 303, inputting the face contour and the audio characteristic data into a deformable nerve radiation field network to obtain a rendered video frame containing a virtual character; the rendered video frame comprises a second image parameter of a preset visual angle;

wherein, the deformable nerve radiation field network can be used for training and rendering video data so as to obtain a model containing a virtual character; for the rendered video frame, the rendered video frame can be a video frame obtained by rendering according to the deformable nerve radiation field network, and can also be understood as a new image obtained by rendering; the rendered video frame comprises a second image parameter of a preset visual angle, and for the second image parameter, the second image parameter can comprise the volume density and the color of the video frame of the preset visual angle, namely the corresponding volume density and the color of the rendered video frame, and it can be understood that the video frame or the image with new volume density and new color can be rendered through the deformable nerve radiation field network; for the preset view angle, the view angle may be a new view angle obtained by rendering, and the view angle may be a view angle direction photographed by a camera corresponding to the video frame.

In the embodiment of the invention, after first image parameters and audio characteristic data corresponding to a video frame in a training video containing a character are obtained, a face contour of the character in the video frame is obtained according to the first image parameters corresponding to the video frame, and the face contour and the audio characteristic data are input into a deformable nerve radiation field network to obtain a rendered video frame containing a virtual character, wherein the rendered video frame contains second image parameters of a preset visual angle.

Step 304, obtaining a third image parameter corresponding to a video frame of the target video, and calculating error data of the third image parameter corresponding to the video frame and a second image parameter corresponding to the rendered video frame corresponding to the video frame;

the target video can be a target video corresponding to the training video, and the target video is a video of a virtual character corresponding to a character containing the training video; for the third image parameter, which is the volume density and color corresponding to each video frame in the target video, it may be understood as the volume density and color corresponding to each video frame in the original video, and different from the second image parameter, for the second image parameter, it may include the volume density and color of the video frame at the preset view angle, that is, the corresponding volume density and color of the rendered video frame.

For error data, it may be the loss of error between the bulk density and color corresponding to the video frame of the target video and the bulk density and color corresponding to the rendered video frame.

In a specific implementation, the corresponding volume density and color of the rendered video frame are obtained, the corresponding volume density and color of the video frame of the target video are further obtained, and error data of the corresponding volume density and color of the target video frame and the corresponding volume density and color of the rendered video frame are calculated.

And 305, when the deformable nerve radiation field network meets the convergence condition according to the error data, obtaining the trained deformable nerve radiation field network.

And the convergence condition is a preset convergence value in the deformable nerve radiation field network and can be adjusted according to actual conditions.

In a specific implementation, after the corresponding volume density and color of the rendered video frame are obtained, and further the corresponding volume density and color of the video frame of the target video are obtained, and the error data of the corresponding volume density and color of the target video frame and the corresponding volume density and color of the rendered video frame are calculated, the convergence condition of the deformable nerve radiation field network can be determined according to the error data, so that the trained deformable nerve radiation field network is obtained. After the trained deformable nerve radiation field network is obtained, videos containing characters can be synthesized into videos containing virtual characters according to the trained deformable nerve radiation field network.

In an optional embodiment, the step 302 of obtaining the face contour of the person in the video frame according to the first image parameter corresponding to the video frame includes:

and obtaining the human face outline of the figure in the video frame according to the trained human face model.

Wherein, for the first image parameter, it may comprise camera parameters, appearance data, expression data and gesture data; the camera parameters may be a viewing angle direction, shooting light or light, and the like when the camera is used for shooting a video; for the shape data, it can be expressed as the shape of the head of the task, or hair, feature information of the face, etc.; for the expression data, it can be the expression of human, such as happiness, anger, sadness, music, etc.; the gesture data may be a speech motion of the mouth of the task, or may be a gesture of another part. It should be noted that, in the embodiment of the present invention, for convenience of understanding, the listed data are simple, that is, for the data included in the first image parameter, in an actual application, it may be far more than the listed data, and a person skilled in the art may select the data according to the actual situation, which is not limited in the embodiment of the present invention.

In one example, as shown in fig. 4, a video frame is input, i.e. a video frame contains a plurality of video frames, wherein camera parameters, shape data, expression data, pose data of each frame can be extracted by employing a facial expression detail capturing and animation module and a facial key point fitting module.

For a human face contour, it can be understood as the face shape of a person; the face contour can be obtained based on a face model, wherein the face model is a common three-dimensional head statistical model, model parameters can be fitted based on a given face data set, and personalized expressions and postures are output.

Alternatively, the face contour may be used to separate a foreground and a background of a video frame (which may be understood as an image), specifically, when an incident ray intersects with the face contour, the incident ray is taken as the foreground corresponding to the video frame in the training video, and when the incident ray does not intersect with the face contour, the incident ray is taken as the background corresponding to the video frame in the training video.

The incident light may be an incident light emitted from the center of the camera when the camera is used for shooting.

In the embodiment of the present invention, camera parameters, shape data, expression data, and pose data obtained by preprocessing each frame in a training video through a facial expression detail capture and animation production module and a facial key point fitting module may be respectively input into a face model for fitting training, so as to obtain a face contour, where the face contour may be used to separate a foreground and a background of a video frame, it may be understood that the foreground may be a portion including a character, and the background may be a portion not including a character, specifically, when an incident ray intersects with the face contour, the incident ray is taken as a foreground corresponding to the video frame in the training video, and when the incident ray does not intersect with the face contour, the incident ray is taken as a background corresponding to the video frame in the training video.

In an optional embodiment, the step 303 of inputting the face contour and the audio feature data into a deformable neural radiation field network to obtain a rendered video frame including a virtual character includes:

Wherein, for the human face contour, the human face contour can be understood as the face shape of a person; for the audio feature data, it may be feature information of an audio signal in the video, and the audio feature information may be mel cepstral coefficient features, for the mel cepstral coefficient features, it is the most commonly used speech feature in terms of speech recognition and speaker recognition.

For the deformable neural radiation field network, the deformable neural radiation field network can be used for training and rendering video data so as to obtain a model containing a virtual character; for the incident light, it may be an incident light emitted from the center of the camera when the camera is used for shooting artificially; for the sampling point position, it may be a position selected according to the incident light, and for the sampling point position, it is a sampling point corresponding to the incident light that is usually manually selected according to experience.

For the rendered video frame, the rendered video frame may be a video frame rendered according to the deformable neural radiation field network, and may also be understood as a new image rendered; the rendered video frame comprises a second image parameter of a preset visual angle, and for the second image parameter, the volume density and the color of the video frame of the preset visual angle, that is, the corresponding volume density and color of the rendered video frame, can be understood as a video frame or an image with a new volume density and a new color can be rendered through the deformable nerve radiation field network; for the preset view angle, the view angle may be a new view angle obtained by rendering, and the view angle may be a view angle direction photographed by a camera corresponding to the video frame.

Optionally, the coordinate code and the view direction of the sampling point position may be determined according to the sampling point position of the incident light, the coordinate code of the sampling point position, for example, (x, y), and the values of x and y may be selected according to the actual situation.

Optionally, the deformable neural radiation field network includes an implicit deformation code and an implicit appearance code, and specifically, the implicit deformation code and the implicit appearance code may be understood as a vector initialized randomly in the deformable neural radiation field network, where the implicit deformation code is used to construct an expression change of a facial appearance of a person in the video frame, and the implicit appearance code is used to construct a change of different illumination and post-photography processing of the video frame.

Specifically, the rendered video frame including the virtual character can be obtained by acquiring the sampling point position of the incident light corresponding to the video frame, determining the coordinate code and the view angle direction of the sampling point position, and inputting the coordinate code and the view angle direction of the sampling point position, and the implicit deformation code and the implicit appearance code in the deformable nerve radiation field network into the deformable nerve radiation field network.

Optionally, the deformable neural radiation field network includes a radiation field network and an encoder network, the radiation field network is configured to generate a volume density of a video frame for a preset view angle, the encoder network is configured to generate a color of the video frame for the preset view angle, as can be seen from fig. 4, the first image parameter and the implicit deformation code may be input to the radiation field network of the deformable neural radiation field network to obtain a volume density corresponding to the video frame for the preset view angle, the view angle direction, the implicit appearance code, and the audio feature data are input to the encoder network of the deformable neural radiation field network to obtain a color corresponding to the video frame for the preset view angle, and finally, the rendered video frame corresponding to the training video may be obtained according to the volume density and the color of the rendered video frame, that is, an image of a new volume density and a new color is obtained.

In an optional embodiment, the step 304 of obtaining a third image parameter corresponding to a video frame of the target video, and calculating error data of the third image parameter corresponding to the video frame and a second image parameter corresponding to the rendered video frame corresponding to the video frame includes:

Wherein, for the error data, it may be an error loss between a bulk density and a color corresponding to a video frame of the target video and a bulk density and a color corresponding to the rendered video frame; for the target video, the target video may be a target video corresponding to the training video, and the target video is a video of a virtual character corresponding to a character including the training video; for the third image parameter, which is the volume density and color corresponding to each video frame in the target video, it may be understood as the volume density and color corresponding to each video frame in the original video, and different from the second image parameter, for the second image parameter, it may include the volume density and color of the video frame at the preset view angle, that is, the corresponding volume density and color of the rendered video frame.

For the rendered video frame, the rendered video frame can be a video frame obtained by rendering according to the deformable nerve radiation field network, and can also be understood as a new image obtained by rendering; the rendered video frame comprises a second image parameter of a preset visual angle, and for the second image parameter, the second image parameter can comprise the volume density and the color of the video frame of the preset visual angle, namely the corresponding volume density and the color of the rendered video frame, and it can be understood that the video frame or the image with new volume density and new color can be rendered through the deformable nerve radiation field network; for the preset view angle, the view angle may be a new view angle obtained by rendering, and the view angle may be a view angle direction photographed by a camera corresponding to the video frame.

In the embodiment of the invention, error data of the volume density and the color corresponding to the video frame in the target video and the volume density and the color corresponding to the rendered video frame are calculated according to the volume density and the color corresponding to the video frame in the acquired target video and further according to the volume density and the color corresponding to the rendered video frame.

In an optional embodiment, the synthesizing a video including a character into a video including a virtual character according to the trained deformable nerve radiation field network includes:

and optimizing the optimization target, and synthesizing the video containing the character into the video containing the virtual character.

For the optimization target, the error data obtained in the above steps may be error loss between the bulk density and color corresponding to the video frame of the target video and the bulk density and color corresponding to the rendered video frame; the virtual character can be understood as a digital speaker, the digital speaker can be built by means of advanced technologies such as artificial intelligence and deep learning, the five sense organs and body parts of the digital person are modeled by combining human body proportion, the digital person has the mouth-lip expression, limb actions, voice tone, image quality and emotional expression similar to those of a real person, and the digital person gives enthusiasm, generous and natural overall impression to people.

In the specific implementation, the error data is used as the optimization target of the deformable nerve radiation field network, the optimization target is further optimized, the video containing the character is synthesized into the video containing the virtual character, the problem that the facial expression of the virtual character such as eyes and mouth is unnatural can be solved, more abundant geometric detail information such as wrinkles is provided, the problem of hair rendering of the virtual character is solved, the virtual character is expanded to be suitable for an actual dynamic change scene, the application range is wider, the expressive force and persuasive force of the virtual character are effectively improved, and meanwhile, the smoothness and the authenticity of three-dimensional visual representation of the virtual character are improved.

In order to make those skilled in the art better understand the technical solutions of the embodiments of the present invention, the following exemplary descriptions are provided by specific examples.

In one example, during training of a Deformable Neural radiation Field network (Deformable Neural Field, abbreviated as Deformable NeRF), it is assumed that an incident ray emitted by a camera samples an i-th frame image, and then a coordinate code and a view direction corresponding to a sampling point position on the incident ray, an implicit deformation code (Latent deformation code), an implicit appearance code (Latent appearance code), and audio features are input into the Deformable Neural radiation Field network together, and a volume Density (Density) and a color (RGB color) of a new camera view are obtained through rendering. Specifically, assume a 3D coordinate point x, and a view direction D, passing through a position encoding function

Then, the data are input into a standard network function of the deformable nerve radiation fieldFCan be output to obtain the color c and the body density->

Namely:

for rays emanating from the center o of the camera in the direction of incidence d

RGB color->

Can be calculated from the voxel rendering equation and expressed as follows:

wherein, the first and the second end of the pipe are connected with each other,

and &>

Respectively represents a far point and a near point of the incident ray, is present>

Is the slave->

TotThe cumulative transmittance of the image data is generally approximated by a hierarchical sampling method to calculate the integral, and finally, the color of each ray in the training sample picture (video frame) in each batch is calculated, and the error minimization optimization training is performed in combination with the color of the pixel of the real picture (video frame or image corresponding to the initially input video), which is expressed as follows:

whereinpAn index variable representing all pixels in a batch of pictures,

is a function of the nerve radiation field>

Is determined by the parameters of (a) and (b),

is the pixel color value of the real picture. However, the standard nervus radiation field NeRF can only model static scenes and is therefore not suitable for modeling human heads because human faces are rich in speech expressionExpressions and diversified gesture actions.

In order to better model the dynamically-changed facial geometric detail information, a deformation nerve radiation field structure is adopted, and for a 3D point x of an ith frame image in a video, a deformation function is used

Mapping into a standard space, i.e. </or >>

In particular, in>

Is defined as->

Wherein->

Is an implicit deformation code for each frame. Furthermore, in view of the illumination changes that may occur in the actual scene, we additionally employ an implicit appearance coding ≥>

. Finally, to model the relationship between audio signals and facial expressions, we encode the features of the synchronized audio signal output by the audio encoder ≥>

Also input into the deformed nerve radiation field. Thus, the final deformed nerve radiation field of the ith frame can be expressed as follows:

and finally, optimizing the hidden codes and the network parameters by a common random gradient descent method, and generating a vivid digital speaker by a rendering formula after training, namely generating a virtual character.

It should be noted that, with respect to the above-mentioned construction method of the deformable nerve radiation field network, a virtual character with a sense of reality and expressive force can be synthesized, and in an actual application scenario, a person skilled in the art can also apply the method to the fields of business anchor, image introduction, and the like according to an actual application situation, which is not limited in this embodiment of the present invention.

It should be noted that, for simplicity of description, the method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the illustrated order of acts, as some steps may occur in other orders or concurrently in accordance with the embodiments of the present invention. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred and that no particular act is required to implement the invention.

Referring to fig. 5, a block diagram of a structure of a device for constructing a deformable neural radiation field network provided in an embodiment of the present invention is shown, which may specifically include the following modules:

the data acquisition module 501 is configured to acquire first image parameters and audio feature data corresponding to video frames in a training video containing a person; the training video is provided with a corresponding target video, and the target video is a video of a virtual character corresponding to a character containing the training video;

a face contour obtaining module 502, configured to obtain a face contour of the person in the video frame according to the first image parameter corresponding to the video frame;

a rendering video frame obtaining module 503, configured to input the face contour and the audio feature data into a deformable nerve radiation field network, so as to obtain a rendering video frame including a virtual character; the rendered video frame comprises a second image parameter of a preset visual angle;

an error data calculation module 504, configured to obtain a third image parameter corresponding to a video frame of the target video, and calculate error data of the third image parameter corresponding to the video frame and a second image parameter corresponding to the rendered video frame corresponding to the video frame;

and a deformable neural radiation field network constructing module 505, configured to obtain a trained deformable neural radiation field network when it is determined that the deformable neural radiation field network meets the convergence condition according to the error data.

In an optional embodiment, the face contour acquisition module 502 is specifically configured to:

inputting the first image parameters to a preset human face model for training;

In an optional embodiment, the rendered video frame obtaining module 503 is specifically configured to:

In an alternative embodiment, the apparatus further comprises:

and the position data acquisition module is used for determining the coordinate coding and the visual angle direction of the position of the sampling point according to the position of the sampling point of the incident light.

In an alternative embodiment, the apparatus further comprises:

In an alternative embodiment, the error data calculation module 504 is specifically configured to:

In an alternative embodiment, the apparatus further comprises:

For the apparatus embodiment, since it is substantially similar to the method embodiment, the description is relatively simple, and reference may be made to the partial description of the method embodiment for relevant points.

In addition, an embodiment of the present invention further provides an electronic device, including: the processor, the memory, and the computer program stored in the memory and capable of running on the processor, when being executed by the processor, implement each process of the above-mentioned method for constructing a deformable neural radiation field network, and can achieve the same technical effect, and in order to avoid repetition, details are not repeated here.

the embodiment of the present invention further provides a computer-readable storage medium 601, where a computer program is stored on the computer-readable storage medium 601, and when being executed by a processor, the computer program implements each process of the above-mentioned method for constructing a deformable neural radiation field network, and can achieve the same technical effect, and is not described herein again to avoid repetition. The computer-readable storage medium 601 is, for example, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk.

Fig. 7 is a schematic diagram of a hardware structure of an electronic device for implementing various embodiments of the present invention.

The electronic device 700 includes, but is not limited to: a radio frequency unit 701, a network module 702, an audio output unit 703, an input unit 704, a sensor 705, a display unit 706, a user input unit 707, an interface unit 708, a memory 709, a processor 710, a power supply 711, and the like. Those skilled in the art will appreciate that the electronic device configuration shown in fig. 7 does not constitute a limitation of electronic devices, which may include more or fewer components than shown, or some components may be combined, or a different arrangement of components. In the embodiment of the present invention, the electronic device includes, but is not limited to, a mobile phone, a tablet computer, a notebook computer, a palm computer, a vehicle-mounted terminal, a wearable device, a pedometer, and the like.

It should be understood that, in the embodiment of the present invention, the radio frequency unit 701 may be used for receiving and sending signals during a process of sending and receiving information or a call, and specifically, after receiving downlink data from a base station, the downlink data is processed by the processor 710; in addition, the uplink data is transmitted to the base station. In general, radio frequency unit 701 includes, but is not limited to, an antenna, at least one amplifier, a transceiver, a coupler, a low noise amplifier, a duplexer, and the like. In addition, the radio frequency unit 701 may also communicate with a network and other devices through a wireless communication system.

The electronic device provides wireless broadband internet access to the user via the network module 702, such as assisting the user in sending and receiving e-mails, browsing web pages, and accessing streaming media.

The audio output unit 703 may convert audio data received by the radio frequency unit 701 or the network module 702 or stored in the memory 709 into an audio signal and output as sound. Also, the audio output unit 703 may also provide audio output related to a specific function performed by the electronic apparatus 700 (e.g., a call signal reception sound, a message reception sound, etc.). The audio output unit 703 includes a speaker, a buzzer, a receiver, and the like.

The input unit 704 is used to receive audio or video signals. The input Unit 704 may include a Graphics Processing Unit (GPU) 7041 and a microphone 7042, and the graphics processor 7041 processes image data of a still picture or video obtained by an image capturing device (e.g., a camera) in a video capturing mode or an image capturing mode. The processed image frames may be displayed on the display unit 706. The image frames processed by the graphic processor 7041 may be stored in the memory 709 (or other storage medium) or transmitted via the radio unit 701 or the network module 702. The microphone 7042 may receive sounds and may be capable of processing such sounds into audio data. The processed audio data may be converted into a format output transmittable to a mobile communication base station via the radio frequency unit 701 in case of a phone call mode.

The electronic device 700 also includes at least one sensor 705, such as a light sensor, motion sensor, and other sensors. Specifically, the light sensor includes an ambient light sensor that can adjust the brightness of the display panel 7061 according to the brightness of ambient light, and a proximity sensor that can turn off the display panel 7061 and/or a backlight when the electronic device 700 is moved to the ear. As one type of motion sensor, an accelerometer sensor can detect the magnitude of acceleration in each direction (generally three axes), detect the magnitude and direction of gravity when stationary, and can be used to identify the posture of an electronic device (such as horizontal and vertical screen switching, related games, magnetometer posture calibration), and vibration identification related functions (such as pedometer, tapping); the sensors 705 may also include fingerprint sensors, pressure sensors, iris sensors, molecular sensors, gyroscopes, barometers, hygrometers, thermometers, infrared sensors, etc., which are not described in detail herein.

The display unit 706 is used to display information input by the user or information provided to the user. The Display unit 706 may include a Display panel 7061, and the Display panel 7061 may be configured in the form of a Liquid Crystal Display (LCD), an Organic Light-Emitting Diode (OLED), or the like.

The user input unit 707 may be used to receive input numeric or character information and generate key signal inputs related to user settings and function control of the electronic apparatus. Specifically, the user input unit 707 includes a touch panel 7071 and other input devices 7072. The touch panel 7071, also referred to as a touch screen, may collect touch operations by a user on or near the touch panel 7071 (e.g., operations by a user on or near the touch panel 7071 using a finger, a stylus, or any other suitable object or attachment). The touch panel 7071 may include two parts of a touch detection device and a touch controller. The touch detection device detects the touch direction of a user, detects a signal brought by touch operation and transmits the signal to the touch controller; the touch controller receives touch information from the touch sensing device, converts the touch information into touch point coordinates, sends the touch point coordinates to the processor 710, receives a command from the processor 710, and executes the command. In addition, the touch panel 7071 can be implemented by various types such as resistive, capacitive, infrared, and surface acoustic wave. The user input unit 707 may include other input devices 7072 in addition to the touch panel 7071. In particular, the other input devices 7072 may include, but are not limited to, a physical keyboard, function keys (such as volume control keys, switch keys, etc.), a trackball, a mouse, and a joystick, which are not described herein again.

Further, the touch panel 7071 may be overlaid on the display panel 7061, and when the touch panel 7071 detects a touch operation on or near the touch panel 7071, the touch operation is transmitted to the processor 710 to determine the type of the touch event, and then the processor 710 provides a corresponding visual output on the display panel 7061 according to the type of the touch event. Although the touch panel 7071 and the display panel 7061 are shown in fig. 7 as two separate components to implement the input and output functions of the electronic device, in some embodiments, the touch panel 7071 and the display panel 7061 may be integrated to implement the input and output functions of the electronic device, which is not limited herein.

The interface unit 708 is an interface for connecting an external device to the electronic apparatus 700. For example, the external device may include a wired or wireless headset port, an external power supply (or battery charger) port, a wired or wireless data port, a memory card port, a port for connecting a device having an identification module, an audio input/output (I/O) port, a video I/O port, an earphone port, and the like. The interface unit 708 may be used to receive input (e.g., data information, power, etc.) from an external device and transmit the received input to one or more elements within the electronic apparatus 700 or may be used to transmit data between the electronic apparatus 700 and the external device.

The memory 709 may be used to store software programs as well as various data. The memory 709 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the cellular phone, and the like. Further, the memory 709 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device.

The processor 710 is a control center of the electronic device, connects various parts of the whole electronic device by using various interfaces and lines, performs various functions of the electronic device and processes data by running or executing software programs and/or modules stored in the memory 709 and calling data stored in the memory 709, thereby monitoring the whole electronic device. Processor 710 may include one or more processing units; preferably, the processor 710 may integrate an application processor, which mainly handles operating systems, user interfaces, application programs, etc., and a modem processor, which mainly handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into processor 710.

The electronic device 700 may also include a power supply 711 (e.g., a battery) for providing power to the various components, and preferably, the power supply 711 may be logically coupled to the processor 710 via a power management system, such that functions of managing charging, discharging, and power consumption may be performed via the power management system.

In addition, the electronic device 700 includes some functional modules that are not shown, and are not described in detail herein.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one of 8230, and" comprising 8230does not exclude the presence of additional like elements in a process, method, article, or apparatus comprising the element.

Through the description of the foregoing embodiments, it is clear to those skilled in the art that the method of the foregoing embodiments may be implemented by software plus a necessary general hardware platform, and certainly may also be implemented by hardware, but in many cases, the former is a better implementation. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present invention.

While the present invention has been described with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, which are illustrative and not restrictive, and it will be apparent to those skilled in the art that various changes and modifications can be made therein without departing from the spirit and scope of the invention as defined in the appended claims.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

It can be clearly understood by those skilled in the art that, for convenience and simplicity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one position, or may be distributed on multiple network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a U disk, a removable hard disk, a ROM, a RAM, a magnetic disk, or an optical disk.

The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A method for constructing a deformable nerve radiation field network is characterized by comprising the following steps:

inputting the face contour and the audio characteristic data into a deformable nerve radiation field network to obtain a rendering video frame containing a virtual character; the rendered video frame comprises a second image parameter of a preset visual angle;

2. The method of claim 1, wherein obtaining the face contour of the person in the video frame according to the first image parameter corresponding to the video frame comprises:

inputting the first image parameters to a preset human face model for training;

3. The method of claim 1, wherein the first image parameters include at least camera parameters, appearance data, expression data, and pose data.

4. The method of claim 1, wherein inputting the face contour and the audio feature data into a deformable neural radiation field network to obtain a rendered video frame containing a virtual character comprises:

inputting the face contour and the audio characteristic data into the deformable nerve radiation field network;

5. The method of claim 1, wherein the face contour is used to distinguish between a foreground and a background corresponding to a video frame in the training video.

6. The method of claim 4, further comprising:

when the incident ray intersects with the face contour, the incident ray is used as a foreground corresponding to a video frame in the training video;

and when the incident ray and the face contour are not intersected, taking the incident ray as a background corresponding to a video frame in the training video.

7. The method of claim 4, further comprising:

8. The method of claim 1, wherein the network of deformable neural radiation fields comprises an implicit deformation code and an implicit appearance code; the implicit deformation code is used for constructing expression changes of the face appearance of the person in the video frame, and the implicit appearance code is used for constructing changes of different illumination and post-shooting processing of the video frame.

9. The method of claim 8, wherein inputting the face contour and the audio feature data into a deformable neural radiation field network to obtain a rendered video frame containing a virtual character comprises:

and inputting the coordinate code and the view angle direction of the sampling point position, and the implicit deformation code and the implicit appearance code in the deformable nerve radiation field network into the deformable nerve radiation field network to obtain a rendered video frame containing a virtual character.

10. The method of claim 1, wherein the deformable neural radiation field network comprises a radiation field network and an encoder network.

11. The method of claim 10, wherein the radial field network is configured to generate a volume density of the video frame for a preset view, and wherein the encoder network is configured to generate a color of the video frame for the preset view.

12. The method of claim 1, wherein the second image parameters include at least volume density and color.

13. The method according to any one of claims 8-12, further comprising:

and inputting the first image parameters corresponding to the video frame and the implicit deformation code of the deformable nerve radiation field network into the radiation field network of the deformable nerve radiation field network to obtain the volume density corresponding to the video frame at a preset visual angle.

14. The method of claim 13, further comprising:

15. The method of claim 12, further comprising:

16. The method of claim 15, wherein the obtaining a third image parameter corresponding to a video frame of the target video and calculating error data of the third image parameter corresponding to the video frame and a second image parameter corresponding to the rendered video frame corresponding to the video frame comprises:

17. The method according to claim 1, wherein after the obtaining a trained deformable neural radiation field network when it is determined from the error data that the deformable neural radiation field network satisfies a convergence condition, the method further comprises:

and synthesizing the videos containing the characters into the video containing the virtual characters according to the trained deformable nerve radiation field network.

18. The method of claim 17, wherein synthesizing the video containing the human figure into the video containing the virtual human figure according to the trained deformable neural radiation field network comprises:

19. A device for constructing a deformable nerve radiation field network is characterized by comprising:

the rendering video frame acquisition module is used for inputting the face contour and the audio characteristic data into the deformable nerve radiation field network to obtain a rendering video frame containing a virtual character; the rendered video frame comprises a second image parameter of a preset visual angle;

20. An electronic device, comprising a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory communicate with each other via the communication bus;

the memory is used for storing a computer program;

the processor, when executing a program stored on the memory, implementing the method of any of claims 1-18.

21. A computer-readable storage medium having instructions stored thereon, which when executed by one or more processors, cause the processors to perform the method of any one of claims 1-18.