WO2024078243A1 - 视频生成模型的训练方法、装置、存储介质及计算机设备 - Google Patents

视频生成模型的训练方法、装置、存储介质及计算机设备 Download PDF

Info

Publication number
WO2024078243A1
WO2024078243A1 PCT/CN2023/118459 CN2023118459W WO2024078243A1 WO 2024078243 A1 WO2024078243 A1 WO 2024078243A1 CN 2023118459 W CN2023118459 W CN 2023118459W WO 2024078243 A1 WO2024078243 A1 WO 2024078243A1
Authority
WO
WIPO (PCT)
Prior art keywords
video
training
target user
color value
mouth
Prior art date
Application number
PCT/CN2023/118459
Other languages
English (en)
French (fr)
Inventor
伍洋
胡鹏飞
齐晓娟
吴秀哲
单瀛
徐静
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Publication of WO2024078243A1 publication Critical patent/WO2024078243A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T17/00Three dimensional [3D] modelling, e.g. data description of 3D objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T13/00Animation
    • G06T13/203D [Three Dimensional] animation
    • G06T13/403D [Three Dimensional] animation of characters, e.g. humans, animals or virtual beings
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/90Determination of colour characteristics
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2200/00Indexing scheme for image data processing or generation, in general
    • G06T2200/04Indexing scheme for image data processing or generation, in general involving 3D image data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2200/00Indexing scheme for image data processing or generation, in general
    • G06T2200/08Indexing scheme for image data processing or generation, in general involving all processing steps from image acquisition to 3D model generation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • G06T2207/10021Stereoscopic video; Stereoscopic image sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30196Human being; Person
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30196Human being; Person
    • G06T2207/30201Face

Definitions

  • the present application relates to the field of computer vision technology, and more specifically, to a training method, apparatus, storage medium and computer equipment for a video generation model.
  • the main principle of generating a talking portrait video is to use a better-looking user reconstructed avatar to reenact the user's actual portrait movements.
  • the talking portrait video generated by the relevant technology is prone to the situation where the user's body tissue movements in the reconstructed video are not coordinated, which greatly reduces the realism of the video generation result presented to the user.
  • the embodiments of the present application provide a video generation model training method, apparatus, storage medium and computer equipment, aiming to improve the motion coordination when generating a speaking person portrait video.
  • an embodiment of the present application provides a training method for a video generation model, which is executed by a computer device, and the method includes: obtaining a training video of a target user; extracting the target user's voice features, the target user's expression parameters and the target user's head parameters from the training video, wherein the head parameters are used to characterize the target user's head posture information and head position information; merging the target user's voice features, the target user's expression parameters and the target user's head parameters to obtain a conditional input of the training video; performing network training on a preset single neural radiation field based on the conditional input, three-dimensional coordinates and viewing direction to obtain a video generation model; wherein the video generation model is obtained based on total loss training, the total loss includes image reconstruction loss, the image reconstruction loss is determined by a predicted object color value and a real object color value, the predicted object color value is generated by a single neural radiation field according to the conditional input, three-dimensional coordinates and viewing direction, and the video generation model is
  • an embodiment of the present application also provides a training device for a video generation model, which is deployed on a computer device, and the device includes: a conditional acquisition module, which is used to obtain a training video of a target user; extracting the target user's voice features, the target user's expression parameters and the target user's head parameters from the training video, the head parameters being used to characterize the target user's head posture information and head position information; merging the target user's voice features, the target user's expression parameters and the target user's head parameters to obtain a conditional input of the training video; a network training module, which is used to perform network training on a preset single neural radiation field based on the conditional input, three-dimensional coordinates and viewing direction to obtain a video generation model; wherein the video generation model is obtained based on total loss training, the total loss includes image reconstruction loss, the image reconstruction loss is determined by the predicted object color value and the real object color value, and the predicted object color value is a single The neural radiation field is generated according to the
  • an embodiment of the present application further provides a computer-readable storage medium, which stores a computer program, wherein when the computer program is executed by a processor, the above-mentioned video generation model training method is executed.
  • an embodiment of the present application further provides a computer device, which includes a processor and a memory, wherein the memory stores a computer program, and when the computer program is called by the processor, the training method of the video generation model is executed.
  • an embodiment of the present application also provides a computer program product, which includes a computer program stored in a storage medium; a processor of a computer device reads the computer program from the storage medium, and the processor executes the computer program, so that the computer device executes the steps in the above-mentioned video generation model training method.
  • the present application provides a training method for a video generation model, which extracts voice features, expression parameters and head parameters from a training video of a target user, wherein the head parameters are used to characterize the head posture information and head position information of the target user, and the voice features, expression parameters and head parameters are merged to obtain the conditional input of the training video.
  • a network training is performed on a preset single neural radiation field based on the conditional input, three-dimensional coordinates and viewing direction to obtain a video generation model, which is obtained by training based on the total loss, and the total loss includes the image reconstruction loss, which is determined by the predicted object color value and the real object color value, and the predicted object color value is generated by a single neural radiation field according to the conditional input, three-dimensional coordinates and viewing direction.
  • the video generation model obtained by network training can estimate the shoulder part and its motion state according to the head posture information and the head position information, so that when the video generation model is used to reconstruct the target user's to-be-reconstructed video, and the reconstructed video corresponding to the target user is obtained, the predicted video frame has a complete and realistic head and shoulder part, and the movement state of the head and shoulders is kept coordinated, thereby greatly improving the authenticity of the reconstructed video display.
  • FIG1 shows a schematic diagram of a system architecture provided by an embodiment of the present application
  • FIG2 is a schematic diagram showing a flow chart of a method for training a video generation model provided in an embodiment of the present application
  • FIG3 shows a network architecture diagram of a single neural radiation field provided by an embodiment of the present application
  • FIG4 shows a schematic diagram of a camera ray provided in an embodiment of the present application.
  • FIG5 is a schematic diagram showing a flow chart of another method for training a video generation model provided in an embodiment of the present application.
  • FIG6 shows a schematic diagram of an application scenario provided by an embodiment of the present application.
  • FIG7 is a schematic diagram showing a performance comparison provided by an embodiment of the present application.
  • FIG8 shows an implementation effect diagram of a training method for a video generation model provided in an embodiment of the present application
  • FIG9 is a module block diagram of a training device for a video generation model provided in an embodiment of the present application.
  • FIG10 is a module block diagram of a computer device provided in an embodiment of the present application.
  • FIG. 11 is a module block diagram of a computer-readable storage medium provided in an embodiment of the present application.
  • the training method of the video generation model of the present application involves artificial intelligence (AI) technology, which utilizes artificial intelligence technology to automatically train the video generation model and subsequently automatically generate videos.
  • AI artificial intelligence
  • a potential solution is to re-simulate the actual portrait motion based on a good-looking reconstructed avatar of the user, thereby generating a high-fidelity talking portrait video (Talking Portrait Video), in which the reconstructed avatar matches the user's voice audio and real head motion, facial expression, blinking, etc.
  • the above solution is also beneficial to many other applications, such as digital humans, filmmaking, and multiplayer online games.
  • modeling schemes for generating talking person portrait videos can be roughly divided into three categories: model-based, Generative Adversarial Network (GAN)-based, and Neural Radiance Fields (NeRF)-based.
  • GAN Generative Adversarial Network
  • NeRF Neural Radiance Fields
  • model-based schemes usually create a three-dimensional (3D) model of a specific person based on red-green-blue (RGB) or red-green-blue-depth map (RGBD) data, and then assign facial expressions to the 3D model without considering head movement, and the resolution of the generated results is limited.
  • Generative adversarial network-based schemes generally use adversarial learning models to directly generate the appearance of a person, but their learning process cannot know the 3D geometry of the scene and requires additional reference images to provide identity information.
  • the solutions based on neural radiance fields mainly include two methods with audio and motion as driving sources.
  • audio-driven methods such as audio-driven neural radiance fields (AD-NeRF)
  • AD-NeRF audio-driven neural radiance fields
  • Motion-driven methods such as learning a mapping function, transfer source motion or expression to the target face.
  • AD-NeRF relies on two independent neural radiance fields to simulate the head and torso respectively, so there is a problem of network structure separation.
  • NerFACE (a NeRF-based face modeling algorithm) cannot generate stable and natural torso sequences, resulting in the problem of incoordination between the head and shoulders of the reconstructed portrait in the speaking portrait video, and the lip shape of the reconstructed portrait generated by the above method cannot be synchronized with the lip shape of the user.
  • an embodiment of the present application provides a training method for a video generation model.
  • the following first introduces the system architecture of the training method for a video generation model involved in the present application.
  • the training method of the video generation model provided in the embodiment of the present application can be applied in a system 300, and a data acquisition device 310 is used to acquire training data.
  • the training data may include a training video for training.
  • the data acquisition device 310 may store the training data in a database 320, and the training device 330 may train the target model 301 based on the training data maintained in the database 320.
  • the training device 330 can train the preset neural network based on the training video until the preset neural network meets the preset conditions and obtains the target model 301.
  • the preset neural network is a single neural radiation field.
  • the preset conditions may be: the total loss value of the total loss function is less than the preset value, the total loss value of the total loss function no longer changes, or the number of training times reaches the preset number of times.
  • the target model 301 can be used to realize the generation of the reconstructed video in the embodiment of the present application.
  • the training data maintained in the database 320 does not necessarily all come from the data acquisition device 310, but may also be received from other devices.
  • the client device 360 may also serve as a data acquisition terminal, and the acquired data is used as new training data and stored in the database 320.
  • the training device 330 does not necessarily train the preset neural network based entirely on the training data maintained in the database 320, but may also train the preset neural network based on the training data obtained from the cloud or other devices.
  • the above description should not be used as a limitation on the embodiments of the present application.
  • the target model 301 obtained by training the training device 330 can be applied to different systems or devices, such as the execution device 340 shown in Figure 1.
  • the execution device 340 can be a terminal, such as a mobile phone terminal, a tablet computer, a laptop computer, augmented reality (AR)/virtual reality (VR), etc., and can also be a server or cloud, but is not limited to this.
  • AR augmented reality
  • VR virtual reality
  • the execution device 340 can be used to interact with external devices for data.
  • a user can use a client device 360 to send input data to the execution device 340 through a network.
  • the input data may include: a training video or a video to be reconstructed sent by the client device 360 in the embodiment of the present application.
  • the execution device 340 can call data, programs, etc. in the data storage system 350 for corresponding calculation processing, and store data and instructions such as processing results obtained by the calculation processing in the data storage system 350.
  • the execution device 340 can return the processing result, that is, the reconstructed video generated by the target model 301, to the client device 360 through the network, so that the user can query the processing result on the client device 360.
  • the training device 330 can generate a corresponding target model 301 based on different training data for different goals or different tasks, and the corresponding target model 301 can be used to achieve the above goals or complete the above tasks, thereby providing the user with the desired results.
  • the system 300 shown in FIG1 may be a client-server (C/S) system architecture
  • the execution device 340 may be a cloud server deployed by a service provider
  • the client device 360 may be a laptop computer used by a user.
  • a user may use the video generation software installed in a laptop computer to upload the video to be reconstructed to the cloud server via the network.
  • the cloud server receives the video to be reconstructed, it uses the target model 301 to generate the video.
  • the portrait is reconstructed, the corresponding reconstructed video is generated, and the reconstructed video is returned to the laptop computer, and then the user can obtain the reconstructed video on the video generation.
  • FIG. 1 is only a schematic diagram of the architecture of a system provided in an embodiment of the present application.
  • the architecture and application scenarios of the system described in the embodiment of the present application are intended to more clearly illustrate the technical solution of the embodiment of the present application, and do not constitute a limitation on the technical solution provided in the embodiment of the present application.
  • the data storage system 350 in FIG. 1 is an external memory relative to the execution device 340.
  • the data storage system 350 may also be placed in the execution device 340.
  • the execution device 340 may also be a client device directly. It is known to those skilled in the art that with the evolution of the system architecture and the emergence of new application scenarios, the technical solution provided in the embodiment of the present application is also applicable to solving similar technical problems.
  • Figure 2 shows a flow chart of a method for training a video generation model provided by an embodiment of the present application.
  • the method for training a video generation model is applied to a training device 500 for a video generation model as shown in Figure 9 and a computer device 600 ( Figure 10) equipped with the training device 500 for a video generation model.
  • the server can be an independent physical server, or a server cluster or distributed system composed of multiple physical servers. It can also be a cloud server that provides basic cloud computing services such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communications, middleware services, domain name services, security services, content delivery networks (Content Delivery Network, CDN), blockchain, and big data and artificial intelligence platforms.
  • the terminal can be a tablet computer, a laptop computer, a desktop computer, a smart speaker, a smart watch, etc., but is not limited to this.
  • the training method of the video generation model can specifically include the following steps:
  • S120 extracting the target user's voice features, the target user's expression parameters, and the target user's head parameters from the training video, where the head parameters are used to characterize the target user's head posture information and head position information.
  • S130 The target user's voice features, the target user's expression parameters, and the target user's head parameters are combined to obtain a conditional input of the training video.
  • the method proposed by the related art to generate a video of a talking person using only voice or expression as the driving source will produce a non-negligible visual problem, namely, the incoordination of head-torso movement.
  • the reason for this problem is that the neural radiation field often models the complete portrait as a rigid entity, without distinguishing between head movement and torso movement. Therefore, whenever the camera's viewing direction and position are changed, the entire portrait will change direction rigidly, and the shoulder movement will shake, resulting in incoordination between head movement and shoulder movement.
  • the embodiment of the present application creatively introduces the user's head posture information and head position information into conditional input, so that the neural radiation field can implicitly estimate the movement state of the shoulder based on the head posture information and head position information, so that the subsequently generated reconstructed portrait can maintain the coordination between head movement and shoulder movement.
  • the conditional input may at least include the target user's voice features, expression parameters, and head parameters, where the head parameters can be used to characterize head posture information and head position information.
  • voice features can be used to characterize the audio information when the user speaks.
  • Expression parameters can be used to characterize the user's facial expression information when speaking, such as the eyes and mouth.
  • the head posture information can be used to represent the direction of the user's head, and the head position can be used to represent the shooting position of the camera.
  • the step of extracting the target user's voice features, the target user's expression parameters, and the target user's head parameters from the training video may include:
  • a speech recognition model may be used to extract speech features from the training video. For example, when the training video is not associated with independent audio data, the audio data of the target user may be extracted based on the training video. When the training video is associated with independent audio data, the audio data of the target user may be directly obtained from the data packet of the training video. Furthermore, the audio data may be input into a DeepSpeech model to output speech features.
  • the DeepSpeech model is composed of a plurality of RNN layers and a CTC Loss structure, which is used to learn the mapping from speech to text.
  • the DeepSpeech model can be used to extract the speech features of the target user's speech sound content.
  • the acquired audio data is sampled to obtain a sampling array, wherein the data format of the audio data can be MP3 (MPEG-1 Audio Layer 3) or WAV (WaveForm), etc.
  • the sampling array is subjected to a Fast Fourier Transform (FFT), and on this basis, two layers of convolution (the activation function uses the Relu function) are calculated to obtain the convolved data.
  • FFT Fast Fourier Transform
  • 3D face reconstruction can refer to reconstructing a 3D model of a face from one or more 2D images.
  • the 2D image is a video frame in the training video
  • the 3D face reconstruction in the embodiment of the present application refers to reconstructing the target user in the training video to obtain a 3D face.
  • the face shape representation includes the face shape and expression changes learned by the model from the 3D face, and then determining the expression parameters through the expression changes in the face shape representation.
  • corresponding expression parameters can be obtained from each video frame of the training video.
  • expression parameters can be obtained from each video frame using a 3D deformable face model (3D Morphable Models, 3DMM), which can perform 3D reconstruction on a 2D face in a single video frame to obtain a corresponding 3D face, that is, a 3D face shape, and the face shape representation v of the 3D face shape is:
  • Es and Ee represent the matrices of orthogonal basis vectors in shape space and expression space respectively.
  • s and e represent the shape coefficient and expression coefficient respectively.
  • N represents the number of vertices in the 3D Face Mesh.
  • the expression coefficient e can be used as the expression parameter of the reconstructed 3D face.
  • the 3D deformable face model can be used to reconstruct the 2D face in a single video frame. Conversely, the vertices of the 3D face mesh can be mapped to a 2D image plane. Transformation mapping refers to the operation of projecting the 3D face onto the image plane.
  • the three-dimensional face of the target user is transformed and mapped to obtain a rotation matrix and a translation vector corresponding to the three-dimensional face.
  • f represents the scale factor
  • Pr represents the orthogonal projection matrix
  • R represents the rotation matrix (Rotation Matrix)
  • t represents the translation vector (Translation Vector). Therefore, the rotation matrix R and the translation vector t can be obtained by the above formula.
  • the head position can inversely represent the shooting position of the camera
  • the angle of the head posture will change relative to the shooting angle of the camera. Therefore, the neural radiation field can obtain the reason for the change of head posture when knowing the shooting position. Then, based on the head posture and the shooting position of the camera, the shoulder shape and its movement state can be implicitly estimated well, so that the characters in the predicted video frames are complete and realistic, and the movements of the head and shoulders are coordinated.
  • the rotation matrix Converted into Euler angles which consist of three elements and represent direction information, i.e. head posture information.
  • the translation vector with the camera shooting position information is reversely represented as head position information.
  • positional encoding is performed on the head posture information and the head position information to obtain two encoded high-dimensional vectors, and the two high-dimensional vectors are connected into a vector representation P.
  • S140 Perform network training on a preset single neural radiation field based on conditional input, three-dimensional coordinates, and viewing direction to obtain a video generation model.
  • the neural radiation field is used in this application to render the RGB value of each pixel in the video frame of the two-dimensional video.
  • the head and torso of the portrait are reconstructed by two independent neural radiation fields, but the two independent neural radiation fields are separately generated to reconstruct the head and torso of the portrait, and the calculation cost is high, that is, the disadvantage.
  • the method of using independent neural radiation fields to independently generate the head area and torso area respectively due to the separation of the network structure, will cause the head area and torso area to be mismatched, making the final reconstructed portrait display effect not realistic and natural enough. Therefore, in the related technology, the two neural radiation fields cannot achieve the effect of matching the head and torso of the reconstructed portrait, and the time complexity and space complexity of the algorithm also increase with the separation of the network structure.
  • this application proposes to use a simple neural radiation field to reconstruct the head and torso of the portrait, so that the torso movement can match the head movement, thereby making the reconstructed portrait realistic, natural and stable. It can also greatly reduce the time complexity and space complexity of the algorithm, thereby effectively reducing the computing cost.
  • the video generation model is trained based on the total loss, and the total loss includes the image reconstruction loss.
  • the image reconstruction loss is determined by the predicted object color value and the real object color value.
  • the predicted object color value is generated by a single neural radiation field according to conditional input, three-dimensional coordinates and viewing direction.
  • the mouth image area is the most difficult part to learn in the process of neural radiation field generation, because the mouth shape is the part that changes the most with the audio. At the same time, the mouth area is also the most concerned and sensitive viewing area when the audience watches the generated speaking portrait video. Once the lip movement is out of sync with the audio to a certain extent, the audience can immediately notice it, which significantly reduces the display effect of the reconstructed video.
  • the present application proposes to enhance the lip image area to improve the synchronization performance of the mouth and lips.
  • the mouth emphasis loss can be determined, and the mouth emphasis loss is determined by the predicted mouth color value and the real mouth color value.
  • the predicted mouth color value is generated by a single neural radiation field according to the conditional input, three-dimensional coordinates and viewing direction, so as to construct the total loss based on the image reconstruction loss and the mouth emphasis loss.
  • the trained video generation model can not only improve the coordination of the head and shoulder movement, but also improve the synchronization of the mouth movement, thereby improving the authenticity of the reconstructed video display.
  • the three-dimensional coordinates and viewing direction of the spatial sampling points on the camera ray can be obtained first.
  • the camera ray is the light emitted by the camera when imaging the scene, and the camera ray corresponds to the pixel points on the video frame of the training video.
  • This application uses neural radiation fields to synthesize two-dimensional views based on the information of spatial sampling points.
  • camera rays are the rays emitted by the camera when imaging the scene, and the camera rays correspond to the pixels on the video frame.
  • a pixel point on the resulting two-dimensional image actually corresponds to the projection set of all continuous spatial sampling points on a camera ray starting from the camera.
  • the neural radiation field can predict the RGB color value (i.e., color value) and density information (i.e., volume density) of the spatial sampling point based on the three-dimensional coordinates and viewing direction of the input spatial sampling point. To this end, it is necessary to know the three-dimensional coordinates and viewing direction of the spatial sampling point on the camera ray.
  • the three-dimensional coordinates of the spatial sampling point can be set according to the position information of the pixel point on the two-dimensional plane image.
  • the pixel coordinates can be converted into the three-dimensional coordinates of the spatial sampling point on the camera ray under the unified world coordinates based on the internal and external parameters of the camera.
  • the viewing direction can be determined according to the shooting angle of the camera shooting scene set in advance, or the viewing direction can be set in advance based on the observation angle of the character in the acquired reference video.
  • the network training is performed on the preset single neural radiation field based on the conditional input, three-dimensional coordinates and viewing direction.
  • the specific process is as follows.
  • the step of performing network training on a preset single neural radiation field based on conditional input, three-dimensional coordinates, and viewing direction may include:
  • two temporal smoothing networks can be used to filter the speech feature a and the expression parameter e respectively.
  • the expression parameter e is subjected to temporal smoothing: in the temporal dimension, the smoothed expression parameter of the video frame at time t is calculated based on the linear combination of the expression parameter e of each video frame from time step t-T/2 to t+T/2, where T is the time interval, and the expression parameter e is used as the input of the temporal smoothing network, and the weight of the linear combination can be calculated.
  • the temporal smoothing network consists of five one-dimensional convolutions followed by a linear layer with Softmax activation.
  • the smoothed speech features of the video frame at time t are calculated based on the linear combination of the speech features a of each video frame from time step t-T/2 to t+T/2.
  • the speech features a are used as the input of the time smoothing network, and the weight of the linear combination can be calculated.
  • a single neural radiation field can calculate the predicted color value c and volume density ⁇ of each spatial sampling point based on the three-dimensional coordinates of the spatial sampling point, the viewing direction, the smoothed speech features, the smoothed expression parameters, and the head parameters.
  • the neural network of a single neural radiation field can be a multi-layer perceptron (MLP), represented by an implicit function F ⁇ : F ⁇ :(x,d,a,e,p) ⁇ (c, ⁇ )
  • the input of the implicit function F ⁇ (i.e., a single neural radiation field) includes three-dimensional coordinates x, viewing direction d, smoothed speech features a, smoothed expression parameters e, and head parameters p.
  • the output of the function F ⁇ is the predicted color value c and volume density ⁇ corresponding to the spatial sampling point.
  • a single neural radiation field can be a multi-layer perceptron composed of eight perception layers.
  • a video frame sequence of a training video is obtained, and the video frame sequence is associated with an audio track (i.e., audio data).
  • an audio track i.e., audio data.
  • a three-dimensional deformable face model can be used to reconstruct a three-dimensional face for each video frame, and expression parameters e, head posture information, and head position information can be obtained, and head parameters p are determined based on the head posture information and head position information.
  • DeepSpeech is used to extract speech feature a from the audio track.
  • the smoothed speech features, smoothed expression parameters and head parameters p are used as conditional inputs to input the three-dimensional coordinates x and the viewing direction d into the neural radiation field (i.e., implicit function F ⁇ ).
  • the neural radiation field can predict the volume density and intermediate features corresponding to the spatial sampling point based on the conditional input and the three-dimensional coordinate x, and then predict the predicted color value corresponding to the spatial sampling point based on the intermediate features and the viewing direction d. Then, the head-torso image is generated based on the predicted color value c and volume density ⁇ corresponding to the spatial sampling point. The complete image of the coordinated motion is the reconstructed video frame.
  • a single neural radiation field is trained based on the image reconstruction loss and the mouth emphasis loss, where the mouth emphasis loss calculation uses the semantic segmentation map corresponding to the pre-obtained mouth area, and the intermediate feature is the intermediate value generated during the calculation of the neural radiation field.
  • the mouth image area is the most difficult part to learn in the process of neural radiation field generation, because the mouth shape is the part that changes the most with the audio. At the same time, the mouth area is also the most concerned and sensitive viewing area when the audience watches the generated speaking portrait video. Once the lip movement is out of sync with the audio to a certain extent, the audience can immediately notice it, which significantly reduces the display effect of the reconstructed video.
  • the present application proposes to enhance the lip image area to improve the synchronization performance of the mouth and lips.
  • Image reconstruction loss can also guide the neural radiation field to learn the color information of the entire image area, that is, the color value of the pixel point, and the movement state of the shoulder can be estimated based on the head parameters.
  • the trained video generation model can not only improve the coordination of head and shoulder movements, but also improve the synchronization of mouth movements, thereby improving the authenticity of the reconstructed video display.
  • the step of determining the image reconstruction loss corresponding to the entire image area of the video frame based on the predicted color value and volume density may include:
  • the color integration of the camera rays in the entire image area is performed to predict the predicted object color value corresponding to each camera ray in the entire image area.
  • the neural radiation field obtains the color information and density information of a three-dimensional spatial sampling point.
  • a pixel on the obtained two-dimensional image actually corresponds to all continuous spatial sampling points on a camera ray starting from the camera. Therefore, it is necessary to obtain the color value of the camera ray finally rendered on the two-dimensional image based on all spatial sampling points on the camera ray.
  • volume density can be understood as the probability that a camera ray r is terminated when passing through an infinitesimal particle at the position x of the spatial sampling point. This probability is differentiable, that is, the opacity of this spatial sampling point. Since the spatial sampling points on a camera ray are continuous, the color value of the pixel point on the two-dimensional image corresponding to this camera ray can be obtained by integration. Please refer to Figure 4, which shows a schematic diagram of a camera ray.
  • the camera rays in the entire image area of the video frame are color integrated, and the method of predicting the color value of the predicted object corresponding to each camera ray in the entire image area can be to obtain the cumulative transparency corresponding to the spatial sampling point on each camera ray in the entire image area, and the cumulative transparency is generated by integrating the volume density of the camera ray in the first integration interval; based on the product of the cumulative transparency, the predicted color value and the volume density, the integrand is determined; and in the second integration interval, the integrand is generated by integrating the volume density of the camera ray in the second integration interval.
  • the color integral function is used to perform color integration to predict the predicted object color value corresponding to each camera ray in the entire image area; wherein the first integration interval is the sampling distance of the camera ray from the near boundary to the spatial sampling point, and the second integration interval is the sampling distance of the camera ray from the near boundary to the far boundary.
  • the cumulative transparency T(t) corresponding to each spatial sampling point on the camera ray in the entire image area of the video frame of the training video is obtained, wherein the cumulative transparency can be understood as the probability that the camera ray does not hit any particle in the first integral interval.
  • the cumulative transparency can be generated by integrating the volume density of the camera ray in the first integral interval.
  • the first integral interval is the sampling distance of the camera ray from the near-end boundary tn to the spatial sampling point t.
  • the integral formula is as follows:
  • the integrand is determined, and the color integral of the integrand is performed on the second integral interval to predict the predicted object color value C(r) corresponding to each camera ray in the entire image area.
  • r(s) represents the camera ray.
  • the second integral interval is the sampling distance of the camera ray from the near boundary tn to the far boundary tf .
  • the color integral can be expressed as:
  • the image reconstruction loss corresponding to the entire image area is determined.
  • the predicted object color value C(r) corresponding to each camera ray in the entire image area and the corresponding real object color value can be calculated.
  • the image reconstruction loss can be constructed based on the mean square error (MSE):
  • R is a camera ray set, which includes camera rays on all image regions. It should be noted that the original color values of all pixel points in the video frame of the training video can be used as the ground-truth color values of the camera rays corresponding to the pixel points.
  • the step of determining the mouth emphasis loss corresponding to the mouth image region of the video frame based on the predicted color value and the volume density may include:
  • the mouth emphasis loss corresponding to the mouth image area is determined.
  • the video frames in the training video can be semantically segmented to obtain the mouth image area corresponding to the video frame, and the mouth emphasis loss can be determined based on the predicted color value and volume density.
  • the camera rays in the mouth image area of the video frame are color integrated to predict the predicted mouth color value corresponding to each camera ray in the mouth image area.
  • the mouth emphasis loss corresponding to the mouth image area is determined.
  • the mouth emphasis loss can be constructed based on the mean square error:
  • R mouth is a camera ray set, which includes camera rays on the mouth image area. It should be noted that the original color value of the pixel point in the mouth area on the video frame in the training video can be used as the ground-truth mouth color value of the camera ray corresponding to the pixel point.
  • the total loss is constructed by combining the image reconstruction loss and the mouth emphasis loss, and the total loss is used to train the network for a single neural radiation field.
  • the present application multiplies the mouth emphasis loss Lmouth by an additional weight coefficient and adds the resultant image reconstruction loss Lphotometic to form a total loss to perform network training on a single neural radiation field.
  • the step of combining the image reconstruction loss and the mouth emphasis loss to construct a total loss, and using the total loss to perform network training on a single neural radiation field may include:
  • the weight parameter can select the optimal value according to the training experience during the network training experiment.
  • the total loss is determined based on the image reconstruction loss, the weight coefficient and the mouth emphasis loss.
  • the single neural radiation field is iteratively trained according to the total loss until the single neural radiation field meets the preset conditions.
  • the single neural radiation field can be iteratively trained according to the total loss until the single neural radiation field meets the preset conditions, wherein the preset conditions can be: the total loss value of the total loss function L is less than the preset value, the total loss value of the total loss function L no longer changes, or the number of training times reaches the preset number of times, etc.
  • an optimizer can be used to optimize the total loss function L, and the learning rate (Learning Rate), the batch size (Batch Size) during training, and the epoch (Epoch) of training can be set based on experimental experience.
  • the single neural radiation field that meets the preset conditions can be used as a video generation model.
  • the video generation model can be used to reconstruct the target user's video to be reconstructed, and finally obtain a reconstructed video.
  • the target user's video to be reconstructed can be obtained, and then the video to be reconstructed is reconstructed according to the video generation model to obtain the reconstructed video corresponding to the target user.
  • the video to be reconstructed includes at least a conference video in a video conference, a live video in a live broadcast, and a pre-recorded video, etc., which are not limited here.
  • the object of the video to be reconstructed is reconstructed according to the video generation model, and a method for obtaining the reconstructed video corresponding to the target user can be to obtain a preset number of frames of the video to be reconstructed from the video to be reconstructed, wherein the preset number of frames can be determined by the computing performance of the computer device currently performing object reconstruction.
  • each video frame to be reconstructed is input into the video generation model, and the video generation model predicts the reconstructed video frame of each video frame to be reconstructed. Since the video generation model introduces head posture information and head position information when reconstructing video frames, it can estimate the appropriate shoulder shape to adapt to changes in the head state and position, thereby making the shoulders and head of the generated character image appear more natural, stable and coordinated on the overall video frame, and based on all the calculated reconstructed video frames, the reconstructed video corresponding to the target user is synthesized.
  • voice features, expression parameters and head parameters are extracted from the training video of the target user.
  • the head parameters are used to characterize the head posture information and head position information of the target user.
  • the voice features, expression parameters and head parameters are merged to obtain the conditional input of the training video. Then, based on the conditional input, three-dimensional coordinates and viewing direction, the preset single neural radiation field is trained in the network to obtain a video generation model.
  • the video generation model can give the reconstructed portrait facial expressions while considering the head movement, so that the reconstructed portrait has high resolution, thereby improving the clarity of the reconstructed image, and the movement state of the shoulders can be implicitly estimated according to the head posture information and the head position information, so that the generated reconstructed portrait can maintain the coordination between the head movement and the shoulder movement, and can also ensure that the reconstructed portrait has the integrity of the head and shoulders.
  • the video generation model can be trained based on image reconstruction loss and mouth emphasis loss, wherein the image reconstruction loss is determined by the predicted object color value and the true object color value generated by a single neural radiation field according to the conditional input, and the mouth emphasis loss is determined by the predicted mouth color value and the true mouth color value generated by a single neural radiation field according to the conditional input.
  • the image reconstruction loss can guide a single neural radiation field to predict the different lighting effects at the spatial sampling point under different viewing angles.
  • the color integral can make the color of the pixel corresponding to the camera ray richer, thereby enhancing the display effect of the reconstructed video.
  • the reconstructed video can be synchronized with the mouth movement of the video to be reconstructed, and the change of the mouth shape can be accurately matched with the voice.
  • the reconstructed portrait can maintain the coordination between the head movement and the shoulder movement, thereby greatly improving the authenticity of the reconstructed video display.
  • the following will take the video generation model training device specifically integrated in a computer device as an example for explanation, and will elaborate in detail on the process shown in FIG5 in combination with the application scenario shown in FIG6.
  • the computer device may be a server or a terminal device, etc. Please refer to FIG5, which shows another video generation model training method provided in an embodiment of the present application.
  • the video generation model training method can be applied to the video conferencing scenario shown in FIG6.
  • the video conferencing service provider provides a service end, which includes a cloud training server 410 and a cloud execution server 430.
  • the cloud training server 410 is used to train a video generation model for object reconstruction
  • the cloud execution server 430 is used to deploy a video generation model for object reconstruction and a computer program for video conferencing related functions.
  • the generated reconstructed video is sent to the client.
  • the client may include the video conferencing software 421 opened on the smart TV 420 when the recipient uses the video conferencing service, and the video conferencing software 441 opened on the laptop 440 when the sender uses the video conferencing service.
  • the sender and the receiver conduct a video conference through their respective video conference software, i.e., the client.
  • the sender can use the object reconstruction function on the video conference software 441 for personal reasons to reconstruct his real portrait, so that the reconstructed ideal portrait is shown on the receiver's video conference software 421.
  • the reconstruction of the portrait is completed by the cloud execution server 430 on the service side using the video generation model.
  • Figure 6 is only an application scenario provided by the embodiment of the present application.
  • the application scenario described in the embodiment of the present application is to more clearly illustrate the technical solution of the embodiment of the present application, and does not constitute a limitation on the technical solution provided by the embodiment of the present application.
  • the reconstruction of the real portrait in Figure 6 can also be completed directly on the video conferencing software 441, and the cloud execution server 430 can transmit the reconstructed portrait video generated by the video conferencing software 441 to the video conferencing software 421.
  • the technical solution provided in the embodiment of the present application is also applicable to solving similar technical problems.
  • the training method of the video generation model can specifically include the following steps:
  • S210 The computer device obtains an initial video of a preset duration.
  • the initial video records the audio content of the target user speaking.
  • additional reference images are needed to provide identity information for network learning.
  • This application proposes to obtain a video of a specific person, that is, an initial video of a preset length as training data, which can be used for network learning of video reconstruction, avoiding the use of too much training data, thereby improving the efficiency of network training.
  • the sender may use a pre-recorded speech video with a preset duration of five minutes as the initial video, and send the initial video to the cloud training server 410 for preprocessing through the video conferencing software 441.
  • the video conferencing software 441 may also directly preprocess the initial video to obtain a training video, and then send the training video to the cloud training server 410.
  • S220 The computer device pre-processes the initial video according to a preset resolution and a preset sampling rate to obtain a training video.
  • the present application can determine the portrait of the target user in the initial video in the central area of the video frame of the training video through preprocessing, so that in the reconstructed video generated by the video generation model obtained after training, the character area can occupy the center of the video screen.
  • the preset resolution and the preset sampling rate can be set according to the display requirements of the character content in the video screen in the actual application scenario. For example, after receiving the initial video sent by the video conferencing software 441, the cloud training server 410 can sample the initial video based on a sampling frequency of 25fps, and crop the video frames sampled from the initial video based on a resolution of 450 ⁇ 450 pixels to obtain a training video, so that the portrait of the target user occupies the central area of the video frame.
  • S230 The computer device extracts the conditional input corresponding to the training video of the target user.
  • This application introduces the user's head posture information and head position information into conditional input, so that the neural radiation field can implicitly estimate the movement state of the shoulder based on the head posture information and head position information, so that the generated reconstructed portrait can maintain the coordination between head movement and shoulder movement.
  • a method for extracting the conditional input corresponding to the training video is to obtain a training video of the target user; extract the target user's voice features, the target user's expression parameters and the target user's head parameters from the training video, where the head parameters are used to characterize the target user's head posture information and head position information; and merge the target user's voice features, the target user's expression parameters and the target user's head parameters to obtain the conditional input of the training video.
  • the step of extracting the target user's voice features, the target user's expression parameters, and the target user's head parameters from the training video by the computer device may include:
  • the computer device extracts speech features from the training video of the target user to obtain speech features.
  • the cloud training server 410 can use the DeepSpeech model to learn the mapping of speech to text in the training video, that is, to extract the speech features of the target user's speech sound content.
  • the cloud training server 410 can sample the audio data associated with the training video to obtain a sampling array, perform a fast Fourier transform on the sampling array, and perform a two-layer convolution calculation on this basis to obtain the convolved data.
  • the cloud training server 410 performs a Shape operation on the convolved data, and slices the operated data to obtain a preset number of data slices, and inputs each data slice into each RNN layer respectively, obtains output data from each RNN layer accordingly, and merges the output data to obtain the speech feature a corresponding to the audio data.
  • the computer device performs three-dimensional face reconstruction on the training video of the target user to obtain a facial shape representation of the target user's three-dimensional face shape, and determines the expression parameters of the target user based on the facial shape representation.
  • the cloud training server 410 can use a three-dimensional deformable face model to obtain expression parameters from each video frame.
  • the three-dimensional deformable face model can perform three-dimensional reconstruction on a two-dimensional face in a single video frame to obtain a corresponding three-dimensional face shape representation.
  • Es and Ee represent the matrices of orthogonal basis vectors in shape space and expression space respectively.
  • s and e represent shape coefficient and expression coefficient respectively.
  • the expression coefficient e can be used as the expression parameter of the reconstructed three-dimensional face shape.
  • the computer device transforms and maps the three-dimensional face shape of the target user to obtain the rotation matrix and translation vector corresponding to the three-dimensional face shape.
  • the cloud training server 410 can transform the three-dimensional face of the target user to obtain the rotation matrix and translation vector corresponding to the three-dimensional face.
  • f represents the scale factor
  • Pr represents the orthogonal projection matrix
  • R represents the rotation matrix
  • t represents the translation vector.
  • the computer device determines the head posture information based on the rotation matrix and determines the head position information based on the translation vector, and obtains the head parameters of the target user according to the head posture information and the head position information.
  • the cloud training server 410 can convert the rotation matrix into Euler angles, which are composed of three elements and represent direction information, that is, head posture information. And the translation vector is represented as head position information. Next, position encoding is performed on the head posture information and the head position information to obtain two encoded high-dimensional vectors respectively, and the two high-dimensional vectors are connected into a vector representation P.
  • S240 The computer device performs network training on a preset single neural radiation field based on conditional input, three-dimensional coordinates, and viewing direction to obtain a video generation model.
  • the training method of the video generation model provided in the embodiment of the present application includes the training of a preset single neural radiation field. It is worth noting that the training of the preset single neural radiation field can be performed in advance based on the acquired training sample data set. Subsequently, each time the object reconstruction needs to be performed, the trained video generation model can be used for direct calculation without the need to perform network training again each time the object reconstruction is performed.
  • the step of the computer device performing network training on a preset single neural radiation field based on conditional input, three-dimensional coordinates, and viewing direction may include:
  • the computer device performs time smoothing processing on the speech features and expression parameters respectively to obtain corresponding smoothed speech features and smoothed expression parameters.
  • the cloud training server 410 can use two time smoothing networks to filter the speech feature a and the expression parameter e respectively.
  • the expression parameter e is subjected to time smoothing: in the time dimension, the smoothed expression parameter of the video frame at time t is calculated based on the linear combination of the expression parameter e of each video frame from time step t-T/2 to t+T/2, and the weight of the linear combination can be calculated using the expression parameter e as the input of the time smoothing network.
  • the time smoothing network consists of five one-dimensional convolutions followed by a linear layer with softmax activation.
  • the cloud training server 410 can calculate the smoothed speech features of the video frame at time t based on the linear combination of the speech features a of each video frame at time steps t-T/2 to t+T/2 in the time dimension, and use the speech features a as the input of the time smoothing network to calculate the weight of the linear combination.
  • the computer device obtains the three-dimensional coordinates and viewing direction of the spatial sampling point on the camera ray.
  • the cloud training server 410 can convert the pixel coordinates into the three-dimensional coordinates of the spatial sampling points on the light under the unified world coordinates based on the internal and external parameters of the camera.
  • the cloud training server 410 can determine the viewing angle according to the preset shooting angle of the camera shooting scene, or can pre-set the viewing angle based on the observation angle of the character in the pre-acquired reference video.
  • the computer device inputs the three-dimensional coordinates, viewing direction, smoothed speech features, smoothed expression parameters, and head parameters into a preset single neural radiation field, and calculates the predicted color value and volume density corresponding to the spatial sampling point.
  • the cloud training server 410 can use the three-dimensional coordinate x of the spatial sampling point, the viewing direction d, the smoothed speech feature a, the smoothed expression parameter e and the head parameter p as function inputs based on the implicit function F ⁇ , so that the implicit function F ⁇ calculates the predicted color value c and volume density ⁇ of each spatial sampling point.
  • the implicit function F ⁇ is expressed as: F ⁇ :(x, d, a, e, p) ⁇ (c, ⁇ ).
  • the computer device determines the image reconstruction loss corresponding to the entire image area of the video frame based on the predicted color value and volume density, and determines the mouth emphasis loss corresponding to the mouth image area of the video frame based on the predicted color value and volume density.
  • the step of determining the image reconstruction loss corresponding to all image regions of the video frame of the training video based on the predicted color value and volume density may include:
  • the computer device performs color integration on the camera rays in the entire image area of the video frame based on the predicted color value and volume density, and predicts the predicted object color value corresponding to each camera ray in the entire image area.
  • the cloud training server 410 can obtain the cumulative transparency corresponding to the spatial sampling point on each camera ray in the entire image area, where the cumulative transparency representation can be understood as the probability that the camera ray does not hit any particle in the first integration interval.
  • the cumulative transparency can be generated by integrating the volume density of the camera ray in the first integration interval, and the first integration interval is the sampling distance of the camera ray from the proximal boundary to the spatial sampling point.
  • the cloud training server 410 can determine the integrand based on the product of the accumulated transparency, the predicted color value and the volume density, and perform color integration on the integrand over the second integration interval to predict the predicted object color value corresponding to each camera ray in the entire image area.
  • the second integration interval is the sampling distance of the camera ray from the near boundary to the far boundary.
  • the cloud training server 410 can determine the image reconstruction loss corresponding to the entire image area based on the predicted object color value corresponding to each camera ray in the entire image area and the corresponding real object color value.
  • the image reconstruction loss can be constructed based on the mean square error.
  • the original color value of the pixel point in the entire area on the video frame in the training video is used as the real object color value of the camera ray corresponding to the pixel point.
  • the computer device determines the image reconstruction loss corresponding to the entire image area based on the predicted object color value corresponding to each camera ray in the entire image area and the corresponding real object color value.
  • the step of determining the mouth emphasis loss corresponding to the mouth image region of the video frame based on the predicted color value and the volume density may include:
  • the camera rays in the mouth image area are color integrated to predict the predicted mouth color value corresponding to each camera ray in the mouth image area.
  • the cloud training server 410 can perform image semantic segmentation on the video frames in the training video to obtain the mouth image area corresponding to the video frame, and based on the predicted color value and volume density, perform color integration on the camera rays in the mouth image area of the video frame to predict the predicted mouth color value corresponding to each camera ray in the mouth image area.
  • the cloud training server 410 can determine the mouth emphasis loss corresponding to the mouth image area based on the predicted mouth color value corresponding to each camera ray in the mouth image area and the corresponding real mouth color value, and use the original color value of the pixel point in the mouth area on the video frame in the training video as the real mouth color value of the camera ray corresponding to the pixel point.
  • the computer device combines the image reconstruction loss and the mouth emphasis loss to construct a total loss, and uses the total loss to train the network on a single neural radiation field.
  • this application multiplies the mouth emphasis loss by an additional weight coefficient and adds it to the image reconstruction loss to form a total loss to perform network training on a single neural radiation field.
  • the cloud training server 410 may obtain the weight coefficient and determine the total loss based on the image reconstruction loss, the weight coefficient and the mouth emphasis loss, and then iteratively train the single neural radiation field according to the total loss until the single neural radiation field meets the preset conditions.
  • test set A and test set B are both speaking portrait videos.
  • Related technologies include MakeItTalk, AD-NeRF, Wav2Lip and NerFACE.
  • Evaluation indicators include PSNR and SSIM for evaluating the quality of reconstructed video frames (such as facial expressions); LPIPS for measuring the quality of realism; LMD for evaluating the accuracy of mouth shape; Sync for evaluating the synchronization of lips and audio.
  • test set A and test set B the evaluation indicators of PSNR, SSIM and LPIPS are calculated on the entire image area, and the evaluation indicators of LMD and Sync are calculated on the mouth image area.
  • the calculation results are shown in Table 1 below:
  • the method proposed in this application achieved the best performance in terms of evaluation indicators PSNR, SSIM, LPIPS and LMD. At the same time, it also has superiority in audio-lip synchronization and accuracy. For example, it can be observed that the human portrait in the reconstructed video frame created by the method of this application has more accurate facial expressions, higher lip synchronization accuracy and more natural head-torso coordination.
  • AD-NeRF relies on using two independent neural radiation fields to model the head and torso, which inevitably leads to separation and shaking of the neck of the portrait.
  • this application introduces detailed head posture information and head position information as conditional input based on a single neural radiation field, which can generate more accurate visual details, such as facial expressions, better than AD-NeRF.
  • the training method of the video generation model can be intuitively compared with related technologies on two test sets, that is, the reconstructed video frames generated by each method are compared together.
  • the related technologies include MakeItTalk, AD-NeRF, Wav2Lip, ATVG, PC-AVS and NerFACE.
  • Figure 7 a schematic diagram of performance comparison is shown, which should be explained that the schematic diagram is an example diagram after processing.
  • the computer device reconstructs the target user's to-be-reconstructed video according to the video generation model to obtain a reconstructed video corresponding to the target user.
  • the single neural radiation field that meets the preset conditions can be deployed as a video generation model on the cloud execution server 430. Then, the cloud execution server 430 can reconstruct the target user's video to be reconstructed based on the video generation model, and finally obtain the reconstructed video.
  • the cloud execution server 430 may obtain the conference video to be reconstructed, that is, the video to be reconstructed, sent by the sender through the video conferencing software 441 on the laptop computer 440, and then obtain the video frames to be reconstructed with a preset number of frames from the conference video, wherein the preset number of frames may be determined by the computing performance of the computer device currently performing object reconstruction.
  • the cloud execution server 430 may evaluate the computing performance by querying the memory utilization and the GPU computing performance.
  • the cloud execution server 430 may divide its own computing performance into different levels and match the corresponding preset number of frames for the computing performance of different levels.
  • the cloud execution server 430 can input each video frame to be reconstructed into the video generation model, predict the reconstructed video frame of each video frame to be reconstructed from the video generation model, and synthesize the reconstructed video corresponding to the sender based on the calculated frame sequence of all reconstructed video frames. Then, the reconstructed video is sent to the smart TV 420 of the receiver, and the reconstructed video can be displayed through the video conferencing software 421.
  • Figure 8 shows an implementation effect diagram of a training method for a video generation model.
  • This application greatly improves the authenticity of the speaking portrait video based on the implicit representation ability of a single neural radiation field.
  • the training method of the video generation model can be applied to application scenarios such as video conferencing, video chatting, live broadcasting, and digital humans that require the reconstruction of speaking portrait videos.
  • the head posture and facial expression that accurately match the video to be reconstructed (1) in Figure 8 and the mouth shape that is synchronized with the voice of the video to be reconstructed (2) in Figure 8 can be obtained, and different good appearances can be obtained.
  • This application also adds the head posture information and head position information in each video frame to the conditional input of a single neural radiation field, thereby guiding the generation of the shoulder area and adapting to the position of the head, and finally generating the natural, stable and coordinated shoulders (3) in Figure 8, avoiding the head and shoulder incoordination problem caused by rigid head and shoulder modeling.
  • an initial video of a preset duration can be obtained, and the initial video is preprocessed according to a preset resolution and a preset sampling rate to obtain a training video.
  • the initial video of the preset duration is obtained as training data, which can be used for network learning for video reconstruction, avoiding the use of too much training data, and greatly improving the efficiency of network training.
  • a conditional input corresponding to a training video of a target user is extracted, the conditional input including voice features, expression parameters and head parameters, the head parameters being used to characterize head posture information and head position information, and network training is performed on a preset single neural radiation field based on the voice features, expression parameters and head parameters to obtain a video generation model.
  • the video generation model can give facial expressions to the reconstructed portrait while considering head movement, so that the reconstructed portrait has high resolution, and the movement state of the shoulders can be implicitly estimated based on the head posture information and the head position information, so that the generated reconstructed portrait can maintain the coordination between the head movement and the shoulder movement, and can also ensure that the reconstructed portrait has the integrity of the head and shoulders.
  • the video generation model is trained based on image reconstruction loss and mouth emphasis loss.
  • the image reconstruction loss is determined by the predicted object color value and the real object color value generated by a single neural radiation field according to conditional input
  • the mouth emphasis loss is determined by the predicted mouth color value and the real mouth color value generated by a single neural radiation field according to conditional input.
  • the training device 500 for the video generation model includes: a condition acquisition module 510, which is used to obtain a training video of a target user; extract the voice features of the target user, the expression parameters of the target user and the head parameters of the target user from the training video, and the head parameters are used to characterize the head posture information and head position information of the target user; the voice features of the target user, the expression parameters of the target user and the head parameters of the target user are merged to obtain the conditional input of the training video; a network training module 520, which is used to perform network training on a preset single neural radiation field based on the conditional input, three-dimensional coordinates and viewing direction to obtain a video generation model; wherein the video generation model is obtained based on total loss training, the total loss includes image reconstruction loss, the image reconstruction loss is determined by the predicted object color value and the real object color value, and the predicted object color value is generated by a condition acquisition module 510, which is used to obtain a training video of a target user; extract the voice features of the target user, the
  • conditional acquisition module 510 can be specifically used to: extract voice features from the training video of the target user to obtain the voice features of the target user; perform three-dimensional face reconstruction on the training video of the target user to obtain a facial representation of the three-dimensional face of the target user, and determine the expression parameters of the target user based on the facial representation; perform transformation mapping on the three-dimensional face of the target user to obtain a rotation matrix and translation vector corresponding to the three-dimensional face; determine head posture information based on the rotation matrix and determine head position information based on the translation vector, and obtain the head parameters of the target user based on the head posture information and head position information.
  • the total loss includes a mouth emphasis loss, where the mouth emphasis loss is determined by a predicted mouth color value and a true mouth color value, where the predicted mouth color value is generated by a single neural radiation field based on conditional input, three-dimensional coordinates, and viewing direction.
  • the video generation model training device 500 further includes a sampling acquisition unit:
  • a sampling acquisition unit used to acquire the three-dimensional coordinates and viewing angle direction of a spatial sampling point on a camera ray, where the camera ray is the light emitted by the camera when imaging a scene, and the camera ray corresponds to a pixel point on a video frame;
  • the network training module 520 may include: a smoothing processing unit, which is used to perform temporal smoothing processing on speech features and expression parameters respectively to obtain smoothed speech features and smoothed expression parameters; a sampling calculation unit, which is used to input three-dimensional coordinates, viewing direction, smoothed speech features, smoothed expression parameters and head parameters into a preset single neural radiation field, and calculate the predicted color value and volume density corresponding to the spatial sampling point; a loss determination unit, which is used to determine the image reconstruction loss corresponding to the entire image area of the video frame of the training video frame based on the predicted color value and volume density, and determine the mouth emphasis loss corresponding to the mouth image area of the video frame based on the predicted color value and volume density; a network training unit, which is used to construct a total loss by combining the image reconstruction loss and the mouth emphasis loss, and use the total loss to perform network training on the single neural radiation field.
  • a smoothing processing unit which is used to perform temporal smoothing processing on speech features and expression parameters respectively to obtain smoothed speech features
  • the loss determination unit may include: a prediction subunit, used to perform color integration on the camera rays in the entire image area based on the predicted color value and volume density, and predict the predicted object color value corresponding to each camera ray in the entire image area; a reconstruction loss subunit, used to determine the image reconstruction loss corresponding to the entire image area based on the predicted object color value corresponding to each camera ray in the entire image area and the corresponding real object color value.
  • the prediction subunit can be specifically used to: obtain the cumulative transparency corresponding to the spatial sampling point on each camera ray in the entire image area, the cumulative transparency is generated by integrating the volume density of the camera ray over a first integral interval; determine the integrand based on the product of the cumulative transparency, the predicted color value and the volume density; perform color integration on the integrand over a second integral interval to predict the predicted object color value corresponding to each camera ray in the entire image area; wherein the first integral interval is the sampling distance of the camera ray from the near boundary to the spatial sampling point, and the second integral interval is the sampling distance of the camera ray from the near boundary to the far boundary.
  • the loss determination unit can also be specifically used to: perform image semantic segmentation on the video frame to obtain the mouth image area corresponding to the video frame; based on the predicted color value and volume density, perform color integration on the camera rays in the mouth image area of the video frame to predict the predicted mouth color value corresponding to each camera ray in the mouth image area; based on the predicted mouth color value corresponding to each camera ray in the mouth image area and the corresponding real mouth color value, determine the mouth emphasis loss corresponding to the mouth image area.
  • the network training unit can be specifically used to: obtain weight coefficients; determine the total loss based on image reconstruction loss, weight coefficients and mouth emphasis loss; iteratively train a single neural radiation field according to the total loss until the single neural radiation field meets preset conditions.
  • the training device 500 for the video generation model may also include: an initial acquisition module, used to acquire an initial video of a preset length, the initial video records the audio content of the target user speaking; a preprocessing module, used to preprocess the initial video according to a preset resolution and a preset sampling rate to obtain a training video, and the preprocessing is used to determine the object content of the target user in the central area of the video frame of the training video.
  • an initial acquisition module used to acquire an initial video of a preset length, the initial video records the audio content of the target user speaking
  • a preprocessing module used to preprocess the initial video according to a preset resolution and a preset sampling rate to obtain a training video, and the preprocessing is used to determine the object content of the target user in the central area of the video frame of the training video.
  • the video generation model training apparatus 500 may further include an object reconstruction module 530:
  • the object reconstruction module 530 is used to obtain the target user's to-be-reconstructed video; perform object reconstruction on the target user's to-be-reconstructed video according to the video generation model to obtain the reconstructed video corresponding to the target user.
  • the video to be reconstructed includes a conference video
  • the object reconstruction module 530 may be specifically used to:
  • a preset number of video frames to be reconstructed are obtained from the video to be reconstructed; each video frame to be reconstructed is input into a video generation model to calculate a reconstructed video frame corresponding to each video frame to be reconstructed; and a reconstructed video corresponding to a target user is synthesized based on all the video frames to be reconstructed.
  • the coupling between modules may be electrical, mechanical or other forms of coupling.
  • each functional module in each embodiment of the present application can be integrated into a processing module, or each module can exist physically separately, or two or more modules can be integrated into one module.
  • the above integrated modules can be implemented in the form of hardware or software functional modules.
  • the solution provided by the present application extracts voice features, expression parameters and head parameters from the training video of the target user.
  • the head parameters are used to characterize the head posture information and head position information of the target user.
  • the voice features, expression parameters and head parameters are merged to obtain the conditional input of the training video.
  • the network training is performed on the preset single neural radiation field based on the conditional input, three-dimensional coordinates and viewing direction to obtain a video generation model.
  • the video generation model can give the reconstructed portrait facial expressions while considering the head movement, so that the reconstructed portrait has high resolution, and the movement state of the shoulders can be implicitly estimated according to the head posture information and head position information, so that the generated reconstructed portrait can maintain the coordination between the head movement and the shoulder movement, and can also ensure the integrity of the reconstructed portrait with the head and shoulders.
  • the video generation model is trained based on image reconstruction loss and mouth emphasis loss, wherein the image reconstruction loss is determined by the predicted object color value and the real object color value generated by a single neural radiation field according to conditional input, and the mouth emphasis loss is determined by the predicted mouth color value and the real mouth color value generated by a single neural radiation field according to conditional input.
  • the reconstructed video obtained can be synchronized with the mouth movement of the video to be reconstructed, thereby improving the authenticity of the reconstructed video display.
  • the embodiment of the present application further provides a computer device 600, which includes a processor 610, a memory 620, a power supply 630, and an input unit 640.
  • the memory 620 stores a computer program.
  • the various method steps provided in the above embodiment can be implemented.
  • the structure of the computer device shown in the figure does not constitute a limitation on the computer device, and may include more or fewer components than shown in the figure, or combine certain components, or arrange the components differently. Among them:
  • the processor 610 may include one or more processing cores.
  • the processor 610 uses various interfaces and lines to connect various parts of the entire battery management system, and executes or calls data stored in the memory 620 to perform various functions of the battery management system and process data, as well as various functions of the computer device and process data, thereby performing overall control of the computer device. Control.
  • the processor 610 can be implemented in at least one hardware form of digital signal processing (DSP), field programmable gate array (FPGA), and programmable logic array (PLA).
  • DSP digital signal processing
  • FPGA field programmable gate array
  • PDA programmable logic array
  • the processor 610 can integrate one or a combination of a central processing unit 610 (CPU), a graphics processing unit 610 (GPU), and a modem.
  • the CPU mainly processes the operating system, user interface, and application programs; the GPU is responsible for rendering and drawing display content; and the modem is used to handle wireless communications. It can be understood that the above-mentioned modem may not be integrated into the processor 610, but may be implemented separately through a communication chip.
  • the computer device 600 may also include a display unit, etc., which will not be described in detail herein.
  • the processor 610 in the computer device will load the executable files corresponding to the processes of one or more computer programs into the memory 620 according to the following instructions, and the processor 610 will run the data stored in the memory 620, such as the phone book and audio and video data, to implement the various method steps provided in the aforementioned embodiments.
  • an embodiment of the present application further provides a computer-readable storage medium 700 , in which a computer program 710 is stored.
  • the computer program 710 can be called by a processor to execute various method steps provided in the embodiment of the present application.
  • the computer-readable storage medium may be an electronic memory such as a flash memory, an EEPROM (Electrically Erasable Programmable Read-Only Memory), an EPROM, a hard disk, or a ROM.
  • the computer-readable storage medium includes a non-volatile computer-readable storage medium (Non-Transitory Computer-Readable Storage Medium).
  • the computer-readable storage medium 700 has storage space for a computer program that executes any of the method steps in the above embodiments. These computer programs can be read from or written to one or more computer program products. The computer program can be compressed in an appropriate form.
  • a computer program product comprising a computer program, the computer program being stored in a computer-readable storage medium.
  • a processor of a computer device reads the computer program from the computer-readable storage medium, and the processor executes the computer program, so that the computer device executes various method steps provided in the above embodiments.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Geometry (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computer Graphics (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Processing Or Creating Images (AREA)
  • Electrically Operated Instructional Devices (AREA)
  • Image Analysis (AREA)

Abstract

本申请公开一种视频生成模型的训练方法、装置、存储介质及计算机设备,从目标用户的训练视频中提取语音特征、表情参数和头部参数,头部参数用于表征目标用户的头部姿态信息与头部位置信息,将语音特征、表情参数和头部参数进行合并,得到训练视频的条件输入;基于条件输入、三维坐标和视角方向对单个神经辐射场进行网络训练,得到视频生成模型;视频生成模型为基于总损失训练得到,总损失包括图像重建损失。在训练过程中引入头部姿态信息和头部位置信息,因此训练得到的视频生成模型能够引入肩部运动状态考量,使得后续根据视频生成模型进行视频重构时,能够使得头部和肩部之间的运动更加协调稳定,提升了重构视频显示的真实性。

Description

视频生成模型的训练方法、装置、存储介质及计算机设备
本申请要求于2022年10月13日提交中国专利局、申请号202211255944.4、申请名称为“视频生成方法、装置、存储介质及计算机设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及计算机视觉技术领域,更具体地,涉及一种视频生成模型的训练方法、装置、存储介质及计算机设备。
背景技术
近年来,人脸重演(Face Reenactment)技术因其在媒体、娱乐、虚拟现实等方面的应用前景而备受关注。说话人像视频的生成作为人脸重演的一项重要任务,被广泛用于视频会议、视频聊天和虚拟人场景中。例如,用户可以利用自己具有良好外观的重构人像,代替自己出镜参加视频会议。
其中,说话人像视频生成的主要原理为利用一个形象更佳的用户重构化身来重演用户实际的人像动作。然而,有关技术生成的说话人像视频容易出现重构视频中用户的人体组织运动不协调的情况,从而,大大降低了视频生成结果呈现给用户的真实感。
发明内容
本申请实施例提供一种视频生成模型的训练方法、装置、存储介质以及计算机设备。旨在提升说话人像视频生成时的运动协调性。
一方面,本申请实施例提供一种视频生成模型的训练方法,该方法由计算机设备执行,该方法包括:获取目标用户的训练视频;从训练视频中提取目标用户的语音特征、目标用户的表情参数和目标用户的头部参数,头部参数用于表征目标用户的头部姿态信息与头部位置信息;将目标用户的语音特征、目标用户的表情参数和目标用户的头部参数进行合并,得到训练视频的条件输入;基于条件输入、三维坐标以及视角方向对预设的单个神经辐射场进行网络训练,得到视频生成模型;其中,视频生成模型为基于总损失训练得到,总损失包括图像重建损失,图像重建损失是由预测对象颜色值和真实对象颜色值确定的,预测对象颜色值是单个神经辐射场根据条件输入、三维坐标和视角方向生成的,视频生成模型用于对目标用户的待重构视频进行对象重构,得到目标用户对应的重构视频。
另一方面,本申请实施例还提供一种视频生成模型的训练装置,该装置部署在计算机设备上,该装置包括:条件获取模块,用于获取目标用户的训练视频;从训练视频中提取目标用户的语音特征、目标用户的表情参数和目标用户的头部参数,头部参数用于表征目标用户的头部姿态信息与头部位置信息;将目标用户的语音特征、目标用户的表情参数和目标用户的头部参数进行合并,得到训练视频的条件输入;网络训练模块,用于基于条件输入、三维坐标以及视角方向对预设的单个神经辐射场进行网络训练,得到视频生成模型;其中,视频生成模型为基于总损失训练得到,总损失包括图像重建损失,图像重建损失是由预测对象颜色值和真实对象颜色值确定的,预测对象颜色值是单个 神经辐射场根据条件输入、三维坐标和视角方向生成的;视频生成模型用于对目标用户的待重构视频进行对象重构,得到目标用户对应的重构视频。
另一方面,本申请实施例还提供一种计算机可读存储介质,该计算机可读存储介质存储有计算机程序,其中,在该计算机程序被处理器运行时执行上述的视频生成模型的训练方法。
另一方面,本申请实施例还提供一种计算机设备,该计算机设备包括处理器以及存储器,存储器存储有计算机程序,该计算机程序被处理器调用时执行上述的视频生成模型的训练方法。
另一方面,本申请实施例还提供一种计算机程序产品,该计算机程序产品包括计算机程序,该计算机程序存储在存储介质中;计算机设备的处理器从存储介质读取该计算机程序,处理器执行该计算机程序,使得计算机设备执行上述视频生成模型的训练方法中的步骤。
本申请提供的一种视频生成模型的训练方法,从目标用户的训练视频中提取语音特征、表情参数和头部参数,头部参数用于表征目标用户的头部姿态信息与头部位置信息,将语音特征、表情参数和头部参数进行合并,得到训练视频的条件输入。进一步地,基于条件输入、三维坐标以及视角方向对预设的单个神经辐射场进行网络训练,得到视频生成模型,该视频生成模型为基于总损失训练得到,总损失包括图像重建损失,图像重建损失是由预测对象颜色值和真实对象颜色值确定的,预测对象颜色值是单个神经辐射场根据条件输入、三维坐标和视角方向生成的。通过在条件输入中引入头部参数,使得网络训练得到的视频生成模型可以根据头部姿态信息与头部位置信息估算出肩膀部分及其运动状态,这样,在使用视频生成模型对目标用户的待重构视频进行对象重构,得到目标用户对应的重构视频时,使得预测出的视频帧中具有完整且逼真的头部与肩膀部分,并且使得头部与肩膀的动作状态保持协调,从而大大提升重构视频显示的真实性。
附图说明
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1示出了本申请实施例提供的一种系统架构示意图;
图2示出了本申请实施例提供的一种视频生成模型的训练方法的流程示意图;
图3示出了本申请实施例提供的一种单个神经辐射场的网络架构图;
图4示出了本申请实施例提供的一种相机射线的示意图;
图5示出了本申请实施例提供的另一种视频生成模型的训练方法的流程示意图;
图6示出了本申请实施例提供的一种应用场景示意图;
图7示出了本申请实施例提供的一种性能对比的示意图;
图8示出了本申请实施例提供的一种视频生成模型的训练方法的实现效果图;
图9是本申请实施例提供的一种视频生成模型的训练装置的模块框图;
图10是本申请实施例提供的一种计算机设备的模块框图;
图11是本申请实施例提供的一种计算机可读存储介质的模块框图。
具体实施方式
下面详细描述本申请的实施方式,实施方式的示例在附图中示出,其中自始至终相同或类似的标号表示相同或类似的元件或具有相同或类似功能的元件。下面通过参考附图描述的实施方式是示例性地,仅用于解释本申请,而不能理解为对本申请的限制。
为了使本技术领域的人员更好地理解本申请的方案,下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整的描述。显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
需要说明的是,在本申请的具体实施方式中,涉及到的视频等相关数据,当运用到本申请实施例的具体产品或技术中时,需要获得用户许可或者同意,且相关数据的收集、使用和处理需要遵守相关国家和地区的相关法律法规和标准。为了便于理解,下面先对本申请所涉及到的相关术语及概念进行介绍。
本申请的视频生成模型的训练方法涉及人工智能(Artificial Intelligence,AI)技术,利用人工智能技术自动化进行视频生成模型的训练,以及后续自动化进行视频生成。
在视频会议中,由于一些个人的关注或偏好,用户并不总是方便向所有参会者展示自己当前的真实面貌和周围环境。这种情况下,一个潜在的解决方案是基于用户的一个好看的重构化身重新模拟自身实际的人像运动,进而生成高保真的说话人像视频(Talking Portrait Video),该说话人像视频中的重构化身与用户的语音音频和真实的头部运动、面部表情、眨眼等运动相匹配。上述解决方案也有利于许多其他应用,如数字人、电影制作和多人在线游戏等。
目前,有关说话人像视频生成的建模方案大致可以分为三类:基于模型,基于生成对抗网络(Generative Adversarial Network,GAN)以及基于神经辐射场(Neural Radiance Fields,NeRF)。其中,基于模型的方案通常根据红绿蓝(Red-Green-Blue,RGB)或红绿蓝-深度信息(Red-Green-Blue-Depth map,RGBD)数据创建一个特定人物的三维(Three-Dimensional,3D)模型,然后在不考虑头部运动的情况下为该3D模型赋予面部表情,且生成结果的分辨率受限。基于生成对抗网络的方案一般采用对抗学习模式直接生成人物外观,但其学习过程不能知晓场景的3D几何形状,需要额外参考图像来提供身份信息。
基于神经辐射场的方案主要包括以音频和运动为驱动源(Driving Source)的两种方法。其中,音频驱动方法,如语音驱动神经辐射场(Audio Driven Neural Radiance Fields,AD-NeRF)专注于建立语音音频与视觉外观运动之间的关系。运动驱动方法,如学习一个映射函数,将源运动或表情迁移到目标人脸。然而,AD-NeRF依赖于两个独立的神经辐射场来分别模拟头部和躯干,因此存在网络结构分离的问题。NerFACE(一种基于NeRF的人脸建模算法)无法生成稳定和自然的躯干序列,从而导致说话人像视频中重构人像出现头部和肩部之间运行不协调的问题,且上述方法生成的重构人像的嘴唇形状与用户的嘴唇形状无法同步。
为了解决上述问题,本申请实施例提供了视频生成模型的训练方法,下面先对本申请所涉及到的视频生成模型的训练方法的系统的架构进行介绍。
如图1所示,本申请实施例提供的视频生成模型的训练方法可以应用在系统300中,数据获取设备310用于获取训练数据。针对本申请实施例的视频生成模型的训练方法来说,训练数据可以包括用于训练使用的训练视频。数据获取设备310在获取到训练数据之后,可以将该训练数据存入数据库320,训练设备330可以基于数据库320中维护的训练数据训练得到目标模型301。
训练设备330可以基于训练视频对预设的神经网络进行训练,直至该预设的神经网络满足预设条件,得到目标模型301。其中,预设的神经网络为单个神经辐射场。预设条件可以为:总损失函数的总损失值小于预设值、总损失函数的总损失值不再变化、或者训练次数达到预设次数等。该目标模型301能够用于实现本申请实施例中重构视频的生成。
需要说明的是,在实际的应用场景中,数据库320中维护的训练数据不一定都来自于数据获取设备310,也可以从其他设备接收得到,例如,客户端设备360也可以作为数据获取端,将获取的数据作为新的训练数据,并存入数据库320。此外,训练设备330也不一定完全基于数据库320维护的训练数据对预设的神经网络进行训练,也有可能基于从云端或其他设备获取的训练数据对预设的神经网络进行训练,上述描述不应该作为对本申请实施例的限定。
上述根据训练设备330训练得到的目标模型301可以应用于不同的系统或设备中,如应用于图1所示的执行设备340,该执行设备340可以是终端,例如,手机终端、平板电脑、笔记本电脑、增强现实(Augmented Reality,AR)/虚拟现实(Virtual Reality,VR)等,还可以是服务器或者云端等,但并不局限于此。
在图1中,执行设备340可以用于与外部设备进行数据交互,例如,用户可以使用客户端设备360通过网络向执行设备340发送输入数据。该输入数据在本申请实施例中可以包括:客户端设备360发送的训练视频或待重构视频。在执行设备340对输入数据进行预处理,或者在执行设备340的执行模块341执行计算等相关的处理过程中,执行设备340可以调用数据存储系统350中的数据、程序等以用于相应的计算处理,并将计算处理得到的处理结果等数据和指令存入数据存储系统350中。
最后,执行设备340可以将处理结果,也即,目标模型301生成的重构视频通过网络返回给客户端设备360,从而,用户可以在客户端设备360上查询处理结果。值得说明的是,训练设备330可以针对不同的目标或不同的任务,基于不同的训练数据生成相应的目标模型301,该相应的目标模型301即可以用于实现上述目标或者完成上述任务,从而为用户提供所需的结果。
示例性地,图1所示的系统300可以为客户端-服务器(Client-Server,C/S)系统架构,执行设备340可以为服务供应商部署的云服务器,客户端设备360可以为用户使用的笔记本电脑。例如,用户可以利用笔记本电脑中安装的视频生成软件,通过网络上传待重构视频至云服务器,云服务器在接受到待重构视频时,利用目标模型301 进行人像重构,生对应的重构视频,并将重构视频返回至笔记本电脑,进而用户即可在视频生成上获取重构视频。
值得注意的是,图1仅是本申请实施例提供的一种系统的架构示意图,本申请实施例描述的系统的架构以及应用场景是为了更加清楚的说明本申请实施例的技术方案,并不构成对于本申请实施例提供的技术方案的限定。例如,图1中的数据存储系统350相对执行设备340是外部存储器,在其它情况下,也可以将数据存储系统350置于执行设备340中。执行设备340也可以直接是客户端设备。本领域普通技术人员可知,随着系统架构的演变和新的应用场景的出现,本申请实施例提供的技术方案对于解决类似的技术问题,同样适用。
请参阅图2,图2示出了本申请一个实施例提供的视频生成模型的训练方法的流程示意图。在具体的实施例中,所述视频生成模型的训练方法应用于如图9所示的视频生成模型的训练装置500以及配置有视频生成模型的训练装置500的计算机设备600(图10)。
下面将以计算机设备为例,说明本实施例的具体流程,可以理解的是,本实施例所应用的计算机设备可以为服务器或者终端等,服务器可以是独立的物理服务器,也可以是多个物理服务器构成的服务器集群或者分布式系统,还可以是提供云服务、云数据库、云计算、云函数、云存储、网络服务、云通信、中间件服务、域名服务、安全服务、内容分发网络(Content Delivery Network,CDN)、区块链以及大数据和人工智能平台等基础云计算服务的云服务器。终端可以是平板电脑、笔记本电脑、台式电脑、智能音箱、智能手表等,但并不局限于此。所述视频生成模型的训练方法具体可以包括以下步骤:
S110:获取目标用户的训练视频。
S120:从训练视频中提取目标用户的语音特征、目标用户的表情参数和目标用户的头部参数,头部参数用于表征目标用户的头部姿态信息与头部位置信息。
S130:将目标用户的语音特征、目标用户的表情参数和目标用户的头部参数进行合并,得到训练视频的条件输入。
相关技术提出的仅以语音或者表情作为驱动源来生成说话人像视频的方法,会产生不可忽略的视觉问题,也即头部-躯干运动不协调。分析该问题出现的原因,是由于神经辐射场往往将完整的人像建模为一个刚性实体,而不会区分头部运动和躯干运动。所以,每当改变相机观察方向和位置时,整个人像就会僵硬地改变朝向,肩部运动出现晃动,导致头部运动与肩部运动不协调。
为此,本申请实施例创造性的将用户的头部姿态信息与头部位置信息引入条件输入,使得神经辐射场基于头部姿态信息与头部位置信息可以隐式估算出肩膀的运动状态,从而使得后续生成的重构人像可以保持头部运动与肩部运动之间的协调性。
基于此,条件输入至少可以包括目标用户的语音特征、表情参数以及头部参数,该头部参数可以用于表征头部姿态信息与头部位置信息。语音特征可以用于表征用户说话时的音频信息。表情参数可以用于表征用户说话时面部表情信息,如,眼睛和嘴 巴的动作。头部姿态信息可以用于表征用户头部的朝向,头部位置可以用于反向表征相机的拍摄位置。
在一些实施例中,该从训练视频中提取目标用户的语音特征、目标用户的表情参数和目标用户的头部参数的步骤可以包括:
(1)对目标用户的训练视频进行语音特征提取,得到语音特征。
作为一种实施方式,在获取目标用户的训练视频时,可以利用语音识别模型对该训练视频进行语音特征提取。例如,当训练视频没有关联独立的音频数据时,可以基于训练视频提取目标用户的音频数据,当训练视频关联独立的音频数据时,可以从训练视频的数据包直接获取目标用户的音频数据,进一步地,可以将音频数据输入至深度语音(DeepSpeech)模型,输出语音特征。
在一种可能的实现方式中,DeepSpeech模型是由多个RNN层和CTC Loss的结构组成,用来学习语音到文本的映射,在本申请实施例中,DeepSpeech模型可用于提取目标用户说话声音内容的语音特征。对获取的音频数据进行采样,得到采样数组,其中,该音频数据的数据格式可以为MP3(MPEG-1 Audio Layer 3)或WAV(WaveForm)等。进一步地,对采样数组进行快速傅里叶变换(Fast Fourier Transform,FFT),并在此基础上进行两层卷积(激活函数用Relu函数)计算,得到卷积后的数据。
对卷积后的数据进行Shape操作,并对该操作后的数据进行切片操作(Sclice Channel)得到预设数量的数据片,并将每个数据片分别输入每个RNN层,从每个RNN层对应得到输出数据,并对输出数据进行合并操作(Concat)得到音频数据(Audio Data)对应的隐式编码(Latent Code),即为,语音特征a。
(2)对目标用户的训练视频进行三维人脸重构,得到目标用户的三维脸型的脸型表示,并基于脸型表示确定目标用户的表情参数。
其中,三维人脸重构可以是指从一张或多张二维图像中重建出人脸的三维模型,在本申请实施例中,二维图像是训练视频中的视频帧,故本申请实施例的三维人脸重构指的是对训练视频中目标用户进行重新构建得到三维人脸。脸型表示包含了模型从三维人脸中学习到人脸脸型和表情变化,进而通过脸型表示中的表情变化来确定表情参数。
作为一种实施方式,可以从训练视频的每个视频帧中获取对应的表情参数。可选地,可以利用三维可变形人脸模型(3D Morphable Models,3DMM)从每个视频帧中获取表情参数,该三维可变性人脸模型可以对单张视频帧中的二维人脸进行三维重建,得到相应的三维人脸,也即三维脸型,该三维脸型的脸型表示v为:
其中, 表示为在选定的人脸数据集上计算的平均值。Es和Ee分别表示形状空间和表情空间的正交基向量的矩阵。s和e分别表示形状系数和表情系数。N表示三维脸型网格(3D Face Mesh)中的顶点数。进一步地,可以将表情系数e作为重构的三维脸型的表情参数。
(3)对目标用户的三维脸型进行变换映射,得到三维脸型对应的旋转矩阵和平移向量。
利用三维可变形人脸模型可以对单张视频帧中的二维人脸进行三维重建,相反地,也可以将三维脸型网格的顶点映射到一个二维的图像平面。其中,变换映射是指将三维脸型投影到图像平面上的操作。
作为一种实施方式,对目标用户的三维脸型进行变换映射,得到三维脸型对应的旋转矩阵和平移向量。可选地,变换映射可以使用弱透视投影模型,该模型对三维脸型网格的顶点在二维平面的函数输出g可以表示为:
g=f+Pr+R+t
其中,f表示比例因子,Pr表示正交投影矩阵,R表示旋转矩阵(Rotation Matrix)以及t表示平移向量(Translation Vector),以此,可以通过上述公式得到旋转矩阵R和平移向量t。
(4)基于旋转矩阵确定头部姿态信息以及基于平移向量确定头部位置信息,并根据头部姿态信息和头部位置信息得到目标用户的头部参数。
考虑到头部位置可以反向表示出相机的拍摄位置,头部姿态的角度会相对于相机的拍摄角度而改变,因此,神经辐射场在知道拍摄位置的情况下,可以得到头部姿态变化的原因,进而基于头部姿态和相机的拍摄位置,就能很好地隐式估算出肩膀形状及其运动状态,使得预测出的视频帧中的人物具有完整性和逼真性,并且头部与肩膀的动作保持协调。
作为一种实施方式,可以将旋转矩阵转换为欧拉角,欧拉角由3个元素组成,表示方向信息,也即头部姿态信息。并将带有相机拍摄位置信息的平移向量反向表示为头部位置信息。进一步地,对头部姿态信息和头部位置信息进行位置编码(Positional Encoding),分别得到两个编码后的高维向量,并将两个高维向量连接成一个向量表示P。
S140:基于条件输入、三维坐标以及视角方向对预设的单个神经辐射场进行网络训练,得到视频生成模型。
其中,神经辐射场在本申请中用于渲染出二维视频的视频帧中每个像素点的RGB值。有关技术中,通过两个独立的神经辐射场重构人像的头部和躯干,但是两个独立的神经辐射场是分开独自生成的重构人像的头部和躯干的,且计算成本较高,即缺点,然而,利用独立的神经辐射场分别独立生成头部区域和躯干区域的方法,由于网络结构的分离会导致发生头部区域和躯干区域存在不匹配的情况,使得最终重构的人像显示效果不够真实和自然,因此,在有关技术中,两个神经辐射场无法实现重构人像的头部和躯干相互匹配的效果,算法的时间复杂度和空间复杂度也随网络结构的分离变高。
为此,本申请提出使用一个简单的神经辐射场来重构人像的头部和躯干,使得躯干运动能够与头部运行相互匹配,进而使得重构的人像可以达到真实、自然和稳定的 显示效果。并且可以大大降低算法的时间复杂度和空间复杂度,进而有效将降低运算成本。
在本申请实施例中,视频生成模型为基于总损失训练得到,总损失包括图像重建损失,图像重建损失是由预测对象颜色值和真实对象颜色值确定的,预测对象颜色值是单个神经辐射场根据条件输入、三维坐标和视角方向生成的。
考虑到嘴部图像区域是神经辐射场生成图像过程中最难学习的部分,因为嘴部形状是随着音频变化而变化最大的部分。同时,观众在观看生成的说话人像视频时,嘴巴区域也是最关注和最敏感的视图区域。一旦唇动与音频在一定程度上不同步,观众可以立即注意到它,从而显著降低重构视频的显示效果。
因此,本申请提出对唇部图像区域进行增强以提高嘴巴唇部的同步性能。例如可以确定嘴部强调损失,嘴部强调损失是由预测嘴部颜色值和真实嘴部颜色值确定的,预测嘴部颜色值是单个神经辐射场根据条件输入、三维坐标和视角方向生成的,从而基于图像重建损失和嘴部强调损失共同构建总损失。如此,通过结合图像重建损失和嘴部强调损失,使得训练出的视频生成模型能够不仅提高头肩运动的协调性,还能提升嘴部运动的同步性,从而提升重构视频显示的真实性。
在总损失包括像重建损失和嘴部强调损失的情况下,为了实现网络训练,可以先获取相机射线上空间采样点的三维坐标和视角方向,相机射线为相机在对场景进行成像时发出的光线,且相机射线对应训练视频的视频帧上的像素点。
本申请利用神经辐射场可以基于空间采样点的信息来合成二维视图。其中,相机射线为相机在对场景进行成像时发出的光线,且相机射线对应视频帧上的像素点。当相机对三维场景进行成像时,所得到的二维图像上的一个像素点实际上对应了一条从相机出发的相机射线上的所有连续空间采样点的投影集合。
该神经辐射场可以基于输入的空间采样点的三维坐标和视角方向,预测出该空间采样点的RGB颜色值(即为颜色值)和密度信息(即为体积密度)为此,需要知道相机射线上空间采样点的三维坐标和视角方向。
作为一种实施方式,空间采样点的三维坐标x=(x,y,z)和视角方向d=(θ,φ)可以进行预先的制定设置,具体地,由于空间采样点的位置会决定最终二维平面图像像素点的位置,所以可以根据二维平面图像上像素点的位置信息来设定空间采样点的三维坐标,例如,可以基于相机的内外参数将像素坐标转换为了统一的世界坐标下,相机射线上的空间采样点的三维坐标。进一步地,可以根据预先设定的相机拍摄场景的拍摄角度来确定视角方向,也可以预先基于对获取的参考视频中角色的观察角度来设定视角方向。
然后,基于条件输入、三维坐标以及视角方向对预设的单个神经辐射场进行网络训练。具体过程参照如下步骤。
在一些实施例中,该基于条件输入、三维坐标以及视角方向对预设的单个神经辐射场进行网络训练的步骤可以包括:
(1)对语音特征和表情参数分别进行时间平滑处理,得到平滑语音特征和平滑表情参数。
由于每个视频帧的表情参数都是单独获取的,因此相邻两视频帧之间存在时间不连续性。类似地,语音特征也存在同样的问题,这会导致生成的重构视频出现画面抖动跳帧以及声音不流畅的情况。为了使最终生成的重构视频能够更加稳定,可以对语音特征和表情参数分别进行时间平滑处理。
作为一种实施方式,可以分别使用两个时间平滑网络(Temporal Smoothing Network)过滤语音特征a和表情参数e。例如,对表情参数e进行时间平滑处理:在时间维度上,基于时间步长t-T/2到t+T/2上每个视频帧的表情参数e的线性组合来计算出t时刻视频帧的平滑表情参数,其中,T为时间间隔,以表情参数e作为时间平滑网络的输入,可以计算出线性组合的权重。该时间平滑网络由五个一维卷积组成,后跟一个带有Softmax激活的线性层。
在时间维度上,基于时间步长t-T/2到t+T/2上每个视频帧的语音特征a的线性组合来计算出t时刻视频帧的平滑语音特征,以语音特征a作为时间平滑网络的输入,可以计算出线性组合的权重。
(2)将三维坐标、视角方向、平滑语音特征、平滑表情参数以及头部参数输入至预设的单个神经辐射场,计算得到空间采样点对应的预测颜色值和体积密度。
作为一种实施方式,单个神经辐射场可以基于空间采样点的三维坐标、视角方向以及平滑语音特征、平滑表情参数以及头部参数计算出每个空间采样点的预测颜色值c和体积密度σ。其中,单个神经辐射场的神经网络可以为多层感知机(Multi-Layer Perceptron,MLP),由隐函数Fθ表示:
Fθ:(x,d,a,e,p)→(c,σ)
其中,隐函数Fθ(即单个神经辐射场)的输入包括三维坐标x、视角方向d、平滑语音特征a、平滑表情参数e以及头部参数p,函数Fθ的输出为空间采样点对应的预测颜色值c和体积密度σ。
请参阅图3,图3示出了一种单个神经辐射场的网络架构图。其中,单个神经辐射场可以为八个感知层构成的多层感知机。如图3所示,获取训练视频的视频帧序列,该视频帧序列关联有音频轨迹(即音频数据)。在一种可能的实现方式中,可以利用三维可变形人脸模型对每个视频帧进行三维人脸重构,获取表情参数e、头部姿态信息和头部位置信息,并基于头部姿态信息和头部位置信息确定头部参数p。并利用DeepSpeech从音频轨迹中提取语音特征a。
然后,分别对表情参数和语音特征进行时间平滑处理得到平滑语音特和平滑表情参数。并将平滑语音特征、平滑表情参数以及头部参数p作为条件输入联合三维坐标x、视角方向d输入至神经辐射场(即隐函数Fθ)中。
在一种可能的实现方式中,神经辐射场可以基于条件输入和三维坐标x预测出空间采样点对应的体积密度和中间特征,再基于中间特征和视角方向d预测出空间采样点对应的预测颜色值。进而基于空间采样点对应的预测颜色值c和体积密度σ生成头部-躯干 协调运动的完整图像,也即重构视频帧。并基于图像重建损失和嘴部强调损失对单个神经辐射场进行训练,其中,嘴部强调损失计算利用预先得到的嘴部区域对应的语义分割图,中间特征为神经辐射场的计算过程中生成的中间值。
(3)针对训练视频帧的视频帧,基于预测颜色值和体积密度,确定视频帧的全部图像区域对应的图像重建损失,以及基于预测颜色值和所述体积密度,确定视频帧的嘴部图像区域对应的嘴部强调损失。
考虑到嘴部图像区域是神经辐射场生成图像过程中最难学习的部分,因为嘴部形状是随着音频变化而变化最大的部分。同时,观众在观看生成的说话人像视频时,嘴巴区域也是最关注和最敏感的视图区域。一旦唇动与音频在一定程度上不同步,观众可以立即注意到它,从而显著降低重构视频的显示效果。
因此,本申请提出对唇部图像区域进行增强以提高嘴巴唇部的同步性能。利用从每个视频帧中获取的嘴部区域的语义分割图,在每次迭代中找出来自嘴巴的光线,然后在渲染后计算嘴部强调损失的过程中给予较大的权重。图像重建损失也可以很好的指引神经辐射场学习到全部图像区域上的颜色信息,也即像素点的颜色值,同时基于头部参数可以估算出肩部的运动状态。如此,通过结合图像重建损失和嘴部强调损失,使得训练出的视频生成模型能够不仅提高头肩运动的协调性,还能提升嘴部运动的同步性,从而提升重构视频显示的真实性。
作为一种实施方式,该基于预测颜色值和体积密度,确定视频帧的全部图像区域对应的图像重建损失的步骤可以包括:
(3.1)基于预测颜色值和体积密度,对在全部图像区域内的相机射线进行颜色积分,预测全部图像区域内每个相机射线对应的预测对象颜色值。
在本申请实施例中,神经辐射场得到的是一个三维空间采样点的颜色信息和密度信息,当用一个相机去对这个场景成像时,所得到的二维图像上的一个像素实际上对应了一条从相机出发的相机射线上的所有连续的空间采样点。因此,需要基于这相机条射线上的所有空间采样点得到这条相机射线最终在二维图像上渲染的颜色值。
此外,体积密度(Volume Density)可以被理解为一条相机射线r在经过空间采样点所处位置x的一个无穷小的粒子时被终止的概率,这个概率是可微的,也即,这个空间采样点的不透明度。由于一条相机射线上的空间采样点是连续的,这条相机射线对应在二维图像上像素点的颜色值可以由积分的方式得到,请参阅图4,图4示出了一种相机射线的示意图,该相机射线(Ray)可以标记为r(t)=o+td,其中,o表示相机射线的原点,d表示相机射线的角度,相机射线上t处近段边界和远端边界分别表示为tn以及tf
在一种可能的实现方式中,基于预测颜色值和体积密度,对在视频帧的全部图像区域内的相机射线进行颜色积分,预测全部图像区域内每个相机射线对应的预测对象颜色值的方式可以是获取全部图像区域内每个相机射线上空间采样点对应的累计透明度,累计透明度为在第一积分区间上基于相机射线的体积密度进行积分生成的;基于累计透明度、预测颜色值和体积密度的乘积,确定被积函数;在第二积分区间上对被 积函数进行颜色积分,预测全部图像区域内每个相机射线对应的预测对象颜色值;其中,第一积分区间为相机射线从近端边界到空间采样点的采样距离,第二积分区间为相机射线从近端边界到远端边界的采样距离。
具体地,获取训练视频的视频帧的全部图像区域内每个相机射线上空间采样点对应的累计透明度T(t),其中,累计透明度可以被理解为相机射线在第一积分区间上没有击中任何粒子的概率,累计透明度可通过在第一积分区间上基于相机射线的体积密度进行积分生成,第一积分区间为相机射线从近端边界tn到空间采样点处t的采样距离,积分公式如下:
然后,基于累计透明度T(t)、预测颜色值和体积密度的乘积,确定被积函数,并在第二积分区间上对被积函数进行颜色积分,预测全部图像区域内每个相机射线对应的预测对象颜色值C(r),r(s)表示相机射线,第二积分区间为相机射线从近端边界tn到远端边界tf的采样距离,颜色积分可以表示为:
(3.2)基于全部图像区域内每个相机射线对应的预测对象颜色值和对应的真实对象颜色值,确定全部图像区域对应的图像重建损失。
在得到预测对象颜色值后,可以可以基于全部图像区域内每个相机射线对应的预测对象颜色值C(r)和对应的真实对象颜色值确定全部图像区域对应的图像重建损失。在一种可能的实现方式中,可以基于均方误差(Mean Square Error,MSE)构建图像重建损失:
其中,R是相机射线集合,该集合中包含了全部图像区域上的相机射线。需要说明的是,可以将训练视频中视频帧上全部区域像素点原有的颜色值作为该像素点对应的相机射线的真实对象颜色值(Ground-truth)。
作为一种实施方式,该基于预测颜色值和体积密度,确定视频帧的嘴部图像区域对应的嘴部强调损失的步骤可以包括:
(3.1)对视频帧进行图像语义分割,得到视频帧对应的嘴部图像区域。
(3.2)基于预测颜色值和体积密度,对在嘴部图像区域内的相机射线进行颜色积分,预测嘴部图像区域内每个相机射线对应的预测嘴部颜色值.
(3.3)基于嘴部图像区域内每个相机射线对应的预测嘴部颜色值和对应的真实嘴部颜色值,确定嘴部图像区域对应的嘴部强调损失。
在本申请实施例中,为了确定嘴部强调损失,可以对训练视频中的视频帧进行图像语义分割,得到视频帧对应的嘴部图像区域,并基于预测颜色值和体积密度,对在 视频帧的嘴部图像区域内的相机射线进行颜色积分,预测嘴部图像区域内每个相机射线对应的预测嘴部颜色值。
基于嘴部图像区域内每个相机射线对应的预测嘴部颜色值和对应的真实嘴部颜色值,确定嘴部图像区域对应的嘴部强调损失,在一种可能的实现方式中,可以基于均方误差构建嘴部强调损失:
其中,Rmouth是相机射线集合,该集合中包含了嘴部图像区域上的相机射线。需要说明的是,可以将训练视频中视频帧上嘴部区域像素点原有的颜色值作为该像素点对应的相机射线的真实嘴部颜色值(Ground-truth)。
(4)结合图像重建损失和嘴部强调损失构建总损失,并利用总损失对单个神经辐射场进行网络训练。
为了强调嘴部区域的训练,本申请将嘴部强调损失Lmouth乘以额外的权重系数与图像重建损失Lphotometic相加构成总损失来对单个神经辐射场进行网络训练。
作为一种实施方式,该结合图像重建损失和嘴部强调损失构建总损失,并利用总损失对单个神经辐射场进行网络训练的步骤可以包括:
(4.1)获取权重系数。
权重参数可以在网络训练实验过程中,根据训练经验选取最优值。该权重系数λ>0。
(4.2)基于图像重建损失、权重系数以及嘴部强调损失确定总损失。
将嘴部强调损失Lmouth乘以额外的权重系数λ与图像重建损失Lphotometic相加构成总损失:
L=Lphotometic+λLmouth
(4.3)根据总损失对单个神经辐射场进行迭代训练,直至单个神经辐射场满足预设条件。
在得到总损失后,可以根据总损失对单个神经辐射场进行迭代训练,直至单个神经辐射场满足预设条件,其中,预设条件可以为:总损失函数L的总损失值小于预设值、总损失函数L的总损失值不再变化、或者训练次数达到预设次数等。可选的,可以采用优化器去优化总损失函数L,基于实验经验设置学习率(Learning Rate)、训练时的批量大小(Batch Size)以及训练的时期(Epoch)。
当对单个神经辐射场的网络训练满足预设条件时,可以将该满足预设条件的单个神经辐射场作为视频生成模型。该视频生成模型可以用于对目标用户的待重构视频进行对象重构,最终得到重构视频。
作为一种实施方式,可以获取目标用户的待重构视频,进而根据视频生成模型对待重构视频进行对象重构,得到目标用户对应的重构视频。其中,待重构视频至少包括视频会议中的会议视频,直播过程中的实况视频,以及预先录制的视频等,在此不做限定。
在一种可能的实现方式中,根据视频生成模型对待重构视频进行对象重构,得到目标用户对应的重构视频的方式可以是从待重构视频中获取预设帧数的待重构视频帧,其中,预设帧数可以由当前进行对象重构的计算机设备的计算性能决定。
然后将每个待重构视频帧输入到视频生成模型中,从视频生成模型对应预测每个待重构视频帧的重构视频帧,由于视频生成模型在重构视频帧时引入了头部姿态信息和头部位置信息,从而能够估算出合适的肩膀形状来适应头部状态和位置的变化,进而使得生成的人物形象的肩部与头部在整体视频帧上显示的更加自然、稳定和协调,并基于计算得到的所有重构视频帧,合成目标用户对应的重构视频。
本申请实施例中,从目标用户的训练视频中提取语音特征、表情参数和头部参数,头部参数用于表征目标用户的头部姿态信息与头部位置信息,将语音特征、表情参数和头部参数进行合并,得到训练视频的条件输入。进而基于条件输入、三维坐标以及视角方向对预设的单个神经辐射场进行网络训练,得到视频生成模型。如此,通过在条件输入中引入头部姿态信息与头部位置信息,视频生成模型可以在考虑头部运动的情况下赋予重构人像面部表情,使得重构人像具有高分辨率,从而提高重构图像的清晰度,并且根据头部姿态信息与头部位置信息可以隐式估算出肩膀的运动状态,从而使得生成的重构人像在保持头运动与肩部运动之间的协调性外,还能保证重构人像具有头部和肩部的完整性。
此外,该视频生成模型可以为基于图像重建损失和嘴部强调损失训练得到,其中,图像重建损失由单个神经辐射场根据条件输入生成的预测对象颜色值和真实对象颜色值确定,嘴部强调损失由单个神经辐射场根据条件输入生成的预测嘴部颜色值和真实嘴部颜色值确定。
由于颜色值与空间采样点的位置以及视角方向有关,图像重建损失可以引导单个神经辐射场能够预测不同视角下空间采样点处的不同光照效果,最后通过颜色积分可以使得相机射线对应的像素点的色彩更加丰富,进而增强了重构视频的显示效果。当根据视频生成模型对目标用户的待重构视频进行对象重构时,得到的重构视频可以与待重构视频的嘴部运动具有同步性,并且使得嘴部形状的变化与语音能够准确匹配,加上重构人像可以保持头运动与肩部运动之间的协调性,进而大大提升重构视频显示的真实性。
结合上述实施例所描述的方法,以下将举例作进一步详细说明。
下面将以视频生成模型的训练装置具体集成在计算机设备中为例进行说明,并将针对图5所示的流程结合图6所示的应用场景进行详细地阐述,该计算机设备可以为服务器或者终端设备等。请参阅图5,图5示出了本申请实施例提供的另一种视频生成模型的训练方法,在具体的实施例中,该视频生成模型的训练方法可以运用到如图6所示的视频会议场景中。
视频会议服务供应商提供服务端,该服务端包括云训练服务器410以及云执行服务器430。云训练服务器410用于训练出进行对象重构的视频生成模型,云执行服务器430用于部署进行对象重构的视频生成模型、进行视频会议相关功能的计算机程序,并 对客户端发送的生成的重构视频。其中,客户端可以包括接收方使用视频会议服务时,在智能电视420上打开的视频会议软件421,以及发送方使用视频会议服务时,笔记本电脑440上打开的视频会议软件441。
在上述视频会议场景中,发送方与接收方通过各自的视频会议软件,也即客户端进行视频会议,发送方由于个人原因可以使用视频会议软件441上的对象重构功能,对自己的真实人像进行重构,从而,在接受方的视频会议软件421上示出重构的理想人像。其中,人像的重构是服务端的云执行服务器430利用视频生成模型完成的。
需要说明的是,图6仅是本申请实施例提供的一种应用场景,本申请实施例描述的应用场景是为了更加清楚的说明本申请实施例的技术方案,并不构成对于本申请实施例提供的技术方案的限定。例如,在其它情况下,图6中真实人像的重构也可以是在视频会议软件441上直接完成,云执行服务器430可以将视频会议软件441生成的重构的人像视频传至视频会议软件421。本领域普通技术人员可知,随着系统架构的演变和新的应用场景(如,视频聊天和实况直播等)的出现,本申请实施例提供的技术方案对于解决类似的技术问题,同样适用。视频生成模型的训练方法具体可以包括以下步骤:
S210:计算机设备获取预设时长的初始视频。
其中,初始视频记录有目标用户说话的音频内容。考虑到有关技术在网络学习过程不能知晓场景的3D几何形状,需要额外参考图像来提供身份信息进行网络学习。本申请提出获取特定人物的一段视频,也即预设时长的初始视频作为训练数据,即可用于进行视频重构的网络学习,避免使用过多的训练数据,从而提高网络训练的效率。
示例性地,发送方可以利用预先录制一段预设时长为五分钟的说话视频作为初始视频,并将该初始视频通过视频会议软件441发送到云训练服务器410进行预处理。可选地,视频会议软件441也可以直接对初始视频进行预处理得到训练视频,再将训练视频发送至云训练服务器410。
S220:计算机设备根据预设分辨率和预设采样率对所述初始视频进行预处理,得到训练视频。
为了让生成的重构视频中人物区域能够占据画面的中心,提高观众观看视频的舒适度,本申请在网络训练阶段,通过预处理可以将初始视频中目标用户的人像确定在训练视频的视频帧的中心区域,从而训练后得到的视频生成模型生成的重构视频中,人物区域能够占据视频画面的中心。
其中,预设分辨率和预设采样率可以根据实际应用场景中,对视频画面中人物内容的显示需求进行设定。示例性地,云训练服务器410在接收到视频会议软件441发送来的初始视频后,可以基于25fps的采样频率对初始视频进行采样,并基于450×450像素的分辨率对初始视频采样出的视频帧进行裁剪,得到训练视频,使得目标用户的人像占据视频帧的中心区域。
S230:计算机设备提取目标用户的训练视频对应的条件输入。
本申请将用户的头部姿态信息与头部位置信息引入条件输入,使得神经辐射场基于头部姿态信息与头部位置信息可以隐式估算出肩膀的运动状态,从而使得生成的重构人像可以保持头运动与肩部运动之间的协调性。
在本申请实施例中,提取训练视频对应的条件输入的方式是获取目标用户的训练视频;从训练视频中提取目标用户的语音特征、目标用户的表情参数和目标用户的头部参数,头部参数用于表征目标用户的头部姿态信息与头部位置信息;将目标用户的语音特征、目标用户的表情参数和目标用户的头部参数进行合并,得到训练视频的条件输入。
在一些实施例中,该计算机设备从训练视频中提取目标用户的语音特征、目标用户的表情参数和目标用户的头部参数的步骤可以包括:
(1)计算机设备对目标用户的训练视频进行语音特征提取,得到语音特征。
示例性地,云训练服务器410在获取训练视频时,可以利用DeepSpeech模型来学习训练视频中的语音到文本的映射,也即提取目标用户说话声音内容的语音特征。具体地,云训练服务器410可以对训练视频关联的音频数据进行采样,得到采样数组,并对采样数组进行快速傅里叶变换,在此基础上进行两层卷积计算,得到卷积后的数据。
云训练服务器410对卷积后的数据进行Shape操作,并对该操作后的数据进行切片操作得到预设数量的数据片,并将每个数据片分别输入每个RNN层,对应从每个RNN层得到输出数据,并对输出数据进行合并操作得到音频数据对应的语音特征a。
(2)计算机设备对目标用户的训练视频进行三维人脸重构,得到目标用户的三维脸型的脸型表示,并基于脸型表示确定目标用户的表情参数。
示例性地,云训练服务器410可以利用三维可变性人脸模型从每个视频帧中获取表情参数,该三维可变性人脸模型可以对单张视频帧中的二维人脸进行三维重建,得到相应的三维脸型的脸型表示
其中,表示为在选定的人脸数据集上计算的平均值。Es和Ee分别表示形状空间和表情空间的正交基向量的矩阵。s和e分别表示形状系数和表情系数。进一步地,可以将表情系数e作为重构的三维脸型的表情参数。
(3)计算机设备对目标用户的三维脸型进行变换映射,得到三维脸型对应的旋转矩阵和平移向量。
示例性地,云训练服务器410可以对目标用户的三维脸型进行变换映射,得到三维脸型对应的旋转矩阵和平移向量。可选地,变换映射可以使用弱透视投影模型,该模型对三维脸型网格的顶点在二维平面的函数输出可以表示为g=f+Pr+R+t。其中,f表示比例因子,Pr表示正交投影矩阵,R表示旋转矩阵以及t表示平移向量。
(4)计算机设备基于旋转矩阵确定头部姿态信息以及基于平移向量确定头部位置信息,并根据头部姿态信息和头部位置信息得到目标用户的头部参数。
示例性地,云训练服务器410可以将旋转矩阵转换为欧拉角,欧拉角由3个元素组成,表示方向信息,也即头部姿态信息。并将平移向量表示为头部位置信息。进一 步地,对头部姿态信息和头部位置信息进行位置编码,分别得到两个编码后的高维向量,并将两个高维向量连接成一个向量表示P。
S240:计算机设备基于条件输入、三维坐标以及视角方向对预设的单个神经辐射场进行网络训练,得到视频生成模型。
本申请实施例中提供的视频生成模型的训练方法包括对预设的单个神经辐射场的训练,值得说明的是,对预设的单个神经辐射场的训练可以是根据获取的训练样本数据集合预先进行的,后续在每次需要执行对象重构时,可以利用训练得到的视频生成模型直接计算,而无需每次执行对象重构时,再次进行网络训练。
在一些实施例中,该计算机设备基于条件输入、三维坐标以及视角方向对预设的单个神经辐射场进行网络训练的步骤可以包括:
(1)计算机设备对语音特征和表情参数分别进行时间平滑处理,得到对应的平滑语音特征和平滑表情参数。
示例性地,云训练服务器410可以分别使用两个时间平滑网络过滤语音特征a和表情参数e。例如,对表情参数e进行时间平滑处理:在时间维度上,基于时间步长t-T/2到t+T/2上每个视频帧的表情参数e的线性组合来计算出t时刻视频帧的平滑表情参数,以表情参数e作为时间平滑网络的输入,可以计算出线性组合的权重。该时间平滑网络由五个一维卷积组成,后跟一个带有softmax激活的线性层。
示例性地,云训练服务器410可以在时间维度上,基于时间步长t-T/2到t+T/2上每个视频帧的语音特征a的线性组合来计算出t时刻视频帧的平滑语音特征,以语音特征a作为时间平滑网络的输入,可以计算出线性组合的权重。
(2)计算机设备获取相机射线上空间采样点的三维坐标和视角方向。
示例性地,云训练服务器410可以基于相机的内外参数将像素坐标转换为了统一的世界坐标下的光线上的空间采样点的三维坐标。云训练服务器410可以根据预先设定的相机拍摄场景的拍摄角度来确定视角方向,也可以预先基于对预先获取的参考视频中角色的观察角度来设定视角方向。
(3)计算机设备将三维坐标、视角方向、平滑语音特征、平滑表情参数以及头部参数输入至预设的单个神经辐射场,计算得到空间采样点对应的预测颜色值和体积密度。
示例性地,云训练服务器410可以基于隐函数Fθ,将空间采样点的三维坐标x、视角方向d以及平滑语音特征a,平滑表情参数e和头部参数p作为函数输入,从而隐函数Fθ计算出每个空间采样点的预测颜色值c和体积密度σ。其中,隐函数Fθ表示为:Fθ:(x,d,a,e,p)→(c,σ)。
(4)针对训练视频帧的视频帧,计算机设备基于预测颜色值和体积密度,确定视频帧的全部图像区域对应的图像重建损失,以及基于预测颜色值和体积密度,确定视频帧的嘴部图像区域对应的嘴部强调损失。
作为一种实施方式,该基于预测颜色值和体积密度,确定训练视频的视频帧的全部图像区域对应的图像重建损失的步骤可以包括:
(4.1)计算机设备基于预测颜色值和体积密度,对在视频帧的全部图像区域内的相机射线进行颜色积分,预测全部图像区域内每个相机射线对应的预测对象颜色值。
示例性地,云训练服务器410可以获取全部图像区域内每个相机射线上空间采样点对应的累计透明度,其中,累计透明度表示可以被理解为相机射线在第一积分区间上没有击中任何粒子的概率,累计透明度可通过在第一积分区间上基于相机射线的体积密度进行积分生成,第一积分区间为相机射线从近端边界到空间采样点处的采样距离。
云训练服务器410可以基于累计透明度、预测颜色值和体积密度的乘积,确定被积函数,并在第二积分区间上对被积函数进行颜色积分,预测全部图像区域内每个相机射线对应的预测对象颜色值第二积分区间为相机射线从近端边界到远端边界的采样距离。
云训练服务器410可以基于全部图像区域内每个相机射线对应的预测对象颜色值和对应的真实对象颜色值,确定全部图像区域对应的图像重建损失。可选地,可以基于均方误差构建图像重建损失。将训练视频中视频帧上全部区域像素点原有的颜色值作为该像素点对应的相机射线的真实对象颜色值。
(4.2)计算机设备基于全部图像区域内每个相机射线对应的预测对象颜色值和对应的真实对象颜色值,确定全部图像区域对应的图像重建损失。
作为一种实施方式,该基于预测颜色值和体积密度,确定视频帧的嘴部图像区域对应的嘴部强调损失的步骤可以包括:
(4.1)对视频帧进行图像语义分割,得到视频帧对应的嘴部图像区域。
(4.2)基于预测颜色值和体积密度,对在嘴部图像区域内的相机射线进行颜色积分,预测嘴部图像区域内每个相机射线对应的预测嘴部颜色值。
(4.3)基于嘴部图像区域内每个相机射线对应的预测嘴部颜色值和对应的真实嘴部颜色值,确定嘴部图像区域对应的嘴部强调损失。
示例性地,云训练服务器410可以对训练视频中的视频帧进行图像语义分割,得到视频帧对应的嘴部图像区域,并基于预测颜色值和体积密度,对在视频帧的嘴部图像区域内的相机射线进行颜色积分,预测嘴部图像区域内每个相机射线对应的预测嘴部颜色值。
云训练服务器410可以基于嘴部图像区域内每个相机射线对应的预测嘴部颜色值和对应的真实嘴部颜色值,确定嘴部图像区域对应的嘴部强调损失。并将训练视频中视频帧上嘴部区域像素点原有的颜色值作为该像素点对应的相机射线的真实嘴部颜色值。
(5)计算机设备结合图像重建损失和嘴部强调损失构建总损失,并利用总损失对单个神经辐射场进行网络训练。
为了强调嘴部区域的训练,本申请将嘴部强调损失乘以额外的权重系数与图像重建损失相加构成总损失来对单个神经辐射场进行网络训练。
示例性地,云训练服务器410可以获取权重系数,并基于图像重建损失、权重系数以及嘴部强调损失确定总损失。进而根据总损失对单个神经辐射场进行迭代训练,直至单个神经辐射场满足预设条件。
在一种可能的实现方式中,为了定量分析本申请中的视频生成模型的性能。可以在两个测试集上将该视频生成模型的训练方法与有关技术(Baselines)进行比较。其中,测试集A和测试集B都是说话人像视频。有关技术包括MakeItTalk、AD-NeRF、Wav2Lip以及NerFACE。评估指标(Metrics)包括PSNR和SSIM用于评估重构视频帧的质量(如,面部表情);LPIPS用于测量真实感的质量;LMD用于评估嘴型的准确性;Sync用于评估嘴唇与音频同步性。
对测试集A和测试集B在全部图像区域上计算PSNR、SSIM和LPIPS的评估指标,在嘴部图像区域上计算LMD和Sync的评估指标,计算结果如下表1所示:
表1
根据表1可以看出,在两个测试集上,本申请提出的方法在评估指标PSNR、SSIM、LPIPS和LMD上获得了最好的性能表现。同时,在音频-嘴形同步性和准确性上也具有优越性。例如,可以观察到本申请的方法创建的重构视频帧中的人物人像具有更准确的面部表情,更高的嘴型同步精度和更自然的头部-躯干协调。
AD-NeRF的生成能力依赖于使用两个独立的神经辐射场进行建模头部和躯干,这不可避免地会导致人像颈部存在分离和晃动的问题。不同的是,本申请基于单个神经辐射场引入详细的头部姿态信息和头部位置信息作为条件输入,能够生成更准确的视觉细节,如面部表情比AD-NeRF更好。
在一种可能的实现方式中,为了定性分析本申请中的视频生成模型的性能。可以在两个测试集上将该视频生成模型的训练方法与有关技术进行直观地比较,也即,将各个方法生成的重构视频帧放在一起进行比较。其中,有关技术包括MakeItTalk、AD-NeRF、Wav2Lip、ATVG、PC-AVS以及NerFACE。如图7所示的一种性能对比的示意图,需要说明的是,该示意图是经过处理后的举例示图。
从图7可以观察出,与基于生成对抗网络的方法(ATVG、Wav2lip、MakeItTalk、PC-AVS)相比,本申请能够生成更加清晰和完整的说话人人像,并且具有更逼真的图像质量,表情也恢复的更加准确。观察现有的基于NeRF的方法(AD-NeRF、NerFACE)的生成结果,AD-NeRF存在头肩分离问题,NerFACE则存在头肩刚性建模带来的头肩不协调问题,所以肩膀会随着头部姿态的变化而发生过度旋转。与AD-NeRF和NerFACE相比,本申请生成的重构视频帧的人物人像完整而协调,真实感强。
S250:计算机设备根据视频生成模型对目标用户的待重构视频进行对象重构,得到目标用户对应的重构视频。
当云训练服务器410对单个神经辐射场的网络训练满足预设条件时,可以将该满足预设条件的单个神经辐射场作为视频生成模型部署在云执行服务器430上。进而云执行服务器430可以基于该视频生成模型对目标用户的待重构视频进行对象重构,最终得到重构视频。
示例性地,云执行服务器430可以获取发送方通过笔记本电脑440上的视频会议软件441发送的待重构的会议视频,也即待重构视频,进而,从会议视频中获取预设帧数的待重构视频帧,其中,预设帧数可以由当前进行对象重构的计算机设备的计算性能决定。例如,云执行服务器430可以通过查询内存利用率以及GPU运算性能来进行计算性能的评估。可选地,云执行服务器430可以对自身的计算性能进行不同等级的划分,并为不同等级的计算性能匹配对应的预设帧数。
云执行服务器430可以将每个待重构视频帧输入到视频生成模型中,从视频生成模型对应预测每个待重构视频帧的重构视频帧,并基于计算得到的所有重构视频帧的帧序列合成发送方对应的重构视频。进而将该重构视频发送至接收方的智能电视420上,并通过视频会议软件421可以对该重构视频进行显示。
请参阅图8,图8示出了一种视频生成模型的训练方法的实现效果图,本申请基于单个神经辐射场的隐式表征能力,大大提升了说话人像视频的真实度,该视频生成模型的训练方法可以应用于视频会议、视频聊天、实况直播以及数字人等需要进行说话肖像视频重构的应用场景中。通过将表情参数和语音特征作为单个神经辐射场的驱动源进行训练,可以获取图8中(1)与待重构视频准确匹配的头部姿态和面部表情,以及图8中(2)与待重构视频的语音同步的嘴型,而具有不同的良好外观。本申请并将每个视频帧中的头部姿态信息和头部位置信息加入到单个神经辐射场的条件输入中,从而指导肩膀区域的生成,并适应头部的位置,最终能够生成图8中(3)自然、稳定和协调的肩部,避免了由于头肩刚性建模所带来的头肩不协调问题。
本申请实施例中,可以获取预设时长的初始视频,并根据预设分辨率和预设采样率对所述初始视频进行预处理,得到训练视频。从而获取预设时长的初始视频作为训练数据,即可用于进行视频重构的网络学习,避免使用过多的训练数据,大大提高网络训练的效率。
本申请实施例中,提取目标用户的训练视频对应的条件输入,该条件输入包括语音特征、表情参数以及头部参数,该头部参数用于表征头部姿态信息与头部位置信息,并基于语音特征、表情参数以及头部参数对预设的单个神经辐射场进行网络训练,得到视频生成模型,通过在条件输入中引入头部姿态信息与头部位置信息,视频生成模型在考虑头部运动的情况下可以赋予重构人像面部表情,使得重构人像具有高分辨率,并且根据头部姿态信息与头部位置信息可以隐式估算出肩膀的运动状态,使得生成的重构人像在保持头运动与肩部运动之间的协调性外,还能保证重构人像具有头部和肩部的完整性。
此外,该视频生成模型为基于图像重建损失和嘴部强调损失训练得到,该图像重建损失由单个神经辐射场根据条件输入生成的预测对象颜色值和真实对象颜色值确定,改嘴部强调损失由单个神经辐射场根据条件输入生成的预测嘴部颜色值和真实嘴部颜色值确定,如此,当根据视频生成模型对目标用户的待重构视频进行对象重构时,得到的重构视频可以与待重构视频的嘴部运动具有同步性,从而提升重构视频显示的真实性。
请参阅图9,其示出了本申请实施例提供的一种视频生成模型的训练装置500的结构框图。该视频生成模型的训练装置500包括:条件获取模块510,用于获取目标用户的训练视频;从训练视频中提取目标用户的语音特征、目标用户的表情参数和目标用户的头部参数,头部参数用于表征目标用户的头部姿态信息与头部位置信息;将目标用户的语音特征、目标用户的表情参数和目标用户的头部参数进行合并,得到训练视频的条件输入;网络训练模块520,用于基于条件输入、三维坐标以及视角方向对预设的单个神经辐射场进行网络训练,得到视频生成模型;其中,视频生成模型为基于总损失训练得到,总损失包括图像重建损失,图像重建损失是由预测对象颜色值和真实对象颜色值确定的,预测对象颜色值是单个神经辐射场根据条件输入、三维坐标和视角方向生成的;视频生成模型用于对目标用户的待重构视频进行对象重构,得到目标用户对应的重构视频。
在一些实施例中,条件获取模块510可以具体用于:对目标用户的训练视频进行语音特征提取,得到目标用户的语音特征;对目标用户的训练视频进行三维人脸重构,得到目标用户的三维脸型的脸型表示,并基于脸型表示确定目标用户的表情参数;对目标用户的三维脸型进行变换映射,得到三维脸型对应的旋转矩阵和平移向量;基于旋转矩阵确定头部姿态信息以及基于平移向量确定头部位置信息,并根据头部姿态信息和头部位置信息得到目标用户的头部参数。
在一些实施例中,总损失包括嘴部强调损失,嘴部强调损失是由预测嘴部颜色值和真实嘴部颜色值确定的,预测嘴部颜色值是单个神经辐射场根据条件输入、三维坐标和视角方向生成的。
在一些实施例中,视频生成模型的训练装置500还包括采样获取单元:
采样获取单元,用于获取相机射线上空间采样点的三维坐标和视角方向,相机射线为相机在对场景进行成像时发出的光线,且相机射线对应视频帧上的像素点;
网络训练模块520可以包括:平滑处理单元,用于对语音特征和表情参数分别进行时间平滑处理,得到平滑语音特征和平滑表情参数;采样计算单元,用于将三维坐标、视角方向、平滑语音特征、平滑表情参数以及头部参数输入至预设的单个神经辐射场,计算得到空间采样点对应的预测颜色值和体积密度;损失确定单元,用于针对训练视频帧的视频帧,基于预测颜色值和体积密度,确定训练视频的视频帧的全部图像区域对应的图像重建损失,以及基于预测颜色值和体积密度,确定视频帧的嘴部图像区域对应的嘴部强调损失;网络训练单元,用于结合图像重建损失和嘴部强调损失构建总损失,并利用总损失对单个神经辐射场进行网络训练。
在一些实施例中,损失确定单元可以包括:预测子单元,用于基于预测颜色值和体积密度,对在全部图像区域内的相机射线进行颜色积分,预测全部图像区域内每个相机射线对应的预测对象颜色值;重建损失子单元,用于基于全部图像区域内每个相机射线对应的预测对象颜色值和对应的真实对象颜色值,确定全部图像区域对应的图像重建损失。
在一些实施例中,预测子单元可以具体用于:获取全部图像区域内每个相机射线上空间采样点对应的累计透明度,累计透明度为在第一积分区间上基于相机射线的体积密度进行积分生成的;基于累计透明度、预测颜色值和体积密度的乘积,确定被积函数;在第二积分区间上对被积函数进行颜色积分,预测全部图像区域内每个相机射线对应的预测对象颜色值;其中,第一积分区间为相机射线从近端边界到空间采样点的采样距离,第二积分区间为相机射线从近端边界到远端边界的采样距离。
在一些实施例中,损失确定单元还可以具体用于:对视频帧进行图像语义分割,得到视频帧对应的嘴部图像区域;基于预测颜色值和体积密度,对在视频帧的嘴部图像区域内的相机射线进行颜色积分,预测嘴部图像区域内每个相机射线对应的预测嘴部颜色值;基于嘴部图像区域内每个相机射线对应的预测嘴部颜色值和对应的真实嘴部颜色值,确定嘴部图像区域对应的嘴部强调损失。
在一些实施例中,网络训练单元可以具体用于:获取权重系数;基于图像重建损失、权重系数以及嘴部强调损失确定总损失;根据总损失对单个神经辐射场进行迭代训练,直至单个神经辐射场满足预设条件。
在一些实施例中,视频生成模型的训练装置500还可以包括:初始获取模块,用于获取预设时长的初始视频,初始视频记录有目标用户说话的音频内容;预处理模块,用于根据预设分辨率和预设采样率对初始视频进行预处理,得到训练视频,预处理用于将目标用户的对象内容确定在训练视频的视频帧的中心区域。
在一些实施例中,视频生成模型的训练装置500还可以包括对象重构模块530:
对象重构模块530,用于获取目标用于的待重构视频;根据视频生成模型对目标用户的待重构视频进行对象重构,得到目标用户对应的重构视频。
在一些实施例中,待重构视频包括会议视频,对象重构模块530可以具体用于:
从待重构视频中获取预设帧数的待重构视频帧;将每个待重构视频帧输入至视频生成模型,计算出每个待重构视频帧对应的重构视频帧;基于所有待重构视频帧,合成目标用户对应的重构视频。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述装置和模块的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
在本申请所提供的几个实施例中,模块相互之间的耦合可以是电性,机械或其它形式的耦合。
另外,在本申请各个实施例中的各功能模块可以集成在一个处理模块中,也可以是各个模块单独物理存在,也可以两个或两个以上模块集成在一个模块中。上述集成的模块既可以采用硬件的形式实现,也可以采用软件功能模块的形式实现。
本申请提供的方案,从目标用户的训练视频中提取语音特征、表情参数和头部参数,头部参数用于表征目标用户的头部姿态信息与头部位置信息,将语音特征、表情参数和头部参数进行合并,得到训练视频的条件输入。进而基于条件输入、三维坐标以及视角方向对预设的单个神经辐射场进行网络训练,得到视频生成模型。由此,通过在条件输入中引入头部姿态信息与头部位置信息,使得视频生成模型可以在考虑头部运动的情况下赋予重构人像面部表情,使得重构人像具有高分辨率,并且根据头部姿态信息与头部位置信息可以隐式估算出肩膀的运动状态,从而使得生成的重构人像在保持头运动与肩部运动之间的协调性外,还能保证重构人像具有头部和肩部的完整性。
其次,该视频生成模型为基于图像重建损失和嘴部强调损失训练得到,其中,图像重建损失由单个神经辐射场根据条件输入生成的预测对象颜色值和真实对象颜色值确定,嘴部强调损失由单个神经辐射场根据条件输入生成的预测嘴部颜色值和真实嘴部颜色值确定,如此,当根据视频生成模型对目标用户的待重构视频进行对象重构时,得到的重构视频可以与待重构视频的嘴部运动具有同步性,进而提高重构视频显示的真实性。
如图10所示,本申请实施例还提供一种计算机设备600,该计算机设备600包括处理器610、存储器620、电源630和输入单元640,存储器620存储有计算机程序,计算机程序被处理器610调用时,可实执行上述实施例提供的各种方法步骤。本领域技术人员可以理解,图中示出的计算机设备的结构并不构成对计算机设备的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。其中:
处理器610可以包括一个或多个处理核。处理器610利用各种接口和线路连接整个电池管理系统内的各种部分,通过运行或执行存储在存储器620内的指令、程序、指令集或程序集,调用存储在存储器620内的数据,执行电池管理系统的各种功能和处理数据,以及执行计算机设备的各种功能和处理数据,从而对计算机设备进行整体 控制。可选地,处理器610可以采用数字信号处理(Digital Signal Processing,DSP)、现场可编程门阵列(Field-Programmable Gate Array,FPGA)、可编程逻辑阵列(Programmable Logic Array,PLA)中的至少一种硬件形式来实现。处理器610可集成中央处理器610(Central Processing Unit,CPU)、图像处理器610(Graphics Processing Unit,GPU)和调制解调器等中的一种或几种的组合。其中,CPU主要处理操作系统、用户界面和应用程序等;GPU用于负责显示内容的渲染和绘制;调制解调器用于处理无线通信。可以理解的是,上述调制解调器也可以不集成到处理器610中,单独通过一块通信芯片进行实现。
尽管未示出,计算机设备600还可以包括显示单元等,在此不再赘述。具体在本实施例中,计算机设备中的处理器610会按照如下的指令,将一个或一个以上的计算机程序的进程对应的可执行文件加载到存储器620中,并由处理器610来运行存储在存储器620中的比如电话本和音视频数据,从而实现前述实施例提供的各种方法步骤。
如图11所示,本申请实施例还提供一种计算机可读存储介质700,该计算机可读存储介质700中存储有计算机程序710,计算机程序710可被处理器调用于执行本申请实施例提供的各种方法步骤。
计算机可读存储介质可以是诸如闪存、EEPROM(电可擦除可编程只读存储器)、EPROM、硬盘或者ROM之类的电子存储器。可选地,计算机可读存储介质包括非易失性计算机可读存储介质(Non-Transitory Computer-Readable Storage Medium)。计算机可读存储介质700具有执行上述实施例中任何方法步骤的计算机程序的存储空间。这些计算机程序可以从一个或者多个计算机程序产品中读出或者写入到这一个或者多个计算机程序产品中。计算机程序能够以适当形式进行压缩。
根据本申请的一个方面,提供了一种计算机程序产品,该计算机程序产品包括计算机程序,该计算机程序被存储在计算机可读存储介质中。计算机设备的处理器从计算机可读存储介质读取该计算机程序,处理器执行该计算机程序,使得该计算机设备执行上述实施例提供的各种方法步骤。
以上,仅是本申请的较佳实施例而已,并非对本申请作任何形式上的限制,虽然本申请已以较佳实施例揭示如上,然而并非用以限定本申请,任何本领域技术人员,在不脱离本申请技术方案范围内,当可利用上述揭示的技术内容做出些许更动或修饰为等同变化的等效实施例,但凡是未脱离本申请技术方案内容,依据本申请的技术实质对以上实施例所作的任何简介修改、等同变化与修饰,均仍属于本申请技术方案的范围内。

Claims (15)

  1. 一种视频生成模型的训练方法,所述方法由计算机设备执行,所述方法包括:
    获取目标用户的训练视频;
    从所述训练视频中提取所述目标用户的语音特征、所述目标用户的表情参数和所述目标用户的头部参数,所述头部参数用于表征所述目标用户的头部姿态信息与头部位置信息;
    将所述目标用户的语音特征、所述目标用户的表情参数和所述目标用户的头部参数进行合并,得到所述训练视频的条件输入;
    基于所述条件输入、三维坐标以及视角方向对预设的单个神经辐射场进行网络训练,得到视频生成模型;
    其中,所述视频生成模型为基于总损失训练得到,所述总损失包括图像重建损失,所述图像重建损失是由预测对象颜色值和真实对象颜色值确定的,所述预测对象颜色值是单个神经辐射场根据所述条件输入、所述三维坐标和所述视角方向生成的;所述视频生成模型用于对所述目标用户的待重构视频进行对象重构,得到所述目标用户对应的重构视频。
  2. 根据权利要求1所述的方法,所述从所述训练视频中提取所述目标用户的语音特征、所述目标用户的表情参数和所述目标用户的头部参数,包括:
    对所述目标用户的训练视频进行语音特征提取,得到所述目标用户的语音特征;
    对所述目标用户的训练视频进行三维人脸重构,得到所述目标用户的三维脸型的脸型表示,并基于所述脸型表示确定所述目标用户的表情参数;
    对所述目标用户的三维脸型进行变换映射,得到所述三维脸型对应的旋转矩阵和平移向量;
    基于所述旋转矩阵确定所述头部姿态信息以及基于所述平移向量确定所述头部位置信息,并根据所述头部姿态信息和所述头部位置信息得到所述目标用户的头部参数。
  3. 根据权利要求1或2所述的方法,所述总损失包括嘴部强调损失,所述嘴部强调损失是由预测嘴部颜色值和真实嘴部颜色值确定的,所述预测嘴部颜色值是单个神经辐射场根据所述条件输入、所述三维坐标和所述视角方向生成的。
  4. 根据权利要求3所述的方法,在所述基于所述条件输入、三维坐标以及视角方向对预设的单个神经辐射场进行网络训练之前,所述方法还包括:
    获取相机射线上空间采样点的三维坐标和视角方向,所述相机射线为相机在对场景进行成像时发出的光线,且所述相机射线对应所述训练视频的视频帧上的像素点;
    所述基于所述条件输入、三维坐标以及视角方向对预设的单个神经辐射场进行网络训练,包括:
    对所述语音特征和所述表情参数分别进行时间平滑处理,得到平滑语音特征和平滑表情参数;
    将所述三维坐标、所述视角方向、所述平滑语音特征、所述平滑表情参数以及所述头部参数输入至所述单个神经辐射场,计算得到所述空间采样点对应的预测颜色值和体积密度;
    针对所述训练视频帧的视频帧,基于所述预测颜色值和所述体积密度,确定所述视频帧的全部图像区域对应的图像重建损失,以及基于所述预测颜色值和所述体积密度,确定所述视频帧的嘴部图像区域对应的嘴部强调损失;
    结合所述图像重建损失和所述嘴部强调损失构建所述总损失,并利用所述总损失对所述单个神经辐射场进行网络训练。
  5. 根据权利要求4所述的方法,所述基于所述预测颜色值和所述体积密度,确定所述视频帧的全部图像区域对应的图像重建损失,包括:
    基于所述预测颜色值和所述体积密度,对在所述全部图像区域内的相机射线进行颜色积分,预测所述全部图像区域内每个相机射线对应的预测对象颜色值;
    基于所述全部图像区域内每个相机射线对应的预测对象颜色值和对应的真实对象颜色值,确定所述全部图像区域对应的图像重建损失。
  6. 根据权利要求5所述的方法,所述基于所述预测颜色值和所述体积密度,对在所述视频帧的全部图像区域内的相机射线进行颜色积分,预测所述全部图像区域内每个相机射线对应的预测对象颜色值,包括:
    获取所述全部图像区域内每个相机射线上空间采样点对应的累计透明度,所述累计透明度为在第一积分区间上基于相机射线的体积密度进行积分生成的;
    基于所述累计透明度、所述预测颜色值和所述体积密度的乘积,确定被积函数;
    在第二积分区间上对所述被积函数进行颜色积分,预测所述全部图像区域内每个相机射线对应的预测对象颜色值;
    其中,所述第一积分区间为相机射线从近端边界到空间采样点的采样距离,所述第二积分区间为相机射线从近端边界到远端边界的采样距离。
  7. 根据权利要求4至6任一项所述的方法,所述基于所述预测颜色值和所述体积密度,确定所述视频帧的嘴部图像区域对应的嘴部强调损失,包括:
    对所述视频帧进行图像语义分割,得到所述视频帧对应的嘴部图像区域;
    基于所述预测颜色值和所述体积密度,对在所述嘴部图像区域内的相机射线进行颜色积分,预测所述嘴部图像区域内每个相机射线对应的预测嘴部颜色值;
    基于所述嘴部图像区域内每个相机射线对应的预测嘴部颜色值和对应的真实嘴部颜色值,确定所述嘴部图像区域对应的嘴部强调损失。
  8. 根据权利要求4所述的方法,所述结合所述图像重建损失和所述嘴部强调损失构建所述总损失,并利用所述总损失对所述单个神经辐射场进行网络训练,包括:
    获取权重系数;
    基于所述图像重建损失、所述权重系数以及所述嘴部强调损失确定所述总损失;
    根据所述总损失对所述单个神经辐射场进行迭代训练,直至所述单个神经辐射场满足预设条件。
  9. 根据权利要求1所述的方法,所述获取目标用户的训练视频,包括:
    获取预设时长的初始视频,所述初始视频记录有所述目标用户说话的音频内容;
    根据预设分辨率和预设采样率对所述初始视频进行预处理,得到所述训练视频,所述预处理用于将所述初始视频中所述目标用户的人像确定在训练视频的视频帧的中心区域。
  10. 根据权利要求1所述的方法,所述方法还包括:
    获取所述目标用于的待重构视频;
    根据所述视频生成模型对所述待重构视频进行对象重构,得到所述目标用户对应的重构视频。
  11. 根据权利要求10所述的方法,所述根据所述视频生成模型对所述待重构视频进行对象重构,得到所述目标用户对应的重构视频,包括:
    从所述待重构视频中获取预设帧数的待重构视频帧;
    将每个所述待重构视频帧输入至所述视频生成模型,计算出每个所述待重构视频帧对应的重构视频帧;
    基于所有待重构视频帧,合成所述目标用户对应的重构视频。
  12. 一种视频生成模型的训练装置,所述装置部署在计算机设备上,所述装置包括:
    条件获取模块,用于获取目标用户的训练视频;从所述训练视频中提取所述目标用户的语音特征、所述目标用户的表情参数和所述目标用户的头部参数,所述头部参数用于表征所述目标用户的头部姿态信息与头部位置信息;将所述目标用户的语音特征、所述目标用户的表情参数和所述目标用户的头部参数进行合并,得到所述训练视频的条件输入;
    网络训练模块,用于基于所述条件输入、三维坐标以及视角方向对预设的单个神经辐射场进行网络训练,得到视频生成模型;
    其中,所述视频生成模型为基于总损失训练得到,所述总损失包括图像重建损失,所述图像重建损失是由预测对象颜色值和真实对象颜色值确定的,所述预测对象颜色值是单个神经辐射场根据所述条件输入、所述三维坐标和所述视角方向生成的;所述视频生成模型用于对所述目标用户的待重构视频进行对象重构,得到所述目标用户对应的重构视频。
  13. 一种计算机可读存储介质,所述计算机可读存储介质中存储有计算机程序,所述计算机程序可被处理器调用执行如权利要求1~11任一项所述的方法。
  14. 一种计算机设备,包括:
    存储器;
    一个或多个处理器,与所述存储器耦接;
    一个或多个计算机程序,其中,所述一个或多个计算机程序被存储在所述存储器中并被配置为由所述一个或多个处理器执行,所述一个或多个计算机程序配置用于执行如权利要求1~11任一项所述的方法。
  15. 一种计算机程序产品,所述计算机程序产品包括计算机程序,所述计算机程序被存储在存储介质中;计算机设备的处理器从存储介质读取所述计算机程序,处理器执行所述计算机程序,使得所述计算机设备执行如权利要求1~11任一项所述的方法。
PCT/CN2023/118459 2022-10-13 2023-09-13 视频生成模型的训练方法、装置、存储介质及计算机设备 WO2024078243A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202211255944.4A CN117036583A (zh) 2022-10-13 2022-10-13 视频生成方法、装置、存储介质及计算机设备
CN202211255944.4 2022-10-13

Publications (1)

Publication Number Publication Date
WO2024078243A1 true WO2024078243A1 (zh) 2024-04-18

Family

ID=88637780

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/118459 WO2024078243A1 (zh) 2022-10-13 2023-09-13 视频生成模型的训练方法、装置、存储介质及计算机设备

Country Status (2)

Country Link
CN (1) CN117036583A (zh)
WO (1) WO2024078243A1 (zh)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117593442B (zh) * 2023-11-28 2024-05-03 拓元(广州)智慧科技有限公司 一种基于多阶段细粒度渲染的人像生成方法
CN117478824B (zh) * 2023-12-27 2024-03-22 苏州元脑智能科技有限公司 会议视频生成方法、装置、电子设备及存储介质
CN117746192A (zh) * 2024-02-20 2024-03-22 荣耀终端有限公司 电子设备及其数据处理方法
CN117745597A (zh) * 2024-02-21 2024-03-22 荣耀终端有限公司 图像处理方法及相关装置

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113192162A (zh) * 2021-04-22 2021-07-30 清华珠三角研究院 语音驱动图像的方法、系统、装置及存储介质
CN113269872A (zh) * 2021-06-01 2021-08-17 广东工业大学 基于三维人脸重构和视频关键帧优化的合成视频生成方法
CN113822969A (zh) * 2021-09-15 2021-12-21 宿迁硅基智能科技有限公司 训练神经辐射场模型和人脸生成方法、装置及服务器
US20220044463A1 (en) * 2019-08-29 2022-02-10 Tencent Technology (Shenzhen) Company Limited Speech-driven animation method and apparatus based on artificial intelligence
CN114202604A (zh) * 2021-11-30 2022-03-18 长城信息股份有限公司 一种语音驱动目标人视频生成方法、装置及存储介质
US11295501B1 (en) * 2020-11-04 2022-04-05 Tata Consultancy Services Limited Method and system for generating face animations from speech signal input
CN114782596A (zh) * 2022-02-28 2022-07-22 清华大学 语音驱动的人脸动画生成方法、装置、设备及存储介质

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20220044463A1 (en) * 2019-08-29 2022-02-10 Tencent Technology (Shenzhen) Company Limited Speech-driven animation method and apparatus based on artificial intelligence
US11295501B1 (en) * 2020-11-04 2022-04-05 Tata Consultancy Services Limited Method and system for generating face animations from speech signal input
CN113192162A (zh) * 2021-04-22 2021-07-30 清华珠三角研究院 语音驱动图像的方法、系统、装置及存储介质
CN113269872A (zh) * 2021-06-01 2021-08-17 广东工业大学 基于三维人脸重构和视频关键帧优化的合成视频生成方法
CN113822969A (zh) * 2021-09-15 2021-12-21 宿迁硅基智能科技有限公司 训练神经辐射场模型和人脸生成方法、装置及服务器
CN114202604A (zh) * 2021-11-30 2022-03-18 长城信息股份有限公司 一种语音驱动目标人视频生成方法、装置及存储介质
CN114782596A (zh) * 2022-02-28 2022-07-22 清华大学 语音驱动的人脸动画生成方法、装置、设备及存储介质

Also Published As

Publication number Publication date
CN117036583A (zh) 2023-11-10

Similar Documents

Publication Publication Date Title
US11818506B2 (en) Circumstances based 3D representations of participants of virtual 3D communications
WO2024078243A1 (zh) 视频生成模型的训练方法、装置、存储介质及计算机设备
US11856328B2 (en) Virtual 3D video conference environment generation
US11861936B2 (en) Face reenactment
US20210392175A1 (en) Sharing content during a virtual 3d video conference
US11765332B2 (en) Virtual 3D communications with participant viewpoint adjustment
US20230123005A1 (en) Real-time video dimensional transformations of video for presentation in mixed reality-based virtual spaces
US11790535B2 (en) Foreground and background segmentation related to a virtual three-dimensional (3D) video conference
US11870939B2 (en) Audio quality improvement related to a participant of a virtual three dimensional (3D) video conference
US11918412B2 (en) Generating a simulated image of a baby
US20230106330A1 (en) Method for creating a variable model of a face of a person
WO2024113779A1 (zh) 图像处理方法、装置及相关设备
US20230247180A1 (en) Updating a model of a participant of a three dimensional video conference call
JP2024518888A (ja) 仮想3d通信のための方法及びシステム
Zhu et al. Virtual avatar enhanced nonverbal communication from mobile phones to PCs

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23876448

Country of ref document: EP

Kind code of ref document: A1