WO2024055379A1 - 基于角色化身模型的视频处理方法、系统及相关设备 - Google Patents

基于角色化身模型的视频处理方法、系统及相关设备 Download PDF

Info

Publication number
WO2024055379A1
WO2024055379A1 PCT/CN2022/124917 CN2022124917W WO2024055379A1 WO 2024055379 A1 WO2024055379 A1 WO 2024055379A1 CN 2022124917 W CN2022124917 W CN 2022124917W WO 2024055379 A1 WO2024055379 A1 WO 2024055379A1
Authority
WO
WIPO (PCT)
Prior art keywords
driving
image
video
facial
training
Prior art date
Application number
PCT/CN2022/124917
Other languages
English (en)
French (fr)
Inventor
李昱
曹成坤
邓杨
周昌印
余飞
Original Assignee
粤港澳大湾区数字经济研究院(福田)
杭州盖视科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 粤港澳大湾区数字经济研究院(福田), 杭州盖视科技有限公司 filed Critical 粤港澳大湾区数字经济研究院(福田)
Publication of WO2024055379A1 publication Critical patent/WO2024055379A1/zh

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N5/00Details of television systems
    • H04N5/222Studio circuitry; Studio devices; Studio equipment
    • H04N5/262Studio circuits, e.g. for mixing, switching-over, change of character of image, other special effects ; Cameras specially adapted for the electronic generation of special effects

Definitions

  • the present application relates to the field of video processing technology, and in particular to a video processing method, system and related equipment based on a character avatar model.
  • the user wants to implement face replacement based on video processing, for example, using the first user's expression in the video to drive the second user's face to make a corresponding expression.
  • the video is usually processed frame by frame, requiring the first user and the second user to record a video respectively.
  • the facial area in the image of the first user and the second user is Intercept and replace.
  • the problem with the existing technology is that the facial areas in the images of the first user and the second user are intercepted and replaced, and although the expression in the facial area in the image corresponding to the second user after replacement is that of the first user, Expression, but in fact the corresponding facial features are still the first user's facial features, and the purpose of using the first user's expression to drive the second user's face to make a corresponding expression is not achieved.
  • the problem with the existing technology is that it is impossible to use the first user's expression to drive the second user's face to make a video processing solution that only intercepts and replaces the facial area in each frame of the video of the two users.
  • the corresponding expression is not conducive to improving the effect of video display, nor is it conducive to improving the effect of video face replacement.
  • the main purpose of this application is to provide a video processing method, system and related equipment based on a role avatar model, aiming to solve the problem in the existing technology by only intercepting and extracting the facial area in each frame of the image in the video of two users.
  • the alternative video processing solution is not conducive to improving the effect of video display.
  • the first aspect of this application provides a video processing method based on a character avatar model, wherein the above video processing method based on a character avatar model includes:
  • the driving video of the driving object is obtained by photographing the expression and posture of the driving object;
  • the driven object in the driven video performs the same expressions and gestures as the driving object in the driving video.
  • the reference image is used to provide the character avatar model with image texture details corresponding to the driven object, and the image texture details of the driven video and the reference image are the same.
  • the above reference image is an RGB image with a channel number of 3.
  • the above-mentioned method obtains multi-frame facial geometry rendering images corresponding to the above-mentioned driving object based on the above-mentioned driving video, including:
  • the three-dimensional facial mesh corresponding to each of the above-mentioned driving images is rendered to obtain a facial geometric rendering image corresponding to each of the above-mentioned driving images, wherein the above-mentioned facial geometric rendering image is a grayscale image.
  • the above method further includes:
  • the three-dimensional facial parameters are aligned based on the facial space position corresponding to the driven object in the character avatar model to update the three-dimensional facial parameters.
  • the above three-dimensional facial parameters include individual coefficients, expression coefficients and posture coefficients.
  • the above-mentioned acquisition of the time code corresponding to each of the above-mentioned facial geometric rendering images, and the above-mentioned reference image, each of the above-mentioned facial geometric rendering images, and the time code corresponding to each of the above-mentioned facial geometric rendering images generate the target character through the above-mentioned character avatar model.
  • Driver videos including:
  • Each set of data to be processed is input into the above-mentioned character avatar model in turn, and a driven image corresponding to each set of the above-mentioned data to be processed is obtained, wherein a set of the above-mentioned data to be processed consists of the above-mentioned reference image, one of the above-mentioned facial geometric rendering images and the face It consists of time encoding corresponding to the geometrically rendered image.
  • the above-mentioned driven image the above-mentioned driven object performs the same expressions and gestures as in the corresponding facial geometrically rendered image;
  • the driven images are sequentially connected according to corresponding time codes to generate the driven video.
  • the above time encoding is used to input time information for the above character avatar model.
  • the spatial dimensions of the above-mentioned temporal coding, the facial geometric rendering image corresponding to the above-mentioned temporal coding, and the above-mentioned reference image are the same.
  • the above character avatar model is pre-trained according to the following steps:
  • the reference image in the training data, the training face geometric rendering image and the training time code corresponding to the training face geometric rendering image are input into the deep neural network generator, and the above-mentioned reference image and the above-mentioned training face are generated through the above-mentioned deep neural network generator.
  • a training driven image of a geometrically rendered image wherein the training data includes a plurality of training image groups, and each training image group includes a reference image corresponding to the driven object, and a training facial geometric rendering corresponding to the driving object.
  • the above-mentioned preset training conditions include reconstruction loss convergence between the above-mentioned training driven image and the above-mentioned training driving image.
  • the above training data is obtained by processing the collected image data through a preset data enhancement method, and the above preset data enhancement method includes spatial random cropping.
  • the above-mentioned reconstruction loss is obtained by calculating a multi-item joint image reconstruction loss function.
  • the above-mentioned multi-item joint image reconstruction loss function is used to combine at least two losses among L1 reconstruction loss, perceptual loss and GAN discriminator loss.
  • a second aspect of the present application provides a video processing system based on a character avatar model, wherein the above-mentioned video processing system based on a character avatar model includes:
  • the driving information acquisition module is used to obtain the driving video of the driving object, the authority verification information of the driving object, and the driven object corresponding to the driving object, wherein the driving video is obtained by photographing the expression and posture of the driving object;
  • An authority verification module configured to obtain a role avatar model and a reference image corresponding to the driven object when the authority verification information of the driving object meets the authority verification conditions of the driven object;
  • the driving video processing module is used to obtain a multi-frame facial geometric rendering image corresponding to the driving object according to the driving video, wherein the facial geometric rendering image is used to reflect the expression and posture corresponding to the driving object;
  • the driven video generation module is used to obtain the time code corresponding to each of the above-mentioned facial geometric rendering images, and according to the above-mentioned reference image, each of the above-mentioned facial geometric rendering images and the time code corresponding to each of the above-mentioned facial geometric rendering images, through the above-mentioned character avatar
  • the model generates a driven video, wherein the driven object in the driven video performs the same expressions and gestures as the driving object in the driving video.
  • a third aspect of the present application provides an intelligent terminal.
  • the intelligent terminal includes a memory, a processor, and a video processing program based on a role avatar model that is stored in the memory and can be run on the processor.
  • the video processing program based on the role avatar model is When the processing program is executed by the processor, any one of the steps of the video processing method based on the character avatar model is implemented.
  • the driving video of the driving object, the permission verification information of the driving object and the driven object corresponding to the driving object are obtained, wherein the driving video is obtained by photographing the expression and posture of the driving object; when When the permission verification information of the above-mentioned driving object meets the above-mentioned permission verification conditions of the driven object, obtain the character avatar model and reference image corresponding to the above-mentioned driven object; obtain the multi-frame facial geometry rendering image corresponding to the above-mentioned driving object according to the above-mentioned driving video, Among them, the above-mentioned facial geometric rendering image is used to reflect the expression and posture corresponding to the above-mentioned driving object; the time code corresponding to each of the above-mentioned facial geometric rendering images is obtained, and according to the above-mentioned reference image, each of the above-mentioned facial geometric rendering images and each of the above-mentioned face The temporal encoding corresponding to the geometric
  • the solution of this application does not just intercept and replace the facial areas in the images of different objects, but pre-sets the character avatar model for the driven object.
  • Video and after the permission verification is passed the trained character avatar model and reference image of the corresponding driven object are obtained.
  • the facial geometric rendering image used to reflect the expression and posture of the driving object is obtained.
  • the facial geometric rendering image and the reference image are fused through the character avatar model to obtain the driven video.
  • the driven video is not obtained by simple image replacement of the facial area, but by fusing the expression and posture of the driving object with the actual texture of the driven object, thereby realizing the execution by the driven object and the driving object in the driving video Same expression and gesture.
  • the facial geometry rendering image only reflects expressions and postures, it does not reflect the actual texture of the face of the driving object.
  • the actual texture is only provided by the reference image of the driven object, so it will not be retained incorrectly when the character avatar model performs image information fusion.
  • the actual texture corresponding to the driving object that is, the image texture details of the object displayed in the final driven video, is the same as the driven object.
  • using the video of the driving object to drive the character avatar model is beneficial to improving the video display effect of the character avatar model. It is possible to use the expression of the driving object to drive the face of the driven object to make a corresponding expression, which is beneficial to improving the effect of video face replacement and improving the user experience.
  • Figure 1 is a schematic flow chart of a video processing method based on a character avatar model provided by an embodiment of the present application
  • Figure 2 is a schematic flowchart of a specific process for generating a driven video based on the role avatar model of user A provided by an embodiment of the present application;
  • Figure 3 is a schematic diagram of the component modules of a video processing system based on a character avatar model provided by an embodiment of the present application;
  • Figure 4 is a functional block diagram of the internal structure of an intelligent terminal provided by an embodiment of the present application.
  • the term “if” may be interpreted as “when” or “once” or “in response to determining” or “in response to classifying to” depending on the context.
  • the phrase “if determined” or “if classified to [the described condition or event]” may be interpreted, depending on the context, to mean “once determined” or “in response to a determination” or “once classified to [the described condition or event]” event]” or “in response to classification to [the described condition or event]”.
  • the user wants to implement face replacement based on video processing, such as using the first user's expression to drive the second user's face to make a corresponding expression in the video, thereby achieving entertainment effects.
  • the video is usually processed frame by frame, requiring the first user and the second user to record a video respectively.
  • the facial area in the image of the first user and the second user is Intercept and replace.
  • the problem with the existing technology is that the facial areas in the images of the first user and the second user are intercepted and replaced, and although the expression in the facial area in the image corresponding to the second user after replacement is that of the first user, Expression, but in fact the corresponding facial features are still the first user's facial features, and the purpose of using the first user's expression to drive the second user's face to make a corresponding expression is not achieved.
  • the problem with the existing technology is that the video processing solution that only intercepts and replaces the facial area in each frame of the video of the two users is not conducive to improving the effect of video face replacement, and cannot achieve the expression of the first user. to drive the second user's face to make corresponding expressions.
  • portrait videos can be used to learn to predict some explicit attributes, such as posture, camera position, or implicit feature expressions, using deep learning and other methods. These attributes and feature expressions can be adjusted and manipulated. to restore the image of the portrait.
  • one solution such as FOMM
  • FOMM is based on unsupervised learning of the key point correspondence between the target portrait and the driving portrait and converts it into a dense optical flow field, and then uses the optical flow field to map the face image, and finally generates
  • the network generates images; however, this solution is based on 2D pixel mapping, the face does not have 3D consistency, and the dense optical flow field easily causes the background and the face to move together.
  • a large number of display attributes are first estimated, including face corresponding maps, implementation diagrams, etc., and then combined with image-to-image conversion technology to restore the real portrait; however, this scheme uses a large number of display attributes, For example, 3D face mapping, but the generated results will have artifacts and blurs, and will not be smooth when processing videos.
  • neural radiation fields are used as renderers to generate high-definition portraits, which can improve the consistency of large-angle 3D, but the rendering results still lose details and have low rendering efficiency.
  • the solution of this application considers both cost and effect, and provides a solution that is more efficient, has better effects, and requires a lower training time for the model, making video processing and rendering more efficient.
  • the result is higher definition and more realistic, achieving the effect of using the expression of the driving object to drive the face of the driven object to make the corresponding expression, and generating the corresponding driven video, and making the generated driven video close to the real video.
  • the driving video of the driving object, the permission verification information of the driving object and the driven object corresponding to the driving object are obtained, wherein the driving video is obtained by photographing the expression and posture of the driving object; when the above When the permission verification information of the driving object meets the permission verification conditions of the driven object, the character avatar model and reference image corresponding to the driven object are obtained; and the multi-frame facial geometry rendering image corresponding to the driving object is obtained according to the driving video, where , the above-mentioned facial geometry rendering image is used to reflect the expression and posture corresponding to the above-mentioned driving object; obtain the time code corresponding to each of the above-mentioned facial geometry rendering images, and use the above-mentioned reference image, each of the above-mentioned facial geometry rendering images and each of the above-mentioned facial geometry The time encoding corresponding to the rendered image is used to generate a driven video through the character avatar model, wherein the driven object in the driven video performs the same expressions and gestures
  • the solution of this application does not just intercept and replace the facial areas in the images of different objects, but pre-sets the character avatar model for the driven object.
  • Video and after the permission verification is passed the trained character avatar model and reference image of the corresponding driven object are obtained.
  • the facial geometric rendering image used to reflect the expression and posture of the driving object is obtained.
  • the facial geometric rendering image and the reference image are fused through the character avatar model to obtain the driven video.
  • the driven video is not obtained by simple image replacement of the facial area, but by fusing the expression and posture of the driving object with the actual texture of the driven object, thereby realizing the execution by the driven object and the driving object in the driving video Same expression and gesture.
  • the facial geometry rendering image only reflects expressions and postures, it does not reflect the actual texture of the face of the driving object.
  • the actual texture is only provided by the reference image of the driven object, so it will not be retained incorrectly when the character avatar model performs image information fusion.
  • the actual texture corresponding to the driving object that is, the image texture details of the object displayed in the final driven video, is the same as the driven object.
  • using the video of the driving object to drive the character avatar model is beneficial to improving the video display effect of the character avatar model. It is possible to use the expression of the driving object to drive the face of the driven object to make a corresponding expression, which is beneficial to improving the effect of video face replacement and improving the user experience.
  • this embodiment of the present application provides a video processing method based on a character avatar model. Specifically, the above method includes the following steps:
  • Step S100 Obtain the driving video of the driving object, the authority verification information of the driving object, and the driven object corresponding to the driving object, wherein the driving video is obtained by photographing the expression and posture of the driving object.
  • the above-mentioned driving object is an object that needs to retain the corresponding expression and posture but not the corresponding facial details during video processing (for example, user B).
  • a role avatar model for example, a digital portrait model of the driven object speaking in a specific scene
  • the driven object is an object that needs to retain the corresponding facial details during video processing (ie, user A). Therefore, the video processing process in this embodiment is equivalent to driving the corresponding facial expression of user A based on the expressions and gestures provided by user B.
  • the character avatar model allows the character avatar model corresponding to user A to generate a corresponding driven video.
  • the image of user A makes the same expressions and gestures as in the driven video of user B, thereby achieving the goal of passing the video Processing to achieve the effect of using user B to drive the image of user A.
  • the above posture represents the head posture of the corresponding object
  • the expression represents the facial expression of the corresponding object.
  • the above-mentioned driving video can be obtained by shooting the driving object with a camera, a mobile phone or other equipment, and is specifically a video of a spoken word or a video with lip movements, so as to be modeled and matched with real lip movements.
  • corresponding character avatar models can be trained in advance for multiple other users, and the driving object determines the character avatar model that needs to be selected and used by specifying the corresponding driven object.
  • driving object and driven object can be animals, animated images, virtual characters or real people.
  • the driving object and the driven object can be the same or different.
  • real people are taken as an example for explanation. But it is not a specific limitation.
  • the images of the head area of the driving object and the driven object are processed according to the above-mentioned video processing method.
  • the above-mentioned character avatar model is also a model used to process the head area avatar, but based on this solution
  • the above character avatar model can also be used to process the entire character image in the video, including the head area and limbs.
  • Step S200 When the authority verification information of the driving object meets the authority verification conditions of the driven object, obtain the character avatar model and reference image corresponding to the driven object.
  • the above permission verification information is used to verify whether the driving object has permission to use the character avatar model and/or reference image corresponding to the driven object.
  • the role of the driven object is preset
  • the avatar model is set with permission verification conditions. Only when the permission verification information of the driving object meets the permission verification conditions of the driven object, the role avatar model and reference image corresponding to the driven object can be obtained. It should be noted that there are many ways to set permission verification conditions and corresponding permission verification information, such as password matching, authorization through permission tables, etc., which are not specifically limited here.
  • the reference image is used to provide the character avatar model with image texture details corresponding to the driven object, and the image texture details of the driven video and the reference image are the same.
  • the expressions and gestures are provided by the driving video, and the corresponding driven video is generated based on the image texture details of the face in the reference image.
  • the above-mentioned reference image is obtained by photographing the driven object, and can capture the image texture details corresponding to the background area in the scene where the driven object is located. Therefore, the corresponding background in the driven video is also the same as the reference image.
  • the image texture details may include details other than expressions and postures, such as facial features, facial features (such as wrinkles, glasses, etc.), and image texture in the background area, which are not specifically limited here.
  • Step S300 Obtain multi-frame facial geometric rendering images corresponding to the driving object based on the driving video, where the facial geometric rendering images are used to reflect the expressions and postures corresponding to the driving object.
  • continuous multi-frame facial geometric rendering images corresponding to the driving object are sequentially obtained according to the above driving video.
  • Each frame of the facial geometric rendering image is used to reflect the expression and posture corresponding to the driving object, but the driving object is not retained.
  • Image texture details for every frame in the video are sequentially obtained according to the above driving video.
  • the above-mentioned acquisition of multi-frame facial geometric rendering images corresponding to the above-mentioned driving objects based on the above-mentioned driving video includes: splitting the above-mentioned driving video to obtain multiple frames of driving images; and separately extracting and obtaining corresponding images of each of the above-mentioned driving images.
  • the three-dimensional facial parameters corresponding to each of the above-mentioned driving images are respectively obtained according to the three-dimensional facial parameters corresponding to each of the above-mentioned driving images; the three-dimensional facial mesh corresponding to each of the above-mentioned driving images is rendered to obtain each of the above-mentioned driving images.
  • Corresponding facial geometric rendering image wherein the above facial geometric rendering image is a grayscale image.
  • the above-mentioned three-dimensional facial parameters include individual coefficients, expression coefficients and posture coefficients. Further, in order to improve the accuracy in the video processing process and improve the authenticity of the finally obtained driven video and the agility of the expressions in it, in this embodiment, the three-dimensional facial parameters corresponding to each of the above-mentioned driving images are obtained respectively.
  • the above-mentioned method also includes: aligning the above-mentioned three-dimensional facial parameters based on the facial space position corresponding to the above-mentioned driven object in the above-mentioned character avatar model to update the above-mentioned three-dimensional facial parameters. That is, the face rendering image is obtained based on the updated three-dimensional face parameters.
  • the above-mentioned driving video is split into image frames in order to obtain a multi-frame driving image, and then 3D face parameter estimation is performed on each frame of the driving image to obtain the corresponding three-dimensional facial parameters, including individual coefficients, expression coefficients and Attitude coefficient.
  • the corresponding three-dimensional facial parameters are extracted through a pre-trained parameter extraction model.
  • the parameter extraction model is trained to output the corresponding three-dimensional facial parameters according to the input face image.
  • the parameter extraction model can be a Pre-trained neural network model.
  • the above-mentioned expression coefficients are used to reflect the corresponding expression characteristics of the driving object, such as grinning, crying, etc.; the above-mentioned posture coefficients are used to reflect the corresponding head postures of the driving object, such as turning the head left and right, nodding up and down, shaking the head, etc.; the above-mentioned individuals
  • the coefficient is used to reflect the facial characteristics of the driving object, such as face shape. Different users have different face shapes, so the individual coefficients of different driving objects are also different. Combining the above three three-dimensional facial parameters can make the expressions in the generated driven video more accurate.
  • the individual coefficients, expression coefficients and posture coefficients are converted into the face space of the character avatar model of the driven object (ie, face posture correction is performed), thereby improving the generation effect of the driven video.
  • the head postures of the driving object and the driven object need to be aligned, that is, the spatial size and spatial position of the head need to be roughly aligned.
  • the three-dimensional facial parameters corresponding to the driven object are stored in the character avatar model corresponding to the driven object, which can be used to reflect the facial space position corresponding to the driving object.
  • the goal is to align the mean and variance of each coefficient in the converted three-dimensional facial parameters of the driving object with the three-dimensional facial parameters of the driven object.
  • each driver is calculated through the preset 3D face model (such as BFM, FLAME, the 3D face model is represented by function f)
  • the three-dimensional face mesh corresponding to the image.
  • the above-mentioned three-dimensional face mesh represents the face geometric information in the driving image.
  • a preset renderer such as Pytorch3D, the renderer is represented by Render
  • the above-mentioned facial geometry rendering image is a 1-channel grayscale image
  • one driving video corresponds to multiple frames of driving images, which also corresponds to multiple frames of facial geometry rendering images (i.e., a facial geometry rendering image can be obtained collection).
  • Step S400 Obtain the time code corresponding to each of the above-mentioned facial geometric rendering images, and generate a driven video through the above-mentioned character avatar model based on the above-mentioned reference image, each of the above-mentioned facial geometric rendering images, and the time code corresponding to each of the above-mentioned facial geometric rendering images.
  • the driven object in the driven video performs the same expressions and gestures as the driving object in the driving video.
  • the above reference image is an RGB image with a channel number of 3, and the reference image can be any image containing the face of the driven object.
  • the reference image used when using the character avatar model is the same as when training the character avatar model, and can be any frame in the training video containing the driven object used when training the above character avatar model.
  • the above reference images are used to provide textures for characters and backgrounds, allowing the character avatar model (i.e. a trained neural network generator, such as the UNet model) to recover more details.
  • the above-mentioned acquisition of the time code corresponding to each of the above-mentioned facial geometric rendering images is driven by the above-mentioned character avatar model generation based on the above-mentioned reference image, each of the above-mentioned facial geometric rendering images and the time code corresponding to each of the above-mentioned facial geometric rendering images.
  • the video includes: obtaining the time codes corresponding to each of the above-mentioned facial geometric rendering images according to the preset time-coding calculation formula; inputting each set of data to be processed into the above-mentioned character avatar model in turn to obtain the driven images corresponding to each set of the above-mentioned data to be processed.
  • a set of the above-mentioned data to be processed consists of the above-mentioned reference image, one of the above-mentioned facial geometric rendering images and the time code corresponding to the facial geometric rendering image.
  • the above-mentioned driven image the above-mentioned driven object executes the corresponding facial expression.
  • the same expressions and gestures in the geometric rendering images; each of the above-mentioned driven images is connected in sequence according to the corresponding time encoding to generate the above-mentioned driven video.
  • the above-mentioned time encoding is used to input time information into the above-mentioned character avatar model to improve the temporal stability when generating the driven image.
  • the above-mentioned time encoding is a time information encoding with a channel number of 2N. Its spatial dimension is the same as the driving image and the facial geometry rendering image, and the values in each channel are the same.
  • TPE t (sin(2 0 ⁇ t),cos(2 0 ⁇ t),...,sin(2 N-1 ⁇ t),cos(2 N-1 ⁇ t)) (1)
  • TPE t represents the time encoding corresponding to the facial geometry rendering image numbered t
  • the facial geometry rendering image numbered t corresponds to the driving image numbered t.
  • N is a preset constant value that can be set and adjusted according to the actual situation (for example, set to 3).
  • the number of time encoding channels is 2N because there are two sets of sin and cos encoding.
  • each group of data to be processed can be formed in sequence.
  • a group of data to be processed includes a reference image, a frame of facial geometry rendering image and a time code.
  • Each group of data to be processed can be input into the above-mentioned character avatar model in turn.
  • the driven images of each frame can be obtained in sequence, and the driven video can finally be obtained by combination.
  • the character subject in the above driven video is the driven object, and the character subject performs the expressions and gestures made by the driving object in the driving video.
  • the above-mentioned character avatar model is pre-trained according to the following steps:
  • the reference image in the training data, the training face geometric rendering image and the training time code corresponding to the training face geometric rendering image are input into the deep neural network generator, and the above-mentioned reference image and the above-mentioned training face are generated through the above-mentioned deep neural network generator.
  • a training driven image of a geometrically rendered image wherein the training data includes a plurality of training image groups, and each training image group includes a reference image corresponding to the driven object, and a training facial geometric rendering corresponding to the driving object.
  • the model parameters of the character avatar model are adjusted, and the step of inputting the reference image in the training data, the training facial geometry rendering image and the training time code corresponding to the training facial geometry rendering image into the character avatar model is continued until the preset training conditions are met, so as to obtain a trained deep neural network generator, and use the trained deep neural network generator as the character avatar model;
  • the above-mentioned preset training conditions include reconstruction loss convergence between the above-mentioned training driven image and the above-mentioned training driving image.
  • the above training data is obtained by processing the collected image data through a preset data enhancement method, and the above preset data enhancement method includes spatial random cropping.
  • the above-mentioned collected image data is the image data collected directly during training
  • the training data is the data obtained by performing data enhancement operations on the collected image data.
  • the above-mentioned reconstruction loss is calculated by a multi-term joint image reconstruction loss function.
  • the above-mentioned multi-term joint image reconstruction loss function is used to combine at least two losses among L1 reconstruction loss, perceptual loss and GAN discriminator loss.
  • the training process of the above-mentioned deep neural network generator and the process of video processing based on the character avatar model are also described in detail based on a specific application scenario.
  • the data used are corresponding or the same.
  • the training face geometric rendering image corresponds to the geometric rendering image, and the difference in their names is used to distinguish the training ones.
  • the data used in the process is still the data used in the video processing process using the model, and their acquisition methods or processing methods can be used as a reference for each other.
  • the training facial geometric rendering image used in the training process corresponds to the driven object.
  • the training facial geometric rendering The image is obtained by training the three-dimensional face mesh during the training process.
  • the training three-dimensional face mesh can be obtained by training the three-dimensional face parameters.
  • the training three-dimensional face parameters can be obtained by training the driving image.
  • the training driving image can be obtained by shooting User A’s training driver video is obtained.
  • user A is first photographed to obtain a video of user A speaking (i.e., training driver video), and then split into multiple frames of training driver images in sequence.
  • a video of user A speaking i.e., training driver video
  • the obtained training driving image is recorded as I t , and then 3D face parameter estimation is performed on each frame of training driving image I t to obtain the training 3D facial parameters, including individual coefficient ⁇ t and expression coefficient and attitude coefficient ⁇ t .
  • the corresponding training three-dimensional face mesh is calculated based on the preset 3D face model (can be recorded as f), and then rendered based on the preset renderer Render to obtain a 1-channel training face geometry rendering image M t ⁇ R H ⁇ W ⁇ 1 , H and W respectively represent the training face geometric rendering image (or the corresponding height and width of the training driver image).
  • the rendering process is as shown in the following formula (2):
  • Render represents the processing process of the renderer
  • function f represents the processing process of the 3D face model
  • the purpose is to train a deep neural network generator to restore the original training-driven image with a human face I t from the training face geometry rendering image M t .
  • reference images are also introduced and time encoding TPE t .
  • Reference image is an RGB image with a channel number of 3, that is Specifically, for the same character avatar model, during the training of the deep neural network generator and the video processing based on the character avatar model, the image frames are divided in the same way, the reference images used are the same, and the time The encoding setting method is also the same, so the time encoding during the training process can be specifically set with reference to the above formula (1), which will not be described again here.
  • the spatial dimension of the temporal encoding TPE t is related to the training facial geometry rendering image M t and the reference image Consistent, and TPE t ⁇ R H ⁇ W ⁇ 2N .
  • the function of the above-mentioned character avatar model (denoted as g) is to provide M t , TPE t and When generating an image with a face (i.e., a training driven image) I' t ⁇ R H ⁇ W ⁇ 3 , as shown in the following formula (3):
  • I' t represents the training driven image corresponding to the t-th frame training driving image
  • g represents the processing process of the character avatar model.
  • the character avatar model i.e., neural network generator
  • the character avatar model used in this embodiment is a graph-to-graph (input and output are images, and the spatial size remains unchanged) convolutional neural network (such as UNet).
  • the neural network generator g is trained so that the predicted training driven image I' t can reconstruct the corresponding training driving image I t in the training video, specifically by minimizing the training driven image I' t and the training driving image I
  • the reconstruction loss between t is the target to iteratively optimize and update the model parameters in the character avatar model g until the reconstruction loss converges to the minimum.
  • the above preset training conditions may also include that the number of iterations reaches the iteration number threshold.
  • FIG. 2 is a schematic flowchart of a specific process for generating a driven video based on the character avatar model of user A provided by an embodiment of the present application.
  • a 3D human is extracted from each frame of the driven image of user B. Face parameters (including individual coefficients, expression coefficients and posture coefficients) are then converted into the space corresponding to user A. The goal of the conversion is to align the mean and variance of the converted coefficients, which can be expressed by the following formula (4) Show:
  • T represents the alignment conversion process, and Respectively represent the individual coefficient, expression coefficient and posture coefficient corresponding to user B before conversion.
  • a low-cost and highly realistic character avatar model is provided.
  • user A For user A's character avatar model, user A only needs to use daily shooting equipment (such as mobile phones) to shoot the corresponding training driver video (such as a speech video of about 2 minutes) in a scene (scene 1). It becomes the material for character avatar model training.
  • training driver video such as a speech video of about 2 minutes
  • data augmentation and expansion techniques can be used to obtain multiple sets of training images.
  • the above training-driven video is trained on the digital avatar model for about 4 hours on the training platform, which can be used to support user A’s arbitrary head movements and expressions in the same scene (scenario 1). digital avatar model.
  • user B can drive the above-mentioned digital avatar model by recording a driving video in any scene (scene 2), and generate a driven video of user A speaking in scene 1, and in the driven video, user A and the user B's posture and expression are the same in the driving video.
  • this implementation is combined with time coding to ensure that the generated driven video is real and natural, has high fluency and stability in the time domain, and can achieve generation results similar to real-shot videos.
  • data enhancement methods including spatial random cropping can be used and the neural network generator can be performed by optimizing the multi-joint image reconstruction loss function (such as L1 reconstruction loss, perceptual loss, GAN discriminator loss) Training, training is performed on NVIDIA A100-SXM4-40GB GPU, Batch Size is 20, and input and output image resolution is 512*512.
  • data enhancement refers to adding spatial random cropping during the training process to enhance data diversity.
  • the video frames shot by user A are spatially randomly cropped and then input into the neural network generator.
  • the calculation loss refers to calculating the loss between the image generated by the driven object and its corresponding original driving image calculated by the neural network generator (ie, the character avatar model).
  • the video processing method based on the character avatar model provided in this embodiment has high efficiency in the process of model training and rendering of new videos.
  • the DVP solution requires an average of 42 hours of training, and an average of 0.2 per frame when rendering the video. seconds; the NerFace solution requires an average of 55 hours of training, and an average of 6 seconds per frame when rendering a video; in this embodiment, the average training is 4 hours, and an average of 0.03 seconds per frame when rendering a video. It can be seen that the solution in this embodiment is conducive to improving training and processing efficiency.
  • SSIM Structure Similarity Index
  • PSNR Peak Signal-to-Noise Ratio
  • the character avatar model for the driven object is preset.
  • the character avatar model and reference image of the corresponding driven object are obtained.
  • the facial geometric rendering image used to reflect the expression and posture of the driving object is obtained.
  • the facial geometric rendering image and the reference image are fused through the character avatar model of the driven object to obtain the driven video. .
  • the driven video is not obtained by simple image replacement of the facial area, but by fusing the expression and posture of the driving object with the actual texture of the driven object, thereby realizing the execution by the driven object and the driving object in the driving video Same expression and gesture.
  • the facial geometry rendering image only reflects expressions and postures, it does not reflect the actual texture of the face of the driving object.
  • the actual texture is only provided by the reference image of the driven object, so it will not be retained incorrectly when the character avatar model performs image information fusion.
  • the actual texture corresponding to the driving object that is, the image texture details of the object displayed in the final driven video, is the same as the driven object.
  • using the video of the driving object to drive the character avatar model is beneficial to improving the video display effect of the character avatar model. It is possible to use the expression of the driving object to drive the face of the driven object to make a corresponding expression, which is beneficial to improving the effect of video face replacement and improving the user experience.
  • embodiments of the present application also provide a video processing system based on the character avatar model.
  • the above video processing system based on the character avatar model includes:
  • the driving information acquisition module 510 is used to obtain the driving video of the driving object, the permission verification information of the driving object, and the driven object corresponding to the driving object, wherein the driving video is obtained by photographing the expression and posture of the driving object;
  • the authority verification module 520 is configured to obtain the role avatar model and reference image corresponding to the driven object when the authority verification information of the driving object meets the authority verification conditions of the driven object;
  • the driving video processing module 530 is used to obtain a multi-frame facial geometric rendering image corresponding to the driving object according to the driving video, wherein the facial geometric rendering image is used to reflect the expression and posture corresponding to the driving object;
  • the driven video generation module 540 is used to obtain the time code corresponding to each of the above-mentioned facial geometric rendering images, and based on the above-mentioned reference image, each of the above-mentioned facial geometric rendering images and the time code corresponding to each of the above-mentioned facial geometric rendering images, through the above-mentioned character
  • the avatar model generates a driven video in which the driven object performs the same expressions and gestures as the driving object in the driving video.
  • each module of the video processing system based on the character avatar model is not unique and is not specifically limited here.
  • the above-mentioned intelligent terminal includes a processor, a memory, a network interface and a display screen connected through a system bus.
  • the processor of the smart terminal is used to provide computing and control capabilities.
  • the memory of the smart terminal includes non-volatile storage media and internal memory.
  • the non-volatile storage medium stores an operating system and a video processing program based on the character avatar model.
  • the internal memory provides an environment for the operation of the operating system and the video processing program based on the character avatar model in the non-volatile storage medium.
  • the network interface of the smart terminal is used to communicate with external terminals through network connections. When the video processing program based on the role avatar model is executed by the processor, the steps of any of the above video processing methods based on the role avatar model are implemented.
  • the display screen of the smart terminal may be a liquid crystal display screen or an electronic ink display screen.
  • a smart terminal in one embodiment, includes a memory, a processor, and a video processing program based on a character avatar model that is stored in the memory and can be run on the processor.
  • the above-mentioned video processing program based on the character avatar model is When the video processing program is executed by the above-mentioned processor, the steps of any video processing method based on the character avatar model provided by the embodiments of the present application are implemented.
  • Embodiments of the present application also provide a computer-readable storage medium.
  • the computer-readable storage medium stores a video processing program based on the character avatar model.
  • the video processing program based on the character avatar model is executed by the processor, the implementation of the present application is realized.
  • the steps of any video processing method based on the character avatar model are provided in the example.
  • sequence number of each step in the above embodiment does not mean the order of execution.
  • the execution order of each process should be determined by its function and internal logic, and should not constitute any limitation on the implementation process of the embodiment of the present application.
  • Module completion means dividing the internal structure of the above device into different functional units or modules to complete all or part of the functions described above.
  • Each functional unit and module in the embodiment can be integrated into one processing unit, or each unit can exist physically alone, or two or more units can be integrated into one unit.
  • the above-mentioned integrated unit can be hardware-based. It can also be implemented in the form of software functional units.
  • the specific names of each functional unit and module are only for the convenience of distinguishing each other and are not used to limit the scope of protection of the present application.
  • system/terminal device and method can be implemented in other ways.
  • system/terminal equipment embodiments described above are only illustrative.
  • division of the above modules or units is only a logical function division. In actual implementation, it can be divided in other ways, such as multiple units or units. Components may be combined or may be integrated into another system, or some features may be ignored, or not implemented.
  • the above-mentioned integrated modules/units are implemented in the form of software functional units and sold or used as independent products, they can be stored in a computer-readable storage medium. Based on this understanding, the present application can implement all or part of the processes in the methods of the above embodiments by instructing relevant hardware through a computer program.
  • the above computer program can be stored in a computer-readable storage medium.
  • the computer program can be stored in a computer-readable storage medium. When executed by the processor, the steps of each of the above method embodiments can be implemented.
  • the above-mentioned computer program includes computer program code, and the above-mentioned computer program code may be in the form of source code, object code, executable file or some intermediate form, etc.
  • the above-mentioned computer-readable media may include: any entity or device capable of carrying the above-mentioned computer program code, recording media, U disk, mobile hard disk, magnetic disk, optical disk, computer memory, read-only memory (ROM, Read-Only Memory), random accessory Access memory (RAM, Random Access Memory), electrical carrier signals, telecommunications signals, and software distribution media, etc. It should be noted that the content contained in the above computer-readable storage media can be appropriately increased or decreased according to the requirements of legislation and patent practice in the jurisdiction.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Processing Or Creating Images (AREA)

Abstract

本申请公开了基于角色化身模型的视频处理方法、系统及相关设备,其中,上述方法包括:获取驱动对象的驱动视频、权限验证信息及对应的被驱动对象;当驱动对象的权限验证信息满足被驱动对象的权限验证条件时,获取被驱动对象的角色化身模型和参考图像;根据驱动视频获取驱动对象对应的多帧脸部几何渲染图像;获取各脸部几何渲染图像对应的时间编码,根据参考图像、各脸部几何渲染图像及各脸部几何渲染图像对应的时间编码,通过角色化身模型生成被驱动视频,被驱动视频中由上述被驱动对象执行与驱动视频中驱动对象相同的表情和姿态。本申请以驱动对象的视频来驱动角色化身模型,有利于提高角色化身模型的视频展示效果。

Description

基于角色化身模型的视频处理方法、系统及相关设备 技术领域
本申请涉及视频处理技术领域,尤其涉及的是一种基于角色化身模型的视频处理方法、系统及相关设备。
背景技术
随着科学技术的发展,尤其是视频处理技术的发展,用户对于视频处理的要求也逐渐增高。例如,用户想要基于视频处理实现脸部替换,例如实现在视频中以第一用户的表情来驱动第二用户的脸做出对应的表情。
而现有技术中,通常对视频进行逐帧处理,要求第一用户和第二用户分别录取一段视频,对于视频中的每一帧图像,将第一用户和第二用户的图像中脸部区域进行截取和替换。现有技术的问题在于,将第一用户和第二用户的图像中脸部区域进行截取和替换,而替换后的第二用户所对应的图像中的脸部区域中表情虽然是第一用户的表情,但实际上对应的脸部五官也仍是第一用户的五官,并没有实现以第一用户的表情来驱动第二用户的脸做出对应的表情的目的。
现有技术的问题在于,只通过对两个用户的视频中每一帧图像中脸部区域进行截取和替换的视频处理方案,无法实现以第一用户的表情来驱动第二用户的脸做出对应的表情,不利于提高视频展示的效果,也不利于提高视频脸部替换的效果。
因此,现有技术还有待改进和发展。
发明内容
本申请的主要目的在于提供一种基于角色化身模型的视频处理方法、系统及相关设备,旨在解决现有技术中只通过对两个用户的视频中每一帧图像中脸部区域进行截取和替换的视频处理方案不利于提高视频展示的效果的问题。
为了实现上述目的,本申请第一方面提供一种基于角色化身模型的视频处理方法,其中,上述基于角色化身模型的视频处理方法包括:
获取驱动对象的驱动视频、上述驱动对象的权限验证信息以及上述驱动对 象对应的被驱动对象,其中,上述驱动视频通过拍摄上述驱动对象的表情和姿态获得;
当上述驱动对象的权限验证信息满足上述被驱动对象的权限验证条件时,获取上述被驱动对象对应的角色化身模型和参考图像;
根据上述驱动视频获取上述驱动对象对应的多帧脸部几何渲染图像,其中,上述脸部几何渲染图像用于体现上述驱动对象对应的表情和姿态;
获取各上述脸部几何渲染图像对应的时间编码,根据上述参考图像、各上述脸部几何渲染图像以及各上述脸部几何渲染图像对应的时间编码,通过上述角色化身模型生成被驱动视频,其中,上述被驱动视频中由上述被驱动对象执行与上述驱动视频中驱动对象相同的表情和姿态。
可选的,上述参考图像用于为上述角色化身模型提供上述被驱动对象对应的图像纹理细节,上述被驱动视频与上述参考图像的图像纹理细节相同。
可选的,上述参考图像是通道数为3的RGB图像。
可选的,上述根据上述驱动视频获取上述驱动对象对应的多帧脸部几何渲染图像,包括:
对上述驱动视频进行拆分获得多帧驱动图像;
分别提取获得各上述驱动图像对应的三维脸部参数;
分别根据各上述驱动图像对应的三维脸部参数获取各上述驱动图像对应的三维脸部网格;
对各上述驱动图像对应的三维脸部网格进行渲染获得各上述驱动图像对应的脸部几何渲染图像,其中,上述脸部几何渲染图像为灰度图。
可选的,在上述分别根据各上述驱动图像对应的三维脸部参数获取各上述驱动图像对应的三维脸部网格之前,上述方法还包括:
基于上述角色化身模型中上述被驱动对象对应的脸部空间位置对上述三维脸部参数进行对齐以更新上述三维脸部参数。
可选的,上述三维脸部参数包括个体系数、表情系数和姿态系数。
可选的,上述获取各上述脸部几何渲染图像对应的时间编码,根据上述参考图像、各上述脸部几何渲染图像以及各上述脸部几何渲染图像对应的时间编码,通过上述角色化身模型生成被驱动视频,包括:
根据预设的时间编码计算公式获取各上述脸部几何渲染图像对应的时间编码;
将各组待处理数据依次输入上述角色化身模型,获得各组上述待处理数据对应的被驱动图像,其中,一组上述待处理数据由上述参考图像、一个上述脸部几何渲染图像以及该脸部几何渲染图像对应的时间编码组成,上述被驱动图像中由上述被驱动对象执行与对应的脸部几何渲染图像中相同的表情和姿态;
将各上述被驱动图像按照对应的时间编码依次连接并生成上述被驱动视频。
可选的,上述时间编码用于为上述角色化身模型输入时间信息,上述时间编码计算公式为:TPE t=(sin(2 0πt),cos(2 0πt),…,sin(2 N-1πt),cos(2 N-1πt)),其中,TPE t代表编号为t的脸部几何渲染图像对应的时间编码,N为预设常数。
可选的,上述时间编码、上述时间编码对应的脸部几何渲染图像以及上述参考图像的空间维度相同。
可选的,上述角色化身模型预先根据如下步骤训练获得:
将训练数据中的参考图像、训练脸部几何渲染图像以及该训练脸部几何渲染图像对应的训练时间编码输入深度神经网络生成器,通过上述深度神经网络生成器生成针对上述参考图像和上述训练脸部几何渲染图像的训练被驱动图像,其中,上述训练数据包括多组训练图像组,每一组训练图像组包括与上述被驱动对象对应的参考图像、与上述驱动对象对应的训练脸部几何渲染图像、该训练脸部几何渲染图像对应的训练时间编码以及该训练脸部几何渲染图像对应的训练驱动图像;
根据上述训练被驱动图像和上述训练驱动图像,对上述深度神经网络生成器的模型参数进行调整,并继续执行上述将训练数据中的参考图像、训练脸部几何渲染图像以及该训练脸部几何渲染图像对应的训练时间编码输入上述角色化身模型的步骤,直至满足预设训练条件,以得到上述角色化身模型。
可选的,上述预设训练条件包括上述训练被驱动图像和上述训练驱动图像之间的重建损失收敛。
可选的,上述训练数据由采集的图像数据通过预设数据增强方式进行处理后获得,上述预设数据增强方式包括空间随机裁剪。
可选的,上述重建损失通过多项联合图像重建损失函数计算获得,上述多 项联合图像重建损失函数用于联合L1重建损失、感知损失和GAN判别器损失中的至少两种损失。
本申请第二方面提供一种基于角色化身模型的视频处理系统,其中,上述基于角色化身模型的视频处理系统包括:
驱动信息获取模块,用于获取驱动对象的驱动视频、上述驱动对象的权限验证信息以及上述驱动对象对应的被驱动对象,其中,上述驱动视频通过拍摄上述驱动对象的表情和姿态获得;
权限验证模块,用于当上述驱动对象的权限验证信息满足上述被驱动对象的权限验证条件时,获取上述被驱动对象对应的角色化身模型和参考图像;
驱动视频处理模块,用于根据上述驱动视频获取上述驱动对象对应的多帧脸部几何渲染图像,其中,上述脸部几何渲染图像用于体现上述驱动对象对应的表情和姿态;
被驱动视频生成模块,用于获取各上述脸部几何渲染图像对应的时间编码,根据上述参考图像、各上述脸部几何渲染图像以及各上述脸部几何渲染图像对应的时间编码,通过上述角色化身模型生成被驱动视频,其中,上述被驱动视频中由上述被驱动对象执行与上述驱动视频中驱动对象相同的表情和姿态。
本申请第三方面提供一种智能终端,上述智能终端包括存储器、处理器以及存储在上述存储器上并可在上述处理器上运行的基于角色化身模型的视频处理程序,上述基于角色化身模型的视频处理程序被上述处理器执行时实现任意一项上述基于角色化身模型的视频处理方法的步骤。
由上可见,本申请方案中,获取驱动对象的驱动视频、上述驱动对象的权限验证信息以及上述驱动对象对应的被驱动对象,其中,上述驱动视频通过拍摄上述驱动对象的表情和姿态获得;当上述驱动对象的权限验证信息满足上述被驱动对象的权限验证条件时,获取上述被驱动对象对应的角色化身模型和参考图像;根据上述驱动视频获取上述驱动对象对应的多帧脸部几何渲染图像,其中,上述脸部几何渲染图像用于体现上述驱动对象对应的表情和姿态;获取各上述脸部几何渲染图像对应的时间编码,根据上述参考图像、各上述脸部几何渲染图像以及各上述脸部几何渲染图像对应的时间编码,通过上述角色化身模型生成被驱动视频,其中,上述被驱动视频中由上述被驱动对象执行与上述驱动视频中驱动对象 相同的表情和姿态。
与现有技术相比,本申请方案中并不只是对不同对象的图像中脸部区域进行截取和替换,而是预先设置有针对被驱动对象的角色化身模型,在获取到驱动对象对应的驱动视频并在权限验证通过后,获取对应的被驱动对象的已训练角色化身模型和参考图像。然后根据驱动视频获取用于体现驱动对象的表情和姿态的脸部几何渲染图像,结合时间编码,通过角色化身模型对脸部几何渲染图像和参考图像进行融合,从而获得被驱动视频。
被驱动视频并不是通过简单的脸部区域的图像替换获得的,而是将驱动对象的表情、姿态与被驱动对象的实际纹理融合获得的,从而实现由被驱动对象执行与驱动视频中驱动对象相同的表情和姿态。因为脸部几何渲染图像只体现表情和姿态,不体现驱动对象的脸部的实际纹理,实际纹理仅由被驱动对象的参考图像提供,所以在角色化身模型进行图像信息融合时不会错误地保留驱动对象所对应的实际纹理,即最终被驱动视频中所展现出来的对象的图像纹理细节与被驱动对象是相同的。如此,以驱动对象的视频来驱动角色化身模型,有利于提高角色化身模型的视频展示效果。能够实现以驱动对象的表情来驱动被驱动对象的脸做出对应的表情,有利于提高视频脸部替换的效果,有利于提高用户使用体验。
附图说明
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其它的附图。
图1是本申请实施例提供的一种基于角色化身模型的视频处理方法的流程示意图;
图2是本申请实施例提供的一种基于用户A的角色化身模型生成被驱动视频的具体流程示意图;
图3是本申请实施例提供的一种基于角色化身模型的视频处理系统的组成模块示意图;
图4是本申请实施例提供的一种智能终端的内部结构原理框图。
具体实施方式
以下描述中,为了说明而不是为了限定,提出了诸如特定系统结构、技术之类的具体细节,以便透彻理解本申请实施例。然而,本领域的技术人员应当清楚,在没有这些具体细节的其它实施例中也可以实现本申请。在其它情况下,省略对众所周知的系统、装置、电路以及方法的详细说明,以免不必要的细节妨碍本申请的描述。
应当理解,当在本说明书和所附权利要求书中使用时,术语“包括”指示所描述特征、整体、步骤、操作、元素和/或组件的存在,但并不排除一个或多个其它特征、整体、步骤、操作、元素、组件和/或其集合的存在或添加。
还应当理解,在本申请说明书中所使用的术语仅仅是出于描述特定实施例的目的而并不意在限制本申请。如在本申请说明书和所附权利要求书中所使用的那样,除非上下文清楚地指明其它情况,否则单数形式的“一”、“一个”及“该”意在包括复数形式。
还应当进一步理解,在本申请说明书和所附权利要求书中使用的术语“和/或”是指相关联列出的项中的一个或多个的任何组合以及所有可能组合,并且包括这些组合。
如在本说明书和所附权利要求书中所使用的那样,术语“如果”可以依据上下文被解释为“当…时”或“一旦”或“响应于确定”或“响应于分类到”。类似的,短语“如果确定”或“如果分类到[所描述条件或事件]”可以依据上下文被解释为意指“一旦确定”或“响应于确定”或“一旦分类到[所描述的条件或事件]”或“响应于分类到[所描述条件或事件]”。
下面结合本申请实施例的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请的一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其它实施例,都属于本申请保护的范围。
在下面的描述中阐述了很多具体细节以便于充分理解本申请,但是本申请还可以采用其它不同于在此描述的其它方式来实施,本领域技术人员可以在不违背本申请内涵的情况下做类似推广,因此本申请不受下面公开的具体实施例的限制。
随着科学技术的发展,尤其是视频处理技术的发展,用户对于视频处理的要求也逐渐增高。例如,用户想要基于视频处理实现脸部替换,例如实现在视频中以第一用户的表情来驱动第二用户的脸做出对应的表情,从而达到娱乐效果。
而现有技术中,通常对视频进行逐帧处理,要求第一用户和第二用户分别录取一段视频,对于视频中的每一帧图像,将第一用户和第二用户的图像中脸部区域进行截取和替换。现有技术的问题在于,将第一用户和第二用户的图像中脸部区域进行截取和替换,而替换后的第二用户所对应的图像中的脸部区域中表情虽然是第一用户的表情,但实际上对应的脸部五官也仍是第一用户的五官,并没有实现以第一用户的表情来驱动第二用户的脸做出对应的表情的目的。
现有技术的问题在于,只通过对两个用户的视频中每一帧图像中脸部区域进行截取和替换的视频处理方案不利于提高视频脸部替换的效果,无法实现以第一用户的表情来驱动第二用户的脸做出对应的表情。
在一种应用场景中,基于三维重建、动画和CG渲染技术,使用多相机对静态人物进行扫描和3D重建,然后将其和驱动模型的关节点进行绑定,完成动作驱动等效果,最终通过重光照、PBR等渲染技术将2D影像进行渲染重现。但该方案需要预先通过大量额外设备(例如多相机阵列)采集人物形象细节以进行几何形状和纹理的高精度还原,还需要进行人工三维建模调整,制作成本较大。
在另一种应用场景中,可以使用人像视频,利用深度学习等方式,学习预测一些显性的属性,如姿态、相机位置,或者隐式的特征表达,这些属性和特征表达可以被调整和操作以恢复出人像的影像。例如,一种方案(例如FOMM)是基于无监督学习目标人像和驱动人像的关键点对应关系并将其转化成生成稠密光流场,然后使用光流场对人脸图片进行映射,最后通过生成网络生成图像;但该方案是基于2D像素映射的,人脸不具有3D一致性,稠密光流场容易使得背景和人脸一起移动。另一种方案(例如DVP)中,先估计大量显示属性,包括人脸对应图、实现图等,结合图像到图像的转换技术,恢复出真实的人像;但该方案使用了大量的显示属性,例如3D人脸对应图,但生成结果会存在伪影和模糊,并且在处理视频时不流畅。另一种方案(例如NerFace)中,使用神经辐射场作为渲染器,生成高清的人像,可以提升大角度3D的一致性,但其渲染结果仍然会丢失细节,并且渲染效率低。
为了解决上述多个问题中的至少一个问题,本申请方案从成本和效果两方面进行考虑,提供一种效率更高、效果更好且模型所需训练时长更低的方案,使得视频处理和渲染的结果清晰度更高并且更写实,达到以驱动对象的表情来驱动被驱动对象的脸做出对应的表情,并生成对应的被驱动视频,并且使得生成的被驱动视频接近真实视频的效果。
具体的,本申请方案中,获取驱动对象的驱动视频、上述驱动对象的权限验证信息以及上述驱动对象对应的被驱动对象,其中,上述驱动视频通过拍摄上述驱动对象的表情和姿态获得;当上述驱动对象的权限验证信息满足上述被驱动对象的权限验证条件时,获取上述被驱动对象对应的角色化身模型和参考图像;根据上述驱动视频获取上述驱动对象对应的多帧脸部几何渲染图像,其中,上述脸部几何渲染图像用于体现上述驱动对象对应的表情和姿态;获取各上述脸部几何渲染图像对应的时间编码,根据上述参考图像、各上述脸部几何渲染图像以及各上述脸部几何渲染图像对应的时间编码,通过上述角色化身模型生成被驱动视频,其中,上述被驱动视频中由上述被驱动对象执行与上述驱动视频中驱动对象相同的表情和姿态。
与现有技术相比,本申请方案中并不只是对不同对象的图像中脸部区域进行截取和替换,而是预先设置有针对被驱动对象的角色化身模型,在获取到驱动对象对应的驱动视频并在权限验证通过后,获取对应的被驱动对象的已训练角色化身模型和参考图像。然后根据驱动视频获取用于体现驱动对象的表情和姿态的脸部几何渲染图像,结合时间编码,通过角色化身模型对脸部几何渲染图像和参考图像进行融合,从而获得被驱动视频。
被驱动视频并不是通过简单的脸部区域的图像替换获得的,而是将驱动对象的表情、姿态与被驱动对象的实际纹理融合获得的,从而实现由被驱动对象执行与驱动视频中驱动对象相同的表情和姿态。因为脸部几何渲染图像只体现表情和姿态,不体现驱动对象的脸部的实际纹理,实际纹理仅由被驱动对象的参考图像提供,所以在角色化身模型进行图像信息融合时不会错误地保留驱动对象所对应的实际纹理,即最终被驱动视频中所展现出来的对象的图像纹理细节与被驱动对象是相同的。如此,以驱动对象的视频来驱动角色化身模型,有利于提高角色化身模型的视频展示效果。能够实现以驱动对象的表情来驱动被驱动对象的脸做 出对应的表情,有利于提高视频脸部替换的效果,有利于提高用户使用体验。
示例性方法
如图1所示,本申请实施例提供一种基于角色化身模型的视频处理方法,具体的,上述方法包括如下步骤:
步骤S100,获取驱动对象的驱动视频、上述驱动对象的权限验证信息以及上述驱动对象对应的被驱动对象,其中,上述驱动视频通过拍摄上述驱动对象的表情和姿态获得。
其中,上述驱动对象是在视频处理过程中需要保留对应的表情和姿态但不保留对应的脸部细节的对象(例如用户B)。本实施例中,预先为被驱动对象(例如用户A)训练有一个角色化身模型(例如被驱动对象在特定场景下说话的数字人像模型,或者根据被驱动对象在特定场景下说话的视频训练获得的数字人像模型)。被驱动对象是在进行视频处理过程中需要保留对应的脸部细节的对象(即用户A),因此,本实施例中的视频处理过程相当于根据用户B提供的表情和姿态驱动用户A对应的角色化身模型,使得用户A对应的该角色化身模型生成对应的被驱动视频,该被驱动视频中,以用户A的形象做出与用户B的驱动视频中相同的表情和姿态,从而达到通过视频处理实现以用户B驱动用户A形象的效果。上述姿态代表对应对象的头部姿势,表情则代表对应对象的脸部表情。上述驱动视频可以由相机、手机等设备对驱动对象进行拍摄获得,且具体为拍摄获得的说话视频或具有唇部动作的视频,以便结合真实的唇部动作进行建模和匹配。
在一个实施例中,可以预先为多个其他用户都训练有对应的角色化身模型,驱动对象通过指定对应的被驱动对象来确定需要选择并使用的角色化身模型。
需要说明的是,上述驱动对象和被驱动对象可以为动物、动画形象、虚拟人物或者真实人物,驱动对象和被驱动对象可以相同,也可以不同;本实施例中以真实人物为例进行说明,但不作为具体限定。
进一步的,本实施例中根据上述视频处理方法对驱动对象和被驱动对象的头部区域的图像进行处理,对应的,上述角色化身模型也是用于处理头部区域头像的模型,但基于本方案上述角色化身模型也可以用于处理视频中整个人物形象包括头部区域及肢体。
步骤S200,当上述驱动对象的权限验证信息满足上述被驱动对象的权限验证条件时,获取上述被驱动对象对应的角色化身模型和参考图像。
其中,上述权限验证信息用于验证驱动对象是否有权限使用被驱动对象对应的角色化身模型和/或参考图像。具体的,为了保护被驱动对象的隐私和安全,避免任何用户都可以使用被驱动对象的角色化身模型从而生成具有被驱动对象的视频的情况出现,本实施例中,预先为被驱动对象的角色化身模型设置有权限验证条件,只有当驱动对象的权限验证信息满足被驱动对象的权限验证条件时,才能获取被驱动对象对应的角色化身模型和参考图像。需要说明的是,权限验证条件和对应的权限验证信息的设置方式有多种,例如密码匹配的方式、通过权限表授权的方式等,在此不作具体限定。
本实施例中,上述参考图像用于为上述角色化身模型提供上述被驱动对象对应的图像纹理细节,上述被驱动视频与上述参考图像的图像纹理细节相同。具体的,由驱动视频提供表情和姿态,结合参考图像中脸部的图像纹理细节生成对应的被驱动视频。上述参考图像是对被驱动对象进行拍摄获得的,并且可以拍摄出被驱动对象所处的场景中背景区域所对应的图像纹理细节,从而,被驱动视频中对应的背景也与参考图像相同。需要说明的是,图像纹理细节可以包括除表情和姿态以外的细节,例如五官、脸部的特征(如皱纹、眼镜等)、背景区域的图像纹理,在此不作具体限定。
步骤S300,根据上述驱动视频获取上述驱动对象对应的多帧脸部几何渲染图像,其中,上述脸部几何渲染图像用于体现上述驱动对象对应的表情和姿态。
本实施例中,根据上述驱动视频依次获取驱动对象对应的连续多帧脸部几何渲染图像,其中,每一帧脸部几何渲染图像用于体现驱动对象对应的表情和姿态,但并不保留驱动视频中每一帧的图像纹理细节。
具体的,本实施例中,上述根据上述驱动视频获取上述驱动对象对应的多帧脸部几何渲染图像,包括:对上述驱动视频进行拆分获得多帧驱动图像;分别提取获得各上述驱动图像对应的三维脸部参数;分别根据各上述驱动图像对应的三维脸部参数获取各上述驱动图像对应的三维脸部网格;对各上述驱动图像对应的三维脸部网格进行渲染获得各上述驱动图像对应的脸部几何渲染图像,其中,上述脸部几何渲染图像为灰度图。
其中,上述三维脸部参数包括个体系数、表情系数和姿态系数。进一步的,为了提高视频处理过程中的准确性以及提高最终获得的被驱动视频的真实性和其中表情的灵动性,本实施例中,在上述分别根据各上述驱动图像对应的三维脸部参数获取各上述驱动图像对应的三维脸部网格之前,上述方法还包括:基于上述角色化身模型中上述被驱动对象对应的脸部空间位置对上述三维脸部参数进行对齐以更新上述三维脸部参数。即脸部渲染图像是根据更新后的三维脸部参数获得的。
具体的,将上述驱动视频按顺序拆分成图像帧以获得多帧驱动图像,然后对于每一帧驱动图像进行3D人脸参数估计以获得对应的三维脸部参数,包括个体系数、表情系数和姿态系数。在一种应用场景中,通过预先训练的参数提取模型提取对应的三维脸部参数,该参数提取模型被训练为根据输入的人脸图像输出对应的三维脸部参数,该参数提取模型可以是一个预先训练的神经网络模型。上述表情系数用于体现驱动对象对应的表情特征,例如咧嘴笑、瘪嘴哭等;上述姿态系数用于体现驱动对象对应的头部姿势,例如向左右转头、上下点头、摇头等;上述个体系数则用于体现驱动对象的脸部特征,例如脸型,不同用户的脸型不同,因此不同驱动对象的个体系数也不同。结合上述三种三维脸部参数,可以使得生成的被驱动视频中的表情更加准确。
进一步的,将个体系数、表情系数和姿态系数转换到被驱动对象的角色化身模型的脸部空间内(即进行人脸姿态校正),从而提高被驱动视频的生成效果。进行三维脸部参数对齐(或转换)的过程中,将驱动对象和被驱动对象两者的头部姿态进行对齐,即头部的空间大小和空间位置需要进行大致对齐。在一种应用场景中,在被驱动对象对应的角色化身模型中存储有被驱动对象对应的三维脸部参数,可用于体现驱动对象对应的脸部空间位置,在对驱动对象的三维脸部参数进行对齐的过程中,以将转换后的驱动对象的三维脸部参数中各个系数与被驱动对象的三维脸部参数中各个系数的均值和方差对齐为目标。
获得各个驱动图像对应的三维脸部参数(即个体系数、表情系数和姿态系数)之后,通过预先设置的3D人脸模型(例如BFM、FLAME,3D人脸模型用函数f表示)计算出各个驱动图像对应的三维脸部网格,上述三维脸部网格代表该驱动图像中的人脸几何信息。进一步的,使用预先设置的渲染器(例如 Pytorch3D,渲染器用Render表示)将各个三维脸部网格渲染出来获得各上述驱动图像对应的脸部几何渲染图像。本实施例中,上述脸部几何渲染图像为1通道的灰度图,且一个驱动视频对应有多帧驱动图像,则同样对应有多帧脸部几何渲染图像(即可以获得脸部几何渲染图像的集合)。
步骤S400,获取各上述脸部几何渲染图像对应的时间编码,根据上述参考图像、各上述脸部几何渲染图像以及各上述脸部几何渲染图像对应的时间编码,通过上述角色化身模型生成被驱动视频,其中,上述被驱动视频中由上述被驱动对象执行与上述驱动视频中驱动对象相同的表情和姿态。
本实施例中,在获得上述脸部几何渲染图像之后,加入时间编码,从而结合参考图像,通过角色化身模型完成当前姿态和表情的被驱动对象的预测和生成,进而生成对应的被驱动视频。具体的,上述参考图像是通道数为3的RGB图像,参考图像可以是任意一张含有被驱动对象脸部的图像。本实施例中,使用角色化身模型与训练角色化身模型时使用的参考图像是相同的,且可以是在对上述角色化身模型进行训练时使用的包含被驱动对象的训练视频中的任意一帧。上述参考图像用于提供人物和背景的纹理,使得角色化身模型(即一个训练好的神经网络生成器,例如UNet模型)恢复更多的细节。
需要说明的是,针对一个角色化身模型,在参考图像选定后,全局固定使用同一张参考图像,不再进行更改。
进一步的,上述获取各上述脸部几何渲染图像对应的时间编码,根据上述参考图像、各上述脸部几何渲染图像以及各上述脸部几何渲染图像对应的时间编码,通过上述角色化身模型生成被驱动视频,包括:根据预设的时间编码计算公式获取各上述脸部几何渲染图像对应的时间编码;将各组待处理数据依次输入上述角色化身模型,获得各组上述待处理数据对应的被驱动图像,其中,一组上述待处理数据由上述参考图像、一个上述脸部几何渲染图像以及该脸部几何渲染图像对应的时间编码组成,上述被驱动图像中由上述被驱动对象执行与对应的脸部几何渲染图像中相同的表情和姿态;将各上述被驱动图像按照对应的时间编码依次连接并生成上述被驱动视频。
其中,上述时间编码用于给上述角色化身模型输入时间信息,以提高生成被驱动图像时的时域稳定度。本实施例中,上述时间编码是通道数为2N的时间 信息编码,其空间维度和驱动图像以及脸部几何渲染图像相同,其每个通道内数值一样。
具体的,上述预设的时间编码计算公式如下公式(1)所示:
TPE t=(sin(2 0πt),cos(2 0πt),…,sin(2 N-1πt),cos(2 N-1πt))   (1)
其中,TPE t代表编号为t的脸部几何渲染图像对应的时间编码,编号为t的脸部几何渲染图像与编号为t的驱动图像对应,本实施例中,对于驱动视频,按照每一帧对应的时间位置进行拆分获得多帧驱动图像,并且将各帧驱动图像对应的时间位置作为该驱动图像的编号(即标记),因此本实施例中t的取值可以从0开始(t=0,1,2,3,4…)。N是预设常数值,可以根据实际情况进行设置和调整(例如设置成3),时间编码的通道数为2N,因为存在sin和cos两组编码。
获得时间编码之后,可以依次组成各组待处理数据,一组待处理数据中包括参考图像、一帧脸部几何渲染图像以及一个时间编码,依次将各组待处理数据输入上述角色化身模型中,可以依次获得各帧被驱动图像,最终组合获得被驱动视频。上述被驱动视频中人物主体是被驱动对象,且该人物主体执行的是驱动视频中驱动对象所做出的表情和姿态。
具体的,上述角色化身模型预先根据如下步骤训练获得:
将训练数据中的参考图像、训练脸部几何渲染图像以及该训练脸部几何渲染图像对应的训练时间编码输入深度神经网络生成器,通过上述深度神经网络生成器生成针对上述参考图像和上述训练脸部几何渲染图像的训练被驱动图像,其中,上述训练数据包括多组训练图像组,每一组训练图像组包括与上述被驱动对象对应的参考图像、与上述驱动对象对应的训练脸部几何渲染图像、该训练脸部几何渲染图像对应的训练时间编码以及该训练脸部几何渲染图像对应的训练驱动图像;
根据上述训练被驱动图像和上述训练驱动图像,对上述角色化身模型的模型参数进行调整,并继续执行上述将训练数据中的参考图像、训练脸部几何渲染图像以及该训练脸部几何渲染图像对应的训练时间编码输入上述角色化身模型的步骤,直至满足预设训练条件,以得到已训练的深度神经网络生成器,并将上述已训练的深度神经网络生成器作为上述角色化身模型;
其中,上述预设训练条件包括上述训练被驱动图像和上述训练驱动图像之间的重建损失收敛。
具体的,上述训练数据由采集的图像数据通过预设数据增强方式进行处理后获得,上述预设数据增强方式包括空间随机裁剪。其中,上述采集的图像数据是训练时直接采集获得的图像数据,训练数据则是由采集的图像数据进行数据增强的操作处理后获得的数据。上述重建损失通过多项联合图像重建损失函数计算获得,上述多项联合图像重建损失函数用于联合L1重建损失、感知损失和GAN判别器损失中的至少两种损失
本实施例中,还基于一种具体应用场景对上述深度神经网络生成器的训练过程以及基于角色化身模型进行视频处理的过程进行具体说明。需要说明的是,在模型的训练和使用过程中,所使用到的数据是相对应或相同的,例如训练脸部几何渲染图像与几何渲染图像是相对应的,其名称区别用于区分是训练过程中使用的数据还是使用模型进行视频处理过程中使用的数据,其获取方式或处理方式可以互为参考。
具体的,本实施例中训练获得的是被驱动对象(即用户A)对应的角色化身模型,因此训练过程中使用的训练脸部几何渲染图像是与被驱动对象对应的,训练脸部几何渲染图像通过训练过程中的训练三维脸部网格获得,训练三维脸部网格则可以通过训练三维脸部参数获得,训练三维脸部参数则可以通过训练驱动图像获得,训练驱动图像则可以通过拍摄的用户A的训练驱动视频获得。
具体的,先对用户A进行拍摄获取用户A说话的视频(即训练驱动视频),然后将其按照顺序拆分为多帧训练驱动图像。本实施例中,在角色化身模型训练和使用过程中,对视频按帧进行划分的方式是相同的,因此训练过程中也将每一帧对应的时间位置标记为t(t=0,1,2,3,4…)。获得的训练驱动图像记为I t,然后对每一帧训练驱动图像I t进行3D人脸参数估计,获得训练三维脸部参数,包括个体系数β t、表情系数
Figure PCTCN2022124917-appb-000001
以及姿态系数θ t。然后基于预设的3D人脸模型(可以记为f)计算出对应的训练三维脸部网格,然后基于预设的渲染器Render进行渲染获得1通道的训练脸部几何渲染图像M t∈R H×W×1,H和W分别代表训练脸部几何渲染图像(或训练驱动图像对应的高和宽),渲染处理过程如下公式(2)所示:
Figure PCTCN2022124917-appb-000002
其中,Render代表渲染器的处理过程,函数f代表3D人脸模型的处理过程。
需要说明的是,在角色化身模型的训练阶段,目的是训练一个深度神经网络生成器,用于从训练脸部几何渲染图像M t中恢复原始的带人脸的训练驱动图像I t。除此之外,还引入参考图像
Figure PCTCN2022124917-appb-000003
和时间编码TPE t。参考图像
Figure PCTCN2022124917-appb-000004
是通道数为3的RGB图像,即
Figure PCTCN2022124917-appb-000005
具体的,对于同一个角色化身模型,在进行深度神经网络生成器的训练和基于角色化身模型进行视频处理的过程中,图像帧的划分方式是相同的,使用的参考图像是相同的,并且时间编码的设置方式也是相同的,因此训练过程中的时间编码可以参照上述公式(1)进行具体设置,在此不再赘述。本实施例中,时间编码TPE t的空间维度与训练脸部几何渲染图像M t以及参考图像
Figure PCTCN2022124917-appb-000006
一致,且TPE t∈R H×W×2N
本实施例中,上述角色化身模型(记为g)的功能为在给定M t、TPE t
Figure PCTCN2022124917-appb-000007
时生成带人脸的图像(即训练被驱动图像)I' t∈R H×W×3,如下公式(3)所示:
Figure PCTCN2022124917-appb-000008
其中,I' t代表第t帧训练驱动图像对应的训练被驱动图像,g代表角色化身模型的处理过程。本实施例中使用的角色化身模型(即神经网络生成器)是一个图到图(输入和输出都是图像,并且空间尺寸不变)的卷积神经网络(如UNet)。训练过程中训练神经网络生成器g使得预测出的训练被驱动图像I' t可以重建训练视频中对应的训练驱动图像I t,具体即以最小化训练被驱动图像I' t和训练驱动图像I t之间的重建损失为目标来迭代优化并更新角色化身模型g中的模型参数,直到重建损失收敛到最小。需要说明的是,上述预设训练条件还可以包括迭代次数达到迭代次数阈值。
获得上述被驱动对象(即用户A)对应的已训练角色化身模型之后,在驱动对象(即用户B)想要驱动被驱动对象执行对应的动作和表情时,对驱动对象进行视频采集获得驱动视频,从而进行视频处理并生成对应的被驱动视频。图2是本申请实施例提供的一种基于用户A的角色化身模型生成被驱动视频的具体流程示意图,如图2所示,本实施例中,对于用户B的每一帧驱动图像提取3D 人脸参数(包括个体系数、表情系数和姿态系数),然后将其进行转换,转换到用户A对应的空间内,转换的目标是使得转换后系数的均值和方差对齐,可以如下公式(4)所示:
Figure PCTCN2022124917-appb-000009
其中,
Figure PCTCN2022124917-appb-000010
Figure PCTCN2022124917-appb-000011
分别代表转换后获得的与用户A的空间对应的个体系数、表情系数和姿态系数,T代表对齐转换过程,
Figure PCTCN2022124917-appb-000012
Figure PCTCN2022124917-appb-000013
分别代表转换前用户B对应的个体系数、表情系数和姿态系数。
使用
Figure PCTCN2022124917-appb-000014
Figure PCTCN2022124917-appb-000015
渲染获得用户B对应的脸部渲染几何图像
Figure PCTCN2022124917-appb-000016
结合时间编码和参考图像,通过角色化身模型,完成当前姿态和表情的用户A对应的被驱动图像的预测已获得对应的被驱动图像,处理过程可以参照上述公式(3)及其具体步骤,在此不再赘述。
如此,本实施例中,提供一种低成本且高度写实的角色化身模型。对于用户A的角色化身模型,只需要由用户A使用日常拍摄设备(如手机)在一个场景(场景一)中拍摄对应的训练驱动视频(例如一段时长约为2分钟的讲话视频),即可成为角色化身模型训练时的素材。并且,基于一段训练驱动视频,可以利用数据增强和扩展技术获得多组训练图像组。在一种具体应用场景中,将上述训练驱动视频在训练平台上进行约4个小时的数字化身模型的训练,可以得到用于支持用户A在相同场景(场景一)下任意头部动作和表情的数字化身模型。在后续的使用过程中,用户B在任意场景(场景二)下录制驱动视频即可以驱动上述数字化身模型,生成用户A在场景一中讲话的被驱动视频,并且被驱动视频中用户A与用户B在驱动视频中的姿态和表情相同。同时,本实施中结合时间编码,保证生成的被驱动视频真实自然,并且在时域上流畅度和稳定度较高,能达到与真实拍摄视频相近的生成结果。
具体的,在角色化身模型的训练过程中,可以使用包括空间随机裁剪等数据增强方式并通过优化多联合图像重建损失函数(例如L1重建损失、感知损失、GAN判别器损失)进行神经网络生成器训练,训练在NVIDIA A100-SXM4-40GB GPU上进行,Batch Size为20,输入输出图像分辨率为512*512。其中,数据增强是指在训练过程中加入空间随机裁剪,增强数据多样性。例如,对于用户A拍摄的视频帧进行空间随机裁剪后输入神经网络生成器,并通过多项联合图像重 建损失函数计算,调整神经网络生成器的各项参数,以获得达到训练预期的神经网络生成器。其中,计算损失是指计算神经网络生成器(即角色化身模型)计算被驱动对象生成的图像与其对应的原始的驱动图像之间的损失。
需要说明的是,本实施例提供的基于角色化身模型的视频处理方法在模型训练和渲染新视频的过程中都具有较高的效率。在一种应用场景中,使用相同的条件(训练素材时长相同,图像分辨率都为512*512)训练获得稳定收敛的模型,DVP方案平均需要训练42小时,且渲染视频时平均每帧需要0.2秒;NerFace方案平均需要训练55小时,渲染视频时平均每帧需要6秒;而本实施例中平均训练4小时,渲染视频时平均每帧0.03秒,可见本实施例方案有利于提高训练和处理效率。
进一步的,对于生成的被驱动视频的质量,可以采用结构相似性(SSIM,Structure Similarity Index)和峰值信噪比(PSNR,Peak Signal-to-Noise Ratio)作为重建指标以进行评价,这两个指标越高则代表生成的被驱动视频与真实的驱动视频时间差距越小,也代表生成的被驱动视频的质量越高。基于本实施例方案生成的被驱动视频对应的结构相似性大于95%,峰值信噪比大于26.85,可见生成的被驱动视频的质量较高。
由上可见,本申请实施例提供的基于角色化身模型的视频处理方法中,不只是对不同对象的图像中脸部区域进行截取和替换,而是预先设置有针对被驱动对象的角色化身模型,在获取到驱动对象对应的驱动视频并在权限验证通过后,获取对应的被驱动对象的角色化身模型和参考图像。然后根据驱动视频获取用于体现驱动对象的表情和姿态的脸部几何渲染图像,结合时间编码,通过被驱动对象的角色化身模型对脸部几何渲染图像和参考图像进行融合,从而获得被驱动视频。
被驱动视频并不是通过简单的脸部区域的图像替换获得的,而是将驱动对象的表情、姿态与被驱动对象的实际纹理融合获得的,从而实现由被驱动对象执行与驱动视频中驱动对象相同的表情和姿态。因为脸部几何渲染图像只体现表情和姿态,不体现驱动对象的脸部的实际纹理,实际纹理仅由被驱动对象的参考图像提供,所以在角色化身模型进行图像信息融合时不会错误地保留驱动对象所对应的实际纹理,即最终被驱动视频中所展现出来的对象的图像纹理细节与被驱动 对象是相同的。如此,以驱动对象的视频来驱动角色化身模型,有利于提高角色化身模型的视频展示效果。能够实现以驱动对象的表情来驱动被驱动对象的脸做出对应的表情,有利于提高视频脸部替换的效果,有利于提高用户使用体验。
示例性设备
如图3中所示,对应于上述基于角色化身模型的视频处理方法,本申请实施例还提供一种基于角色化身模型的视频处理系统,上述基于角色化身模型的视频处理系统包括:
驱动信息获取模块510,用于获取驱动对象的驱动视频、上述驱动对象的权限验证信息以及上述驱动对象对应的被驱动对象,其中,上述驱动视频通过拍摄上述驱动对象的表情和姿态获得;
权限验证模块520,用于当上述驱动对象的权限验证信息满足上述被驱动对象的权限验证条件时,获取上述被驱动对象对应的角色化身模型和参考图像;
驱动视频处理模块530,用于根据上述驱动视频获取上述驱动对象对应的多帧脸部几何渲染图像,其中,上述脸部几何渲染图像用于体现上述驱动对象对应的表情和姿态;
被驱动视频生成模块540,用于获取各上述脸部几何渲染图像对应的时间编码,根据上述参考图像、各上述脸部几何渲染图像以及各上述脸部几何渲染图像对应的时间编码,通过上述角色化身模型生成被驱动视频,其中,上述被驱动视频中由上述被驱动对象执行与上述驱动视频中驱动对象相同的表情和姿态。
需要说明的是,上述基于角色化身模型的视频处理系统及其各个模块或单元的具体结构和实现方式可以参照上述方法实施例中的对应描述,在此不再赘述。
需要说明的是,上述基于角色化身模型的视频处理系统的各个模块的划分方式并不唯一,在此也不作为具体限定。
基于上述实施例,本申请还提供了一种智能终端,其原理框图可以如图4所示。上述智能终端包括通过系统总线连接的处理器、存储器、网络接口以及显示屏。其中,该智能终端的处理器用于提供计算和控制能力。该智能终端的存储器包括非易失性存储介质、内存储器。该非易失性存储介质存储有操作系统和基于角色化身模型的视频处理程序。该内存储器为非易失性存储介质中的操作系统和基于角色化身模型的视频处理程序的运行提供环境。该智能终端的网络接口用 于与外部的终端通过网络连接通信。该基于角色化身模型的视频处理程序被处理器执行时实现上述任意一种基于角色化身模型的视频处理方法的步骤。该智能终端的显示屏可以是液晶显示屏或者电子墨水显示屏。
本领域技术人员可以理解,图4中示出的原理框图,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的智能终端的限定,具体的智能终端可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。
在一个实施例中,提供了一种智能终端,上述智能终端包括存储器、处理器以及存储在上述存储器上并可在上述处理器上运行的基于角色化身模型的视频处理程序,上述基于角色化身模型的视频处理程序被上述处理器执行时实现本申请实施例提供的任意一种基于角色化身模型的视频处理方法的步骤。
本申请实施例还提供一种计算机可读存储介质,上述计算机可读存储介质上存储有基于角色化身模型的视频处理程序,上述基于角色化身模型的视频处理程序被处理器执行时实现本申请实施例提供的任意一种基于角色化身模型的视频处理方法的步骤。
应理解,上述实施例中各步骤的序号大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。
所属领域的技术人员可以清楚地了解到,为了描述的方便和简洁,仅以上述各功能单元、模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能单元、模块完成,即将上述装置的内部结构划分成不同的功能单元或模块,以完成以上描述的全部或者部分功能。实施例中的各功能单元、模块可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中,上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。另外,各功能单元、模块的具体名称也只是为了便于相互区分,并不用于限制本申请的保护范围。上述装置中单元、模块的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述或记载的部分,可以参见其它实施例的相关描述。
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各实例的单元及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟是以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
在本申请所提供的实施例中,应该理解到,所揭露的系统/终端设备和方法,可以通过其它的方式实现。例如,以上所描述的系统/终端设备实施例仅仅是示意性的,例如,上述模块或单元的划分,仅仅为一种逻辑功能划分,实际实现时可以由另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。
上述集成的模块/单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读存储介质中。基于这样的理解,本申请实现上述实施例方法中的全部或部分流程,也可以通过计算机程序来指令相关的硬件来完成,上述计算机程序可存储于一计算机可读存储介质中,该计算机程序在被处理器执行时,可实现上述各个方法实施例的步骤。其中,上述计算机程序包括计算机程序代码,上述计算机程序代码可以为源代码形式、对象代码形式、可执行文件或某些中间形式等。上述计算机可读介质可以包括:能够携带上述计算机程序代码的任何实体或装置、记录介质、U盘、移动硬盘、磁碟、光盘、计算机存储器、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、电载波信号、电信信号以及软件分发介质等。需要说明的是,上述计算机可读存储介质包含的内容可以根据司法管辖区内立法和专利实践的要求进行适当的增减。
以上所述实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解;其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不是相应技术方案的本质脱离本申请各实施例技术方案的精神和范围,均应包含在本申请的保护范围之内。

Claims (15)

  1. 一种基于角色化身模型的视频处理方法,其中,所述基于角色化身模型的视频处理方法包括:
    获取驱动对象的驱动视频、所述驱动对象的权限验证信息以及所述驱动对象对应的被驱动对象,其中,所述驱动视频通过拍摄所述驱动对象的表情和姿态获得;
    当所述驱动对象的权限验证信息满足所述被驱动对象的权限验证条件时,获取所述被驱动对象对应的角色化身模型和参考图像;
    根据所述驱动视频获取所述驱动对象对应的多帧脸部几何渲染图像,其中,所述脸部几何渲染图像用于体现所述驱动对象对应的表情和姿态;
    获取各所述脸部几何渲染图像对应的时间编码,根据所述参考图像、各所述脸部几何渲染图像以及各所述脸部几何渲染图像对应的时间编码,通过所述角色化身模型生成被驱动视频,其中,所述被驱动视频中由所述被驱动对象执行与所述驱动视频中驱动对象相同的表情和姿态。
  2. 根据权利要求1所述的基于角色化身模型的视频处理方法,其中,所述参考图像用于为所述角色化身模型提供所述被驱动对象对应的图像纹理细节,所述被驱动视频与所述参考图像的图像纹理细节相同。
  3. 根据权利要求1所述的基于角色化身模型的视频处理方法,其中,所述参考图像是通道数为3的RGB图像。
  4. 根据权利要求1所述的基于角色化身模型的视频处理方法,其中,所述根据所述驱动视频获取所述驱动对象对应的多帧脸部几何渲染图像,包括:
    对所述驱动视频进行拆分获得多帧驱动图像;
    分别提取获得各所述驱动图像对应的三维脸部参数;
    分别根据各所述驱动图像对应的三维脸部参数获取各所述驱动图像对应的三维脸部网格;
    对各所述驱动图像对应的三维脸部网格进行渲染获得各所述驱动图像对应的脸部几何渲染图像,其中,所述脸部几何渲染图像为灰度图。
  5. 根据权利要求4所述的基于角色化身模型的视频处理方法,其中,在所述分别根据各所述驱动图像对应的三维脸部参数获取各所述驱动图像对应的三维脸部网格之前,所述方法还包括:
    基于所述角色化身模型中所述被驱动对象对应的脸部空间位置对所述三维脸部参数进行对齐以更新所述三维脸部参数。
  6. 根据权利要求4所述的基于角色化身模型的视频处理方法,其中,所述三维脸部参数包括个体系数、表情系数和姿态系数。
  7. 根据权利要求1所述的基于角色化身模型的视频处理方法,其中,所述获取各所述脸部几何渲染图像对应的时间编码,根据所述参考图像、各所述脸部几何渲染图像以及各所述脸部几何渲染图像对应的时间编码,通过所述角色化身模型生成被驱动视频,包括:
    根据预设的时间编码计算公式获取各所述脸部几何渲染图像对应的时间编码;
    将各组待处理数据依次输入所述角色化身模型,获得各组所述待处理数据对应的被驱动图像,其中,一组所述待处理数据由所述参考图像、一个所述脸部几何渲染图像以及该脸部几何渲染图像对应的时间编码组成,所述被驱动图像中由所述被驱动对象执行与对应的脸部几何渲染图像中相同的表情和姿态;
    将各所述被驱动图像按照对应的时间编码依次连接并生成所述被驱动视频。
  8. 根据权利要求7所述的基于角色化身模型的视频处理方法,其中,所述时间编码用于为所述角色化身模型输入时间信息,所述时间编码计算公式为:TPE t=(sin(2 0πt),cos(2 0πt),…,sin(2 N-1πt),cos(2 N-1πt)),其中,TPE t代表编号为t的脸部几何渲染图像对应的时间编码,N为预设常数。
  9. 根据权利要求8所述的基于角色化身模型的视频处理方法,其中,所述时间编码、所述时间编码对应的脸部几何渲染图像以及所述参考图像的空间维度相同。
  10. 根据权利要求1所述的基于角色化身模型的视频处理方法,其中,所述角色化身模型预先根据如下步骤训练获得:
    将训练数据中的参考图像、训练脸部几何渲染图像以及该训练脸部几何渲染图像对应的训练时间编码输入深度神经网络生成器,通过所述深度神经网络生成器生成针对所述参考图像和所述训练脸部几何渲染图像的训练被驱动图像,其中,所述训练数据包括多组训练图像组,每一组训练图像组包括与所述被驱动对象对应的参考图像、与所述驱动对象对应的训练脸部几何渲染图像、该训练脸部 几何渲染图像对应的训练时间编码以及该训练脸部几何渲染图像对应的训练驱动图像;
    根据所述训练被驱动图像和所述训练驱动图像,对所述深度神经网络生成器的模型参数进行调整,并继续执行所述将训练数据中的参考图像、训练脸部几何渲染图像以及该训练脸部几何渲染图像对应的训练时间编码输入所述深度神经网络生成器的步骤,直至满足预设训练条件,以得到所述角色化身模型。
  11. 根据权利要求10所述的基于角色化身模型的视频处理方法,其中,所述预设训练条件包括所述训练被驱动图像和所述训练驱动图像之间的重建损失收敛。
  12. 根据权利要求10所述的基于角色化身模型的视频处理方法,其中,所述训练数据由采集的图像数据通过预设数据增强方式进行处理后获得,所述预设数据增强方式包括空间随机裁剪。
  13. 根据权利要求11所述的基于角色化身模型的视频处理方法,其中,所述重建损失通过多项联合图像重建损失函数计算获得,所述多项联合图像重建损失函数用于联合L1重建损失、感知损失和GAN判别器损失中的至少两种损失。
  14. 一种基于角色化身模型的视频处理系统,其中,所述基于角色化身模型的视频处理系统包括:
    驱动信息获取模块,用于获取驱动对象的驱动视频、所述驱动对象的权限验证信息以及所述驱动对象对应的被驱动对象,其中,所述驱动视频通过拍摄所述驱动对象的表情和姿态获得;
    权限验证模块,用于当所述驱动对象的权限验证信息满足所述被驱动对象的权限验证条件时,获取所述被驱动对象对应的角色化身模型和参考图像;
    驱动视频处理模块,用于根据所述驱动视频获取所述驱动对象对应的多帧脸部几何渲染图像,其中,所述脸部几何渲染图像用于体现所述驱动对象对应的表情和姿态;
    被驱动视频生成模块,用于获取各所述脸部几何渲染图像对应的时间编码,根据所述参考图像、各所述脸部几何渲染图像以及各所述脸部几何渲染图像对应的时间编码,通过所述角色化身模型生成被驱动视频,其中,所述被驱动视频中由所述被驱动对象执行与所述驱动视频中驱动对象相同的表情和姿态。
  15. 一种智能终端,其中,所述智能终端包括存储器、处理器以及存储在所述存储器上并可在所述处理器上运行的基于角色化身模型的视频处理程序,所述基于角色化身模型的视频处理程序被所述处理器执行时实现如权利要求1-13任意一项所述基于角色化身模型的视频处理方法的步骤。
PCT/CN2022/124917 2022-09-16 2022-10-12 基于角色化身模型的视频处理方法、系统及相关设备 WO2024055379A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202211130893.2A CN115643349A (zh) 2022-09-16 2022-09-16 基于角色化身模型的视频处理方法、系统及相关设备
CN202211130893.2 2022-09-16

Publications (1)

Publication Number Publication Date
WO2024055379A1 true WO2024055379A1 (zh) 2024-03-21

Family

ID=84942019

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/124917 WO2024055379A1 (zh) 2022-09-16 2022-10-12 基于角色化身模型的视频处理方法、系统及相关设备

Country Status (2)

Country Link
CN (1) CN115643349A (zh)
WO (1) WO2024055379A1 (zh)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020057209A1 (zh) * 2018-09-21 2020-03-26 腾讯科技(深圳)有限公司 虚拟对象的信息显示方法、装置、设备及存储介质
CN111368137A (zh) * 2020-02-12 2020-07-03 百度在线网络技术(北京)有限公司 视频的生成方法、装置、电子设备及可读存储介质
CN113269872A (zh) * 2021-06-01 2021-08-17 广东工业大学 基于三维人脸重构和视频关键帧优化的合成视频生成方法
CN114255496A (zh) * 2021-11-30 2022-03-29 北京达佳互联信息技术有限公司 视频生成方法、装置、电子设备及存储介质
CN114845065A (zh) * 2019-09-30 2022-08-02 深圳市商汤科技有限公司 视频图像处理方法、装置、电子设备及存储介质

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020057209A1 (zh) * 2018-09-21 2020-03-26 腾讯科技(深圳)有限公司 虚拟对象的信息显示方法、装置、设备及存储介质
CN114845065A (zh) * 2019-09-30 2022-08-02 深圳市商汤科技有限公司 视频图像处理方法、装置、电子设备及存储介质
CN111368137A (zh) * 2020-02-12 2020-07-03 百度在线网络技术(北京)有限公司 视频的生成方法、装置、电子设备及可读存储介质
CN113269872A (zh) * 2021-06-01 2021-08-17 广东工业大学 基于三维人脸重构和视频关键帧优化的合成视频生成方法
CN114255496A (zh) * 2021-11-30 2022-03-29 北京达佳互联信息技术有限公司 视频生成方法、装置、电子设备及存储介质

Also Published As

Publication number Publication date
CN115643349A (zh) 2023-01-24

Similar Documents

Publication Publication Date Title
CN106682632B (zh) 用于处理人脸图像的方法和装置
CN113269872A (zh) 基于三维人脸重构和视频关键帧优化的合成视频生成方法
CN113628327B (zh) 一种头部三维重建方法及设备
CN116109798B (zh) 图像数据处理方法、装置、设备及介质
KR102353556B1 (ko) 사용자 얼굴기반 표정 및 포즈 재현 아바타 생성장치
CN109191366B (zh) 基于人体姿态的多视角人体图像合成方法及装置
CN112233212A (zh) 人像编辑与合成
CN116310076A (zh) 基于神经辐射场的三维重建方法、装置、设备及存储介质
CN116583878A (zh) 用于个性化3d头部模型变形的方法和系统
JP2016085579A (ja) 対話装置のための画像処理装置及び方法、並びに対話装置
CN115914505B (zh) 基于语音驱动数字人模型的视频生成方法及系统
CN111008927A (zh) 一种人脸替换方法、存储介质及终端设备
CN115239857B (zh) 图像生成方法以及电子设备
CN116997933A (zh) 用于构造面部位置图的方法和系统
Elgharib et al. Egocentric videoconferencing
CN111754622B (zh) 脸部三维图像生成方法及相关设备
US20230230304A1 (en) Volumetric capture and mesh-tracking based machine learning 4d face/body deformation training
CN111640172A (zh) 一种基于生成对抗网络的姿态迁移方法
KR20230110787A (ko) 개인화된 3d 머리 및 얼굴 모델들을 형성하기 위한 방법들 및 시스템들
CN115512014A (zh) 训练表情驱动生成模型的方法、表情驱动方法及装置
CN115984447A (zh) 图像渲染方法、装置、设备和介质
CN115393480A (zh) 基于动态神经纹理的说话人合成方法、装置和存储介质
CN111028318A (zh) 一种虚拟人脸合成方法、系统、装置和存储介质
CN116863069A (zh) 三维光场人脸内容生成方法、电子设备及存储介质
CN116863044A (zh) 人脸模型的生成方法、装置、电子设备及可读存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22958569

Country of ref document: EP

Kind code of ref document: A1