WO2023221684A1

WO2023221684A1 - Digital human generation method and apparatus, and storage medium

Info

Publication number: WO2023221684A1
Application number: PCT/CN2023/087271
Authority: WO
Inventors: 王林芳; 张炜; 石凡; 张琪; 申童; 左佳伟; 梅涛
Original assignee: 京东科技控股股份有限公司
Priority date: 2022-05-18
Filing date: 2023-04-10
Publication date: 2023-11-23
Also published as: CN114863533A

Abstract

The present disclosure relates to the technical field of computers. Provided are a digital human generation method and apparatus, and a storage medium. The method comprises: acquiring a first video; according to character customization information corresponding to an interaction scene, performing editing processing on characters in each frame of image in the first video; and outputting a second video according to each frame of image in the processed first video. Editing processing is performed on characters in a video according to character customization information corresponding to an interaction scene, and a digital human video that matches the interaction scene is generated by means of character editing.

Description

Digital human generation method and device and storage medium

Cross-references to related applications

This application is based on the application with CN application number 202210541984.9 and the filing date is May 18, 2022, and claims its priority. The disclosure content of the CN application is hereby incorporated into this application as a whole.

Technical field

The present disclosure relates to the field of computer technology, and in particular to a digital human generation method and device and a storage medium.

Background technique

Driven by the wave of new technologies such as artificial intelligence and virtual reality, the performance of digital people in all aspects has been improved. Digital people represented by virtual anchors, virtual employees, etc. have successfully entered the public eye and have appeared in film, television, games, etc. in a diversified manner. Many areas such as media, culture and tourism, and finance are shining brightly.

The customization of digital human images strives for authenticity and personalization. Under the requirement of photographic-level hyper-realism, every detail of the digital human image will attract the attention of users. This places higher demands on models when recording image material. However, the model is not a robot after all, and it is impossible to completely match the time and action positioning with the interactive scene used by the image.

Contents of the invention

Some embodiments of the present disclosure propose a digital human generation method, including:

Get the first video;

Edit the characters in each frame of the image in the first video according to the character customization information corresponding to the interactive scene;

A second video is output based on each frame image in the processed first video.

In some embodiments, the first video is obtained by preprocessing the original video, and the preprocessing includes one or more of resolution adjustment, inter-frame smoothing, and frame rate adjustment.

In some embodiments, the resolution adjustment includes:

If the resolution of the original video is higher than the required preset resolution, downsample the original video according to the preset resolution to obtain the first video with the preset resolution;

If the resolution of the original video is lower than the required preset resolution, use the super-resolution model to process the original video. Process to obtain a first video with a preset resolution, and the super-resolution model is used to increase the resolution of the input video to the preset resolution.

In some embodiments, the super-resolution model is trained by a neural network. During the training process, the first video frame from the high-definition video is downsampled according to the preset resolution to obtain the second video frame. The second video frame is used as the input of the neural network, the first video frame is used as the supervision information of the output of the neural network, and the neural network is trained to obtain a super-resolution model.

In some embodiments, the frame rate adjustment includes:

If the frame rate of the original video is higher than the required preset frame rate, extract frames from the original video based on the ratio information between the frame rate of the original video and the preset frame rate to obtain the first video with the preset frame rate;

If the frame rate of the original video is lower than the required preset frame rate, the video frame insertion model is used to insert frames of the original video to the first frame rate, and the first frame rate is the frame rate of the original video before the frame is inserted and the preset frame The least common multiple of the rate, extract frames from the original video after frame insertion based on the ratio information between the first frame rate and the preset frame rate, to obtain the first video with the preset frame rate. The video frame interpolation model is used to generate any two Transition frame between frame images.

In some embodiments, the video frame interpolation model is trained by a neural network. During the training process, three consecutive frames in the training video frame sequence are regarded as triples, and the first frame in the triples is and the third frame as the input of the neural network, and the second frame in the triplet as the supervision information of the output of the neural network, and train the neural network to obtain a video frame interpolation model.

In some embodiments, the input of the neural network includes: visual feature information and depth information of the first frame and the third frame, and optical flow information and deformation information between the first frame and the third frame.

In some embodiments, editing the characters in each frame of the first video according to the character customization information corresponding to the interaction scene includes one or more of the following:

Edit the character expressions in each frame of the first video according to the character expression customization information corresponding to the interaction scene;

Edit the character movements in each frame of the image in the first video according to the customized character movement information corresponding to the interaction scene.

In some embodiments, editing the characters in each frame of the first video based on the character customization information corresponding to the interaction scene includes: based on the user's actions in some video frames in the first video. Adjust the character image, determine the character image adjustment parameters, and adjust the character image in the first video according to the character image adjustment parameters. The characters in the remaining video frames are edited.

In some embodiments, editing the characters in the remaining video frames in the first video according to the character adjustment parameters includes:

According to the target part of the character image adjustment in the character image adjustment parameter, locate the target part of the character in the remaining video frames in the first video through key point detection;

According to the amplitude information or position information of the character image adjustment in the character image adjustment parameters, the amplitude or position of the positioned target part is adjusted through graphics transformation.

In some embodiments, the character expression customization information includes preset classification information corresponding to the target expression, and the character expression customization information in the first video is edited according to the character expression customization information corresponding to the interaction scene. ,include:

Obtain the feature information of each frame of the image in the first video, the feature information of the key points of the face, and the classification information of the original expression;

Fuse the feature information of each frame of image, the feature information of facial key points, the classification information of the original expression and the preset classification information corresponding to the target expression to obtain the feature information of the fused image corresponding to each frame of image;

According to the characteristic information of the fused image corresponding to each frame of image, the fused image corresponding to each frame of image is generated, and all the fused images form a second video in which the facial expression is the target expression.

In some embodiments, obtaining the feature information of each frame of the image in the first video, the feature information of key facial points, and the classification information of the original expression includes:

Input each frame of image in the first video into the facial feature extraction model to obtain the output feature information of each frame of image;

Input the characteristic information of each frame of image into the facial key point detection model to obtain the coordinate information of the facial key points of each frame of image, and use the principal component analysis method to reduce the dimensionality of the coordinate information of all facial key points, Obtain information of preset dimensions as feature information of key points of the human face;

The characteristic information of each frame of image is input into the expression classification model to obtain the classification information of the original expression of each frame of image.

In some embodiments, fusing the feature information of each frame of image, the feature information of facial key points, the classification information of the original expression and the preset classification information corresponding to the target expression includes:

Add and average the classification information of the original expression of each frame of image and the preset classification information corresponding to the target expression to obtain the classification information of the fused expression corresponding to each frame of image;

The characteristic information of the face key points of each frame image multiplied by the first weight obtained by training is compared with the training The characteristic information of each frame of image multiplied by the obtained second weight and the classification information of the fused expression corresponding to each frame of image are spliced.

In some embodiments, generating the fused image corresponding to each frame of image based on the feature information of the fused image corresponding to each frame of image includes:

Input the feature information of the fused image corresponding to each frame of image into the decoder, and output the generated fused image corresponding to each frame of image;

Wherein, the facial feature extraction model includes a convolution layer, and the decoder includes a deconvolution layer.

In some embodiments, the first video in which the human facial expression is the original expression and the preset classification information corresponding to the target expression are input into the expression generation model, and the second video in which the human facial expression is the target expression is output; the expression generation model Training methods include:

Obtain a training pair consisting of each frame image of the first training video and each frame image of the second training video;

Each frame image of the first training video is input into the first generator, the feature information of each frame image of the first training video, the feature information of the face key points and the classification information of the original expression are obtained, and the first generator is The characteristic information of each frame of the training video, the characteristic information of the face key points, the classification information of the original expression and the preset classification information corresponding to the target expression are fused to obtain the fusion image of each frame corresponding to the first training video. Feature information: according to the feature information of each frame fusion image corresponding to the first training video, obtain each frame fusion image corresponding to the first training video output by the first generator;

Each frame image of the second training video is input into the second generator, the feature information of each frame image of the second training video, the feature information of the face key points and the classification information of the target expression are obtained, and the second generator is The characteristic information of each frame of the training video, the characteristic information of the face key points, the classification information of the target expression and the preset classification information corresponding to the original expression are fused to obtain the characteristics of each frame of the fused image corresponding to the second training video. Information, according to the feature information of each frame fusion image corresponding to the second training video, obtain each frame fusion image corresponding to the second training video output by the second generator;

Determine the adversarial loss and cycle-consistent loss according to each frame fusion image corresponding to the first training video and each frame fusion image corresponding to the second training video;

The first generator and the second generator are trained according to the adversarial loss and the cycle-consistent loss. After the training of the first generator is completed, it is used as an expression generation model.

In some embodiments, the method further includes: based on the pixel difference between each two adjacent frames of the fused image corresponding to the first training video, and the pixel difference between each two adjacent frames of the fused image corresponding to the second training video. Pixel difference, which determines pixel-to-pixel loss;

Wherein, training the first generator and the second generator according to the adversarial loss and the cycle-consistent loss includes:

The first generator and the second generator are trained based on the adversarial loss, the cycle-consistent loss, and the pixel-for-pixel loss.

In some embodiments, determining the adversarial loss based on each frame fusion image corresponding to the first training video and each frame fusion image corresponding to the second training video includes: combining each frame corresponding to the first training video. The frame fusion image is input into the first discriminator to obtain the first discrimination result of each frame fusion image corresponding to the first training video;

Input each frame of the fused image corresponding to the second training video into the second discriminator to obtain a second discrimination result of each frame of the fused image corresponding to the second training video;

The first adversarial loss is determined based on the first discrimination result of each frame of the fused image corresponding to the first training video, and the second adversarial loss is determined based on the second discrimination result of each frame of the fused image corresponding to the second training video.

In some embodiments, inputting the fused images of each frame corresponding to the first training video into the first discriminator, and obtaining the first discrimination result of the fused image of each frame corresponding to the first training video includes:

Input each frame of the fused image corresponding to the first training video into the first facial feature extraction model in the first discriminator, and obtain the output feature information of each frame of the fused image corresponding to the first training video;

Input the feature information of each frame of the fused image corresponding to the first training video into the first expression classification model in the first discriminator to obtain the classification information of the expression of each frame of the fused image corresponding to the first training video, As the first judgment result;

The input of the fused images of each frame corresponding to the second training video into the second discriminator to obtain the second discrimination result of the fused images of each frame corresponding to the second training video includes:

Input each frame of the fused image corresponding to the second training video into the second face feature extraction model in the second discriminator, and obtain the output feature information of each frame of the fused image corresponding to the second training video;

Input the feature information of each frame of the fused image corresponding to the second training video into the second expression classification model in the second discriminator to obtain the classification information of the expression of each frame of the fused image corresponding to the second training video, as the second judgment result.

In some embodiments, the cycle consistency loss is determined using the following method:

Each frame fusion image corresponding to the first training video is input into the second generator to generate a reconstructed image of each frame of the first training video, and each frame fusion image corresponding to the second training video is input into the second generator. The first generator generates reconstructed images of each frame of the second training video;

According to the difference between the reconstructed image of each frame of the first training video and the image of each frame of the first training video, And the difference between the reconstructed image of each frame of the second training video and the image of each frame of the second training video is used to determine the cycle consistency loss.

In some embodiments, the pixel-to-pixel loss is determined using the following method:

For each position in the fused image of each adjacent two frames corresponding to the first training video, determine the distance between the representation vectors of the two pixels at the position in the fused image of the two adjacent frames, and combine all positions The corresponding distances are added to obtain the first loss;

For each position in the fused image of each adjacent two frames corresponding to the second training video, determine the distance between the representation vectors of the two pixels at the position in the fused image of the two adjacent frames, and match all positions The distances are added to obtain the second loss;

The first loss and the second loss are summed to obtain the pixel-to-pixel loss.

In some embodiments, obtaining the characteristic information of each frame of the first training video, the characteristic information of the facial key points, and the classification information of the original expression includes: inputting each frame of the image in the first training video. The third facial feature extraction model in the first generator obtains the characteristic information of the output frame images; inputs the characteristic information of each frame image into the first face key in the first generator Point detection model is used to obtain the coordinate information of the facial key points of each frame image; the principal component analysis method is used to reduce the dimensionality of the coordinate information of all facial key points, and the first information of the preset dimension is obtained as the third Characteristic information of facial key points in each frame image of a training video; input the characteristic information of each frame image in the first training video into the third expression classification model in the first generator to obtain the first Classification information of the original expression of each frame image in the training video;

Obtaining the characteristic information of each frame image of the second training video, the characteristic information of the facial key points and the classification information of the target expression includes: inputting each frame image of the second training video into the second generator The fourth face feature extraction model in the second generator is used to obtain the feature information of each frame of the image output; the feature information of each frame of the image is input into the second face key point detection model in the second generator to obtain the feature information of each frame of the image. The coordinate information of the facial key points of each frame image is described; the principal component analysis method is used to reduce the dimensionality of the coordinate information of all facial key points, and the second information of the preset dimension is obtained as each frame of the second training video Characteristic information of facial key points in the image; input characteristic information of each frame of image in the second training video into the fourth expression classification model in the second generator to obtain each frame of image in the second training video Classification information of target expressions.

In some embodiments, fusing the feature information of each frame image of the first training video, the feature information of facial key points, the classification information of the original expression and the preset classification information corresponding to the target expression includes: The classification information of the original expression of each frame image of the first training video and the preset classification information corresponding to the target expression are added and averaged to obtain the classification of the fused expression corresponding to each frame image of the first training video. information; The feature information of the face key points of each frame image of the first training video multiplied by the first weight to be trained is multiplied by the second weight to be trained. The feature information of the frame images and the classification information of the fused expressions corresponding to each frame image of the first training video are spliced;

The fusion of the feature information of each frame image of the second training video, the feature information of the facial key points, the classification information of the target expression and the preset classification information corresponding to the original expression includes: fusing the second training video The classification information of the target expression of each frame of the image is added and averaged with the preset classification information corresponding to the original expression to obtain the classification information of the fused expression corresponding to each frame of the second training video; The characteristic information of the facial key points of each frame of the second training video multiplied by the third weight, and the characteristics of each frame of the second training video multiplied by the fourth weight to be trained information, and the classification information of the fused expression corresponding to each frame image of the second training video is spliced.

In some embodiments, training the first generator and the second generator based on the adversarial loss, the cycle-consistent loss, and the pixel-for-pixel loss includes: converting the adversarial loss , the cycle-consistent loss and the pixel-to-pixel loss are weighted and summed to obtain a total loss; the first generator and the second generator are trained according to the total loss.

In some embodiments, editing the character actions in each frame of the image in the first video based on the character action customization information corresponding to the interaction scene includes:

Adjust the first human body key points of the character in the original first key frame in the first video during the first action to obtain the second human body key points of the character during the second action as the character action customization information;

Extract feature information of each second human body key point neighborhood from the original first key frame;

The feature information of each second human body key point and its neighborhood is input into the image generation model, and the target first key frame of the character during the second action is output.

In some embodiments, the method for obtaining the image generation model includes: using the training video frames and the human body key points of the characters in the training video frames as a pair of training data, and using the human body key points in the training data and their key points in the training video frames. The feature information of the middle neighborhood is used as the input of the image generation network, the training video frames in the training data are used as the supervision information of the output of the image generation network, and the image generation network is trained to obtain the image generation model.

In some embodiments, the first human body key points include the human body outline feature points of the character during the first action, and the second human body key points include the human body outline feature points of the character during the second action.

Some embodiments of the present disclosure provide a digital human generation device, including: a memory; and a processor coupled to the memory, the processor being configured to execute the instructions of various embodiments based on instructions stored in the memory. The digital human generation method described above.

Some embodiments of the present disclosure provide a digital human generation device, including:

an acquisition unit configured to acquire the first video;

The customization unit is configured to edit the characters in each frame of the image in the first video according to the character customization information corresponding to the interaction scene;

The output unit is configured to output the second video according to each frame image in the processed first video.

Some embodiments of the present disclosure provide a non-transitory computer-readable storage medium on which a computer program is stored. When the program is executed by a processor, the steps of the digital human generation method described in various embodiments are implemented.

Description of the drawings

The drawings needed to be used in the description of the embodiments or related technologies will be briefly introduced below. The present disclosure may be more clearly understood from the following detailed description with reference to the accompanying drawings.

Obviously, the drawings in the following description are only some embodiments of the present disclosure. For those of ordinary skill in the art, other drawings can be obtained based on these drawings without exerting creative efforts.

Figure 1A shows a schematic flowchart of a digital human generation method according to some embodiments of the present disclosure.

FIG. 1B shows a schematic flowchart of a digital human generation method according to other embodiments of the present disclosure.

Figure 2 shows a schematic diagram of video preprocessing according to some embodiments of the present disclosure.

Figure 3A shows a schematic flowchart of an expression generation method according to some embodiments of the present disclosure.

Figure 3B shows a schematic diagram of an expression generation method according to other embodiments of the present disclosure.

Figure 3C shows a schematic flowchart of a training method for an expression generation model according to some embodiments of the present disclosure.

Figure 3D shows a schematic diagram of a training method of an expression generation model according to some embodiments of the present disclosure.

FIG. 4A shows a schematic diagram of the human body outline feature points of the character during the first action according to some embodiments of the present disclosure.

FIG. 4B shows a schematic diagram of the human body outline feature points of the character during the second action according to some embodiments of the present disclosure.

FIG. 4C shows a schematic diagram of multiple key points and multiple key connections on a character according to some embodiments of the present disclosure.

Figure 5 shows a schematic structural diagram of a digital human generation device according to some embodiments of the present disclosure.

Figure 6 shows a schematic structural diagram of a digital human generation device according to other embodiments of the present disclosure.

Detailed ways

The technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present disclosure.

Unless otherwise specified, descriptions such as "first" and "second" in this disclosure are used to distinguish different objects and do not Used to express meanings such as size or timing.

Embodiments of the present disclosure edit the characters in the video based on the character customization information corresponding to the interaction scene, and generate a digital human video that matches the interaction scene through character editing.

As shown in Figure 1A, the digital human generation method of this embodiment includes the following steps.

In step S110, the first video is obtained.

For example, the first video may be a recorded original video, or may be obtained by preprocessing the original video. The preprocessing includes one or more of resolution adjustment, inter-frame smoothing, and frame rate adjustment.

In step S120, edit the characters in each frame of the image in the first video according to the character customization information corresponding to the interaction scene.

The editing process of the characters in each frame image in the first video according to the character customization information corresponding to the interaction scene includes one or more of the following: according to the character customization information corresponding to the interaction scene, editing the first video Edit and process the characters in each frame of the image in the first video to generate a digital human image that matches the interaction scene; edit and process the characters in each frame of the image in the first video based on the character expression customization information corresponding to the interaction scene. Generate digital human expressions that match the interaction scene; edit the character movements in each frame of the image in the first video according to the customized information of character movements corresponding to the interaction scene, and generate digital human movements that match the interaction scene.

In step S130, a second video is output based on each frame image in the processed first video.

That is, each frame image in the processed first video is combined to form a second video, and the second video is a digital human video matching the interaction scene.

In the above embodiment, the characters in the video are edited according to the character customization information corresponding to the interaction scene, and a digital human video matching the interaction scene is generated through character editing. For example, a digital human image, digital human expression, etc. that match the interaction scene are generated. Digital human actions and more.

As shown in Figure 1B, the digital human generation method of this embodiment includes the following steps.

In step S210, logic control is customized.

Customized logic control is used to control whether custom logic such as video preprocessing, image customization, expression customization, action customization, etc. is executed and the order of execution.

The edited content of each part such as video preprocessing, image customization, expression customization, and action customization is independent, and there is no strong dependence on each other. Therefore, the execution order of each part can be exchanged, and the generation and interaction scenes can be matched. The basic effect of the digital human video. However, there is still a certain mutual influence between the various parts. According to the execution sequence of S220 to S250 in this embodiment, the mutual influence between the various parts can be minimized, and the final character image presentation effect is better.

In step S220, video preprocessing.

Video preprocessing is to preprocess the recorded original video to obtain the first video. The preprocessing includes one or more of resolution adjustment, inter-frame smoothing, and frame rate adjustment.

In some embodiments, as shown in Figure 2, the preprocessing is performed sequentially in the order of resolution adjustment, inter-frame smoothing, and frame rate adjustment. The effect of video preprocessing is better and the visual appearance of the original video can be retained to the greatest extent. information to ensure that the preprocessed video does not suffer from quality problems such as blurring and distortion, and that frame rate adjustment and resolution adjustment have the least impact on the subsequent digital human customization process.

The resolution adjustment includes: if the resolution of the original video is higher than the required preset resolution, downsampling the original video according to the preset resolution to obtain the first video with the preset resolution; if the resolution of the original video is If the resolution is lower than the required preset resolution, a super-resolution model is used to process the original video to obtain the first video of the preset resolution. The super-resolution model is used to increase the resolution of the input video to the preset resolution. ; If the resolution of the original video is equal to the required preset resolution, you can skip the resolution adjustment step.

Through resolution adjustment, the resolution of the preprocessed first video can be maintained consistent, and the impact of the differentiated resolution of the original video on the digital human customization effect can be reduced.

The super-resolution model is, for example, obtained by training a neural network. During the training process, the first video frame from the high-definition video is downsampled according to the preset resolution to obtain the second video frame, and the second video frame is used as As the input of the neural network, the first video frame is used as the supervision information of the output of the neural network, and the neural network is trained to obtain the super-resolution model. Among them, the gap information between the video frame output by the neural network and the first video frame is used as the loss function, and the parameters of the neural network are iteratively updated according to the loss determined by the loss function until the loss meets certain conditions and the training is completed. At this time, the neural network The output video frame is very close to the first video frame, using the trained neural network as a super-resolution model. Among them, neural networks are a large class of models, including but not limited to convolutional neural networks, recurrent networks based on optical flow methods, generative adversarial networks, etc.

For example, downsample the key frames of high-definition video (1080p) to obtain the second video frame of lower resolution (such as 360p/480p/720p, etc.), obtain a super-resolution model according to the above training method, and use the super-resolution model , the first video with resolutions such as 480p/720p/1080p can be obtained from the original video of any resolution. Among them, 360p/480p/720p/1080p is a video display format, and P means progressive scan. For example, the picture resolution of 1080p is 1920 times 1080.

After the resolution is adjusted, in the frame sequence generated by the super-resolution model or obtained by downsampling, there may be a certain gap in the texture information between the two frames. Therefore, inter-frame smoothing is used here to ensure that the texture and characters are smooth during video playback. There will be no jagged or moiré patterns on the edges to avoid visual impact.

The inter-frame smoothing process may, for example, adopt an average smoothing process. For example, the image information of three consecutive frames is averaged, and the average is used as the image information of the middle frame among the three consecutive frames.

The frame rate adjustment includes: if the frame rate of the original video is higher than the required preset frame rate, extracting frames from the original video based on the ratio information between the frame rate of the original video and the preset frame rate to obtain the third frame rate of the preset frame rate. A video; if the frame rate of the original video is lower than the required preset frame rate, use the video frame insertion model to insert frames in the original video to the first frame rate, and the first frame rate is the frame rate of the original video before the frame is inserted. The least common multiple of the preset frame rate, extract frames from the original video after frame insertion according to the ratio information between the first frame rate and the preset frame rate, to obtain the first video with the preset frame rate, and the video frame interpolation model is used Generate a transition frame between any two frames of images; if the frame rate of the original video is equal to the required preset frame rate, the frame rate adjustment step can be skipped.

Through frame rate adjustment, the frame rate of the preprocessed first video can be maintained consistent, and the impact of the differentiated frame rate of the original video on the digital human customization effect can be reduced. Moreover, the frame insertion operation can also effectively solve the jump problem between two actions. For example, after a digital person performs action A and then moves to action B, the user will feel that the character's movements jump when playing a video without frame insertion processing, which is not realistic enough. In this embodiment, frame insertion is used to insert between the key frames of the two actions. Several transition frames make the user feel that the transition of character movements is natural and more realistic when the video after frame insertion is played.

The video frame insertion model is, for example, obtained by training a neural network. During the training process, three consecutive frames in the training video frame sequence are regarded as a triplet, and the first frame and the third frame in the triplet are regarded as As the input of the neural network, the second frame in the triplet is used as the supervision information of the output of the neural network, and the neural network is trained to obtain the video frame interpolation model. Among them, the gap information between the video frame output by the neural network based on the first frame and the third frame in the input triplet and the second frame in the triplet is used as the loss function, and the neural network is iteratively updated based on the loss determined by the loss function. parameters of the network until the loss meets certain conditions and the training is completed. At this time, the video frame output by the neural network is very close to the second frame in the triplet. The trained neural network is used as a video frame interpolation model and can generate any two frames. Transition frames between images. Among them, neural networks are a large class of models, including but not limited to convolutional neural networks, recurrent networks based on optical flow methods, generative adversarial networks, etc.

The input to the neural network includes, for example: visual feature information and depth information of the first frame and the third frame, as well as optical flow information and deformation information between the first frame and the third frame. Through the fusion of these four parts of information, the inferred transition frame that should be inserted between two frames can make the video transition smoother.

In step S230, image customization.

According to the character customization information corresponding to the interactive scene, the characters in each frame of the image in the first video are edited to meet the user's needs for digital human beauty and body beautification. Among them, image customization includes, for example, skin resurfacing, face slimming, eye enlargement, facial feature position adjustment, body proportion adjustment, such as slimming down, leg lengthening and other beauty and body beautification operations.

In some embodiments, the character adjustment parameters are determined based on the character adjustments made by the user in some video frames in the first video, and the character images in the remaining video frames in the first video are adjusted according to the character adjustment parameters. Perform editing. The "partial video frame" may be, for example, one or several key frames in the first video. First, the image customization of all video digital people can be completed with a small amount of editing work, which improves the efficiency and cost of digital human customization.

The editing process of the characters in the remaining video frames in the first video according to the character adjustment parameters includes: according to the target part of the character adjustment in the character adjustment parameters, positioning the first through key point detection. The target parts of the characters in the remaining video frames in the video, such as facial features or the human body, etc.; according to the amplitude information or position information of the character image adjustment in the character image adjustment parameters, the positioned target parts are transformed through graphics transformation Adjust the amplitude or position.

For example, if the user enlarges the character's eyes in some key frames, the face will be detected first through face detection technology, and then the eyes of the character in the remaining video frames will be located through key point detection technology, and then the eyes will be enlarged according to the user's enlargement. The amplitude information of the eyes, for example, the adjustment amplitude of the distance between the upper and lower eyelids, is used to adjust the amplitude of the eyes of the characters in the remaining video frames through graphical transformation, so as to achieve the beauty effect of big eyes for the characters in all frames of the video.

In step S240, expression customization.

Expression customization refers to an expression generation method that edits the character expressions in each frame of the image in the first video based on the character expression customization information corresponding to the interaction scene, such as the preset classification information corresponding to the target expression, so as to realize the digital expression in the interactive scene. The control of human facial expressions can transfer one expression state of a digital human to another target expression state, while ensuring that only changes in the digital human's facial expression, mouth shape, head movements, etc. are not affected. Therefore, when the digital human expresses the corresponding language content, the expression can change accordingly with the language content.

Figure 3A is a flow chart of some embodiments of the expression generation method of the present disclosure. As shown in Figure 3A, the method in this embodiment includes: steps S310 to S330.

In step S310, the characteristic information of each frame of the image in the first video, the characteristic information of the key points of the face, and the classification information of the original expression are obtained.

The facial expressions in the first video are the original expressions. That is, the human facial expression in each frame image in the first video is mainly the original expression, and the original expression is, for example, a calm expression.

In some embodiments, each frame of the image in the first video is input to the facial feature extraction model to obtain the output of each frame. Feature information of the frame image; input the feature information of each frame image into the face key point detection model to obtain the coordinate information of the face key points of each frame image; use Principal Components Analysis (PCA) to detect all people The coordinate information of the key points of the face is dimensionally reduced to obtain the information of the preset dimension, which is used as the feature information of the key points of the face; the feature information of each frame of image is input into the expression classification model to obtain the classification information of the original expression of each frame of image.

The overall expression generation model includes an encoder and a decoder. The encoder can include a facial feature extraction model, a facial key point detection model and an expression classification model. The facial feature extraction model connects the facial key point detection model and the expression classification model. The facial feature extraction model can use existing models, such as VGG-19, ResNet, Transformer and other deep learning models with feature extraction functions. The part before VGG-19 block 5 can be used as a facial feature extraction model. The face key point detection model and expression classification model can also use existing models, such as MLP (multi-layer perceptron), etc., specifically it can be a 3-layer MLP. After the expression generation model is trained, it is used to generate expressions. The training process will be described in detail later.

The feature information of each frame of the image in the first video is, for example, the Feature Map output by the facial feature extraction model. The key points include, for example, 68 key points such as chin, eyebrow center, mouth corner, etc. Each key point is expressed as a horizontal axis at its location. Y-axis. After obtaining the coordinate information of each key point through the facial key point detection model, in order to reduce redundant information and improve efficiency, PCA is used to reduce the dimensionality of the coordinate information of all facial key points to obtain the preset dimensions (for example, 6 dimensions, you can To achieve the best effect) information, as the feature information of key points of the face. The expression classification model can output the classification of several expressions such as neutral, happy, sad, etc., which can be represented by one-hot encoded vectors. The classification information of the original expression may be the one-hot encoding of the classification of the original expression in each frame of the image in the first video obtained through the expression classification model.

In step S320, the feature information of each frame of image, the feature information of facial key points, the classification information of the original expression and the preset classification information corresponding to the target expression are fused to obtain the feature information of the fused image corresponding to each frame of image.

In some embodiments, the classification information of the original expression of each frame of image and the preset classification information corresponding to the target expression are added and averaged to obtain the classification information of the fused expression corresponding to each frame of image; The feature information of the face key points of each frame of image multiplied by the weights, the feature information of each frame of the image multiplied by the second weight obtained by training, and the classification information of the fused expression corresponding to each frame of the image are spliced.

The target expression is different from the original expression, for example, a smile expression, and the preset classification information corresponding to the target expression is, for example, a preset one-hot code of the target expression. The preset classification information does not need to be obtained through the model, and can be directly encoded using the preset encoding rules (one-hot). For example, a calm expression is coded as 1000 and a smiling expression is coded as 0100. The aforementioned classification information of the original expression is obtained through the expression classification model. This classification information can be different from the preset classification information corresponding to the original expression. For example, the original expression is a calm expression, and the default one-hot code is 1000, but the expression classification The one-hot encoding obtained by the model can be 0.8 0.2 0 0.

The encoder can also include a feature fusion model, which inputs the feature information of each frame of image, the feature information of facial key points, the classification information of the original expression and the preset classification information corresponding to the target expression into the feature fusion model for fusion. The parameters that need to be trained in the feature fusion model include the first weight and the second weight. For each frame of image, the first weight obtained by training is multiplied by the feature information of the facial key points of the image to obtain the first feature vector, and the second weight obtained by training is multiplied by the feature information of the image to obtain the second feature. vector, splicing the first feature vector, the second feature vector and the classification information of the fused expression corresponding to the image to obtain the feature information of the fused image corresponding to the image. The first weight and the second weight can unify the value ranges of the three types of information.

In step S330, a fused image corresponding to each frame of image is generated based on the feature information of the fused image corresponding to each frame of image, and all the fused images are combined to form a second video in which the facial expression is the target expression.

In some embodiments, the feature information of the fused image corresponding to each frame of image is input to the decoder, and the generated fused image corresponding to each frame of image is output. The facial feature extraction model includes convolutional layers, and the decoder includes deconvolutional layers that can generate images based on features. The decoder is, for example, block 5 of VGG-19, which replaces the last convolutional layer with a deconvolutional layer. The fused image is an image whose facial expression is the target expression, and the fused images of each frame form a second video.

Some application examples of the present disclosure are described below with reference to FIG. 3B.

As shown in Figure 3B, for a frame of the image in the first video, a feature map is obtained after feature extraction. Face key point detection and expression classification are performed based on the feature map. The feature information of each key point obtained by face key point detection is PCA is performed, and the dimensionality is reduced to the information of preset dimensions as key point features. The classification information of the original expression is one-hot encoded and fused with the preset classification information corresponding to the target expression to obtain the expression classification vector (the classification information of the fused expression). Then, the feature map of the face, the expression classification vector and the key point features are fused to obtain the feature information of the fused image, and the feature information of the fused image is decoded to obtain the face image of the target expression.

The solution of the above embodiment extracts the characteristic information of each frame of the image in the first video, the characteristic information of the key points of the face and the classification information of the original expression, and fuses the extracted information with the preset classification information corresponding to the target expression to obtain The feature information of the fused image corresponding to each frame of image is then used to generate a fused image corresponding to each frame of image based on the feature information of the fused image corresponding to each frame of image. All the fused images can form a second video in which the facial expression is the target expression. In the above embodiment, feature information of key points on the human face is extracted and used for feature fusion to make the expressions in the fused image more realistic and smooth. Through the fusion of preset classification information corresponding to the target expression, the target expression is directly generated, and Compatible with the character's facial movements and mouth shape in the original image, without affecting the character's mouth shape and head Movements, etc., do not affect the clarity of the original image, making the generated video stable, clear, and smooth.

Figure 3C is a flow chart of some embodiments of the training method of the expression generation model of the present disclosure. The expression generation model can output a second video in which the facial expression is the target expression based on the input first video in which the facial expression is the original expression and the preset classification information corresponding to the target expression.

As shown in Figure 3C, the method in this embodiment includes: steps S410 to S450.

In step S410, a training pair consisting of each frame image of the first training video and each frame image of the second training video is obtained.

The first training video is a video in which the facial expression is the original expression, and the second training video is a video in which the facial expression is the target expression. Each frame image of the first training video does not need to correspond to each frame image of the second training video. Label the classification information of the original expression and the classification information of the target expression.

Using a large number of videos of people talking with different expressions as training data, deep learning is used to perform cross-domain transfer learning (Domain Transfer Learning) to learn the first generator that transforms one expression state into another, and then generates the expression The result is integrated with the entire digital human.

In step S420, each frame image of the first training video is input into the first generator, the feature information of each frame image of the first training video, the feature information of the facial key points and the classification information of the original expression are obtained, and the first The characteristic information of each frame of the training video, the characteristic information of the key points of the face, the classification information of the original expression and the preset classification information corresponding to the target expression are fused to obtain the characteristic information of each frame of the fused image corresponding to the first training video, According to the feature information of each frame fusion image corresponding to the first training video, each frame fusion image corresponding to the first training video output by the first generator is obtained.

After the first generator is trained, it is used as an expression generation model. In some embodiments, each frame image in the first training video is input into the third facial feature extraction model in the first generator to obtain the output feature information of each frame image; the feature information of each frame image is input into the first The first facial key point detection model in the generator obtains the coordinate information of the facial key points in each frame image; the principal component analysis method is used to reduce the dimensionality of the coordinate information of all facial key points, and the first facial key point of the preset dimension is obtained. information, as the feature information of the facial key points of each frame image in the first training video; input the feature information of each frame image in the first training video into the third expression classification model in the first generator to obtain the first training video Classification information of the original expression of each frame of image.

The coordinate information of the key points on the face is subjected to principal component analysis (PCA), and the coordinate information of the key points is reduced to 6 dimensions (6 dimensions is the best result obtained through a large number of experiments). PCA does not involve training parameters (the feature extraction of PCA and the correspondence between front and rear feature dimensions do not change with training. When the gradient is transferred in reverse, only the feature correspondence obtained by the initial PCA is used to transfer the gradient to the previous parameters).

In some embodiments, the classification information of the original expression of each frame image of the first training video and the preset classification information corresponding to the target expression are added and averaged to obtain the fusion expression corresponding to each frame image of the first training video. Classification information; the feature information of the face key points of each frame of the first training video multiplied by the first weight to be trained, and each frame of the first training video multiplied by the second weight to be trained The feature information of the image and the classification information of the fused expression corresponding to each frame of the first training video are spliced to obtain the feature information of each frame of the fused image corresponding to the first training video.

The first generator includes a first feature fusion model, and the first weight and the second weight are parameters to be trained in the first feature fusion model. For the above feature extraction and feature fusion processes, reference can be made to the foregoing embodiments.

The first generator includes a first encoder and a first decoder. The first encoder includes: a third facial feature extraction model, a first facial key point detection model, a third expression classification model, and a first feature fusion model, The characteristic information of each frame of the fused image corresponding to the first training video is input into the first decoder to obtain the generated each frame of the fused image corresponding to the first training video.

In step S430, each frame image of the second training video is input into the second generator, the feature information of each frame image of the second training video, the feature information of the face key points, and the classification information of the target expression are obtained, and the second training video is The characteristic information of each frame image of the video, the characteristic information of the key points of the face, the classification information of the target expression and the preset classification information corresponding to the original expression are fused to obtain the characteristic information of each frame of the fused image corresponding to the second training video. According to The feature information of each frame of the fused image corresponding to the second training video is obtained to obtain the fused image of each frame corresponding to the second training video output by the second generator.

The second generator is structurally identical or similar to the first generator, and the training goal of the second generator is to generate a video with the same expression as the first training video based on the second training video.

In some embodiments, input each frame image in the second training video into the fourth facial feature extraction model in the second generator to obtain the output feature information of each frame image; input the feature information of each frame image into the second The second face key point detection model in the generator obtains the coordinate information of the face key points of each frame image; the principal component analysis method is used to reduce the dimensionality of the coordinate information of all face key points, and the second face key point of the preset dimension is obtained. Information, as the feature information of the face key points of each frame image of the second training video. The characteristic information of each frame image in the second training video is input into the fourth expression classification model in the second generator to obtain the classification information of the target expression of each frame image in the second training video.

The feature information of the face key points in each frame image of the second training video has the same dimension as the feature information of the face key points in each frame image of the first training video, for example, 6 dimensions.

In some embodiments, the classification information of the target expression in each frame of the second training video is compared with the original expression. The corresponding preset classification information is added and averaged to obtain the classification information of the fused expression corresponding to each frame image of the second training video; the classification information of each frame image of the second training video is multiplied by the third weight to be trained. The characteristic information of the face key points is spliced with the characteristic information of each frame image of the second training video multiplied by the fourth weight to be trained, and the classification information of the fused expression corresponding to each frame image of the second training video, Obtain the feature information of each frame fusion image corresponding to the second training video.

The preset classification information corresponding to the original expression does not need to be obtained through the model, and can be directly encoded using the preset encoding rules. The second generator includes a second feature fusion model, and the third weight and the third weight are parameters to be trained in the second feature fusion model. The above process of feature extraction and feature fusion can be referred to the previous embodiments and will not be described again.

The second generator includes a second encoder and a second decoder. The second encoder includes: a fourth facial feature extraction model, a second facial key point detection model, a fourth expression classification model, and a second feature fusion model, The feature information of each frame of the fused image corresponding to the second training video is input into the second decoder to obtain the generated each frame of the fused image corresponding to the second training video.

In step S440, the adversarial loss and the cycle-consistent loss are determined based on the fused images of each frame corresponding to the first training video and the fused images of each frame corresponding to the second training video.

End-to-end training based on generative adversarial learning and cross-domain transfer learning can improve the accuracy of the model and improve training efficiency.

In some embodiments, the adversarial loss is determined using the following method: input the fused image of each frame corresponding to the first training video into the first discriminator to obtain the first discrimination result of the fused image of each frame corresponding to the first training video; The fused images of each frame corresponding to the training video are input into the second discriminator to obtain the second discrimination result of the fused image of each frame corresponding to the second training video; based on the first discrimination result of the fused image of each frame corresponding to the first training video, the second discriminator is determined An adversarial loss, the second adversarial loss is determined based on the second discrimination result of each frame fusion image corresponding to the second training video.

Further, in some embodiments, each frame of the fused image corresponding to the first training video is input into the first face feature extraction model in the first discriminator, and the feature information of each frame of the fused image corresponding to the output first training video is obtained; Input the feature information of each frame of the fused image corresponding to the first training video into the first expression classification model in the first discriminator, and obtain the classification information of the expression of each frame of the fused image corresponding to the first training video as the first discrimination result; Input each frame of the fused image corresponding to the second training video into the second facial feature extraction model in the second discriminator to obtain the feature information of each frame of the fused image corresponding to the output second training video; input each frame of the fused image corresponding to the second training video. The feature information of the frame fusion image is input into the second expression classification model in the second discriminator, and the expression classification information of each frame fusion image corresponding to the second training video is obtained as the second discrimination result.

During the training process, the overall model includes two sets of generators and discriminators. The structures of the first discriminator and the second discriminator are the same or similar, and both include facial feature extraction models and expression classification models. The first facial feature extraction model, the second facial feature extraction model and the third facial feature extraction model and the fourth facial feature extraction model have the same or similar structures. The first expression classification model and the second expression classification model are the same as the third facial feature extraction model. The structures of the three-expression classification model and the fourth expression classification model are the same or similar.

For example, the data of the first video is represented by X={ _xi }, and the data of the second video is represented by Y={y _i }. The first generator G is used to realize X→Y, and is trained to make G(X) as close as possible to Y. The first discriminator D _Y is used to determine whether the fused images of each frame corresponding to the first training video are true or false. The first adversarial loss can be expressed by the following formula:

The second generator F is used to realize Y→X, and is trained to make F(Y) as close as possible to _X. The second discriminator D The second adversarial loss can be expressed by the following formula:

In some embodiments, cycle consistency losses (Cycle Consistency Losses) are determined using the following method: input the fused images of each frame corresponding to the first training video into the second generator, generate a reconstructed image of each frame of the first training video, and convert the fused image of each frame of the first training video into the second generator. The fused images of each frame corresponding to the two training videos are input into the first generator to generate a reconstructed image of each frame of the second training video; based on the difference between the reconstructed image of each frame of the first training video and the image of each frame of the first training video, And the difference between the reconstructed image of each frame of the second training video and the image of each frame of the second training video is used to determine the cycle consistency loss.

In order to further improve the accuracy of the model, the images generated by the first generator are input into the second generator to obtain reconstructed images of each frame of the first training video. It is expected that the reconstructed images of each frame of the first training video generated by the second generator The image should be as consistent as possible with each frame of the first training video, that is, F(G(x))≈x. Input the image generated by the second generator into the first generator to obtain the reconstructed image of each frame of the second training video. It is expected that the reconstructed image of each frame of the second training video generated by the first generator is consistent with each frame of the second training video. The frame images should be as consistent as possible, that is, G(F(y))≈y.

The difference between the reconstructed image of each frame of the first training video and the image of each frame of the first training video can be determined using the following method: for each frame of the reconstructed image of the first training video and the first training image corresponding to the reconstructed image For the image of the video, determine the distance (such as Euclidean distance) between the representation vector of each pixel at the same position in the reconstructed image and the corresponding image, and sum all distances.

The difference between the reconstructed image of each frame of the second training video and the image of each frame of the second training video can be as follows: Method determination: for each frame of the reconstructed image of the second training video and the image of the second training video corresponding to the reconstructed image, determine the relationship between the representation vector of each pixel at the same position in the reconstructed image and the corresponding image. distance (e.g. Euclidean distance) and sum all distances.

In step S450, the first generator and the second generator are trained according to the adversarial loss and the cycle-consistent loss.

The first adversarial loss, the second adversarial loss and the cycle-consistent loss can be weighted and summed to obtain the total loss, and the first generator and the second generator are trained based on the total loss. For example, the total loss can be determined using the following formula:
L＝L _GAN (G,D _Y ,X,Y)+L _GAN (F,D _X ,Y,X)+λL _cyc (G,F) (3)

Among them, L _cyc (G, F) represents the cycle consistency loss, and λ is the weight, which can be obtained through training.

In order to further improve the accuracy of the model and ensure the stability and continuity of the output video results, the loss caused by the pixel difference between the two frames of the video is increased during the training process. In some embodiments, the pixel-to-pixel determination is based on the pixel difference between each two adjacent frames of the fused image corresponding to the first training video and the pixel difference between each two adjacent frames of the fused image corresponding to the second training video. Loss,The first and second generators are trained based on,adversarial loss, cycle-consistent loss and pixel-by-pixel,loss.

Further, in some embodiments, for each position in each two adjacent frames of the fused image corresponding to the first training video, determine the relationship between the representation vectors of the two pixels at the position in the two adjacent frames of the fused image. distance, add the distances corresponding to all positions to obtain the first loss; for each position in the fused image of each adjacent two frames corresponding to the second training video, determine the position in the fused image of the two adjacent frames The distance between the two pixels represents the vector, and the distance corresponding to all positions is summed to obtain the second loss; the first loss and the second loss are summed to obtain the pixel-to-pixel loss. Pixel-for-pixel loss allows the generated video to not change too much between adjacent frames.

In some embodiments, the adversarial loss, the cycle-consistent loss and the pixel-to-pixel loss are weighted and summed to obtain a total loss; the first generator and the second generator are trained according to the total loss. For example, the total loss can be determined using the following formula:
L＝L _GAN (G,D _Y ,X,Y)+L _GAN (F,D _X ,Y,X)+λ ₁ L _cyc (G,F)+
λ ₂ L _P2P (G(x _i ),G(x _i+1 ))+λ ₃ L _P2P (F(y _j ),F(y _j+1 )) (4)

Among them, λ ₁ , λ ₂ , λ ₃ are weights, which can be obtained through training. L _P2P (G(x _i ),G(x _i+1 )) represents the first loss, L _P2P (F(y _j ),F (y _j+1 )) represents the second loss.

As shown in Figure 3D, before end-to-end training, each part of the model can be pre-trained. For example, first select a large amount of open source face recognition data to pre-train the face recognition model, and select the output feature map before department points as a facial feature extraction model (the method for this part is not unique, taking vgg-19 as an example, selecting the part before block 5 can output an 8×8×512-dimensional feature map). After that, the facial feature extraction model and parameters are fixed, and then divided into two branches. The two branches are the facial key point detection model and the expression classification model. The respective branches are fine-tuned using the facial key point detection data set and the expression classification data respectively. (fine-tune) only trains the parameters in these two parts of the model structure. The face key point detection model is not unique, as long as it is based on the convolutional network model and can obtain accurate key points, it can be connected to the modified solution; the expression classification model is a single-label classification task based on the convolutional network model. After pre-training, an end-to-end training process can be performed based on the foregoing embodiments. This can improve training efficiency.

The method of the above embodiment uses adversarial loss, cycle consistent loss, and pixel loss between two adjacent frames of the video to train the overall model, which can improve the accuracy of the model, and the end-to-end training process can improve efficiency and save computing resources.

The disclosed solution is suitable for editing facial expressions in videos. This disclosure adopts a unique deep learning model, integrates expression recognition, key point detection and other technologies, and through data training, learns the movement rules of key points on the human face under different expressions, and finally controls the model by inputting the classification information of the target expression into the model. The output facial expression state, and the expression exists as a style state, can be well superimposed when the character speaks or makes actions such as tilting the head, blinking, etc., making the final output facial action video of the character natural and consistent. The output result can have the same resolution and detail level as the input image, and the output result can still be stable, clear, and flawless at 1080p or even 2k resolution.

In step S250, the action is customized.

Action customization refers to editing and processing the character actions in each frame of the image in the first video based on the character action customization information corresponding to the interactive scene, so as to realize the editing and control of digital human actions in the interactive scene.

In some embodiments, editing the character actions in each frame of the image in the first video according to the character action customization information corresponding to the interaction scene includes: editing the first action of the character in the original first key frame in the first video. Adjust the first human body key points at the time to obtain the second human body key points at the second action of the character as the character action customization information; a feature extraction model, such as the convolution kernel model, can be used to extract from the original first key frame Characteristic information of each second human body key point neighborhood; input characteristic information of each second human body key point and its neighborhood into the image generation model, and output the target first key frame of the character during the second action.

The first human body key points include the character's human body outline feature points during the first action, such as the 14 pairs of white dots shown in Figure 4A. The second human body key points include the character's human body outline feature points during the second action, as shown in Figure 4A The 14 pairs of white dots shown in 4B.

Using human outline feature points to edit character movements is different from using human skeleton feature points to edit character movements. For editing, the generated character movements are more accurate and less prone to deformation, distortion, etc., improving the quality of the generated images.

Before adjusting the human body contour feature points of the character during the first movement, first extract the human body contour feature points of the character during the first movement. Extracting the human body contour feature points during the first action of the character includes, for example: using the semantic segmentation network model to extract the contour line of the character; using the target detection network model to extract multiple key points on the character, such as the black circle shown in Figure 4C points; according to the structural information of the character, connect the multiple key points and determine multiple key connections, such as the white straight lines shown in Figure 4C; according to the intersection points of the vertical lines of the multiple key connections and the contour lines , determine the pairs of multiple human body contour feature points during the first action of the character.

The method of obtaining the image generation model includes: using the training video frame and the human body key points of the characters in the training video frame as a pair of training data, and using the human body key points in the training data and the characteristic information of their neighborhoods in the training video frame. As input to the image generation network, the training video frames in the training data are used as supervision information for the output of the image generation network, and the image generation network is trained to obtain the image generation model. Among them, the gap information between the video frame output by the image generation network based on the input data and the training video frame is used as the loss function, and the parameters of the image generation network are iteratively updated according to the loss determined by the loss function until the loss meets certain conditions and the training is completed. The video frames output by the image generation network are very close to the training video frames, and the trained image generation network is used as the image generation model. Among them, image generation networks are a large class of models, including but not limited to convolutional neural networks, recurrent networks based on optical flow methods, generative adversarial networks, etc. If the image generation network is a generative adversarial network, the total loss function also includes the discriminant loss function of the image discriminant network.

In step S260, the output is rendered.

Use the material results processed in steps S220 to 250 to model the character image. Different rendering technologies can be selected according to the application scenario, combined with artificial intelligence technologies such as intelligent dialogue, speech recognition, speech synthesis, and action interaction, to form a complete set of The digital human video (i.e. the second video) that can interact with the scene is output.

In the above embodiment, the characters in the video are edited according to the character customization information corresponding to the interaction scene, and a digital human video matching the interaction scene is generated through character editing. For example, a digital human image, digital human expression, etc. that match the interaction scene are generated. Digital human actions and more. According to the method of the disclosed embodiment, recording a set of character videos can quickly produce multiple sets of videos with different character styles in different scenes. Moreover, professional engineers are not required to access the system. Users can adjust the character's image, expressions, actions, etc. according to the needs of the scene.

Figure 5 shows a schematic structural diagram of a digital human generation device according to some embodiments of the present disclosure. As shown in FIG. 5 , the digital human generation device 500 of this embodiment includes units 510 to 530.

The acquisition unit 510 is configured to acquire the first video. For details, see step S220.

The customization unit 520 is configured to edit the characters in each frame of the image in the first video according to the character customization information corresponding to the interaction scene. For details, see steps S230-250.

The customization unit 520 includes, for example, an image customization unit 521, an expression customization unit 522, an action customization unit 523, and the like. The image customization unit 521 is configured to edit the characters in each frame image in the first video according to the character customization information corresponding to the interaction scene. For details, see step S230. The expression customization unit 522 is configured to edit the character expressions in each frame image in the first video according to the character expression customization information corresponding to the interaction scene. For details, see step S240. The action customization unit 523 is configured to edit the character actions in each frame image in the first video according to the character action customization information corresponding to the interaction scene. For details, see step S250.

The output unit 530 is configured to output the second video according to each frame image in the processed first video. For details, see step S260.

Figure 6 shows a schematic structural diagram of a digital human generation device according to other embodiments of the present disclosure. As shown in Figure 6, the digital human generation device 600 of this embodiment includes: a memory 610 and a processor 620 coupled to the memory 610. The processor 620 is configured to execute any of the foregoing based on instructions stored in the memory 610. Digital human generation method in the embodiment.

The memory 610 may include, for example, system memory, fixed non-volatile storage media, etc. System memory stores, for example, operating systems, applications, boot loaders, and other programs.

Among them, the processor 620 can be a general-purpose processor, a digital signal processor (Digital Signal Processor, DSP), an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), a field programmable gate array (Field Programmable Gate Array, FPGA) or It can be implemented by other discrete hardware components such as programmable logic devices, discrete gates or transistors.

The device 600 may also include an input/output interface 630, a network interface 640, a storage interface 650, and the like. These interfaces 630, 640, 650, the memory 610 and the processor 620 may be connected through a bus 660, for example. Among them, the input and output interface 630 provides a connection interface for input and output devices such as a monitor, mouse, keyboard, and touch screen. Network interface 640 provides a connection interface for various networked devices. The storage interface 650 provides a connection interface for external storage devices such as SD cards and USB disks. Bus 660 may use any of a variety of bus structures. For example, bus structures include, but are not limited to, Industry Standard Architecture (ISA) bus, Micro Channel Architecture (MCA) bus, and Peripheral Component Interconnect (PCI) bus.

Some embodiments of the present disclosure provide a non-transitory computer-readable storage medium on which a computer program is stored A program that, when executed by a processor, implements the steps of the digital human generation method in any of the foregoing embodiments.

Those skilled in the art will appreciate that embodiments of the present disclosure may be provided as methods, systems, or computer program products. Accordingly, the present disclosure may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment that combines software and hardware aspects. Furthermore, the present disclosure may take the form of a computer program product embodied on one or more non-transitory computer-readable storage media (including, but not limited to, disk memory, CD-ROM, optical storage, etc.) embodying computer program code therein. .

The disclosure is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the disclosure. It will be understood that each process and/or block in the flowchart illustrations and/or block diagrams, and combinations of processes and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing device to produce a machine, such that the instructions executed by the processor of the computer or other programmable data processing device produce a use A device for realizing the functions specified in one process or multiple processes of the flowchart and/or one block or multiple blocks of the block diagram.

These computer program instructions may also be stored in a computer-readable memory that causes a computer or other programmable data processing apparatus to operate in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including the instruction means, the instructions The device implements the functions specified in a process or processes of the flowchart and/or a block or blocks of the block diagram.

These computer program instructions may also be loaded onto a computer or other programmable data processing device, causing a series of operating steps to be performed on the computer or other programmable device to produce computer-implemented processing, thereby executing on the computer or other programmable device. Instructions provide steps for implementing the functions specified in a process or processes of a flowchart diagram and/or a block or blocks of a block diagram.

The above are only preferred embodiments of the present disclosure and are not intended to limit the present disclosure. Any modifications, equivalent substitutions, improvements, etc. made within the spirit and principles of the present disclosure shall be included in the protection of the present disclosure. within the range.

Claims

A digital human generation method, including:

Get the first video;

Edit the characters in each frame of the image in the first video according to the character customization information corresponding to the interactive scene;

A second video is output based on each frame image in the processed first video.
The method according to claim 1, wherein the first video is obtained by preprocessing the original video, and the preprocessing includes one or more of resolution adjustment, inter-frame smoothing, and frame rate adjustment. .
The method of claim 2, wherein the resolution adjustment includes:

If the resolution of the original video is higher than the required preset resolution, downsample the original video according to the preset resolution to obtain the first video with the preset resolution;

If the resolution of the original video is lower than the required preset resolution, a super-resolution model is used to process the original video to obtain the first video of the preset resolution. The super-resolution model is used to convert the resolution of the input video into Upgrade to default resolution.
The method according to claim 3, wherein the super-resolution model is obtained by training a neural network. During the training process, the first video frame from the high-definition video is down-sampled according to a preset resolution to obtain the second video frame. Two video frames, the second video frame is used as the input of the neural network, the first video frame is used as the supervision information of the output of the neural network, and the neural network is trained to obtain a super-resolution model.
The method of claim 2, wherein the frame rate adjustment includes:

If the frame rate of the original video is higher than the required preset frame rate, extract frames from the original video based on the ratio information between the frame rate of the original video and the preset frame rate to obtain the first video with the preset frame rate;

If the frame rate of the original video is lower than the required preset frame rate, the video frame insertion model is used to insert frames of the original video to the first frame rate, and the first frame rate is the frame rate of the original video before the frame is inserted and the preset frame The least common multiple of the rate, extract frames from the original video after frame insertion based on the ratio information between the first frame rate and the preset frame rate, to obtain the first video with the preset frame rate. The video frame interpolation model is used to generate any two Transition frame between frame images.
The method according to claim 5, wherein the video frame insertion model is obtained by training a neural network. During the training process, three consecutive frames in the training video frame sequence are used as triples, and the triples are The first and third frames in the triplet are used as the input of the neural network, and the second frame in the triplet is used as the supervision information of the output of the neural network. The neural network is trained to obtain the video frame interpolation model.
The method according to claim 6, wherein the input of the neural network includes: visual feature information and depth information of the first frame and the third frame, and optical flow information and deformation information between the first frame and the third frame.
The method according to claim 1, wherein editing the characters in each frame image in the first video according to the character customization information corresponding to the interaction scene includes one or more of the following:

Edit the characters in each frame of the image in the first video according to the character customization information corresponding to the interactive scene;

Edit the character expressions in each frame of the first video according to the character expression customization information corresponding to the interaction scene;

Edit the character movements in each frame of the image in the first video according to the customized character movement information corresponding to the interaction scene.
The method according to claim 8, wherein editing the characters in each frame of the image in the first video according to the character customization information corresponding to the interaction scene includes:

According to the character adjustment made by the user in some video frames in the first video, the character adjustment parameters are determined, and the characters in the remaining video frames in the first video are edited according to the character adjustment parameters.
The method according to claim 9, wherein editing the characters in the remaining video frames in the first video according to the character adjustment parameters includes:

According to the target part of the character image adjustment in the character image adjustment parameter, locate the target part of the character in the remaining video frames in the first video through key point detection;

According to the amplitude information or position information of the character image adjustment in the character image adjustment parameters, the amplitude or position of the positioned target part is adjusted through graphics transformation.
The method of claim 8, wherein

The character expression customization information includes preset classification information corresponding to the target expression,

The editing process of the character expressions in each frame of the image in the first video based on the character expression customization information corresponding to the interaction scene includes:

Obtain the feature information of each frame of the image in the first video, the feature information of the key points of the face, and the classification information of the original expression;

Fuse the feature information of each frame of image, the feature information of facial key points, the classification information of the original expression and the preset classification information corresponding to the target expression to obtain the feature information of the fused image corresponding to each frame of image;

According to the characteristic information of the fused image corresponding to each frame of image, the fused image corresponding to each frame of image is generated, and all the fused images form a second video in which the facial expression is the target expression.
The method according to claim 11, wherein said obtaining the characteristic information of each frame of image in the first video, the characteristic information of facial key points and the classification information of the original expression includes:

Input each frame of image in the first video into the facial feature extraction model to obtain the output feature information of each frame of image;

Input the characteristic information of each frame of image into the facial key point detection model to obtain the coordinate information of the facial key points of each frame of image, and use the principal component analysis method to reduce the dimensionality of the coordinate information of all facial key points, Obtain information of preset dimensions as feature information of key points of the human face;

The characteristic information of each frame of image is input into the expression classification model to obtain the classification information of the original expression of each frame of image.
The method according to claim 11, wherein said fusing the characteristic information of each frame of image, the characteristic information of facial key points, the classification information of the original expression and the preset classification information corresponding to the target expression includes:

Add and average the classification information of the original expression of each frame of image and the preset classification information corresponding to the target expression to obtain the classification information of the fused expression corresponding to each frame of image;

The characteristic information of the facial key points of each frame of image multiplied by the first weight obtained by training, the characteristic information of each frame of image multiplied by the second weight obtained by training, and the characteristic information of each frame of image multiplied by the second weight obtained by training The classification information of the fused expressions corresponding to the frame images is spliced.
The method according to claim 12, wherein generating the fused image corresponding to each frame of image according to the feature information of the fused image corresponding to each frame of image includes:

Input the feature information of the fused image corresponding to each frame of image into the decoder, and output the generated fused image corresponding to each frame of image;

Wherein, the facial feature extraction model includes a convolution layer, and the decoder includes a deconvolution layer.
The method of claim 11, wherein

Input the first video in which the human facial expression is the original expression and the preset classification information corresponding to the target expression into the expression generation model, and output the second video in which the human facial expression is the target expression;

The training method of the expression generation model includes:

Obtain a training pair consisting of each frame image of the first training video and each frame image of the second training video;

Each frame image of the first training video is input into the first generator, the feature information of each frame image of the first training video, the feature information of the face key points and the classification information of the original expression are obtained, and the first generator is The characteristic information of each frame of the training video, the characteristic information of the face key points, the classification information of the original expression and the preset classification information corresponding to the target expression are fused to obtain the fusion image of each frame corresponding to the first training video. Feature information: according to the feature information of each frame fusion image corresponding to the first training video, obtain each frame fusion image corresponding to the first training video output by the first generator;

Each frame image of the second training video is input into the second generator, the feature information of each frame image of the second training video, the feature information of the face key points and the classification information of the target expression are obtained, and the second generator is The characteristic information of each frame of the training video, the characteristic information of the face key points, the classification information of the target expression and the preset classification information corresponding to the original expression are fused to obtain the characteristics of each frame of the fused image corresponding to the second training video. Information, according to the feature information of each frame fusion image corresponding to the second training video, obtain each frame fusion image corresponding to the second training video output by the second generator;

Determine the adversarial loss and cycle-consistent loss according to each frame fusion image corresponding to the first training video and each frame fusion image corresponding to the second training video;

The first generator and the second generator are trained according to the adversarial loss and the cycle-consistent loss. After the training of the first generator is completed, it is used as an expression generation model.
The method of claim 15, further comprising:

According to the pixel difference between each two adjacent frames of the fused image corresponding to the first training video, and the The pixel difference between each two adjacent frames of fused images corresponding to the two training videos is used to determine the pixel-to-pixel loss;

Wherein, training the first generator and the second generator according to the adversarial loss and the cycle-consistent loss includes:

The first generator and the second generator are trained based on the adversarial loss, the cycle-consistent loss, and the pixel-for-pixel loss.
The method according to claim 15 or 16, wherein determining the adversarial loss based on each frame fusion image corresponding to the first training video and each frame fusion image corresponding to the second training video includes:

Input the fused images of each frame corresponding to the first training video into the first discriminator to obtain the first discrimination result of the fused image of each frame corresponding to the first training video;

Input each frame of the fused image corresponding to the second training video into the second discriminator to obtain a second discrimination result of each frame of the fused image corresponding to the second training video;

The first adversarial loss is determined based on the first discrimination result of each frame of the fused image corresponding to the first training video, and the second adversarial loss is determined based on the second discrimination result of each frame of the fused image corresponding to the second training video.
The method according to claim 17, wherein inputting each frame fusion image corresponding to the first training video into the first discriminator, and obtaining the first discrimination result of each frame fusion image corresponding to the first training video includes:

Input each frame of the fused image corresponding to the first training video into the first facial feature extraction model in the first discriminator, and obtain the output feature information of each frame of the fused image corresponding to the first training video;

Input the feature information of each frame of the fused image corresponding to the first training video into the first expression classification model in the first discriminator to obtain the classification information of the expression of each frame of the fused image corresponding to the first training video, As the first judgment result;

The input of the fused images of each frame corresponding to the second training video into the second discriminator to obtain the second discrimination result of the fused images of each frame corresponding to the second training video includes:

Input each frame of the fused image corresponding to the second training video into the second face feature extraction model in the second discriminator, and obtain the output feature information of each frame of the fused image corresponding to the second training video;

Input the feature information of each frame of the fused image corresponding to the second training video into the second expression classification model in the second discriminator to obtain the classification information of the expression of each frame of the fused image corresponding to the second training video, as the second judgment result.
The method according to claim 15 or 16, wherein the cycle consistency loss is determined using the following method:

Each frame fusion image corresponding to the first training video is input into the second generator to generate a reconstructed image of each frame of the first training video, and each frame fusion image corresponding to the second training video is input into the second generator. The first generator generates reconstructed images of each frame of the second training video;

The difference between the reconstructed image of each frame of the first training video and each frame of the first training video, and the reconstructed image of each frame of the second training video and each frame of the second training video The difference in images determines the cycle-consistent loss.
The method of claim 16, wherein the pixel-to-pixel loss is determined using the following method:

For each position in the fused image of each adjacent two frames corresponding to the first training video, determine the distance between the representation vectors of the two pixels at the position in the fused image of the two adjacent frames, and combine all positions The corresponding distances are added to obtain the first loss;

For each position in the fused image of each adjacent two frames corresponding to the second training video, determine the distance between the representation vectors of the two pixels at the position in the fused image of the two adjacent frames, and match all positions The distances are added to obtain the second loss;

The first loss and the second loss are summed to obtain the pixel-to-pixel loss.
The method according to claim 15, wherein said obtaining the feature information of each frame image of the first training video, the feature information of facial key points and the classification information of the original expression includes:

Input each frame image in the first training video into the third facial feature extraction model in the first generator to obtain the output feature information of each frame image; input the feature information of each frame image The first face key point detection model in the first generator is used to obtain the coordinate information of the face key points of each frame image; the principal component analysis method is used to reduce the dimensionality of the coordinate information of all face key points, and we obtain The first information of the preset dimension is used as the feature information of the facial key points of each frame image of the first training video; the feature information of each frame image of the first training video is input into the first generator The third expression classification model obtains the classification information of the original expression of each frame image in the first training video;

Obtaining the characteristic information of each frame image of the second training video, the characteristic information of the face key points and the object Classification information of marked emoticons includes:

Input each frame image in the second training video into the fourth facial feature extraction model in the second generator to obtain the output feature information of each frame image; input the feature information of each frame image The second face key point detection model in the second generator is used to obtain the coordinate information of the face key points of each frame image; the principal component analysis method is used to reduce the dimensionality of the coordinate information of all face key points, and we obtain The second information of the preset dimension is used as the feature information of the face key points of each frame image of the second training video; the feature information of each frame image of the second training video is input into the second generator The fourth expression classification model is used to obtain the classification information of the target expression of each frame image in the second training video.
The method according to claim 15, wherein the characteristic information of each frame image of the first training video, the characteristic information of facial key points, the classification information of the original expression and the preset classification information corresponding to the target expression Fusion includes:

The classification information of the original expression of each frame image of the first training video and the preset classification information corresponding to the target expression are added and averaged to obtain the fused expression corresponding to each frame image of the first training video. Classification information; the feature information of the face key points of each frame image of the first training video multiplied by the first weight to be trained, and the first training multiplied by the second weight to be trained The characteristic information of each frame image of the video and the classification information of the fused expression corresponding to each frame image of the first training video are spliced;

The fusion of the feature information of each frame image of the second training video, the feature information of key facial points, the classification information of the target expression and the preset classification information corresponding to the original expression includes:

The classification information of the target expression of each frame image of the second training video and the preset classification information corresponding to the original expression are added and averaged to obtain the fusion expression corresponding to each frame image of the second training video. Classification information; the feature information of the face key points of each frame image of the second training video multiplied by the third weight to be trained, and the second training multiplied by the fourth weight to be trained The feature information of each frame image of the video and the classification information of the fused expression corresponding to each frame image of the second training video are spliced.
The method of claim 16, wherein training the first generator and the second generator based on the adversarial loss, the cycle-consistent loss, and the pixel-for-pixel loss includes:

Perform a weighted sum of the adversarial loss, the cycle-consistent loss and the pixel-to-pixel loss to obtain a total loss;

The first generator and the second generator are trained based on the total loss.
The method of claim 8, wherein

The editing process of the character actions in each frame of the image in the first video based on the character action customization information corresponding to the interactive scene includes:

Adjust the first human body key points of the character in the original first key frame in the first video during the first action to obtain the second human body key points of the character during the second action as the character action customization information;

Extract feature information of each second human body key point neighborhood from the original first key frame;

The feature information of each second human body key point and its neighborhood is input into the image generation model, and the target first key frame of the character during the second action is output.
The method according to claim 24, wherein the method for obtaining the image generation model includes:

The training video frame and the human body key points of the characters in the training video frame are used as a pair of training data. The human body key points in the training data and the characteristic information of their neighborhoods in the training video frame are used as the input of the image generation network. The training data The training video frames in are used as supervision information output by the image generation network, and the image generation network is trained to obtain the image generation model.
The method according to claim 24, wherein the first human body key points include human body outline feature points of the character during the first action, and the second human body key points include human body outline feature points of the character during the second action.
A digital human generation device, including:

A memory; and a processor coupled to the memory, the processor configured to execute the digital human generation method of any one of claims 1-26 based on instructions stored in the memory.
A digital human generation device, including:

an acquisition unit configured to acquire the first video;

The customization unit is configured to edit the characters in each frame of the image in the first video according to the character customization information corresponding to the interaction scene;

The output unit is configured to output the second video according to each frame image in the processed first video.
A non-transitory computer-readable storage medium on which a computer program is stored, which is processed When the computer is executed, the steps of the digital human generation method described in any one of claims 1-26 are implemented.
A computer program consisting of:

Instructions, which when executed by a processor, cause the processor to execute the digital human generation method according to any one of claims 1 to 26.