CN114863533A

CN114863533A - Digital human generation method and device and storage medium

Info

Publication number: CN114863533A
Application number: CN202210541984.9A
Authority: CN
Inventors: 王林芳; 张炜; 石凡; 张琪; 申童; 左佳伟; 梅涛
Original assignee: Jingdong Technology Holding Co Ltd
Current assignee: Jingdong Technology Holding Co Ltd
Priority date: 2022-05-18
Filing date: 2022-05-18
Publication date: 2022-08-05
Also published as: WO2023221684A1

Abstract

The disclosure provides a digital person generation method and device and a storage medium, and relates to the technical field of computers. The method comprises the following steps: acquiring a first video; editing characters in each frame of image in the first video according to character customization information corresponding to the interactive scene; and outputting the second video according to each frame of image in the processed first video. And editing the characters in the video according to the character customization information corresponding to the interactive scene, and generating the digital human video matched with the interactive scene through character editing.

Description

Digital human generation method and device and storage medium

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a method and an apparatus for generating a digital person, and a storage medium.

Background

Under the drive of new technology waves such as artificial intelligence, virtual reality and the like, the performance of digital people in all aspects is improved, the digital people represented by virtual anchor, virtual staff and the like successfully enter the public visual field, and the digital people play great splendid attitudes in various fields such as movies, games, media, travel, finance and the like.

The customization of the digital human image strives for authenticity and personalization. Every detail of the digital human image is of interest to the user under the photographic level super-realistic requirements. This puts high demands on the model when recording image materials. However, the model is not a robot after all and cannot achieve a perfect match in time and motion positioning with the interactive scene used by the character.

Disclosure of Invention

The method and the device for editing the characters in the video carry out editing processing according to the character customization information corresponding to the interactive scene, and the digital human video matched with the interactive scene is generated through character editing.

Some embodiments of the present disclosure provide a method for generating a digital person, including:

acquiring a first video;

editing characters in each frame of image in the first video according to character customization information corresponding to the interactive scene;

and outputting the second video according to each frame of image in the processed first video.

In some embodiments, the first video is obtained by preprocessing an original video, where the preprocessing includes one or more of resolution adjustment, inter-frame smoothing, and frame rate adjustment.

In some embodiments, the resolution adjustment comprises:

if the resolution of the original video is higher than the required preset resolution, performing down-sampling on the original video according to the preset resolution to obtain a first video with the preset resolution;

and if the resolution ratio of the original video is lower than the required preset resolution ratio, processing the original video by using a super-resolution model to obtain a first video with the preset resolution ratio, wherein the super-resolution model is used for increasing the resolution ratio of the input video to the preset resolution ratio.

In some embodiments, the super-resolution model is obtained by training a neural network, and in the training process, a first video frame from a high-definition video is down-sampled according to a preset resolution to obtain a second video frame, the second video frame is used as an input of the neural network, the first video frame is used as supervision information of the output of the neural network, and the neural network is trained to obtain the super-resolution model.

In some embodiments, the frame rate adjustment comprises:

if the frame rate of the original video is higher than the required preset frame rate, performing frame extraction on the original video according to the ratio information of the frame rate of the original video and the preset frame rate to obtain a first video of the preset frame rate;

if the frame rate of the original video is lower than the required preset frame rate, inserting the original video to a first frame rate by using a video frame insertion model, wherein the first frame rate is the least common multiple of the frame rate before the original video is inserted into the frame and the preset frame rate, extracting the frame of the original video after the frame insertion according to the proportion information of the first frame rate and the preset frame rate to obtain a first video with the preset frame rate, and the video frame insertion model is used for generating a transition frame between any two frames of images.

In some embodiments, the video frame interpolation model is obtained by training a neural network, and in the training process, continuous three frames in a training video frame sequence are used as triples, a first frame and a third frame in the triples are used as input of the neural network, a second frame in the triples is used as output supervision information of the neural network, and the neural network is trained to obtain the video frame interpolation model.

In some embodiments, the inputs to the neural network include: visual feature information and depth information of the first frame and the third frame, and optical flow information and deformation information between the first frame and the third frame.

In some embodiments, the editing the person in each image of the first video according to the person customization information corresponding to the interactive scene includes one or more of:

editing the figure image in each frame of image in the first video according to the figure image customization information corresponding to the interactive scene;

editing the character expression in each frame of image in the first video according to character expression customization information corresponding to the interactive scene;

and editing the character actions in each frame of image in the first video according to the character action customization information corresponding to the interactive scene.

In some embodiments, the editing the character image in each frame of image in the first video according to the character image customization information corresponding to the interactive scene includes: determining character image adjusting parameters according to character image adjustment of a part of video frames in a first video of a user, and editing character images in the rest video frames in the first video according to the character image adjusting parameters.

In some embodiments, the editing the character in the rest video frames in the first video according to the character adjustment parameter includes:

detecting and positioning the target parts of the figures in the rest video frames in the first video through key points according to the target parts of the figure image adjustment in the figure image adjustment parameters;

and adjusting the amplitude or position of the positioned target part through graphic transformation according to the amplitude information or position information of the character image adjustment in the character image adjustment parameters.

In some embodiments, the character expression customization information includes preset classification information corresponding to a target expression, and the editing processing of the character expression in each frame of image in the first video according to the character expression customization information corresponding to the interactive scene includes:

acquiring feature information of each frame of image in a first video, feature information of face key points and classification information of original expressions;

fusing the feature information of each frame of image, the feature information of the face key points, the classification information of the original expression and the preset classification information corresponding to the target expression to obtain the feature information of the fused image corresponding to each frame of image;

and generating a fusion image corresponding to each frame of image according to the feature information of the fusion image corresponding to each frame of image, wherein all the fusion images form a second video with the facial expression being the target expression.

In some embodiments, the obtaining feature information of each frame of image in the first video, feature information of a face key point, and classification information of an original expression includes:

inputting each frame of image in the first video into a human face feature extraction model to obtain feature information of each frame of output image;

inputting the feature information of each frame of image into a face key point detection model to obtain coordinate information of the face key points of each frame of image, and performing dimensionality reduction on the coordinate information of all the face key points by adopting a principal component analysis method to obtain preset dimensionality information serving as the feature information of the face key points;

and inputting the characteristic information of each frame of image into an expression classification model to obtain the classification information of the original expression of each frame of image.

In some embodiments, the fusing the feature information of each frame of image, the feature information of the key points of the face, the classification information of the original expression, and the preset classification information corresponding to the target expression includes:

adding and averaging the classification information of the original expression of each frame of image and preset classification information corresponding to the target expression to obtain the classification information of the fusion expression corresponding to each frame of image;

and splicing the feature information of the face key points of each frame of image multiplied by the trained first weight, the feature information of each frame of image multiplied by the trained second weight and the classification information of the fusion expression corresponding to each frame of image.

In some embodiments, the generating a fused image corresponding to each frame of image according to the feature information of the fused image corresponding to each frame of image includes:

inputting the feature information of the fused image corresponding to each frame of image into a decoder, and outputting the generated fused image corresponding to each frame of image;

the face feature extraction model comprises a convolution layer, and the decoder comprises an anti-convolution layer.

In some embodiments, a first video with a facial expression of an original expression and preset classification information corresponding to a target expression are input into an expression generation model, and a second video with the facial expression of the target expression is output; the training method of the expression generation model comprises the following steps:

acquiring a training pair consisting of each frame image of the first training video and each frame image of the second training video;

inputting each frame image of the first training video into a first generator, acquiring feature information of each frame image of the first training video, feature information of a face key point and classification information of an original expression, fusing the feature information of each frame image of the first training video, the feature information of the face key point, the classification information of the original expression and preset classification information corresponding to a target expression to obtain feature information of each frame fused image corresponding to the first training video, and obtaining each frame fused image corresponding to the first training video output by the first generator according to the feature information of each frame fused image corresponding to the first training video;

inputting each frame of image of the second training video into a second generator, acquiring feature information of each frame of image of the second training video, feature information of a face key point and classification information of a target expression, fusing the feature information of each frame of image of the second training video, the feature information of the face key point, the classification information of the target expression and preset classification information corresponding to an original expression to obtain feature information of each frame of fused image corresponding to the second training video, and obtaining each frame of fused image corresponding to the second training video output by the second generator according to the feature information of each frame of fused image corresponding to the second training video;

determining the anti-loss and the cyclic consistent loss according to the fused image of each frame corresponding to the first training video and the fused image of each frame corresponding to the second training video;

and training the first generator and the second generator according to the confrontation loss and the cycle consistency loss, and using the first generator as an expression generation model after the training of the first generator is finished.

In some embodiments, further comprising: determining pixel-to-pixel loss according to the pixel difference between every two adjacent frames of fused images corresponding to the first training video and the pixel difference between every two adjacent frames of fused images corresponding to the second training video;

wherein training the first generator and the second generator according to the immunity loss and the cycle-consistent loss comprises:

training the first generator and the second generator according to the antagonistic loss, the cyclic coincidence loss, and the pixel-to-pixel loss.

In some embodiments, the determining, according to the fused images of the frames corresponding to the first training video and the fused images of the frames corresponding to the second training video, a confrontation loss includes: inputting each frame of fused image corresponding to the first training video into a first discriminator to obtain a first discrimination result of each frame of fused image corresponding to the first training video;

inputting each frame of fused image corresponding to the second training video into a second discriminator to obtain a second discrimination result of each frame of fused image corresponding to the second training video;

and determining a first pair of loss resistances according to the first judgment result of each frame of fused image corresponding to the first training video, and determining a second pair of loss resistances according to the second judgment result of each frame of fused image corresponding to the second training video.

In some embodiments, inputting the frames of fused images corresponding to the first training video into a first discriminator to obtain a first discrimination result of the frames of fused images corresponding to the first training video includes:

inputting each frame of fused image corresponding to the first training video into a first face feature extraction model in the first discriminator to obtain feature information of each frame of fused image corresponding to the output first training video;

inputting the feature information of each frame of fused image corresponding to the first training video into a first expression classification model in the first discriminator to obtain the classification information of the expression of each frame of fused image corresponding to the first training video as a first discrimination result;

inputting each frame of fused image corresponding to the second training video into a second discriminator to obtain a second discrimination result of each frame of fused image corresponding to the second training video comprises:

inputting each frame of fused image corresponding to the second training video into a second face feature extraction model in the second discriminator to obtain feature information of each frame of fused image corresponding to the output second training video;

and inputting the feature information of each frame of fused image corresponding to the second training video into a second expression classification model in the second judging device to obtain the classification information of the expression of each frame of fused image corresponding to the second training video as a second judging result.

In some embodiments, the cycle consistent loss is determined using the following method:

inputting each frame fusion image corresponding to the first training video into the second generator to generate each frame reconstruction image of the first training video, and inputting each frame fusion image corresponding to the second training video into the first generator to generate each frame reconstruction image of the second training video;

and determining the cycle consistent loss according to the difference between each frame of reconstructed image of the first training video and each frame of image of the first training video and the difference between each frame of reconstructed image of the second training video and each frame of image of the second training video.

In some embodiments, the pixel-to-pixel loss is determined using the following method:

determining the distance between the expression vectors of two pixels at each position in each two adjacent frames of fused images corresponding to the first training video, and summing the distances corresponding to all the positions to obtain a first loss;

determining the distance between the expression vectors of two pixels at the positions in each two adjacent frames of fused images corresponding to the second training video, and summing the distances corresponding to all the positions to obtain a second loss;

and adding the first loss and the second loss to obtain the pixel-to-pixel loss.

In some embodiments, the obtaining feature information of each frame of image of the first training video, feature information of a face key point, and classification information of an original expression includes: inputting each frame of image in the first training video into a third facial feature extraction model in the first generator to obtain the output feature information of each frame of image; inputting the feature information of each frame of image into a first face key point detection model in the first generator to obtain the coordinate information of the face key points of each frame of image; reducing the dimension of the coordinate information of all face key points by adopting a principal component analysis method to obtain first information of a preset dimension, wherein the first information is used as feature information of the face key points of each frame of image of the first training video; inputting the feature information of each frame of image in the first training video into a third emotion classification model in the first generator to obtain the classification information of the original emotion of each frame of image in the first training video;

the acquiring the feature information of each frame of image of the second training video, the feature information of the face key point and the classification information of the target expression comprises: inputting each frame of image in the second training video into a fourth face feature extraction model in the second generator to obtain the output feature information of each frame of image; inputting the feature information of each frame of image into a second face key point detection model in the second generator to obtain the coordinate information of the face key points of each frame of image; reducing the dimension of the coordinate information of all face key points by adopting a principal component analysis method to obtain second information of a preset dimension, wherein the second information is used as the feature information of the face key points of each frame of image of the second training video; and inputting the feature information of each frame of image in the second training video into a fourth expression classification model in the second generator to obtain the classification information of the target expression of each frame of image in the second training video.

In some embodiments, the fusing the feature information of each frame of image of the first training video, the feature information of the face key point, the classification information of the original expression, and the preset classification information corresponding to the target expression includes: adding and averaging the classification information of the original expression of each frame of image of the first training video and the preset classification information corresponding to the target expression to obtain the classification information of the fusion expression corresponding to each frame of image of the first training video; splicing the feature information of the face key points of each frame of image of the first training video multiplied by the first weight to be trained, the feature information of each frame of image of the first training video multiplied by the second weight to be trained, and the classification information of the fusion expression corresponding to each frame of image of the first training video;

the fusing the feature information of each frame of image of the second training video, the feature information of the face key point, the classification information of the target expression and the preset classification information corresponding to the original expression comprises: adding and averaging the classification information of the target expression of each frame of image of the second training video and the preset classification information corresponding to the original expression to obtain the classification information of the fusion expression corresponding to each frame of image of the second training video; and splicing the feature information of the face key points of each frame of image of the second training video multiplied by the third weight to be trained, the feature information of each frame of image of the second training video multiplied by the fourth weight to be trained, and the classification information of the fusion expression corresponding to each frame of image of the second training video.

In some embodiments, said training said first generator and said second generator according to said antagonistic loss, said cyclic coincidence loss, and said pixel-to-pixel loss comprises: weighting and summing the confrontation loss, the cyclic coincidence loss and the pixel-to-pixel loss to obtain a total loss; training the first generator and the second generator according to the total loss.

In some embodiments, the editing the character motion in each frame of image in the first video according to the character motion customization information corresponding to the interactive scene includes:

adjusting first human body key points of a character in an original first key frame in the first video during a first action to obtain second human body key points of the character during a second action as character action customization information;

extracting feature information of each second human body key point neighborhood from the original first key frame;

and inputting the characteristic information of each second human body key point and the neighborhood thereof into an image generation model, and outputting a target first key frame of the character in the second action.

In some embodiments, the method for obtaining the image generation model includes: the method comprises the steps of taking a training video frame and human key points of figures in the training video frame as a pair of training data, taking the human key points in the training data and feature information of neighborhoods of the human key points in the training video frame as input of an image generation network, taking the training video frame in the training data as output supervision information of the image generation network, and training the image generation network to obtain an image generation model.

In some embodiments, the first human keypoints comprise human silhouette feature points of the person at the time of the first action, and the second human keypoints comprise human silhouette feature points of the person at the time of the second action.

Some embodiments of the present disclosure provide a digital person generation apparatus, including: a memory; and a processor coupled to the memory, the processor configured to perform the digital person generation methods of the various embodiments based on instructions stored in the memory.

Some embodiments of the present disclosure provide a digital person generation apparatus, including:

an acquisition unit configured to acquire a first video;

the customizing unit is configured to edit the characters in each frame of image in the first video according to the character customizing information corresponding to the interactive scene;

and the output unit is configured to output the second video according to each frame image in the processed first video.

Some embodiments of the present disclosure provide a non-transitory computer readable storage medium having stored thereon a computer program that, when executed by a processor, performs the steps of the digital human generation method of the various embodiments.

Drawings

The drawings that are required to be used in the embodiments or the related art description will be briefly described below. The present disclosure can be understood more clearly from the following detailed description, which proceeds with reference to the accompanying drawings.

It is to be understood that the drawings in the following description are merely exemplary of the disclosure, and that other drawings may be derived from those drawings by one of ordinary skill in the art without undue inventive faculty.

Fig. 1A illustrates a flow diagram of a digital person generation method of some embodiments of the present disclosure.

FIG. 1B shows a flow diagram of a digital person generation method of further embodiments of the disclosure.

Fig. 2 illustrates a schematic diagram of video pre-processing of some embodiments of the present disclosure.

Fig. 3A illustrates a flow diagram of an expression generation method of some embodiments of the present disclosure.

Fig. 3B is a schematic diagram illustrating expression generation methods according to further embodiments of the present disclosure.

Fig. 3C illustrates a flow diagram of a training method of an expression generation model according to some embodiments of the disclosure.

Fig. 3D illustrates a schematic diagram of a training method of an expression generation model according to some embodiments of the present disclosure.

Fig. 4A illustrates a schematic diagram of human contour feature points of a character in a first action according to some embodiments of the present disclosure.

Fig. 4B illustrates a schematic diagram of human contour feature points of a character in a second action according to some embodiments of the present disclosure.

FIG. 4C illustrates a schematic diagram of a plurality of keypoints and a plurality of keypoints-connections on a character, according to some embodiments of the present disclosure.

Fig. 5 shows a schematic structural diagram of a digital human generating device according to some embodiments of the present disclosure.

Fig. 6 shows a schematic structural diagram of a digital human generating device according to further embodiments of the present disclosure.

Detailed Description

The technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the drawings in the embodiments of the present disclosure.

Unless otherwise specified, "first", "second", and the like in the present disclosure are described to distinguish different objects, and are not intended to mean size, timing, or the like.

As shown in fig. 1A, the digital person generation method of this embodiment includes the following steps.

In step S110, a first video is acquired.

The first video may be, for example, a recorded original video, or may be obtained by performing preprocessing on the original video, where the preprocessing includes one or more of resolution adjustment, inter-frame smoothing, and frame rate adjustment.

In step S120, the person in each image of the first video is edited according to the person customization information corresponding to the interactive scene.

The editing processing of the characters in the images of the frames in the first video according to the character customization information corresponding to the interactive scene comprises one or more of the following steps: editing the figure images in each frame of image in the first video according to the figure image customization information corresponding to the interactive scene to generate digital figure images matched with the interactive scene; editing the character expression in each frame of image in the first video according to character expression customization information corresponding to the interactive scene to generate a digital character expression matched with the interactive scene; and editing the character actions in each frame of image in the first video according to the character action customization information corresponding to the interactive scene to generate the digital character actions matched with the interactive scene.

In step S130, a second video is output based on each frame image in the processed first video.

That is, the frames of images in the processed first video are combined to form a second video, which is a digital human video matched with the interactive scene.

In the above embodiment, the characters in the video are edited according to the character customization information corresponding to the interactive scene, and the digital human video matching the interactive scene is generated through character editing, for example, a digital human image, a digital human expression, a digital human action, and the like matching the interactive scene are generated.

As shown in fig. 1B, the digital person generation method of this embodiment includes the following steps.

In step S210, logic control is customized.

The customized logic control is used for controlling whether customized logic such as video preprocessing, image customization, expression customization, action customization and the like is executed, the execution sequence and the like.

The contents edited by each part of video preprocessing, image customization, expression customization, action customization and the like are independent and do not have strong dependency relationship with each other, so that the execution sequence of each part can be changed, and the basic effect of generating the digital human video matched with the interactive scene can be achieved. However, there is a certain mutual influence between the parts, and according to the execution sequence of S220 to S250 in this embodiment, the mutual influence between the parts can be minimized, and the final human image can be presented more effectively.

In step S220, the video is pre-processed.

The video preprocessing is to preprocess the recorded original video to obtain a first video, wherein the preprocessing comprises one or more of resolution adjustment, interframe smoothing processing and frame rate adjustment.

In some embodiments, as shown in fig. 2, the preprocessing is performed in sequence according to the order of resolution adjustment, inter-frame smoothing processing, and frame rate adjustment, so that the video preprocessing effect is better, the visual information of the original video can be retained to the greatest extent, the preprocessed video is guaranteed not to have quality problems such as blurring and distortion, and the influence of the frame rate adjustment and the resolution adjustment on the subsequent digital person customization process is minimized.

The resolution adjustment includes: if the resolution of the original video is higher than the required preset resolution, performing down-sampling on the original video according to the preset resolution to obtain a first video with the preset resolution; if the resolution ratio of the original video is lower than the required preset resolution ratio, processing the original video by using a super-resolution model to obtain a first video with the preset resolution ratio, wherein the super-resolution model is used for increasing the resolution ratio of the input video to the preset resolution ratio; if the resolution of the original video is equal to the required preset resolution, the step of resolution adjustment can be skipped.

Through resolution adjustment, the consistency of the preprocessed first video in the aspect of resolution can be kept, and the influence of the differentiated resolution of the original video on the digital human customization effect is reduced.

The super-resolution model is obtained by training a neural network, for example, in the training process, a first video frame from a high-definition video is subjected to down-sampling according to a preset resolution to obtain a second video frame, the second video frame is used as the input of the neural network, the first video frame is used as the output supervision information of the neural network, and the neural network is trained to obtain the super-resolution model. And iteratively updating parameters of the neural network according to the loss determined by the loss function by taking the difference information between the video frame output by the neural network and the first video frame as the loss function until the loss meets a certain condition, finishing training, wherein the video frame output by the neural network is very close to the first video frame at the moment, and the trained neural network is taken as a super-resolution model. The neural network is a large class of models, including but not limited to convolutional neural network, circular network based on optical flow method, generation countermeasure network, and so on.

For example, a key frame of a high definition video (1080p) is down-sampled to obtain a second video frame with a lower resolution (such as 360p/480p/720 p) and a super-resolution model is obtained according to the training method, and a first video with a resolution of 480p/720p/1080p and the like can be obtained from an original video with an arbitrary resolution by using the super-resolution model. Wherein 360P/480P/720P/1080P is a video display format, P represents progressive scanning, and for example, 1080P picture resolution is 1920 × 1080.

After the resolution is adjusted, texture information between two frames may have a certain difference in a frame sequence generated by a super-resolution model or obtained by down-sampling, so that inter-frame smoothing is adopted to ensure that the texture, the figure edge and the like are not generated with saw teeth or moire patterns during video playing, and visual influence is avoided.

The inter-frame smoothing may be performed by, for example, averaging. For example, the image information of three consecutive frames is averaged, and the average is used as the image information of the intermediate frame in the three consecutive frames.

The frame rate adjustment comprises: if the frame rate of the original video is higher than the required preset frame rate, performing frame extraction on the original video according to the ratio information of the frame rate of the original video and the preset frame rate to obtain a first video of the preset frame rate; if the frame rate of the original video is lower than the required preset frame rate, inserting the original video to a first frame rate by using a video frame insertion model, wherein the first frame rate is the least common multiple of the frame rate before the original video is inserted into the frame and the preset frame rate, extracting the frame of the original video after the frame insertion according to the proportion information of the first frame rate and the preset frame rate to obtain a first video with the preset frame rate, and the video frame insertion model is used for generating a transition frame between any two frames of images; if the frame rate of the original video is equal to the required preset frame rate, the step of frame rate adjustment can be skipped.

Through frame rate adjustment, the consistency of the preprocessed first video in the aspect of the frame rate can be kept, and the influence of the differentiated frame rate of the original video on the customizing effect of the digital person is reduced. Moreover, the frame insertion operation can also effectively solve the problem of jumping between two actions. For example, when the digital person finishes the action a and does the action B, the user feels that the action of the person jumps when the video without the frame interpolation processing is played, which is not true enough.

The video frame interpolation model is obtained by training a neural network, for example, in the training process, continuous three frames in a training video frame sequence are used as triples, a first frame and a third frame in the triples are used as input of the neural network, a second frame in the triples is used as output supervision information of the neural network, and the neural network is trained to obtain the video frame interpolation model. The neural network updates parameters of the neural network iteratively according to loss determined by the loss function based on difference information between video frames output by a first frame and a third frame in an input triple and a second frame in the triple as the loss function until the loss meets a certain condition, training is completed, the video frames output by the neural network are very close to the second frame in the triple at the moment, the trained neural network is used as a video frame interpolation model, and a transition frame between any two frames of images can be generated. The neural network is a large class of models, including but not limited to convolutional neural network, circular network based on optical flow method, generation countermeasure network, and so on.

The inputs of the neural network include, for example: visual feature information and depth information of the first frame and the third frame, and optical flow information and deformation information between the first frame and the third frame. Through the fusion of the four parts of information, the deduced transition frame to be inserted between the two frames can enable the video transition to be smoother.

In step S230, the character is customized.

And editing the figure images in each frame of image in the first video according to the figure image customization information corresponding to the interactive scene, so as to meet the requirements of users on the beauty and the body of the digital people. The image customization includes, for example, skin grinding, face thinning, eye enlargement, facial position adjustment, body proportion adjustment, such as body slimming, leg lengthening, and other body beautifying operations.

In some embodiments, the character adjustment parameters are determined according to character adjustment of a part of video frames in the first video by a user, and the character in the rest video frames in the first video is edited according to the character adjustment parameters. Wherein a "partial video frame" may be, for example, one or several key frames in the first video. Firstly, image customization of all video digital people can be completed through a small amount of editing work, and customization efficiency and customization cost of the digital people are improved.

The editing processing of the character images in the rest video frames in the first video according to the character image adjusting parameters comprises the following steps: detecting and positioning target parts of the characters in the rest video frames in the first video through key points according to the target parts of the character image adjustment in the character image adjustment parameters, wherein the target parts are five sense organs, human bodies or the like; and adjusting the amplitude or the position of the positioned target part through graphic transformation according to the amplitude information or the position information of the character image adjustment in the character image adjustment parameters.

For example, if the eyes of the person are turned up in some key frames by the user, the face is detected by the face detection technology, then the eyes of the person in the rest video frames are located by the key point detection technology, and then the amplitude of the eyes of the person in the rest video frames is adjusted by graphical transformation according to the amplitude information of the eye turned up by the user, for example, the distance between the upper eyelid and the lower eyelid, so as to achieve the effect of beautifying the eyes of the person in all frames of the video.

In step S240, the expression is customized.

The expression customization is an expression generation method for editing and processing the character expressions in each frame of image in the first video according to character expression customization information corresponding to an interactive scene, for example, preset classification information corresponding to a target expression, so that the control of the digital human facial expressions in the interactive scene is realized, one expression state of the digital human can be transferred to another target expression state, and meanwhile, the digital human is ensured to be not influenced by the change of only the facial expressions, the speaking mouth shape, the head action and the like. Therefore, when the digital person expresses the corresponding language content, the expression can be changed correspondingly along with the language content.

Fig. 3A is a flow chart of some embodiments of the expression generation method of the present disclosure. As shown in fig. 3A, the method of this embodiment includes: steps S310 to S330.

In step S310, feature information of each frame of image in the first video, feature information of a face key point, and classification information of an original expression are obtained.

The facial expression in the first video is the original expression. That is, the facial expression in each frame of image in the first video is mainly the original expression, and the original expression is, for example, a calm expression.

In some embodiments, each frame of image in the first video is input into a face feature extraction model, and feature information of each frame of output image is obtained; inputting the characteristic information of each frame of image into a face key point detection model to obtain the coordinate information of the face key points of each frame of image; reducing the dimensions of the coordinate information of all face key points by adopting a Principal Component Analysis (PCA) method to obtain information of preset dimensions as the feature information of the face key points; and inputting the characteristic information of each frame of image into the expression classification model to obtain the classification information of the original expression of each frame of image.

The integral expression generation model comprises an encoder and a decoder, wherein the encoder can comprise a face feature extraction model, a face key point detection model and an expression classification model, and the face feature extraction model is connected with the face key point detection model and the expression classification model. The face feature extraction model can adopt an existing model, for example, a deep learning model with a feature extraction function, such as VGG-19, ResNet, and Transformer. The part before VGG-19block 5 can be used as a face feature extraction model. The face keypoint detection model and the expression classification model may also adopt existing models, such as an MLP (multi-layer perceptron) and the like, and may specifically be a 3-layer MLP. And after the training of the expression generation model is finished, generating an expression, and then describing the training process in detail.

The Feature information of each frame of image in the first video is, for example, a Feature Map (Feature Map) output by a face Feature extraction model, and the key points include, for example, 68 key points such as the chin, the eyebrow center, the corner of the mouth, and the like, and each key point is represented as the horizontal and vertical coordinates of the position where the key point is located. After the coordinate information of each key point is obtained through the face key point detection model, in order to reduce redundant information and improve efficiency, the dimension of the coordinate information of all face key points is reduced through PCA, and information of a preset dimension (for example, 6 dimensions, which can achieve the best effect) is obtained and used as feature information of the face key points. The expression classification model can output classification of neutral, happy, sad and other expressions, and can be represented by one-hot coded vectors. The classification information of the original expression may be one-hot codes of the classification of the original expression in each frame of image in the first video, which is obtained through the expression classification model.

In step S320, the feature information of each frame of image, the feature information of the key points of the face, the classification information of the original expression, and the preset classification information corresponding to the target expression are fused to obtain the feature information of the fused image corresponding to each frame of image.

In some embodiments, the classification information of the original expression of each frame of image is added and averaged with the preset classification information corresponding to the target expression to obtain the classification information of the fusion expression corresponding to each frame of image; and splicing the feature information of the face key points of each frame of image multiplied by the first weight obtained by training, the feature information of each frame of image multiplied by the second weight obtained by training and the classification information of the fusion expression corresponding to each frame of image.

The target expression is different from the original expression, for example, a smiling expression, and the preset classification information corresponding to the target expression is, for example, a preset one-hot code of the target expression. The preset classification information is not required to be obtained through a model, and the preset coding rule (one-hot) is directly adopted for coding. For example, the calm expression code is 1000, and the smile expression code is 0100. The foregoing classification information of the original expression is obtained through an expression classification model, and the classification information may be distinguished from preset classification information corresponding to the original expression, for example, the original expression is a calm expression, and the preset one-hot code is 1000, but the one-hot code obtained by the expression classification model may be 0.80.200.

The encoder can also comprise a feature fusion model, and feature information of each frame of image, feature information of key points of the human face, classification information of the original expression and preset classification information corresponding to the target expression are input into the feature fusion model for fusion. The parameters to be trained in the feature fusion model comprise a first weight and a second weight. And for each frame of image, multiplying a first weight obtained by training by the feature information of the face key point of the image to obtain a first feature vector, multiplying a second weight obtained by training by the feature information of the image to obtain a second feature vector, and splicing the first feature vector, the second feature vector and the classification information of the fusion expression corresponding to the image to obtain the feature information of the fusion image corresponding to the image. The first weight and the second weight may unify value ranges of the three kinds of information.

In step S330, a fused image corresponding to each frame of image is generated according to the feature information of the fused image corresponding to each frame of image, and all the fused images are combined to form a second video with a facial expression being a target expression.

In some embodiments, the feature information of the fused image corresponding to each frame of image is input into a decoder, and the generated fused image corresponding to each frame of image is output. The face feature extraction model includes a convolutional layer and the decoder includes an anti-convolutional layer, and an image may be generated based on the features. The decoder is, for example, block5 of VGG-19, replacing the last convolutional layer with the deconvolution layer. The fused image is an image of which the facial expression is the target expression, and each frame of fused image forms a second video.

Some application examples of the present disclosure are described below in conjunction with fig. 3B.

As shown in fig. 3B, a feature image is obtained after feature extraction is performed on a frame of image in a first video, face key point detection and expression classification are performed respectively according to the feature image, PCA is performed on feature information of each key point obtained by the face key point detection, information with reduced dimensions to preset dimensions is used as key point features, one-hot coding is performed on classification information of an original expression, and preset classification information corresponding to a target expression is fused to obtain expression classification vectors (classification information of a fused expression), so that the feature image of a face, the expression classification vectors and the key point features are fused to obtain feature information of the fused image, and feature decoding is performed on the feature information of the fused image to obtain a face image of the target expression.

The scheme of the embodiment extracts the feature information of each frame of image, the feature information of the face key points and the classification information of the original expression in the first video, fuses the extracted information and the preset classification information corresponding to the target expression to obtain the feature information of the fused image corresponding to each frame of image, further generates the fused image corresponding to each frame of image according to the feature information of the fused image corresponding to each frame of image, and all the fused images can form the second video with the face expression being the target expression. In the above embodiment, the feature information of the key points of the face is extracted and used for feature fusion, so that the expression in the fused image is more real and smooth, the generation of the target expression is directly realized through the fusion of the preset classification information corresponding to the target expression, and the target expression is compatible with the face action and the mouth shape of the person in the original image, so that the mouth shape, the head action and the like of the person are not influenced, the definition of the original image is not influenced, and the generated video is stable, clear and smooth.

Fig. 3C is a flow chart of some embodiments of a training method of the expression generation model of the present disclosure. The expression generation model can output and obtain a second video with the facial expression being the target expression according to the input first video with the facial expression being the original expression and the preset classification information corresponding to the target expression.

As shown in fig. 3C, the method of this embodiment includes: steps S410 to S450.

In step S410, a training pair consisting of each frame image of the first training video and each frame image of the second training video is acquired.

The first training video is a video with a facial expression as an original expression, the second training video is a video with a facial expression as a target expression, and each frame image of the first training video does not need to correspond to each frame image of the second training video one by one. And labeling the classification information of the original expression and the classification information of the target expression.

A large number of videos of characters speaking with different expressions are used as training data, Domain Transfer Learning (Domain Transfer Learning) is performed through deep Learning, a first generator which is converted from one expression state to another expression state is learned, and expression generation results are fused with the whole digital person.

In step S420, each frame of image of the first training video is input into the first generator, feature information of each frame of image of the first training video, feature information of a face key point, and classification information of an original expression are obtained, the feature information of each frame of image of the first training video, the feature information of the face key point, the classification information of the original expression, and preset classification information corresponding to a target expression are fused to obtain feature information of each frame of fused image corresponding to the first training video, and each frame of fused image corresponding to the first training video output by the first generator is obtained according to the feature information of each frame of fused image corresponding to the first training video.

And the first generator is used as an expression generation model after training is completed. In some embodiments, each frame of image in the first training video is input into a third facial feature extraction model in the first generator, so as to obtain feature information of each output frame of image; inputting the characteristic information of each frame of image into a first face key point detection model in a first generator to obtain the coordinate information of the face key points of each frame of image; reducing the dimension of coordinate information of all face key points by adopting a principal component analysis method to obtain first information of a preset dimension, wherein the first information is used as feature information of the face key points of each frame of image of a first training video; and inputting the characteristic information of each frame of image in the first training video into a third emotion classification model in the first generator to obtain the classification information of the original expression of each frame of image in the first training video.

And (3) carrying out Principal Component Analysis (PCA) on the coordinate information of the key points of the human face, and reducing the coordinate information of the key points to 6 dimensions (6 dimensions are the best effect obtained through a large number of experiments). The PCA does not relate to training parameters (feature extraction of PCA and correspondence of front and back feature dimensions do not change with training, when gradient is transmitted in reverse, only the feature correspondence obtained by the initial PCA is used to transmit gradient to the previous parameters).

In some embodiments, the classification information of the original expression of each frame of image of the first training video is added and averaged with the preset classification information corresponding to the target expression to obtain the classification information of the fusion expression corresponding to each frame of image of the first training video; and splicing the feature information of the face key points of each frame of image of the first training video multiplied by the first weight to be trained, the feature information of each frame of image of the first training video multiplied by the second weight to be trained, and the classification information of the fusion expression corresponding to each frame of image of the first training video to obtain the feature information of each frame of fusion image corresponding to the first training video.

The first generator comprises a first feature fusion model, and the first weight and the second weight are parameters to be trained in the first feature fusion model. The above-described processes of feature extraction and feature fusion may refer to the foregoing embodiments.

The first generator includes a first encoder and a first decoder, the first encoder includes: the third face feature extraction model, the first face key point detection model, the third emotion classification model and the first feature fusion model input the feature information of each frame fusion image corresponding to the first training video into the first decoder to obtain each frame fusion image corresponding to the generated first training video.

In step S430, each frame of image of the second training video is input into the second generator, the feature information of each frame of image of the second training video, the feature information of the face key point, and the classification information of the target expression are obtained, the feature information of each frame of image of the second training video, the feature information of the face key point, the classification information of the target expression, and the preset classification information corresponding to the original expression are fused to obtain the feature information of each frame of fused image corresponding to the second training video, and each frame of fused image corresponding to the second training video output by the second generator is obtained according to the feature information of each frame of fused image corresponding to the second training video.

The second generator is identical or similar in structure to the first generator, and the training target of the second generator is to generate a video having the same expression as the first training video based on the second training video.

In some embodiments, each frame of image in the second training video is input into a fourth face feature extraction model in the second generator, and feature information of each output frame of image is obtained; inputting the feature information of each frame of image into a second face key point detection model in a second generator to obtain the coordinate information of the face key points of each frame of image; and reducing the dimension of the coordinate information of all the face key points by adopting a principal component analysis method to obtain second information of a preset dimension, wherein the second information is used as the feature information of the face key points of each frame of image of the second training video. And inputting the feature information of each frame of image in the second training video into a fourth expression classification model in the second generator to obtain the classification information of the target expression of each frame of image in the second training video.

The feature information of the face key points of each frame image of the second training video is the same dimension, for example, 6 dimensions, as the feature information of the face key points of each frame image of the first training video.

In some embodiments, the classification information of the target expression of each frame of image of the second training video is added and averaged with the preset classification information corresponding to the original expression to obtain the classification information of the fusion expression corresponding to each frame of image of the second training video; and splicing the feature information of the face key points of each frame of image of the second training video multiplied by the third weight to be trained, the feature information of each frame of image of the second training video multiplied by the fourth weight to be trained, and the classification information of the fusion expression corresponding to each frame of image of the second training video to obtain the feature information of each frame of fusion image corresponding to the second training video.

The preset classification information corresponding to the original expression is not required to be obtained through a model, and the preset coding rule is directly adopted for coding. The second generator comprises a second feature fusion model, and the third weight are parameters to be trained in the second feature fusion model. The above processes of feature extraction and feature fusion may refer to the foregoing embodiments, and are not described again.

The second generator includes a second encoder and a second decoder, the second encoder including: and the feature information of each frame of fused image corresponding to the second training video is input into a second decoder to obtain each frame of fused image corresponding to the generated second training video.

In step S440, the immunity loss and the cyclic consistency loss are determined according to the frame fusion images corresponding to the first training video and the frame fusion images corresponding to the second training video.

End-to-end training is performed based on generation countermeasure learning and cross-domain transfer learning, so that the accuracy of the model can be improved, and the training efficiency can be improved.

In some embodiments, the challenge loss is determined using the following method: inputting each frame of fused image corresponding to the first training video into a first discriminator to obtain a first discrimination result of each frame of fused image corresponding to the first training video; inputting each frame of fused image corresponding to the second training video into a second discriminator to obtain a second discrimination result of each frame of fused image corresponding to the second training video; and determining a first pair of loss resistances according to the first judgment result of each frame of fused image corresponding to the first training video, and determining a second pair of loss resistances according to the second judgment result of each frame of fused image corresponding to the second training video.

Further, in some embodiments, the frames of fused images corresponding to the first training video are input into a first face feature extraction model in the first discriminator to obtain feature information of the frames of fused images corresponding to the output first training video; inputting the feature information of each frame of fused image corresponding to the first training video into a first expression classification model in a first discriminator to obtain the classification information of the expression of each frame of fused image corresponding to the first training video as a first discrimination result; inputting each frame of fused image corresponding to the second training video into a second face feature extraction model in a second discriminator to obtain feature information of each frame of fused image corresponding to the output second training video; and inputting the characteristic information of each frame of fused image corresponding to the second training video into a second expression classification model in a second discriminator to obtain the classification information of the expression of each frame of fused image corresponding to the second training video as a second judgment result.

The overall model comprises two sets of generators and discriminators in the training process. The first discriminator and the second discriminator have the same or similar structures and both comprise a face feature extraction model and an expression classification model. The first facial feature extraction model and the second facial feature extraction model are the same as or similar to the third facial feature extraction model and the fourth facial feature extraction model in structure, and the first expression classification model and the second expression classification model are the same as or similar to the third expression classification model and the fourth expression classification model in structure.

For example, the data of the first video is X ═ { X ═ X _i Y ═ Y denotes data of the second video _i Represents it. A first generator G for realizing X → Y, training G (X) as close to Y as possible, a first discriminator D _Y And the method is used for judging the truth of each frame of fused image corresponding to the first training video. The first loss tolerance can be expressed by the following equation:

a second generator F for realizing Y → X, training to make F (Y) as close to X as possible, and a second discriminator D _X And the method is used for judging the truth of each frame of fused image corresponding to the second training video. The second pair of loss resistances can be expressed by the following equation:

in some embodiments, Cycle Consistency Losses (Cycle Consistency Losses) are determined using the following method: inputting each frame fusion image corresponding to the first training video into a second generator to generate each frame reconstruction image of the first training video, and inputting each frame fusion image corresponding to the second training video into the first generator to generate each frame reconstruction image of the second training video; and determining the cycle consistent loss according to the difference between each frame of reconstructed image of the first training video and each frame of image of the first training video and the difference between each frame of reconstructed image of the second training video and each frame of image of the second training video.

In order to further improve the accuracy of the model, the image generated by the first generator is input to the second generator to obtain the reconstructed image of each frame of the first training video, and it is desirable that the reconstructed image of each frame of the first training video generated by the second generator is as consistent as possible with the reconstructed image of each frame of the first training video, that is, F (g (x)) is approximately equal to x. The image generated by the second generator is input to the first generator to obtain the reconstructed image of each frame of the second training video, and it is desirable that the reconstructed image of each frame of the second training video generated by the first generator and the reconstructed image of each frame of the second training video coincide as much as possible, i.e., G (f (y)) is approximately equal to y.

The difference between the reconstructed image of each frame of the first training video and the image of each frame of the first training video can be determined by the following method: for each frame of the reconstructed image of the first training video and the image of the first training video corresponding to the reconstructed image, the distance (e.g., euclidean distance) between the reconstructed image and the corresponding representation vector of the pixels at each same location of the image is determined and all the distances are summed.

The difference between the reconstructed image of each frame of the second training video and the image of each frame of the second training video can be determined by the following method: for each frame of the reconstructed image of the second training video and the image of the second training video corresponding to the reconstructed image, the distance (e.g., euclidean distance) between the reconstructed image and the corresponding representation vector of the pixels at each same position of the image is determined and all the distances are summed.

In step S450, the first generator and the second generator are trained according to the antagonistic loss and the cyclic consensus loss.

The first opponent loss, the second opponent loss and the cyclic coincidence loss can be weighted and summed to obtain a total loss, and the first generator and the second generator are trained according to the total loss. For example, the total loss may be determined using the following equation:

L＝L _GAN (G,D _Y ,X,Y)+L _GAN (F,D _X ,Y,X)+λL _cyc (G,F) (3)

wherein L is _cyc (G, F) represents the loss of cyclic agreement, and λ is the weight and can be obtained by training.

In order to further improve the accuracy of the model and ensure the stability and continuity of the output video result, the loss caused by the pixel difference between two frames of the video is increased in the training process. In some embodiments, pixel-to-pixel loss is determined based on pixel differences between each two adjacent frames of fused images corresponding to the first training video and pixel differences between each two adjacent frames of fused images corresponding to the second training video, and the first generator and the second generator are trained based on oppositional loss, cyclic consistent loss, and pixel-to-pixel loss.

Further, in some embodiments, for each position in each two adjacent frames of fused images corresponding to the first training video, determining a distance between the expression vectors of two pixels at the position in the two adjacent frames of fused images, and summing the distances corresponding to all the positions to obtain a first loss; determining the distance between the expression vectors of two pixels at the positions in each two adjacent frames of fused images corresponding to the second training video according to each position in each two adjacent frames of fused images, and summing the distances corresponding to all the positions to obtain a second loss; and adding the first loss and the second loss to obtain pixel-to-pixel loss. The pixel-to-pixel loss can prevent the two adjacent frames of the generated video from changing too much.

In some embodiments, the antagonistic loss, the cyclic consensus loss, and the pixel-to-pixel loss are weighted and summed to yield a total loss; the first generator and the second generator are trained on the total loss. For example, the total loss may be determined using the following equation:

L＝L _GAN (G,D _Y ,X,Y)+L _GAN (F,D _X ,Y,X)+λ ₁ L _cyc (G,F)+λ ₂ L _P2P (G(x _i ),G(x _i+1 ))+λ ₃ L _P2P (F(y _j ),F(y _j+1 )) (4)

wherein λ is ₁ ，λ ₂ ，λ ₃ For weighting, L can be obtained by training _P2P (G(x _i ),G(x _i+1 ) Denotes a first loss, L _P2P (F(y _j ),F(y _j+1 ) Represents a second loss.

As shown in fig. 3D, before performing end-to-end training, models of each part may be pre-trained, for example, a large amount of open source face recognition data is selected to pre-train a face recognition model, and a part before an output feature map is selected as a face feature extraction model (this part method is not unique, for example, vgg-19, and a part before block5 is selected, and a feature map with dimensions of 8 × 8 × 512 may be output). And then fixing a face feature extraction model and parameters, dividing the back into two branches, namely a face key point detection model and an expression classification model, and respectively carrying out fine-tuning (fine-tune) on the respective branches by using a face key point detection data set and expression classification data to train the parameters in the structures of the two parts of models. The face key point detection model is not unique, and a scheme can be accessed as long as the model is based on a convolution network model and can obtain accurate key points; the expression classification model is a single label classification task based on a convolution network model. After pre-training, an end-to-end training process may be performed based on the previous embodiments. This can improve training efficiency.

The method of the embodiment trains the whole model by adopting the counter loss, the cycle consistent loss and the pixel loss between two adjacent video frames, so that the accuracy of the model can be improved, the efficiency can be improved in the end-to-end training process, and the computing resources are saved.

The scheme of the disclosure is suitable for editing the facial expression in the video. This is disclosed through adopting unique degree of deep learning model, fuse technologies such as expression discernment, key point detection, through the training of data, the law that people's facial key point moved under the different expressions is learned, the facial expression state that the control model output is finally come through the categorised information of inputing the target expression to the model, and the expression exists as a style state, can be fine effect stack when the personage speaks or makes movements such as askew head, blink for the facial action video of the personage of final output is nature, not harmonious. The output result can have the same resolution and detail degree as the input image, and the output result can still be kept stable, clear and flawless under 1080p or even 2k resolution.

In step S250, the action is customized.

And the action customization refers to editing and processing the actions of the figures in each frame of image in the first video according to the figure action customization information corresponding to the interactive scene, so as to realize the editing and control of the actions of the digital figures in the interactive scene.

In some embodiments, editing the character motion in each image of the first video according to the character motion customization information corresponding to the interactive scene includes: adjusting first human body key points of a character in an original first key frame in the first video during a first action to obtain second human body key points of the character during a second action as character action customization information; extracting feature information of each second human body key point neighborhood from the original first key frame by using a feature extraction model, such as a convolution kernel model; and inputting the characteristic information of each second human body key point and the neighborhood thereof into an image generation model, and outputting a target first key frame of the character in the second action.

The first human body key points comprise human body contour feature points of the character in the first action, such as 14 pairs of white dots shown in fig. 4A, and the second human body key points comprise human body contour feature points of the character in the second action, such as 14 pairs of white dots shown in fig. 4B.

The character action is edited by utilizing the human body contour characteristic points, compared with the character action editing by utilizing the human body skeleton characteristic points, the generated character action is more accurate, the phenomena of deformation, distortion and the like are not easy to occur, and the generated image quality is improved.

Before adjusting the human body contour feature points of the person in the first action, the human body contour feature points of the person in the first action are extracted. The extracting of the human body contour feature points of the person in the first action includes, for example: extracting the contour line of the figure by utilizing a semantic segmentation network model; extracting a plurality of key points on the character, such as black dots shown in fig. 4C, by using the target detection network model; connecting the plurality of key points according to the structure information of the person, and determining a plurality of key connecting lines, such as white straight lines shown in fig. 4C; and determining a plurality of human body contour feature points paired when the character moves in the first action according to the intersection points of the vertical lines of the key connecting lines and the contour lines.

The method for obtaining the image generation model comprises the following steps: the method comprises the steps of taking a training video frame and human key points of figures in the training video frame as a pair of training data, taking the human key points in the training data and feature information of neighborhoods of the human key points in the training video frame as input of an image generation network, taking the training video frame in the training data as output supervision information of the image generation network, and training the image generation network to obtain an image generation model. The method comprises the steps that difference information of a video frame and a training video frame output by an image generation network based on input data is used as a loss function, parameters of the image generation network are updated iteratively according to loss determined by the loss function until the loss meets a certain condition, training is completed, the video frame output by the image generation network is very close to the training video frame, and the trained image generation network is used as an image generation model. The image generation network is a large class of models, including but not limited to convolutional neural network, circular network based on optical flow method, generation countermeasure network, and so on. If the image generation network is a generative countermeasure network, the overall loss function also includes a discriminant loss function of the image discrimination network.

In step S260, the output is rendered.

Modeling the character image by using the material result processed in each step S220-250, selecting different rendering technologies according to the application scene, combining with artificial intelligence technologies such as intelligent dialogue, voice recognition, voice synthesis and action interaction, combining into a set of complete digital human video (namely, a second video) capable of interacting with the scene, and outputting the set of complete digital human video.

In the above embodiment, the characters in the video are edited according to the character customization information corresponding to the interactive scene, and the digital human video matching the interactive scene is generated through character editing, for example, a digital human image, a digital human expression, a digital human action, and the like matching the interactive scene are generated. According to the method disclosed by the embodiment of the disclosure, a set of character image videos is recorded, and a plurality of sets of videos with different character image styles in different scenes can be rapidly produced. In addition, a professional engineer is not required to access the system, and the user can automatically adjust the image, expression, action and the like of the character according to the scene requirement.

Fig. 5 shows a schematic structural diagram of a digital human generating device according to some embodiments of the present disclosure. As shown in FIG. 5, the digital human generation apparatus 500 of this embodiment includes units 510-530.

The obtaining unit 510 is configured to obtain the first video, which may specifically refer to step S220.

The customizing unit 520 is configured to edit the person in each frame of image in the first video according to the person customizing information corresponding to the interactive scene, which may be specifically referred to in steps S230 to 250.

The customizing unit 520 includes, for example, an avatar customizing unit 521, an expression customizing unit 522, an action customizing unit 523, and the like. The avatar customizing unit 521 is configured to edit the avatar in each frame of image in the first video according to the avatar customizing information corresponding to the interactive scene, as shown in step S230. The expression customizing unit 522 is configured to edit the human expression in each frame of image in the first video according to the human expression customizing information corresponding to the interactive scene, which may be specifically referred to in step S240. The action customizing unit 523 is configured to edit the character actions in each frame of image in the first video according to the character action customizing information corresponding to the interactive scene, which may be specifically referred to in step S250.

An output unit 530 configured to output the second video according to each frame image in the processed first video, see step S260.

Fig. 6 shows a schematic structural diagram of a digital human generating device according to further embodiments of the present disclosure. As shown in fig. 6, the digital person generation apparatus 600 of this embodiment includes: a memory 610 and a processor 620 coupled to the memory 610, the processor 620 configured to perform the digital human generation method of any of the foregoing embodiments based on instructions stored in the memory 610.

Memory 610 may include, for example, system memory, fixed non-volatile storage media, and the like. The system memory stores, for example, an operating system, an application program, a Boot Loader (Boot Loader), and other programs.

The Processor 620 may be implemented as discrete hardware components such as a general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA), other Programmable logic devices, discrete gates, or transistors.

The apparatus 600 may also include an input-output interface 630, a network interface 640, a storage interface 650, and the like. These

interfaces

630, 640, 650 and the connections between the memory 610 and the processor 620 may be, for example, via a bus 660. The input/output interface 630 provides a connection interface for input/output devices such as a display, a mouse, a keyboard, and a touch screen. The network interface 640 provides a connection interface for various networking devices. The storage interface 650 provides a connection interface for external storage devices such as an SD card and a usb disk. The bus 660 may use any of a variety of bus architectures. For example, bus structures include, but are not limited to, an Industry Standard Architecture (ISA) bus, a Micro Channel Architecture (MCA) bus, and a Peripheral Component Interconnect (PCI) bus.

Some embodiments of the present disclosure propose a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the digital human generation method in any of the foregoing embodiments.

As will be appreciated by one skilled in the art, embodiments of the present disclosure may be provided as a method, system, or computer program product. Accordingly, the present disclosure may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present disclosure may take the form of a computer program product embodied on one or more non-transitory computer-readable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and so forth) having computer program code embodied therein.

The present disclosure is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the disclosure. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The above description is only exemplary of the present disclosure and is not intended to limit the present disclosure, so that any modification, equivalent replacement, or improvement made within the spirit and principle of the present disclosure should be included in the scope of the present disclosure.

Claims

1. A method for digital human generation, comprising:

acquiring a first video;

2. The method of claim 1, wherein the first video is obtained from an original video through a pre-process, and the pre-process comprises one or more of resolution adjustment, inter-frame smoothing, and frame rate adjustment.

3. The method of claim 2, wherein the resolution adjustment comprises:

4. The method according to claim 3, wherein the super-resolution model is obtained by training a neural network, and in the training process, a first video frame from the high-definition video is down-sampled according to a preset resolution to obtain a second video frame, the second video frame is used as an input of the neural network, the first video frame is used as supervision information of an output of the neural network, and the neural network is trained to obtain the super-resolution model.

5. The method of claim 2, wherein the frame rate adjustment comprises:

6. The method according to claim 5, wherein the video frame interpolation model is obtained by training a neural network, and in the training process, continuous three frames in the training video frame sequence are used as triples, a first frame and a third frame in the triples are used as input of the neural network, a second frame in the triples is used as supervision information of output of the neural network, and the neural network is trained to obtain the video frame interpolation model.

7. The method of claim 6, wherein the inputs to the neural network comprise: visual feature information and depth information of the first frame and the third frame, and optical flow information and deformation information between the first frame and the third frame.

8. The method of claim 1, wherein the editing the character in each image of the first video according to the character customization information corresponding to the interactive scene comprises one or more of the following:

9. The method of claim 8, wherein the editing the character in each frame of image in the first video according to the character customization information corresponding to the interactive scene comprises:

determining character image adjusting parameters according to character image adjustment of a part of video frames in a first video of a user, and editing character images in the rest video frames in the first video according to the character image adjusting parameters.

10. The method of claim 9, wherein the editing the character in the rest of the video frames of the first video according to the character adjustment parameter comprises:

11. The method of claim 8,

the character expression customizing information comprises preset classification information corresponding to the target expression,

the editing processing of the character expression in each frame of image in the first video according to the character expression customization information corresponding to the interactive scene comprises the following steps:

12. The method of claim 11, wherein the obtaining of the feature information of each frame of image in the first video, the feature information of the face key point, and the classification information of the original expression comprises:

inputting the feature information of each frame of image into a face key point detection model to obtain coordinate information of the face key points of each frame of image, and performing dimensionality reduction on the coordinate information of all the face key points by adopting a principal component analysis method to obtain information of a preset dimensionality as the feature information of the face key points;

13. The method of claim 11, wherein the fusing the feature information of each frame of image, the feature information of the face key points, the classification information of the original expression and the preset classification information corresponding to the target expression comprises:

14. The method according to claim 12, wherein the generating the fused image corresponding to each frame of image according to the feature information of the fused image corresponding to each frame of image comprises:

15. The method of claim 11,

inputting a first video with a facial expression as an original expression and preset classification information corresponding to a target expression into an expression generation model, and outputting to obtain a second video with the facial expression as the target expression;

the training method of the expression generation model comprises the following steps:

16. The method of claim 15, further comprising:

determining pixel-to-pixel loss according to the pixel difference between every two adjacent frames of fused images corresponding to the first training video and the pixel difference between every two adjacent frames of fused images corresponding to the second training video;

17. The method according to claim 15 or 16, wherein the determining the countermeasure loss according to the fused image of each frame corresponding to the first training video and the fused image of each frame corresponding to the second training video comprises:

inputting each frame of fused image corresponding to the first training video into a first discriminator to obtain a first discrimination result of each frame of fused image corresponding to the first training video;

and determining a first pair of loss resistances according to the first judgment result of the fused image of each frame corresponding to the first training video, and determining a second pair of loss resistances according to the second judgment result of the fused image of each frame corresponding to the second training video.

18. The method according to claim 17, wherein inputting the fused image of each frame corresponding to the first training video into a first discriminator to obtain a first discrimination result of the fused image of each frame corresponding to the first training video comprises:

19. The method of claim 15 or 16, wherein the cycle consistent loss is determined using the following method:

20. The method of claim 16, wherein the pixel-to-pixel loss is determined by:

21. The method of claim 15, wherein the obtaining of the feature information of each frame of image of the first training video, the feature information of the face key point, and the classification information of the original expression comprises:

inputting each frame of image in the first training video into a third facial feature extraction model in the first generator to obtain the output feature information of each frame of image; inputting the feature information of each frame of image into a first face key point detection model in the first generator to obtain the coordinate information of the face key points of each frame of image; reducing the dimension of the coordinate information of all face key points by adopting a principal component analysis method to obtain first information of a preset dimension, wherein the first information is used as feature information of the face key points of each frame of image of the first training video; inputting the feature information of each frame of image in the first training video into a third emotion classification model in the first generator to obtain the classification information of the original emotion of each frame of image in the first training video;

the acquiring the feature information of each frame of image of the second training video, the feature information of the face key point and the classification information of the target expression comprises:

inputting each frame of image in the second training video into a fourth face feature extraction model in the second generator to obtain the output feature information of each frame of image; inputting the feature information of each frame of image into a second face key point detection model in the second generator to obtain the coordinate information of the face key points of each frame of image; reducing the dimension of the coordinate information of all face key points by adopting a principal component analysis method to obtain second information of a preset dimension, wherein the second information is used as the feature information of the face key points of each frame of image of the second training video; and inputting the feature information of each frame of image in the second training video into a fourth expression classification model in the second generator to obtain the classification information of the target expression of each frame of image in the second training video.

22. The method of claim 15, wherein the fusing the feature information of each frame of image of the first training video, the feature information of the face key points, the classification information of the original expression and the preset classification information corresponding to the target expression comprises:

adding and averaging the classification information of the original expression of each frame of image of the first training video and the preset classification information corresponding to the target expression to obtain the classification information of the fusion expression corresponding to each frame of image of the first training video; splicing the feature information of the face key points of each frame of image of the first training video multiplied by the first weight to be trained, the feature information of each frame of image of the first training video multiplied by the second weight to be trained, and the classification information of the fusion expression corresponding to each frame of image of the first training video;

the fusing the feature information of each frame of image of the second training video, the feature information of the face key point, the classification information of the target expression and the preset classification information corresponding to the original expression comprises:

adding and averaging the classification information of the target expression of each frame of image of the second training video and the preset classification information corresponding to the original expression to obtain the classification information of the fusion expression corresponding to each frame of image of the second training video; and splicing the feature information of the face key points of each frame of image of the second training video multiplied by the third weight to be trained, the feature information of each frame of image of the second training video multiplied by the fourth weight to be trained, and the classification information of the fusion expression corresponding to each frame of image of the second training video.

23. The method of claim 16, wherein the training the first generator and the second generator according to the countering loss, the recurring consistent loss, and the pixel-to-pixel loss comprises:

weighting and summing the confrontation loss, the cyclic coincidence loss and the pixel-to-pixel loss to obtain a total loss;

training the first generator and the second generator according to the total loss.

24. The method of claim 8,

the editing processing of the character actions in each frame of image in the first video according to the character action customization information corresponding to the interactive scene comprises the following steps:

and inputting the characteristic information of each second human body key point and the neighborhood thereof into the image generation model, and outputting a target first key frame of the character in the second action.

25. The method of claim 24, wherein the obtaining of the image generation model comprises:

the method comprises the steps of taking a training video frame and human key points of figures in the training video frame as a pair of training data, taking the human key points in the training data and feature information of neighborhoods of the human key points in the training video frame as input of an image generation network, taking the training video frame in the training data as output supervision information of the image generation network, and training the image generation network to obtain an image generation model.

26. The method of claim 24, wherein the first human keypoints comprise human silhouette feature points of the person during the first action, and wherein the second human keypoints comprise human silhouette feature points of the person during the second action.

27. A digital person generation apparatus comprising:

a memory; and a processor coupled to the memory, the processor configured to perform the digital person generation method of any of claims 1-26 based on instructions stored in the memory.

28. A digital person generation apparatus, comprising:

an acquisition unit configured to acquire a first video;

29. A non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, carries out the steps of the digital person generation method of any one of claims 1 to 26.