WO2023221684A1 - Digital human generation method and apparatus, and storage medium - Google Patents

Digital human generation method and apparatus, and storage medium Download PDF

Info

Publication number
WO2023221684A1
WO2023221684A1 PCT/CN2023/087271 CN2023087271W WO2023221684A1 WO 2023221684 A1 WO2023221684 A1 WO 2023221684A1 CN 2023087271 W CN2023087271 W CN 2023087271W WO 2023221684 A1 WO2023221684 A1 WO 2023221684A1
Authority
WO
WIPO (PCT)
Prior art keywords
frame
video
image
information
training
Prior art date
Application number
PCT/CN2023/087271
Other languages
French (fr)
Chinese (zh)
Inventor
王林芳
张炜
石凡
张琪
申童
左佳伟
梅涛
Original Assignee
京东科技控股股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 京东科技控股股份有限公司 filed Critical 京东科技控股股份有限公司
Publication of WO2023221684A1 publication Critical patent/WO2023221684A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T3/00Geometric image transformations in the plane of the image
    • G06T3/40Scaling of whole images or parts thereof, e.g. expanding or contracting
    • G06T3/4046Scaling of whole images or parts thereof, e.g. expanding or contracting using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T3/00Geometric image transformations in the plane of the image
    • G06T3/40Scaling of whole images or parts thereof, e.g. expanding or contracting
    • G06T3/4053Scaling of whole images or parts thereof, e.g. expanding or contracting based on super-resolution, i.e. the output image resolution being higher than the sensor resolution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content

Definitions

  • the present disclosure relates to the field of computer technology, and in particular to a digital human generation method and device and a storage medium.
  • Some embodiments of the present disclosure propose a digital human generation method, including:
  • a second video is output based on each frame image in the processed first video.
  • the first video is obtained by preprocessing the original video, and the preprocessing includes one or more of resolution adjustment, inter-frame smoothing, and frame rate adjustment.
  • the resolution adjustment includes:
  • the resolution of the original video is higher than the required preset resolution, downsample the original video according to the preset resolution to obtain the first video with the preset resolution;
  • the super-resolution model is used to process the original video. Process to obtain a first video with a preset resolution, and the super-resolution model is used to increase the resolution of the input video to the preset resolution.
  • the super-resolution model is trained by a neural network.
  • the first video frame from the high-definition video is downsampled according to the preset resolution to obtain the second video frame.
  • the second video frame is used as the input of the neural network
  • the first video frame is used as the supervision information of the output of the neural network
  • the neural network is trained to obtain a super-resolution model.
  • the frame rate adjustment includes:
  • the frame rate of the original video is higher than the required preset frame rate, extract frames from the original video based on the ratio information between the frame rate of the original video and the preset frame rate to obtain the first video with the preset frame rate;
  • the video frame insertion model is used to insert frames of the original video to the first frame rate, and the first frame rate is the frame rate of the original video before the frame is inserted and the preset frame The least common multiple of the rate, extract frames from the original video after frame insertion based on the ratio information between the first frame rate and the preset frame rate, to obtain the first video with the preset frame rate.
  • the video frame interpolation model is used to generate any two Transition frame between frame images.
  • the video frame interpolation model is trained by a neural network.
  • three consecutive frames in the training video frame sequence are regarded as triples, and the first frame in the triples is and the third frame as the input of the neural network, and the second frame in the triplet as the supervision information of the output of the neural network, and train the neural network to obtain a video frame interpolation model.
  • the input of the neural network includes: visual feature information and depth information of the first frame and the third frame, and optical flow information and deformation information between the first frame and the third frame.
  • editing the characters in each frame of the first video according to the character customization information corresponding to the interaction scene includes one or more of the following:
  • editing the characters in each frame of the first video based on the character customization information corresponding to the interaction scene includes: based on the user's actions in some video frames in the first video. Adjust the character image, determine the character image adjustment parameters, and adjust the character image in the first video according to the character image adjustment parameters. The characters in the remaining video frames are edited.
  • editing the characters in the remaining video frames in the first video according to the character adjustment parameters includes:
  • the target part of the character image adjustment in the character image adjustment parameter locate the target part of the character in the remaining video frames in the first video through key point detection;
  • the amplitude or position information of the character image adjustment in the character image adjustment parameters is adjusted through graphics transformation.
  • the character expression customization information includes preset classification information corresponding to the target expression, and the character expression customization information in the first video is edited according to the character expression customization information corresponding to the interaction scene.
  • the fused image corresponding to each frame of image is generated, and all the fused images form a second video in which the facial expression is the target expression.
  • obtaining the feature information of each frame of the image in the first video, the feature information of key facial points, and the classification information of the original expression includes:
  • the characteristic information of each frame of image is input into the expression classification model to obtain the classification information of the original expression of each frame of image.
  • fusing the feature information of each frame of image, the feature information of facial key points, the classification information of the original expression and the preset classification information corresponding to the target expression includes:
  • the characteristic information of the face key points of each frame image multiplied by the first weight obtained by training is compared with the training
  • the characteristic information of each frame of image multiplied by the obtained second weight and the classification information of the fused expression corresponding to each frame of image are spliced.
  • generating the fused image corresponding to each frame of image based on the feature information of the fused image corresponding to each frame of image includes:
  • the facial feature extraction model includes a convolution layer
  • the decoder includes a deconvolution layer
  • the expression generation model Training methods include:
  • Each frame image of the first training video is input into the first generator, the feature information of each frame image of the first training video, the feature information of the face key points and the classification information of the original expression are obtained, and the first generator is The characteristic information of each frame of the training video, the characteristic information of the face key points, the classification information of the original expression and the preset classification information corresponding to the target expression are fused to obtain the fusion image of each frame corresponding to the first training video.
  • Feature information according to the feature information of each frame fusion image corresponding to the first training video, obtain each frame fusion image corresponding to the first training video output by the first generator;
  • Each frame image of the second training video is input into the second generator, the feature information of each frame image of the second training video, the feature information of the face key points and the classification information of the target expression are obtained, and the second generator is The characteristic information of each frame of the training video, the characteristic information of the face key points, the classification information of the target expression and the preset classification information corresponding to the original expression are fused to obtain the characteristics of each frame of the fused image corresponding to the second training video.
  • Information, according to the feature information of each frame fusion image corresponding to the second training video obtain each frame fusion image corresponding to the second training video output by the second generator;
  • the first generator and the second generator are trained according to the adversarial loss and the cycle-consistent loss. After the training of the first generator is completed, it is used as an expression generation model.
  • the method further includes: based on the pixel difference between each two adjacent frames of the fused image corresponding to the first training video, and the pixel difference between each two adjacent frames of the fused image corresponding to the second training video. Pixel difference, which determines pixel-to-pixel loss;
  • training the first generator and the second generator according to the adversarial loss and the cycle-consistent loss includes:
  • the first generator and the second generator are trained based on the adversarial loss, the cycle-consistent loss, and the pixel-for-pixel loss.
  • determining the adversarial loss based on each frame fusion image corresponding to the first training video and each frame fusion image corresponding to the second training video includes: combining each frame corresponding to the first training video.
  • the frame fusion image is input into the first discriminator to obtain the first discrimination result of each frame fusion image corresponding to the first training video;
  • the first adversarial loss is determined based on the first discrimination result of each frame of the fused image corresponding to the first training video
  • the second adversarial loss is determined based on the second discrimination result of each frame of the fused image corresponding to the second training video.
  • inputting the fused images of each frame corresponding to the first training video into the first discriminator, and obtaining the first discrimination result of the fused image of each frame corresponding to the first training video includes:
  • the input of the fused images of each frame corresponding to the second training video into the second discriminator to obtain the second discrimination result of the fused images of each frame corresponding to the second training video includes:
  • the cycle consistency loss is determined using the following method:
  • Each frame fusion image corresponding to the first training video is input into the second generator to generate a reconstructed image of each frame of the first training video, and each frame fusion image corresponding to the second training video is input into the second generator.
  • the first generator generates reconstructed images of each frame of the second training video;
  • the pixel-to-pixel loss is determined using the following method:
  • the first loss and the second loss are summed to obtain the pixel-to-pixel loss.
  • obtaining the characteristic information of each frame of the first training video, the characteristic information of the facial key points, and the classification information of the original expression includes: inputting each frame of the image in the first training video.
  • the third facial feature extraction model in the first generator obtains the characteristic information of the output frame images; inputs the characteristic information of each frame image into the first face key in the first generator
  • Point detection model is used to obtain the coordinate information of the facial key points of each frame image;
  • the principal component analysis method is used to reduce the dimensionality of the coordinate information of all facial key points, and the first information of the preset dimension is obtained as the third Characteristic information of facial key points in each frame image of a training video;
  • input the characteristic information of each frame image in the first training video into the third expression classification model in the first generator to obtain the first Classification information of the original expression of each frame image in the training video;
  • Obtaining the characteristic information of each frame image of the second training video, the characteristic information of the facial key points and the classification information of the target expression includes: inputting each frame image of the second training video into the second generator
  • the fourth face feature extraction model in the second generator is used to obtain the feature information of each frame of the image output; the feature information of each frame of the image is input into the second face key point detection model in the second generator to obtain the feature information of each frame of the image.
  • the coordinate information of the facial key points of each frame image is described; the principal component analysis method is used to reduce the dimensionality of the coordinate information of all facial key points, and the second information of the preset dimension is obtained as each frame of the second training video Characteristic information of facial key points in the image; input characteristic information of each frame of image in the second training video into the fourth expression classification model in the second generator to obtain each frame of image in the second training video Classification information of target expressions.
  • fusing the feature information of each frame image of the first training video, the feature information of facial key points, the classification information of the original expression and the preset classification information corresponding to the target expression includes: The classification information of the original expression of each frame image of the first training video and the preset classification information corresponding to the target expression are added and averaged to obtain the classification of the fused expression corresponding to each frame image of the first training video. information; The feature information of the face key points of each frame image of the first training video multiplied by the first weight to be trained is multiplied by the second weight to be trained. The feature information of the frame images and the classification information of the fused expressions corresponding to each frame image of the first training video are spliced;
  • the fusion of the feature information of each frame image of the second training video, the feature information of the facial key points, the classification information of the target expression and the preset classification information corresponding to the original expression includes: fusing the second training video
  • the classification information of the target expression of each frame of the image is added and averaged with the preset classification information corresponding to the original expression to obtain the classification information of the fused expression corresponding to each frame of the second training video;
  • the characteristic information of the facial key points of each frame of the second training video multiplied by the third weight, and the characteristics of each frame of the second training video multiplied by the fourth weight to be trained information, and the classification information of the fused expression corresponding to each frame image of the second training video is spliced.
  • training the first generator and the second generator based on the adversarial loss, the cycle-consistent loss, and the pixel-for-pixel loss includes: converting the adversarial loss , the cycle-consistent loss and the pixel-to-pixel loss are weighted and summed to obtain a total loss; the first generator and the second generator are trained according to the total loss.
  • editing the character actions in each frame of the image in the first video based on the character action customization information corresponding to the interaction scene includes:
  • the feature information of each second human body key point and its neighborhood is input into the image generation model, and the target first key frame of the character during the second action is output.
  • the method for obtaining the image generation model includes: using the training video frames and the human body key points of the characters in the training video frames as a pair of training data, and using the human body key points in the training data and their key points in the training video frames.
  • the feature information of the middle neighborhood is used as the input of the image generation network
  • the training video frames in the training data are used as the supervision information of the output of the image generation network
  • the image generation network is trained to obtain the image generation model.
  • the first human body key points include the human body outline feature points of the character during the first action
  • the second human body key points include the human body outline feature points of the character during the second action.
  • Some embodiments of the present disclosure provide a digital human generation device, including: a memory; and a processor coupled to the memory, the processor being configured to execute the instructions of various embodiments based on instructions stored in the memory.
  • the digital human generation method described above including: a memory; and a processor coupled to the memory, the processor being configured to execute the instructions of various embodiments based on instructions stored in the memory. The digital human generation method described above.
  • Some embodiments of the present disclosure provide a digital human generation device, including:
  • an acquisition unit configured to acquire the first video
  • the customization unit is configured to edit the characters in each frame of the image in the first video according to the character customization information corresponding to the interaction scene;
  • the output unit is configured to output the second video according to each frame image in the processed first video.
  • Some embodiments of the present disclosure provide a non-transitory computer-readable storage medium on which a computer program is stored. When the program is executed by a processor, the steps of the digital human generation method described in various embodiments are implemented.
  • Figure 1A shows a schematic flowchart of a digital human generation method according to some embodiments of the present disclosure.
  • FIG. 1B shows a schematic flowchart of a digital human generation method according to other embodiments of the present disclosure.
  • Figure 2 shows a schematic diagram of video preprocessing according to some embodiments of the present disclosure.
  • Figure 3A shows a schematic flowchart of an expression generation method according to some embodiments of the present disclosure.
  • Figure 3B shows a schematic diagram of an expression generation method according to other embodiments of the present disclosure.
  • Figure 3C shows a schematic flowchart of a training method for an expression generation model according to some embodiments of the present disclosure.
  • Figure 3D shows a schematic diagram of a training method of an expression generation model according to some embodiments of the present disclosure.
  • FIG. 4A shows a schematic diagram of the human body outline feature points of the character during the first action according to some embodiments of the present disclosure.
  • FIG. 4B shows a schematic diagram of the human body outline feature points of the character during the second action according to some embodiments of the present disclosure.
  • FIG. 4C shows a schematic diagram of multiple key points and multiple key connections on a character according to some embodiments of the present disclosure.
  • Figure 5 shows a schematic structural diagram of a digital human generation device according to some embodiments of the present disclosure.
  • Figure 6 shows a schematic structural diagram of a digital human generation device according to other embodiments of the present disclosure.
  • Embodiments of the present disclosure edit the characters in the video based on the character customization information corresponding to the interaction scene, and generate a digital human video that matches the interaction scene through character editing.
  • Figure 1A shows a schematic flowchart of a digital human generation method according to some embodiments of the present disclosure.
  • the digital human generation method of this embodiment includes the following steps.
  • step S110 the first video is obtained.
  • the first video may be a recorded original video, or may be obtained by preprocessing the original video.
  • the preprocessing includes one or more of resolution adjustment, inter-frame smoothing, and frame rate adjustment.
  • step S120 edit the characters in each frame of the image in the first video according to the character customization information corresponding to the interaction scene.
  • the editing process of the characters in each frame image in the first video according to the character customization information corresponding to the interaction scene includes one or more of the following: according to the character customization information corresponding to the interaction scene, editing the first video Edit and process the characters in each frame of the image in the first video to generate a digital human image that matches the interaction scene; edit and process the characters in each frame of the image in the first video based on the character expression customization information corresponding to the interaction scene. Generate digital human expressions that match the interaction scene; edit the character movements in each frame of the image in the first video according to the customized information of character movements corresponding to the interaction scene, and generate digital human movements that match the interaction scene.
  • step S130 a second video is output based on each frame image in the processed first video.
  • each frame image in the processed first video is combined to form a second video
  • the second video is a digital human video matching the interaction scene.
  • the characters in the video are edited according to the character customization information corresponding to the interaction scene, and a digital human video matching the interaction scene is generated through character editing.
  • a digital human image, digital human expression, etc. that match the interaction scene are generated.
  • FIG. 1B shows a schematic flowchart of a digital human generation method according to other embodiments of the present disclosure.
  • the digital human generation method of this embodiment includes the following steps.
  • step S210 logic control is customized.
  • Customized logic control is used to control whether custom logic such as video preprocessing, image customization, expression customization, action customization, etc. is executed and the order of execution.
  • the edited content of each part such as video preprocessing, image customization, expression customization, and action customization is independent, and there is no strong dependence on each other. Therefore, the execution order of each part can be exchanged, and the generation and interaction scenes can be matched.
  • the basic effect of the digital human video there is still a certain mutual influence between the various parts. According to the execution sequence of S220 to S250 in this embodiment, the mutual influence between the various parts can be minimized, and the final character image presentation effect is better.
  • step S220 video preprocessing.
  • Video preprocessing is to preprocess the recorded original video to obtain the first video.
  • the preprocessing includes one or more of resolution adjustment, inter-frame smoothing, and frame rate adjustment.
  • the preprocessing is performed sequentially in the order of resolution adjustment, inter-frame smoothing, and frame rate adjustment.
  • the effect of video preprocessing is better and the visual appearance of the original video can be retained to the greatest extent. information to ensure that the preprocessed video does not suffer from quality problems such as blurring and distortion, and that frame rate adjustment and resolution adjustment have the least impact on the subsequent digital human customization process.
  • the resolution adjustment includes: if the resolution of the original video is higher than the required preset resolution, downsampling the original video according to the preset resolution to obtain the first video with the preset resolution; if the resolution of the original video is If the resolution is lower than the required preset resolution, a super-resolution model is used to process the original video to obtain the first video of the preset resolution. The super-resolution model is used to increase the resolution of the input video to the preset resolution. ; If the resolution of the original video is equal to the required preset resolution, you can skip the resolution adjustment step.
  • the resolution of the preprocessed first video can be maintained consistent, and the impact of the differentiated resolution of the original video on the digital human customization effect can be reduced.
  • the super-resolution model is, for example, obtained by training a neural network.
  • the first video frame from the high-definition video is downsampled according to the preset resolution to obtain the second video frame, and the second video frame is used as As the input of the neural network, the first video frame is used as the supervision information of the output of the neural network, and the neural network is trained to obtain the super-resolution model.
  • the gap information between the video frame output by the neural network and the first video frame is used as the loss function, and the parameters of the neural network are iteratively updated according to the loss determined by the loss function until the loss meets certain conditions and the training is completed.
  • neural networks are a large class of models, including but not limited to convolutional neural networks, recurrent networks based on optical flow methods, generative adversarial networks, etc.
  • the first video with resolutions such as 480p/720p/1080p can be obtained from the original video of any resolution.
  • 360p/480p/720p/1080p is a video display format
  • P means progressive scan.
  • the picture resolution of 1080p is 1920 times 1080.
  • inter-frame smoothing is used here to ensure that the texture and characters are smooth during video playback. There will be no jagged or moiré patterns on the edges to avoid visual impact.
  • the inter-frame smoothing process may, for example, adopt an average smoothing process.
  • the image information of three consecutive frames is averaged, and the average is used as the image information of the middle frame among the three consecutive frames.
  • the frame rate adjustment includes: if the frame rate of the original video is higher than the required preset frame rate, extracting frames from the original video based on the ratio information between the frame rate of the original video and the preset frame rate to obtain the third frame rate of the preset frame rate.
  • a video if the frame rate of the original video is lower than the required preset frame rate, use the video frame insertion model to insert frames in the original video to the first frame rate, and the first frame rate is the frame rate of the original video before the frame is inserted.
  • the least common multiple of the preset frame rate extract frames from the original video after frame insertion according to the ratio information between the first frame rate and the preset frame rate, to obtain the first video with the preset frame rate, and the video frame interpolation model is used Generate a transition frame between any two frames of images; if the frame rate of the original video is equal to the required preset frame rate, the frame rate adjustment step can be skipped.
  • the frame rate of the preprocessed first video can be maintained consistent, and the impact of the differentiated frame rate of the original video on the digital human customization effect can be reduced.
  • the frame insertion operation can also effectively solve the jump problem between two actions. For example, after a digital person performs action A and then moves to action B, the user will feel that the character's movements jump when playing a video without frame insertion processing, which is not realistic enough.
  • frame insertion is used to insert between the key frames of the two actions. Several transition frames make the user feel that the transition of character movements is natural and more realistic when the video after frame insertion is played.
  • the video frame insertion model is, for example, obtained by training a neural network.
  • three consecutive frames in the training video frame sequence are regarded as a triplet, and the first frame and the third frame in the triplet are regarded as As the input of the neural network, the second frame in the triplet is used as the supervision information of the output of the neural network, and the neural network is trained to obtain the video frame interpolation model.
  • the gap information between the video frame output by the neural network based on the first frame and the third frame in the input triplet and the second frame in the triplet is used as the loss function, and the neural network is iteratively updated based on the loss determined by the loss function. parameters of the network until the loss meets certain conditions and the training is completed.
  • the trained neural network is used as a video frame interpolation model and can generate any two frames. Transition frames between images.
  • neural networks are a large class of models, including but not limited to convolutional neural networks, recurrent networks based on optical flow methods, generative adversarial networks, etc.
  • the input to the neural network includes, for example: visual feature information and depth information of the first frame and the third frame, as well as optical flow information and deformation information between the first frame and the third frame.
  • step S230 image customization.
  • the characters in each frame of the image in the first video are edited to meet the user's needs for digital human beauty and body beautification.
  • image customization includes, for example, skin resurfacing, face slimming, eye enlargement, facial feature position adjustment, body proportion adjustment, such as slimming down, leg lengthening and other beauty and body beautification operations.
  • the character adjustment parameters are determined based on the character adjustments made by the user in some video frames in the first video, and the character images in the remaining video frames in the first video are adjusted according to the character adjustment parameters.
  • Perform editing The "partial video frame" may be, for example, one or several key frames in the first video.
  • the editing process of the characters in the remaining video frames in the first video according to the character adjustment parameters includes: according to the target part of the character adjustment in the character adjustment parameters, positioning the first through key point detection.
  • the target parts of the characters in the remaining video frames in the video such as facial features or the human body, etc.; according to the amplitude information or position information of the character image adjustment in the character image adjustment parameters, the positioned target parts are transformed through graphics transformation Adjust the amplitude or position.
  • the face will be detected first through face detection technology, and then the eyes of the character in the remaining video frames will be located through key point detection technology, and then the eyes will be enlarged according to the user's enlargement.
  • the amplitude information of the eyes for example, the adjustment amplitude of the distance between the upper and lower eyelids, is used to adjust the amplitude of the eyes of the characters in the remaining video frames through graphical transformation, so as to achieve the beauty effect of big eyes for the characters in all frames of the video.
  • step S240 expression customization.
  • Expression customization refers to an expression generation method that edits the character expressions in each frame of the image in the first video based on the character expression customization information corresponding to the interaction scene, such as the preset classification information corresponding to the target expression, so as to realize the digital expression in the interactive scene.
  • the control of human facial expressions can transfer one expression state of a digital human to another target expression state, while ensuring that only changes in the digital human's facial expression, mouth shape, head movements, etc. are not affected. Therefore, when the digital human expresses the corresponding language content, the expression can change accordingly with the language content.
  • Figure 3A is a flow chart of some embodiments of the expression generation method of the present disclosure. As shown in Figure 3A, the method in this embodiment includes: steps S310 to S330.
  • step S310 the characteristic information of each frame of the image in the first video, the characteristic information of the key points of the face, and the classification information of the original expression are obtained.
  • the facial expressions in the first video are the original expressions. That is, the human facial expression in each frame image in the first video is mainly the original expression, and the original expression is, for example, a calm expression.
  • each frame of the image in the first video is input to the facial feature extraction model to obtain the output of each frame.
  • Feature information of the frame image input the feature information of each frame image into the face key point detection model to obtain the coordinate information of the face key points of each frame image; use Principal Components Analysis (PCA) to detect all people
  • PCA Principal Components Analysis
  • the coordinate information of the key points of the face is dimensionally reduced to obtain the information of the preset dimension, which is used as the feature information of the key points of the face;
  • the feature information of each frame of image is input into the expression classification model to obtain the classification information of the original expression of each frame of image.
  • the overall expression generation model includes an encoder and a decoder.
  • the encoder can include a facial feature extraction model, a facial key point detection model and an expression classification model.
  • the facial feature extraction model connects the facial key point detection model and the expression classification model.
  • the facial feature extraction model can use existing models, such as VGG-19, ResNet, Transformer and other deep learning models with feature extraction functions.
  • the part before VGG-19 block 5 can be used as a facial feature extraction model.
  • the face key point detection model and expression classification model can also use existing models, such as MLP (multi-layer perceptron), etc., specifically it can be a 3-layer MLP.
  • the feature information of each frame of the image in the first video is, for example, the Feature Map output by the facial feature extraction model.
  • the key points include, for example, 68 key points such as chin, eyebrow center, mouth corner, etc. Each key point is expressed as a horizontal axis at its location. Y-axis.
  • PCA is used to reduce the dimensionality of the coordinate information of all facial key points to obtain the preset dimensions (for example, 6 dimensions, you can To achieve the best effect) information, as the feature information of key points of the face.
  • the expression classification model can output the classification of several expressions such as neutral, happy, sad, etc., which can be represented by one-hot encoded vectors.
  • the classification information of the original expression may be the one-hot encoding of the classification of the original expression in each frame of the image in the first video obtained through the expression classification model.
  • step S320 the feature information of each frame of image, the feature information of facial key points, the classification information of the original expression and the preset classification information corresponding to the target expression are fused to obtain the feature information of the fused image corresponding to each frame of image.
  • the classification information of the original expression of each frame of image and the preset classification information corresponding to the target expression are added and averaged to obtain the classification information of the fused expression corresponding to each frame of image;
  • the feature information of the face key points of each frame of image multiplied by the weights, the feature information of each frame of the image multiplied by the second weight obtained by training, and the classification information of the fused expression corresponding to each frame of the image are spliced.
  • the target expression is different from the original expression, for example, a smile expression
  • the preset classification information corresponding to the target expression is, for example, a preset one-hot code of the target expression.
  • the preset classification information does not need to be obtained through the model, and can be directly encoded using the preset encoding rules (one-hot).
  • a calm expression is coded as 1000 and a smiling expression is coded as 0100.
  • the aforementioned classification information of the original expression is obtained through the expression classification model. This classification information can be different from the preset classification information corresponding to the original expression.
  • the original expression is a calm expression, and the default one-hot code is 1000, but the expression classification
  • the one-hot encoding obtained by the model can be 0.8 0.2 0 0.
  • the encoder can also include a feature fusion model, which inputs the feature information of each frame of image, the feature information of facial key points, the classification information of the original expression and the preset classification information corresponding to the target expression into the feature fusion model for fusion.
  • the parameters that need to be trained in the feature fusion model include the first weight and the second weight. For each frame of image, the first weight obtained by training is multiplied by the feature information of the facial key points of the image to obtain the first feature vector, and the second weight obtained by training is multiplied by the feature information of the image to obtain the second feature. vector, splicing the first feature vector, the second feature vector and the classification information of the fused expression corresponding to the image to obtain the feature information of the fused image corresponding to the image.
  • the first weight and the second weight can unify the value ranges of the three types of information.
  • step S330 a fused image corresponding to each frame of image is generated based on the feature information of the fused image corresponding to each frame of image, and all the fused images are combined to form a second video in which the facial expression is the target expression.
  • the feature information of the fused image corresponding to each frame of image is input to the decoder, and the generated fused image corresponding to each frame of image is output.
  • the facial feature extraction model includes convolutional layers
  • the decoder includes deconvolutional layers that can generate images based on features.
  • the decoder is, for example, block 5 of VGG-19, which replaces the last convolutional layer with a deconvolutional layer.
  • the fused image is an image whose facial expression is the target expression, and the fused images of each frame form a second video.
  • a feature map is obtained after feature extraction. Face key point detection and expression classification are performed based on the feature map.
  • the feature information of each key point obtained by face key point detection is PCA is performed, and the dimensionality is reduced to the information of preset dimensions as key point features.
  • the classification information of the original expression is one-hot encoded and fused with the preset classification information corresponding to the target expression to obtain the expression classification vector (the classification information of the fused expression). Then, the feature map of the face, the expression classification vector and the key point features are fused to obtain the feature information of the fused image, and the feature information of the fused image is decoded to obtain the face image of the target expression.
  • the solution of the above embodiment extracts the characteristic information of each frame of the image in the first video, the characteristic information of the key points of the face and the classification information of the original expression, and fuses the extracted information with the preset classification information corresponding to the target expression to obtain
  • the feature information of the fused image corresponding to each frame of image is then used to generate a fused image corresponding to each frame of image based on the feature information of the fused image corresponding to each frame of image. All the fused images can form a second video in which the facial expression is the target expression.
  • feature information of key points on the human face is extracted and used for feature fusion to make the expressions in the fused image more realistic and smooth.
  • the target expression is directly generated, and Compatible with the character's facial movements and mouth shape in the original image, without affecting the character's mouth shape and head Movements, etc., do not affect the clarity of the original image, making the generated video stable, clear, and smooth.
  • Figure 3C is a flow chart of some embodiments of the training method of the expression generation model of the present disclosure.
  • the expression generation model can output a second video in which the facial expression is the target expression based on the input first video in which the facial expression is the original expression and the preset classification information corresponding to the target expression.
  • the method in this embodiment includes: steps S410 to S450.
  • step S410 a training pair consisting of each frame image of the first training video and each frame image of the second training video is obtained.
  • the first training video is a video in which the facial expression is the original expression
  • the second training video is a video in which the facial expression is the target expression.
  • Each frame image of the first training video does not need to correspond to each frame image of the second training video. Label the classification information of the original expression and the classification information of the target expression.
  • each frame image of the first training video is input into the first generator, the feature information of each frame image of the first training video, the feature information of the facial key points and the classification information of the original expression are obtained, and the first The characteristic information of each frame of the training video, the characteristic information of the key points of the face, the classification information of the original expression and the preset classification information corresponding to the target expression are fused to obtain the characteristic information of each frame of the fused image corresponding to the first training video, According to the feature information of each frame fusion image corresponding to the first training video, each frame fusion image corresponding to the first training video output by the first generator is obtained.
  • each frame image in the first training video is input into the third facial feature extraction model in the first generator to obtain the output feature information of each frame image; the feature information of each frame image is input into the first
  • the first facial key point detection model in the generator obtains the coordinate information of the facial key points in each frame image; the principal component analysis method is used to reduce the dimensionality of the coordinate information of all facial key points, and the first facial key point of the preset dimension is obtained.
  • information, as the feature information of the facial key points of each frame image in the first training video input the feature information of each frame image in the first training video into the third expression classification model in the first generator to obtain the first training video Classification information of the original expression of each frame of image.
  • PCA principal component analysis
  • the classification information of the original expression of each frame image of the first training video and the preset classification information corresponding to the target expression are added and averaged to obtain the fusion expression corresponding to each frame image of the first training video.
  • Classification information; the feature information of the face key points of each frame of the first training video multiplied by the first weight to be trained, and each frame of the first training video multiplied by the second weight to be trained The feature information of the image and the classification information of the fused expression corresponding to each frame of the first training video are spliced to obtain the feature information of each frame of the fused image corresponding to the first training video.
  • the first generator includes a first feature fusion model, and the first weight and the second weight are parameters to be trained in the first feature fusion model.
  • the first generator includes a first encoder and a first decoder.
  • the first encoder includes: a third facial feature extraction model, a first facial key point detection model, a third expression classification model, and a first feature fusion model, The characteristic information of each frame of the fused image corresponding to the first training video is input into the first decoder to obtain the generated each frame of the fused image corresponding to the first training video.
  • each frame image of the second training video is input into the second generator, the feature information of each frame image of the second training video, the feature information of the face key points, and the classification information of the target expression are obtained, and the second training video is The characteristic information of each frame image of the video, the characteristic information of the key points of the face, the classification information of the target expression and the preset classification information corresponding to the original expression are fused to obtain the characteristic information of each frame of the fused image corresponding to the second training video. According to The feature information of each frame of the fused image corresponding to the second training video is obtained to obtain the fused image of each frame corresponding to the second training video output by the second generator.
  • the second generator is structurally identical or similar to the first generator, and the training goal of the second generator is to generate a video with the same expression as the first training video based on the second training video.
  • each frame image in the second training video into the fourth facial feature extraction model in the second generator to obtain the output feature information of each frame image; input the feature information of each frame image into the second
  • the second face key point detection model in the generator obtains the coordinate information of the face key points of each frame image; the principal component analysis method is used to reduce the dimensionality of the coordinate information of all face key points, and the second face key point of the preset dimension is obtained.
  • Information, as the feature information of the face key points of each frame image of the second training video is input into the fourth expression classification model in the second generator to obtain the classification information of the target expression of each frame image in the second training video.
  • the feature information of the face key points in each frame image of the second training video has the same dimension as the feature information of the face key points in each frame image of the first training video, for example, 6 dimensions.
  • the classification information of the target expression in each frame of the second training video is compared with the original expression.
  • the corresponding preset classification information is added and averaged to obtain the classification information of the fused expression corresponding to each frame image of the second training video; the classification information of each frame image of the second training video is multiplied by the third weight to be trained.
  • the characteristic information of the face key points is spliced with the characteristic information of each frame image of the second training video multiplied by the fourth weight to be trained, and the classification information of the fused expression corresponding to each frame image of the second training video, Obtain the feature information of each frame fusion image corresponding to the second training video.
  • the preset classification information corresponding to the original expression does not need to be obtained through the model, and can be directly encoded using the preset encoding rules.
  • the second generator includes a second feature fusion model, and the third weight and the third weight are parameters to be trained in the second feature fusion model.
  • the second generator includes a second encoder and a second decoder.
  • the second encoder includes: a fourth facial feature extraction model, a second facial key point detection model, a fourth expression classification model, and a second feature fusion model, The feature information of each frame of the fused image corresponding to the second training video is input into the second decoder to obtain the generated each frame of the fused image corresponding to the second training video.
  • step S440 the adversarial loss and the cycle-consistent loss are determined based on the fused images of each frame corresponding to the first training video and the fused images of each frame corresponding to the second training video.
  • End-to-end training based on generative adversarial learning and cross-domain transfer learning can improve the accuracy of the model and improve training efficiency.
  • the adversarial loss is determined using the following method: input the fused image of each frame corresponding to the first training video into the first discriminator to obtain the first discrimination result of the fused image of each frame corresponding to the first training video; The fused images of each frame corresponding to the training video are input into the second discriminator to obtain the second discrimination result of the fused image of each frame corresponding to the second training video; based on the first discrimination result of the fused image of each frame corresponding to the first training video, the second discriminator is determined An adversarial loss, the second adversarial loss is determined based on the second discrimination result of each frame fusion image corresponding to the second training video.
  • each frame of the fused image corresponding to the first training video is input into the first face feature extraction model in the first discriminator, and the feature information of each frame of the fused image corresponding to the output first training video is obtained; Input the feature information of each frame of the fused image corresponding to the first training video into the first expression classification model in the first discriminator, and obtain the classification information of the expression of each frame of the fused image corresponding to the first training video as the first discrimination result; Input each frame of the fused image corresponding to the second training video into the second facial feature extraction model in the second discriminator to obtain the feature information of each frame of the fused image corresponding to the output second training video; input each frame of the fused image corresponding to the second training video.
  • the feature information of the frame fusion image is input into the second expression classification model in the second discriminator, and the expression classification information of each frame fusion image corresponding to the second training video is obtained as the second discrimination result.
  • the overall model includes two sets of generators and discriminators.
  • the structures of the first discriminator and the second discriminator are the same or similar, and both include facial feature extraction models and expression classification models.
  • the first facial feature extraction model, the second facial feature extraction model and the third facial feature extraction model and the fourth facial feature extraction model have the same or similar structures.
  • the first expression classification model and the second expression classification model are the same as the third facial feature extraction model.
  • the structures of the three-expression classification model and the fourth expression classification model are the same or similar.
  • the first generator G is used to realize X ⁇ Y, and is trained to make G(X) as close as possible to Y.
  • the first discriminator D Y is used to determine whether the fused images of each frame corresponding to the first training video are true or false.
  • the first adversarial loss can be expressed by the following formula:
  • the second generator F is used to realize Y ⁇ X, and is trained to make F(Y) as close as possible to X.
  • the second discriminator D The second adversarial loss can be expressed by the following formula:
  • cycle consistency losses are determined using the following method: input the fused images of each frame corresponding to the first training video into the second generator, generate a reconstructed image of each frame of the first training video, and convert the fused image of each frame of the first training video into the second generator.
  • the fused images of each frame corresponding to the two training videos are input into the first generator to generate a reconstructed image of each frame of the second training video; based on the difference between the reconstructed image of each frame of the first training video and the image of each frame of the first training video, And the difference between the reconstructed image of each frame of the second training video and the image of each frame of the second training video is used to determine the cycle consistency loss.
  • the images generated by the first generator are input into the second generator to obtain reconstructed images of each frame of the first training video. It is expected that the reconstructed images of each frame of the first training video generated by the second generator The image should be as consistent as possible with each frame of the first training video, that is, F(G(x)) ⁇ x.
  • the frame images should be as consistent as possible, that is, G(F(y)) ⁇ y.
  • the difference between the reconstructed image of each frame of the first training video and the image of each frame of the first training video can be determined using the following method: for each frame of the reconstructed image of the first training video and the first training image corresponding to the reconstructed image For the image of the video, determine the distance (such as Euclidean distance) between the representation vector of each pixel at the same position in the reconstructed image and the corresponding image, and sum all distances.
  • the difference between the reconstructed image of each frame of the second training video and the image of each frame of the second training video can be as follows: Method determination: for each frame of the reconstructed image of the second training video and the image of the second training video corresponding to the reconstructed image, determine the relationship between the representation vector of each pixel at the same position in the reconstructed image and the corresponding image. distance (e.g. Euclidean distance) and sum all distances.
  • step S450 the first generator and the second generator are trained according to the adversarial loss and the cycle-consistent loss.
  • the first adversarial loss, the second adversarial loss and the cycle-consistent loss can be weighted and summed to obtain the total loss, and the first generator and the second generator are trained based on the total loss.
  • L cyc (G, F) represents the cycle consistency loss
  • is the weight, which can be obtained through training.
  • the loss caused by the pixel difference between the two frames of the video is increased during the training process.
  • the pixel-to-pixel determination is based on the pixel difference between each two adjacent frames of the fused image corresponding to the first training video and the pixel difference between each two adjacent frames of the fused image corresponding to the second training video.
  • Loss The first and second generators are trained based on,adversarial loss, cycle-consistent loss and pixel-by-pixel,loss.
  • the relationship between the representation vectors of the two pixels at the position in the two adjacent frames of the fused image. distance add the distances corresponding to all positions to obtain the first loss; for each position in the fused image of each adjacent two frames corresponding to the second training video, determine the position in the fused image of the two adjacent frames The distance between the two pixels represents the vector, and the distance corresponding to all positions is summed to obtain the second loss; the first loss and the second loss are summed to obtain the pixel-to-pixel loss. Pixel-for-pixel loss allows the generated video to not change too much between adjacent frames.
  • the adversarial loss, the cycle-consistent loss and the pixel-to-pixel loss are weighted and summed to obtain a total loss; the first generator and the second generator are trained according to the total loss.
  • ⁇ 1 , ⁇ 2 , ⁇ 3 are weights, which can be obtained through training.
  • L P2P (G(x i ),G(x i+1 )) represents the first loss
  • L P2P (F(y j ),F (y j+1 )) represents the second loss.
  • each part of the model can be pre-trained before end-to-end training. For example, first select a large amount of open source face recognition data to pre-train the face recognition model, and select the output feature map before department points as a facial feature extraction model (the method for this part is not unique, taking vgg-19 as an example, selecting the part before block 5 can output an 8 ⁇ 8 ⁇ 512-dimensional feature map).
  • the facial feature extraction model and parameters are fixed, and then divided into two branches.
  • the two branches are the facial key point detection model and the expression classification model.
  • the respective branches are fine-tuned using the facial key point detection data set and the expression classification data respectively. (fine-tune) only trains the parameters in these two parts of the model structure.
  • the face key point detection model is not unique, as long as it is based on the convolutional network model and can obtain accurate key points, it can be connected to the modified solution;
  • the expression classification model is a single-label classification task based on the convolutional network model. After pre-training, an end-to-end training process can be performed based on the foregoing embodiments. This can improve training efficiency.
  • the method of the above embodiment uses adversarial loss, cycle consistent loss, and pixel loss between two adjacent frames of the video to train the overall model, which can improve the accuracy of the model, and the end-to-end training process can improve efficiency and save computing resources.
  • the disclosed solution is suitable for editing facial expressions in videos.
  • This disclosure adopts a unique deep learning model, integrates expression recognition, key point detection and other technologies, and through data training, learns the movement rules of key points on the human face under different expressions, and finally controls the model by inputting the classification information of the target expression into the model.
  • the output facial expression state, and the expression exists as a style state can be well superimposed when the character speaks or makes actions such as tilting the head, blinking, etc., making the final output facial action video of the character natural and consistent.
  • the output result can have the same resolution and detail level as the input image, and the output result can still be stable, clear, and flawless at 1080p or even 2k resolution.
  • step S250 the action is customized.
  • Action customization refers to editing and processing the character actions in each frame of the image in the first video based on the character action customization information corresponding to the interactive scene, so as to realize the editing and control of digital human actions in the interactive scene.
  • editing the character actions in each frame of the image in the first video according to the character action customization information corresponding to the interaction scene includes: editing the first action of the character in the original first key frame in the first video. Adjust the first human body key points at the time to obtain the second human body key points at the second action of the character as the character action customization information; a feature extraction model, such as the convolution kernel model, can be used to extract from the original first key frame Characteristic information of each second human body key point neighborhood; input characteristic information of each second human body key point and its neighborhood into the image generation model, and output the target first key frame of the character during the second action.
  • a feature extraction model such as the convolution kernel model
  • the first human body key points include the character's human body outline feature points during the first action, such as the 14 pairs of white dots shown in Figure 4A.
  • the second human body key points include the character's human body outline feature points during the second action, as shown in Figure 4A The 14 pairs of white dots shown in 4B.
  • Using human outline feature points to edit character movements is different from using human skeleton feature points to edit character movements.
  • the generated character movements are more accurate and less prone to deformation, distortion, etc., improving the quality of the generated images.
  • Extracting the human body contour feature points during the first action of the character includes, for example: using the semantic segmentation network model to extract the contour line of the character; using the target detection network model to extract multiple key points on the character, such as the black circle shown in Figure 4C points; according to the structural information of the character, connect the multiple key points and determine multiple key connections, such as the white straight lines shown in Figure 4C; according to the intersection points of the vertical lines of the multiple key connections and the contour lines , determine the pairs of multiple human body contour feature points during the first action of the character.
  • the method of obtaining the image generation model includes: using the training video frame and the human body key points of the characters in the training video frame as a pair of training data, and using the human body key points in the training data and the characteristic information of their neighborhoods in the training video frame.
  • the training video frames in the training data are used as supervision information for the output of the image generation network, and the image generation network is trained to obtain the image generation model.
  • the gap information between the video frame output by the image generation network based on the input data and the training video frame is used as the loss function, and the parameters of the image generation network are iteratively updated according to the loss determined by the loss function until the loss meets certain conditions and the training is completed.
  • the video frames output by the image generation network are very close to the training video frames, and the trained image generation network is used as the image generation model.
  • image generation networks are a large class of models, including but not limited to convolutional neural networks, recurrent networks based on optical flow methods, generative adversarial networks, etc. If the image generation network is a generative adversarial network, the total loss function also includes the discriminant loss function of the image discriminant network.
  • step S260 the output is rendered.
  • Step S220 to 250 Use the material results processed in steps S220 to 250 to model the character image.
  • Different rendering technologies can be selected according to the application scenario, combined with artificial intelligence technologies such as intelligent dialogue, speech recognition, speech synthesis, and action interaction, to form a complete set of The digital human video (i.e. the second video) that can interact with the scene is output.
  • the characters in the video are edited according to the character customization information corresponding to the interaction scene, and a digital human video matching the interaction scene is generated through character editing.
  • a digital human image, digital human expression, etc. that match the interaction scene are generated.
  • Digital human actions and more. According to the method of the disclosed embodiment, recording a set of character videos can quickly produce multiple sets of videos with different character styles in different scenes. Moreover, professional engineers are not required to access the system. Users can adjust the character's image, expressions, actions, etc. according to the needs of the scene.
  • Figure 5 shows a schematic structural diagram of a digital human generation device according to some embodiments of the present disclosure.
  • the digital human generation device 500 of this embodiment includes units 510 to 530.
  • the acquisition unit 510 is configured to acquire the first video. For details, see step S220.
  • the customization unit 520 is configured to edit the characters in each frame of the image in the first video according to the character customization information corresponding to the interaction scene. For details, see steps S230-250.
  • the customization unit 520 includes, for example, an image customization unit 521, an expression customization unit 522, an action customization unit 523, and the like.
  • the image customization unit 521 is configured to edit the characters in each frame image in the first video according to the character customization information corresponding to the interaction scene.
  • the expression customization unit 522 is configured to edit the character expressions in each frame image in the first video according to the character expression customization information corresponding to the interaction scene.
  • the action customization unit 523 is configured to edit the character actions in each frame image in the first video according to the character action customization information corresponding to the interaction scene. For details, see step S250.
  • the output unit 530 is configured to output the second video according to each frame image in the processed first video. For details, see step S260.
  • Figure 6 shows a schematic structural diagram of a digital human generation device according to other embodiments of the present disclosure.
  • the digital human generation device 600 of this embodiment includes: a memory 610 and a processor 620 coupled to the memory 610.
  • the processor 620 is configured to execute any of the foregoing based on instructions stored in the memory 610.
  • Digital human generation method in the embodiment is not limited to:
  • the memory 610 may include, for example, system memory, fixed non-volatile storage media, etc.
  • System memory stores, for example, operating systems, applications, boot loaders, and other programs.
  • the processor 620 can be a general-purpose processor, a digital signal processor (Digital Signal Processor, DSP), an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), a field programmable gate array (Field Programmable Gate Array, FPGA) or It can be implemented by other discrete hardware components such as programmable logic devices, discrete gates or transistors.
  • DSP Digital Signal Processor
  • ASIC Application Specific Integrated Circuit
  • FPGA Field Programmable Gate Array
  • FPGA Field Programmable Gate Array
  • the device 600 may also include an input/output interface 630, a network interface 640, a storage interface 650, and the like. These interfaces 630, 640, 650, the memory 610 and the processor 620 may be connected through a bus 660, for example.
  • the input and output interface 630 provides a connection interface for input and output devices such as a monitor, mouse, keyboard, and touch screen.
  • Network interface 640 provides a connection interface for various networked devices.
  • the storage interface 650 provides a connection interface for external storage devices such as SD cards and USB disks.
  • Bus 660 may use any of a variety of bus structures. For example, bus structures include, but are not limited to, Industry Standard Architecture (ISA) bus, Micro Channel Architecture (MCA) bus, and Peripheral Component Interconnect (PCI) bus.
  • ISA Industry Standard Architecture
  • MCA Micro Channel Architecture
  • PCI Peripheral Component Interconnect
  • Some embodiments of the present disclosure provide a non-transitory computer-readable storage medium on which a computer program is stored A program that, when executed by a processor, implements the steps of the digital human generation method in any of the foregoing embodiments.
  • embodiments of the present disclosure may be provided as methods, systems, or computer program products. Accordingly, the present disclosure may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment that combines software and hardware aspects. Furthermore, the present disclosure may take the form of a computer program product embodied on one or more non-transitory computer-readable storage media (including, but not limited to, disk memory, CD-ROM, optical storage, etc.) embodying computer program code therein. .
  • These computer program instructions may also be stored in a computer-readable memory that causes a computer or other programmable data processing apparatus to operate in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including the instruction means, the instructions
  • the device implements the functions specified in a process or processes of the flowchart and/or a block or blocks of the block diagram.
  • These computer program instructions may also be loaded onto a computer or other programmable data processing device, causing a series of operating steps to be performed on the computer or other programmable device to produce computer-implemented processing, thereby executing on the computer or other programmable device.
  • Instructions provide steps for implementing the functions specified in a process or processes of a flowchart diagram and/or a block or blocks of a block diagram.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Human Computer Interaction (AREA)
  • Image Analysis (AREA)

Abstract

The present disclosure relates to the technical field of computers. Provided are a digital human generation method and apparatus, and a storage medium. The method comprises: acquiring a first video; according to character customization information corresponding to an interaction scene, performing editing processing on characters in each frame of image in the first video; and outputting a second video according to each frame of image in the processed first video. Editing processing is performed on characters in a video according to character customization information corresponding to an interaction scene, and a digital human video that matches the interaction scene is generated by means of character editing.

Description

数字人生成方法和装置及存储介质Digital human generation method and device and storage medium
相关申请的交叉引用Cross-references to related applications
本申请是以CN申请号为202210541984.9,申请日为2022年05月18日的申请为基础,并主张其优先权,该CN申请的公开内容在此作为整体引入本申请中。This application is based on the application with CN application number 202210541984.9 and the filing date is May 18, 2022, and claims its priority. The disclosure content of the CN application is hereby incorporated into this application as a whole.
技术领域Technical field
本公开涉及计算机技术领域,特别涉及一种数字人生成方法和装置及存储介质。The present disclosure relates to the field of computer technology, and in particular to a digital human generation method and device and a storage medium.
背景技术Background technique
在人工智能、虚拟现实等新技术浪潮的带动下,数字人各方面的性能获得提升,以虚拟主播、虚拟员工等为代表的数字人成功进入大众视野,并以多元的姿态在影视、游戏、传媒、文旅、金融等众多领城大放异彩。Driven by the wave of new technologies such as artificial intelligence and virtual reality, the performance of digital people in all aspects has been improved. Digital people represented by virtual anchors, virtual employees, etc. have successfully entered the public eye and have appeared in film, television, games, etc. in a diversified manner. Many areas such as media, culture and tourism, and finance are shining brightly.
数字人形象的定制力求真实性与个性化。在照相级超写实的要求下,数字人形象的每一个细节都会为用户所关注。这对模特在录制形象素材时提出了较高的要求。但是,模特毕竟不是机器人,无法在时间以及动作定位上达到与形象所使用的交互场景完全匹配。The customization of digital human images strives for authenticity and personalization. Under the requirement of photographic-level hyper-realism, every detail of the digital human image will attract the attention of users. This places higher demands on models when recording image material. However, the model is not a robot after all, and it is impossible to completely match the time and action positioning with the interactive scene used by the image.
发明内容Contents of the invention
本公开一些实施例提出一种数字人生成方法,包括:Some embodiments of the present disclosure propose a digital human generation method, including:
获取第一视频;Get the first video;
根据交互场景相应的人物定制信息,对第一视频中的各帧图像中的人物进行编辑处理;Edit the characters in each frame of the image in the first video according to the character customization information corresponding to the interactive scene;
根据处理后的第一视频中的各帧图像,输出第二视频。A second video is output based on each frame image in the processed first video.
在一些实施例中,所述第一视频是由原视频经过预处理得到的,所述预处理包括分辨率调整、帧间平滑处理、帧率调整中的一项或多项。In some embodiments, the first video is obtained by preprocessing the original video, and the preprocessing includes one or more of resolution adjustment, inter-frame smoothing, and frame rate adjustment.
在一些实施例中,所述分辨率调整包括:In some embodiments, the resolution adjustment includes:
如果原视频的分辨率高于要求的预设分辨率,按照预设分辨率对原视频进行降采样,得到预设分辨率的第一视频;If the resolution of the original video is higher than the required preset resolution, downsample the original video according to the preset resolution to obtain the first video with the preset resolution;
如果原视频的分辨率低于要求的预设分辨率,利用超分辨率模型对原视频进行处 理,得到预设分辨率的第一视频,所述超分辨率模型用于将输入视频的分辨率提升至预设分辨率。If the resolution of the original video is lower than the required preset resolution, use the super-resolution model to process the original video. Process to obtain a first video with a preset resolution, and the super-resolution model is used to increase the resolution of the input video to the preset resolution.
在一些实施例中,所述超分辨率模型是由神经网络经过训练得到的,在训练过程中,将来自高清视频的第一视频帧按照预设分辨率进行降采样得到第二视频帧,将第二视频帧作为神经网络的输入,将第一视频帧作为神经网络的输出的监督信息,对神经网络进行训练得到超分辨率模型。In some embodiments, the super-resolution model is trained by a neural network. During the training process, the first video frame from the high-definition video is downsampled according to the preset resolution to obtain the second video frame. The second video frame is used as the input of the neural network, the first video frame is used as the supervision information of the output of the neural network, and the neural network is trained to obtain a super-resolution model.
在一些实施例中,所述帧率调整包括:In some embodiments, the frame rate adjustment includes:
如果原视频的帧率高于要求的预设帧率,根据原视频的帧率与预设帧率的比例信息对原视频进行抽帧,得到预设帧率的第一视频;If the frame rate of the original video is higher than the required preset frame rate, extract frames from the original video based on the ratio information between the frame rate of the original video and the preset frame rate to obtain the first video with the preset frame rate;
如果原视频的帧率低于要求的预设帧率,利用视频插帧模型将原视频插帧至第一帧率,所述第一帧率为原视频插帧之前的帧率与预设帧率的最小公倍数,根据第一帧率与预设帧率的比例信息对插帧后的原视频进行抽帧,得到预设帧率的第一视频,所述视频插帧模型用于生成任意两帧图像之间的过渡帧。If the frame rate of the original video is lower than the required preset frame rate, the video frame insertion model is used to insert frames of the original video to the first frame rate, and the first frame rate is the frame rate of the original video before the frame is inserted and the preset frame The least common multiple of the rate, extract frames from the original video after frame insertion based on the ratio information between the first frame rate and the preset frame rate, to obtain the first video with the preset frame rate. The video frame interpolation model is used to generate any two Transition frame between frame images.
在一些实施例中,所述视频插帧模型是由神经网络经过训练得到的,在训练过程中,将训练视频帧序列中的连续三帧作为三元组,将三元组中的第一帧和第三帧作为神经网络的输入,将三元组中的第二帧作为神经网络的输出的监督信息,对神经网络进行训练得到视频插帧模型。In some embodiments, the video frame interpolation model is trained by a neural network. During the training process, three consecutive frames in the training video frame sequence are regarded as triples, and the first frame in the triples is and the third frame as the input of the neural network, and the second frame in the triplet as the supervision information of the output of the neural network, and train the neural network to obtain a video frame interpolation model.
在一些实施例中,神经网络的输入包括:第一帧和第三帧的视觉特征信息和深度信息,以及第一帧和第三帧之间的光流信息和形变信息。In some embodiments, the input of the neural network includes: visual feature information and depth information of the first frame and the third frame, and optical flow information and deformation information between the first frame and the third frame.
在一些实施例中,所述根据交互场景相应的人物定制信息,对第一视频中的各帧图像中的人物进行编辑处理包括以下中的一项或多项:In some embodiments, editing the characters in each frame of the first video according to the character customization information corresponding to the interaction scene includes one or more of the following:
根据交互场景相应的人物形象定制信息,对第一视频中的各帧图像中的人物形象进行编辑处理;Edit the characters in each frame of the image in the first video according to the character customization information corresponding to the interactive scene;
根据交互场景相应的人物表情定制信息,对第一视频中的各帧图像中的人物表情进行编辑处理;Edit the character expressions in each frame of the first video according to the character expression customization information corresponding to the interaction scene;
根据交互场景相应的人物动作定制信息,对第一视频中的各帧图像中的人物动作进行编辑处理。Edit the character movements in each frame of the image in the first video according to the customized character movement information corresponding to the interaction scene.
在一些实施例中,所述根据交互场景相应的人物形象定制信息,对第一视频中的各帧图像中的人物形象进行编辑处理包括:根据用户在第一视频中的部分视频帧所做的人物形象调整,确定人物形象调整参数,按照所述人物形象调整参数对第一视频中 的其余视频帧中的人物形象进行编辑处理。In some embodiments, editing the characters in each frame of the first video based on the character customization information corresponding to the interaction scene includes: based on the user's actions in some video frames in the first video. Adjust the character image, determine the character image adjustment parameters, and adjust the character image in the first video according to the character image adjustment parameters. The characters in the remaining video frames are edited.
在一些实施例中,所述按照所述人物形象调整参数对第一视频中的其余视频帧中的人物形象进行编辑处理包括:In some embodiments, editing the characters in the remaining video frames in the first video according to the character adjustment parameters includes:
根据所述人物形象调整参数中的人物形象调整的目标部位,通过关键点检测定位第一视频中的其余视频帧中的人物的目标部位;According to the target part of the character image adjustment in the character image adjustment parameter, locate the target part of the character in the remaining video frames in the first video through key point detection;
根据所述人物形象调整参数中的人物形象调整的幅度信息或位置信息,通过图形学变换对定位的目标部位的幅度或位置进行调整。According to the amplitude information or position information of the character image adjustment in the character image adjustment parameters, the amplitude or position of the positioned target part is adjusted through graphics transformation.
在一些实施例中,所述人物表情定制信息包括目标表情对应的预设分类信息,所述根据交互场景相应的人物表情定制信息,对第一视频中的各帧图像中的人物表情进行编辑处理,包括:In some embodiments, the character expression customization information includes preset classification information corresponding to the target expression, and the character expression customization information in the first video is edited according to the character expression customization information corresponding to the interaction scene. ,include:
获取第一视频中每帧图像的特征信息、人脸关键点的特征信息和原表情的分类信息;Obtain the feature information of each frame of the image in the first video, the feature information of the key points of the face, and the classification information of the original expression;
将每帧图像的特征信息、人脸关键点的特征信息、原表情的分类信息与目标表情对应的预设分类信息进行融合,得到所述每帧图像对应的融合图像的特征信息;Fuse the feature information of each frame of image, the feature information of facial key points, the classification information of the original expression and the preset classification information corresponding to the target expression to obtain the feature information of the fused image corresponding to each frame of image;
根据所述每帧图像对应的融合图像的特征信息,生成所述每帧图像对应的融合图像,所有融合图像形成人脸表情是目标表情的第二视频。According to the characteristic information of the fused image corresponding to each frame of image, the fused image corresponding to each frame of image is generated, and all the fused images form a second video in which the facial expression is the target expression.
在一些实施例中,所述获取第一视频中每帧图像的特征信息、人脸关键点的特征信息和原表情的分类信息包括:In some embodiments, obtaining the feature information of each frame of the image in the first video, the feature information of key facial points, and the classification information of the original expression includes:
将所述第一视频中每帧图像输入人脸特征提取模型,得到输出的所述每帧图像的特征信息;Input each frame of image in the first video into the facial feature extraction model to obtain the output feature information of each frame of image;
将所述每帧图像的特征信息输入人脸关键点检测模型,得到所述每帧图像的人脸关键点的坐标信息,采用主成分分析法对所有人脸关键点的坐标信息进行降维,得到预设维度的信息,作为所述人脸关键点的特征信息;Input the characteristic information of each frame of image into the facial key point detection model to obtain the coordinate information of the facial key points of each frame of image, and use the principal component analysis method to reduce the dimensionality of the coordinate information of all facial key points, Obtain information of preset dimensions as feature information of key points of the human face;
将所述每帧图像的特征信息输入表情分类模型,得到所述每帧图像的原表情的分类信息。The characteristic information of each frame of image is input into the expression classification model to obtain the classification information of the original expression of each frame of image.
在一些实施例中,所述将每帧图像的特征信息、人脸关键点的特征信息、原表情的分类信息与目标表情对应的预设分类信息进行融合包括:In some embodiments, fusing the feature information of each frame of image, the feature information of facial key points, the classification information of the original expression and the preset classification information corresponding to the target expression includes:
将所述每帧图像的原表情的分类信息与所述目标表情对应的预设分类信息进行加和取平均,得到所述每帧图像对应的融合表情的分类信息;Add and average the classification information of the original expression of each frame of image and the preset classification information corresponding to the target expression to obtain the classification information of the fused expression corresponding to each frame of image;
将与训练得到的第一权重相乘后的所述每帧图像的人脸关键点的特征信息,与训 练得到的第二权重相乘后的所述每帧图像的特征信息,以及所述每帧图像对应的融合表情的分类信息进行拼接。The characteristic information of the face key points of each frame image multiplied by the first weight obtained by training is compared with the training The characteristic information of each frame of image multiplied by the obtained second weight and the classification information of the fused expression corresponding to each frame of image are spliced.
在一些实施例中,所述根据所述每帧图像对应的融合图像的特征信息,生成所述每帧图像对应的融合图像包括:In some embodiments, generating the fused image corresponding to each frame of image based on the feature information of the fused image corresponding to each frame of image includes:
将所述每帧图像对应的融合图像的特征信息输入解码器,输出生成的所述每帧图像对应的融合图像;Input the feature information of the fused image corresponding to each frame of image into the decoder, and output the generated fused image corresponding to each frame of image;
其中,所述人脸特征提取模型包括卷积层,所述解码器包括反卷积层。Wherein, the facial feature extraction model includes a convolution layer, and the decoder includes a deconvolution layer.
在一些实施例中,将人脸表情是原表情的第一视频和目标表情对应的预设分类信息输入表情生成模型,输出得到人脸表情是目标表情的第二视频;所述表情生成模型的训练方法,包括:In some embodiments, the first video in which the human facial expression is the original expression and the preset classification information corresponding to the target expression are input into the expression generation model, and the second video in which the human facial expression is the target expression is output; the expression generation model Training methods include:
获取由第一训练视频的各帧图像与第二训练视频的各帧图像组成的训练对;Obtain a training pair consisting of each frame image of the first training video and each frame image of the second training video;
将所述第一训练视频的各帧图像输入第一生成器,获取所述第一训练视频的各帧图像的特征信息、人脸关键点的特征信息和原表情的分类信息,将所述第一训练视频的各帧图像的特征信息、人脸关键点的特征信息、原表情的分类信息和目标表情对应的预设分类信息进行融合,得到所述第一训练视频对应的各帧融合图像的特征信息,根据所述第一训练视频对应的各帧融合图像的特征信息,得到所述第一生成器输出的所述第一训练视频对应的各帧融合图像;Each frame image of the first training video is input into the first generator, the feature information of each frame image of the first training video, the feature information of the face key points and the classification information of the original expression are obtained, and the first generator is The characteristic information of each frame of the training video, the characteristic information of the face key points, the classification information of the original expression and the preset classification information corresponding to the target expression are fused to obtain the fusion image of each frame corresponding to the first training video. Feature information: according to the feature information of each frame fusion image corresponding to the first training video, obtain each frame fusion image corresponding to the first training video output by the first generator;
将所述第二训练视频各帧图像输入第二生成器,获取所述第二训练视频的各帧图像的特征信息、人脸关键点的特征信息和目标表情的分类信息,将所述第二训练视频的各帧图像的特征信息、人脸关键点的特征信息、目标表情的分类信息和原表情对应的预设分类信息进行融合,得到所述第二训练视频对应的各帧融合图像的特征信息,根据所述第二训练视频对应的各帧融合图像的特征信息,得到所述第二生成器输出的所述第二训练视频对应的各帧融合图像;Each frame image of the second training video is input into the second generator, the feature information of each frame image of the second training video, the feature information of the face key points and the classification information of the target expression are obtained, and the second generator is The characteristic information of each frame of the training video, the characteristic information of the face key points, the classification information of the target expression and the preset classification information corresponding to the original expression are fused to obtain the characteristics of each frame of the fused image corresponding to the second training video. Information, according to the feature information of each frame fusion image corresponding to the second training video, obtain each frame fusion image corresponding to the second training video output by the second generator;
根据所述第一训练视频对应的各帧融合图像、所述第二训练视频对应的各帧融合图像,确定对抗损失和循环一致损失;Determine the adversarial loss and cycle-consistent loss according to each frame fusion image corresponding to the first training video and each frame fusion image corresponding to the second training video;
根据所述对抗损失和循环一致损失,对所述第一生成器和所述第二生成器进行训练,第一生成器训练完成后作为表情生成模型使用。The first generator and the second generator are trained according to the adversarial loss and the cycle-consistent loss. After the training of the first generator is completed, it is used as an expression generation model.
在一些实施例中,还包括:根据所述第一训练视频对应的每相邻两帧融合图像之间的像素差异,以及所述第二训练视频对应的每相邻两帧融合图像之间的像素差异,确定像素对像素损失; In some embodiments, the method further includes: based on the pixel difference between each two adjacent frames of the fused image corresponding to the first training video, and the pixel difference between each two adjacent frames of the fused image corresponding to the second training video. Pixel difference, which determines pixel-to-pixel loss;
其中,所述根据所述对抗损失和循环一致损失,对所述第一生成器和所述第二生成器进行训练包括:Wherein, training the first generator and the second generator according to the adversarial loss and the cycle-consistent loss includes:
根据所述对抗损失、所述循环一致损失和所述像素对像素损失,对所述第一生成器和所述第二生成器进行训练。The first generator and the second generator are trained based on the adversarial loss, the cycle-consistent loss, and the pixel-for-pixel loss.
在一些实施例中,所述根据所述第一训练视频对应的各帧融合图像、所述第二训练视频对应的各帧融合图像,确定对抗损失包括:将所述第一训练视频对应的各帧融合图像输入第一判别器,得到所述第一训练视频对应的各帧融合图像的第一判别结果;In some embodiments, determining the adversarial loss based on each frame fusion image corresponding to the first training video and each frame fusion image corresponding to the second training video includes: combining each frame corresponding to the first training video. The frame fusion image is input into the first discriminator to obtain the first discrimination result of each frame fusion image corresponding to the first training video;
将所述第二训练视频对应的各帧融合图像输入第二判别器,得到所述第二训练视频对应的各帧融合图像的第二判别结果;Input each frame of the fused image corresponding to the second training video into the second discriminator to obtain a second discrimination result of each frame of the fused image corresponding to the second training video;
根据所述第一训练视频对应的各帧融合图像的第一判别结果,确定第一对抗损失,根据所述第二训练视频对应的各帧融合图像的第二判别结果确定第二对抗损失。The first adversarial loss is determined based on the first discrimination result of each frame of the fused image corresponding to the first training video, and the second adversarial loss is determined based on the second discrimination result of each frame of the fused image corresponding to the second training video.
在一些实施例中,将所述第一训练视频对应的各帧融合图像输入第一判别器,得到所述第一训练视频对应的各帧融合图像的第一判别结果包括:In some embodiments, inputting the fused images of each frame corresponding to the first training video into the first discriminator, and obtaining the first discrimination result of the fused image of each frame corresponding to the first training video includes:
将所述第一训练视频对应的各帧融合图像输入所述第一判别器中第一人脸特征提取模型,得到输出的所述第一训练视频对应的各帧融合图像的特征信息;Input each frame of the fused image corresponding to the first training video into the first facial feature extraction model in the first discriminator, and obtain the output feature information of each frame of the fused image corresponding to the first training video;
将所述第一训练视频对应的各帧融合图像的特征信息输入所述第一判别器中的第一表情分类模型,得到所述第一训练视频对应的各帧融合图像的表情的分类信息,作为第一判别结果;Input the feature information of each frame of the fused image corresponding to the first training video into the first expression classification model in the first discriminator to obtain the classification information of the expression of each frame of the fused image corresponding to the first training video, As the first judgment result;
所述将所述第二训练视频对应的各帧融合图像输入第二判别器,得到所述第二训练视频对应的各帧融合图像的第二判别结果包括:The input of the fused images of each frame corresponding to the second training video into the second discriminator to obtain the second discrimination result of the fused images of each frame corresponding to the second training video includes:
将所述第二训练视频对应的各帧融合图像输入所述第二判别器中第二人脸特征提取模型,得到输出的所述第二训练视频对应的各帧融合图像的特征信息;Input each frame of the fused image corresponding to the second training video into the second face feature extraction model in the second discriminator, and obtain the output feature information of each frame of the fused image corresponding to the second training video;
将所述第二训练视频对应的各帧融合图像的特征信息输入所述第二判别器中的第二表情分类模型,得到所述第二训练视频对应的各帧融合图像的表情的分类信息,作为第二判别结果。Input the feature information of each frame of the fused image corresponding to the second training video into the second expression classification model in the second discriminator to obtain the classification information of the expression of each frame of the fused image corresponding to the second training video, as the second judgment result.
在一些实施例中,所述循环一致损失采用以下方法确定:In some embodiments, the cycle consistency loss is determined using the following method:
将所述第一训练视频对应的各帧融合图像输入所述第二生成器,生成所述第一训练视频的各帧重构图像,将所述第二训练视频对应的各帧融合图像输入所述第一生成器,生成所述第二训练视频的各帧重构图像;Each frame fusion image corresponding to the first training video is input into the second generator to generate a reconstructed image of each frame of the first training video, and each frame fusion image corresponding to the second training video is input into the second generator. The first generator generates reconstructed images of each frame of the second training video;
根据所述第一训练视频的各帧重构图像和所述第一训练视频的各帧图像的差异, 以及所述第二训练视频的各帧重构图像和所述第二训练视频的各帧图像的差异,确定循环一致损失。According to the difference between the reconstructed image of each frame of the first training video and the image of each frame of the first training video, And the difference between the reconstructed image of each frame of the second training video and the image of each frame of the second training video is used to determine the cycle consistency loss.
在一些实施例中,所述像素对像素损失采用以下方法确定:In some embodiments, the pixel-to-pixel loss is determined using the following method:
针对所述第一训练视频对应的每相邻两帧融合图像中的每个位置,确定该相邻两帧融合图像中在该位置上的两个像素的表示向量之间的距离,将所有位置对应的距离进行加和,得到第一损失;For each position in the fused image of each adjacent two frames corresponding to the first training video, determine the distance between the representation vectors of the two pixels at the position in the fused image of the two adjacent frames, and combine all positions The corresponding distances are added to obtain the first loss;
针对所述第二训练视频对应的每相邻两帧融合图像中的每个位置,确定该相邻两帧融合图像中在位置上的两个像素的表示向量之间的距离,将所有位置对应的距离进行加和,得到第二损失;For each position in the fused image of each adjacent two frames corresponding to the second training video, determine the distance between the representation vectors of the two pixels at the position in the fused image of the two adjacent frames, and match all positions The distances are added to obtain the second loss;
将所述第一损失和所述第二损失加和,得到所述像素对像素损失。The first loss and the second loss are summed to obtain the pixel-to-pixel loss.
在一些实施例中,所述获取所述第一训练视频的各帧图像的特征信息、人脸关键点的特征信息和原表情的分类信息包括:将所述第一训练视频中各帧图像输入所述第一生成器中的第三人脸特征提取模型,得到输出的所述各帧图像的特征信息;将所述各帧图像的特征信息输入所述第一生成器中第一人脸关键点检测模型,得到所述各帧图像的人脸关键点的坐标信息;采用主成分分析法对所有人脸关键点的坐标信息进行降维,得到预设维度的第一信息,作为所述第一训练视频的各帧图像的人脸关键点的特征信息;将所述第一训练视频中各帧图像的特征信息输入所述第一生成器中的第三表情分类模型,得到所述第一训练视频中各帧图像的原表情的分类信息;In some embodiments, obtaining the characteristic information of each frame of the first training video, the characteristic information of the facial key points, and the classification information of the original expression includes: inputting each frame of the image in the first training video. The third facial feature extraction model in the first generator obtains the characteristic information of the output frame images; inputs the characteristic information of each frame image into the first face key in the first generator Point detection model is used to obtain the coordinate information of the facial key points of each frame image; the principal component analysis method is used to reduce the dimensionality of the coordinate information of all facial key points, and the first information of the preset dimension is obtained as the third Characteristic information of facial key points in each frame image of a training video; input the characteristic information of each frame image in the first training video into the third expression classification model in the first generator to obtain the first Classification information of the original expression of each frame image in the training video;
所述获取所述第二训练视频的各帧图像的特征信息、人脸关键点的特征信息和目标表情的分类信息包括:将所述第二训练视频中各帧图像输入所述第二生成器中的第四人脸特征提取模型,得到输出的所述各帧图像的特征信息;将所述各帧图像的特征信息输入所述第二生成器中第二人脸关键点检测模型,得到所述各帧图像的人脸关键点的坐标信息;采用主成分分析法对所有人脸关键点的坐标信息进行降维,得到预设维度的第二信息,作为所述第二训练视频的各帧图像的人脸关键点的特征信息;将所述第二训练视频中各帧图像的特征信息输入所述第二生成器中的第四表情分类模型,得到所述第二训练视频中各帧图像的目标表情的分类信息。Obtaining the characteristic information of each frame image of the second training video, the characteristic information of the facial key points and the classification information of the target expression includes: inputting each frame image of the second training video into the second generator The fourth face feature extraction model in the second generator is used to obtain the feature information of each frame of the image output; the feature information of each frame of the image is input into the second face key point detection model in the second generator to obtain the feature information of each frame of the image. The coordinate information of the facial key points of each frame image is described; the principal component analysis method is used to reduce the dimensionality of the coordinate information of all facial key points, and the second information of the preset dimension is obtained as each frame of the second training video Characteristic information of facial key points in the image; input characteristic information of each frame of image in the second training video into the fourth expression classification model in the second generator to obtain each frame of image in the second training video Classification information of target expressions.
在一些实施例中,所述将所述第一训练视频的各帧图像的特征信息、人脸关键点的特征信息、原表情的分类信息和目标表情对应的预设分类信息进行融合包括:将所述第一训练视频的各帧图像的原表情的分类信息与所述目标表情对应的预设分类信息进行加和取平均,得到所述第一训练视频的各帧图像对应的融合表情的分类信息; 将与待训练的第一权重相乘后的所述第一训练视频的各帧图像的人脸关键点的特征信息,与待训练的第二权重相乘后的所述第一训练视频的各帧图像的特征信息,以及所述第一训练视频的各帧图像对应的融合表情的分类信息进行拼接;In some embodiments, fusing the feature information of each frame image of the first training video, the feature information of facial key points, the classification information of the original expression and the preset classification information corresponding to the target expression includes: The classification information of the original expression of each frame image of the first training video and the preset classification information corresponding to the target expression are added and averaged to obtain the classification of the fused expression corresponding to each frame image of the first training video. information; The feature information of the face key points of each frame image of the first training video multiplied by the first weight to be trained is multiplied by the second weight to be trained. The feature information of the frame images and the classification information of the fused expressions corresponding to each frame image of the first training video are spliced;
所述将所述第二训练视频的各帧图像的特征信息、人脸关键点的特征信息、目标表情的分类信息和原表情对应的预设分类信息进行融合包括:将所述第二训练视频的各帧图像的目标表情的分类信息与所述原表情对应的预设分类信息进行加和取平均,得到所述第二训练视频的各帧图像对应的融合表情的分类信息;将与待训练的第三权重相乘后的所述第二训练视频的各帧图像的人脸关键点的特征信息,与待训练的第四权重相乘后的所述第二训练视频的各帧图像的特征信息,以及所述第二训练视频的各帧图像对应的融合表情的分类信息进行拼接。The fusion of the feature information of each frame image of the second training video, the feature information of the facial key points, the classification information of the target expression and the preset classification information corresponding to the original expression includes: fusing the second training video The classification information of the target expression of each frame of the image is added and averaged with the preset classification information corresponding to the original expression to obtain the classification information of the fused expression corresponding to each frame of the second training video; The characteristic information of the facial key points of each frame of the second training video multiplied by the third weight, and the characteristics of each frame of the second training video multiplied by the fourth weight to be trained information, and the classification information of the fused expression corresponding to each frame image of the second training video is spliced.
在一些实施例中,所述根据所述对抗损失、所述循环一致损失和所述像素对像素损失,对所述第一生成器和所述第二生成器进行训练包括:将所述对抗损失、所述循环一致损失和所述像素对像素损失进行加权求和,得到总损失;根据所述总损失对所述第一生成器和所述第二生成器进行训练。In some embodiments, training the first generator and the second generator based on the adversarial loss, the cycle-consistent loss, and the pixel-for-pixel loss includes: converting the adversarial loss , the cycle-consistent loss and the pixel-to-pixel loss are weighted and summed to obtain a total loss; the first generator and the second generator are trained according to the total loss.
在一些实施例中,所述根据交互场景相应的人物动作定制信息,对第一视频中的各帧图像中的人物动作进行编辑处理,包括:In some embodiments, editing the character actions in each frame of the image in the first video based on the character action customization information corresponding to the interaction scene includes:
对第一视频中原第一关键帧中的人物在第一动作时的第一人体关键点进行调整,得到人物在第二动作时的第二人体关键点,作为人物动作定制信息;Adjust the first human body key points of the character in the original first key frame in the first video during the first action to obtain the second human body key points of the character during the second action as the character action customization information;
从原第一关键帧中提取各个第二人体关键点邻域的特征信息;Extract feature information of each second human body key point neighborhood from the original first key frame;
将各个第二人体关键点及其邻域的特征信息输入图像生成模型,输出人物在第二动作时的目标第一关键帧。The feature information of each second human body key point and its neighborhood is input into the image generation model, and the target first key frame of the character during the second action is output.
在一些实施例中,所述图像生成模型的获得方法包括:将训练视频帧以及训练视频帧中的人物的人体关键点作为一对训练数据,将训练数据中人体关键点及其在训练视频帧中邻域的特征信息作为图像生成网络的输入,将训练数据中的训练视频帧作为图像生成网络的输出的监督信息,对图像生成网络进行训练得到所述图像生成模型。In some embodiments, the method for obtaining the image generation model includes: using the training video frames and the human body key points of the characters in the training video frames as a pair of training data, and using the human body key points in the training data and their key points in the training video frames. The feature information of the middle neighborhood is used as the input of the image generation network, the training video frames in the training data are used as the supervision information of the output of the image generation network, and the image generation network is trained to obtain the image generation model.
在一些实施例中,第一人体关键点包括人物在第一动作时的人体轮廓特征点,第二人体关键点包括人物在第二动作时的人体轮廓特征点。In some embodiments, the first human body key points include the human body outline feature points of the character during the first action, and the second human body key points include the human body outline feature points of the character during the second action.
本公开一些实施例提出一种数字人生成装置,包括:存储器;以及耦接至所述存储器的处理器,所述处理器被配置为基于存储在所述存储器中的指令,执行各个实施例所述的数字人生成方法。 Some embodiments of the present disclosure provide a digital human generation device, including: a memory; and a processor coupled to the memory, the processor being configured to execute the instructions of various embodiments based on instructions stored in the memory. The digital human generation method described above.
本公开一些实施例提出一种数字人生成装置,包括:Some embodiments of the present disclosure provide a digital human generation device, including:
获取单元,被配置为获取第一视频;an acquisition unit configured to acquire the first video;
定制单元,被配置为根据交互场景相应的人物定制信息,对第一视频中的各帧图像中的人物进行编辑处理;The customization unit is configured to edit the characters in each frame of the image in the first video according to the character customization information corresponding to the interaction scene;
输出单元,被配置为根据处理后的第一视频中的各帧图像,输出第二视频。The output unit is configured to output the second video according to each frame image in the processed first video.
本公开一些实施例提出一种非瞬时性计算机可读存储介质,其上存储有计算机程序,该程序被处理器执行时实现各个实施例所述的数字人生成方法的步骤。Some embodiments of the present disclosure provide a non-transitory computer-readable storage medium on which a computer program is stored. When the program is executed by a processor, the steps of the digital human generation method described in various embodiments are implemented.
附图说明Description of the drawings
下面将对实施例或相关技术描述中所需要使用的附图作简单地介绍。根据下面参照附图的详细描述,可以更加清楚地理解本公开。The drawings needed to be used in the description of the embodiments or related technologies will be briefly introduced below. The present disclosure may be more clearly understood from the following detailed description with reference to the accompanying drawings.
显而易见地,下面描述中的附图仅仅是本公开的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。Obviously, the drawings in the following description are only some embodiments of the present disclosure. For those of ordinary skill in the art, other drawings can be obtained based on these drawings without exerting creative efforts.
图1A示出本公开一些实施例的数字人生成方法的流程示意图。Figure 1A shows a schematic flowchart of a digital human generation method according to some embodiments of the present disclosure.
图1B示出本公开另一些实施例的数字人生成方法的流程示意图。FIG. 1B shows a schematic flowchart of a digital human generation method according to other embodiments of the present disclosure.
图2示出本公开一些实施例的视频预处理的示意图。Figure 2 shows a schematic diagram of video preprocessing according to some embodiments of the present disclosure.
图3A示出本公开一些实施例的表情生成方法的流程示意图。Figure 3A shows a schematic flowchart of an expression generation method according to some embodiments of the present disclosure.
图3B示出本公开另一些实施例的表情生成方法的示意图。Figure 3B shows a schematic diagram of an expression generation method according to other embodiments of the present disclosure.
图3C示出本公开一些实施例的表情生成模型的训练方法的流程示意图。Figure 3C shows a schematic flowchart of a training method for an expression generation model according to some embodiments of the present disclosure.
图3D示出本公开一些实施例的表情生成模型的训练方法的示意图。Figure 3D shows a schematic diagram of a training method of an expression generation model according to some embodiments of the present disclosure.
图4A示出本公开一些实施例的人物在第一动作时的人体轮廓特征点的示意图。FIG. 4A shows a schematic diagram of the human body outline feature points of the character during the first action according to some embodiments of the present disclosure.
图4B示出本公开一些实施例的人物在第二动作时的人体轮廓特征点的示意图。FIG. 4B shows a schematic diagram of the human body outline feature points of the character during the second action according to some embodiments of the present disclosure.
图4C示出本公开一些实施例的人物上的多个关键点和多条关键连线的示意图。FIG. 4C shows a schematic diagram of multiple key points and multiple key connections on a character according to some embodiments of the present disclosure.
图5示出本公开一些实施例的数字人生成装置的结构示意图。Figure 5 shows a schematic structural diagram of a digital human generation device according to some embodiments of the present disclosure.
图6示出本公开另一些实施例的数字人生成装置的结构示意图。Figure 6 shows a schematic structural diagram of a digital human generation device according to other embodiments of the present disclosure.
具体实施方式Detailed ways
下面将结合本公开实施例中的附图,对本公开实施例中的技术方案进行清楚、完整地描述。The technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present disclosure.
除非特别说明,否则,本公开中的“第一”“第二”等描述用来区分不同的对象,并不 用来表示大小或时序等含义。Unless otherwise specified, descriptions such as "first" and "second" in this disclosure are used to distinguish different objects and do not Used to express meanings such as size or timing.
本公开实施例根据交互场景相应的人物定制信息对视频中的人物进行编辑处理,通过人物编辑生成与交互场景匹配的数字人视频。Embodiments of the present disclosure edit the characters in the video based on the character customization information corresponding to the interaction scene, and generate a digital human video that matches the interaction scene through character editing.
图1A示出本公开一些实施例的数字人生成方法的流程示意图。Figure 1A shows a schematic flowchart of a digital human generation method according to some embodiments of the present disclosure.
如图1A所示,该实施例的数字人生成方法包括以下步骤。As shown in Figure 1A, the digital human generation method of this embodiment includes the following steps.
在步骤S110,获取第一视频。In step S110, the first video is obtained.
第一视频例如可以是录制的原视频,也可以是由原视频经过预处理得到的,所述预处理包括分辨率调整、帧间平滑处理、帧率调整中的一项或多项。For example, the first video may be a recorded original video, or may be obtained by preprocessing the original video. The preprocessing includes one or more of resolution adjustment, inter-frame smoothing, and frame rate adjustment.
在步骤S120,根据交互场景相应的人物定制信息,对第一视频中的各帧图像中的人物进行编辑处理。In step S120, edit the characters in each frame of the image in the first video according to the character customization information corresponding to the interaction scene.
所述根据交互场景相应的人物定制信息,对第一视频中的各帧图像中的人物进行编辑处理包括以下中的一项或多项:根据交互场景相应的人物形象定制信息,对第一视频中的各帧图像中的人物形象进行编辑处理,生成与交互场景匹配的数字人形象;根据交互场景相应的人物表情定制信息,对第一视频中的各帧图像中的人物表情进行编辑处理,生成与交互场景匹配的数字人表情;根据交互场景相应的人物动作定制信息,对第一视频中的各帧图像中的人物动作进行编辑处理,生成与交互场景匹配的数字人动作。The editing process of the characters in each frame image in the first video according to the character customization information corresponding to the interaction scene includes one or more of the following: according to the character customization information corresponding to the interaction scene, editing the first video Edit and process the characters in each frame of the image in the first video to generate a digital human image that matches the interaction scene; edit and process the characters in each frame of the image in the first video based on the character expression customization information corresponding to the interaction scene. Generate digital human expressions that match the interaction scene; edit the character movements in each frame of the image in the first video according to the customized information of character movements corresponding to the interaction scene, and generate digital human movements that match the interaction scene.
在步骤S130,根据处理后的第一视频中的各帧图像,输出第二视频。In step S130, a second video is output based on each frame image in the processed first video.
即,处理后的第一视频中的各帧图像组合形成第二视频,第二视频是与交互场景匹配的数字人视频。That is, each frame image in the processed first video is combined to form a second video, and the second video is a digital human video matching the interaction scene.
上述实施例,根据交互场景相应的人物定制信息对视频中的人物进行编辑处理,通过人物编辑生成与交互场景匹配的数字人视频,例如,生成与交互场景匹配的数字人形象、数字人表情、数字人动作等。In the above embodiment, the characters in the video are edited according to the character customization information corresponding to the interaction scene, and a digital human video matching the interaction scene is generated through character editing. For example, a digital human image, digital human expression, etc. that match the interaction scene are generated. Digital human actions and more.
图1B示出本公开另一些实施例的数字人生成方法的流程示意图。FIG. 1B shows a schematic flowchart of a digital human generation method according to other embodiments of the present disclosure.
如图1B所示,该实施例的数字人生成方法包括以下步骤。As shown in Figure 1B, the digital human generation method of this embodiment includes the following steps.
在步骤S210,定制逻辑控制。In step S210, logic control is customized.
定制逻辑控制用来对视频预处理、形象定制、表情定制、动作定制等定制逻辑是否执行、执行顺序等进行控制。Customized logic control is used to control whether custom logic such as video preprocessing, image customization, expression customization, action customization, etc. is executed and the order of execution.
视频预处理、形象定制、表情定制、动作定制等各部分所编辑的内容是独立的,相互之间不存在强依赖关系,因此,各部分的执行顺序可以调换,均可达到生成与交互场景匹 配的数字人视频的基本效果。但是,各部分之间还是存在一定的相互影响,按照本实施例的S220~S250的执行顺序,可以使得各部分之间的相互影响降至最低,最终人物形象的呈现效果更好。The edited content of each part such as video preprocessing, image customization, expression customization, and action customization is independent, and there is no strong dependence on each other. Therefore, the execution order of each part can be exchanged, and the generation and interaction scenes can be matched. The basic effect of the digital human video. However, there is still a certain mutual influence between the various parts. According to the execution sequence of S220 to S250 in this embodiment, the mutual influence between the various parts can be minimized, and the final character image presentation effect is better.
在步骤S220,视频预处理。In step S220, video preprocessing.
视频预处理是对录制的原视频进行预处理得到第一视频,所述预处理包括分辨率调整、帧间平滑处理、帧率调整中的一项或多项。Video preprocessing is to preprocess the recorded original video to obtain the first video. The preprocessing includes one or more of resolution adjustment, inter-frame smoothing, and frame rate adjustment.
在一些实施例中,如图2所示,所述预处理按照分辨率调整、帧间平滑处理、帧率调整的顺序依次执行,视频预处理的效果更好,可以最大程度保留原视频的视觉信息,保证预处理后的视频不发生模糊、失真等质量问题,并使帧率调整、分辨率调整对后续数字人定制流程影响最小。In some embodiments, as shown in Figure 2, the preprocessing is performed sequentially in the order of resolution adjustment, inter-frame smoothing, and frame rate adjustment. The effect of video preprocessing is better and the visual appearance of the original video can be retained to the greatest extent. information to ensure that the preprocessed video does not suffer from quality problems such as blurring and distortion, and that frame rate adjustment and resolution adjustment have the least impact on the subsequent digital human customization process.
所述分辨率调整包括:如果原视频的分辨率高于要求的预设分辨率,按照预设分辨率对原视频进行降采样,得到预设分辨率的第一视频;如果原视频的分辨率低于要求的预设分辨率,利用超分辨率模型对原视频进行处理,得到预设分辨率的第一视频,所述超分辨率模型用于将输入视频的分辨率提升至预设分辨率;如果原视频的分辨率等于要求的预设分辨率,则可以跳过分辨率调整的步骤。The resolution adjustment includes: if the resolution of the original video is higher than the required preset resolution, downsampling the original video according to the preset resolution to obtain the first video with the preset resolution; if the resolution of the original video is If the resolution is lower than the required preset resolution, a super-resolution model is used to process the original video to obtain the first video of the preset resolution. The super-resolution model is used to increase the resolution of the input video to the preset resolution. ; If the resolution of the original video is equal to the required preset resolution, you can skip the resolution adjustment step.
通过分辨率调整,可以使得预处理后的第一视频在分辨率方面保持一致性,降低原视频差异化分辨率对数字人定制效果的影响。Through resolution adjustment, the resolution of the preprocessed first video can be maintained consistent, and the impact of the differentiated resolution of the original video on the digital human customization effect can be reduced.
所述超分辨率模型例如是由神经网络经过训练得到的,在训练过程中,将来自高清视频的第一视频帧按照预设分辨率进行降采样得到第二视频帧,将第二视频帧作为神经网络的输入,将第一视频帧作为神经网络的输出的监督信息,对神经网络进行训练得到超分辨率模型。其中,将神经网络的输出的视频帧与第一视频帧的差距信息作为损失函数,迭代地根据损失函数确定的损失更新神经网络的参数,直至损失满足一定条件,训练完成,此时神经网络的输出的视频帧非常接近第一视频帧,将训练后的神经网络作为超分辨率模型。其中,神经网络是一大类模型,例如包括但不限于卷积神经网络、基于光流法的循环网络、生成对抗网络等。The super-resolution model is, for example, obtained by training a neural network. During the training process, the first video frame from the high-definition video is downsampled according to the preset resolution to obtain the second video frame, and the second video frame is used as As the input of the neural network, the first video frame is used as the supervision information of the output of the neural network, and the neural network is trained to obtain the super-resolution model. Among them, the gap information between the video frame output by the neural network and the first video frame is used as the loss function, and the parameters of the neural network are iteratively updated according to the loss determined by the loss function until the loss meets certain conditions and the training is completed. At this time, the neural network The output video frame is very close to the first video frame, using the trained neural network as a super-resolution model. Among them, neural networks are a large class of models, including but not limited to convolutional neural networks, recurrent networks based on optical flow methods, generative adversarial networks, etc.
例如,将高清视频(1080p)的关键帧进行降采样得到较低分辨率(如360p/480p/720p等)的第二视频帧,按照上述训练方法得到超分辨率模型,利用该超分辨率模型,可由任意分辨率的原视频得到480p/720p/1080p等分辨率的第一视频。其中,360p/480p/720p/1080p是一种视频显示格式,P表示逐行扫描,例如1080p的画面分辨率为1920乘以1080。 For example, downsample the key frames of high-definition video (1080p) to obtain the second video frame of lower resolution (such as 360p/480p/720p, etc.), obtain a super-resolution model according to the above training method, and use the super-resolution model , the first video with resolutions such as 480p/720p/1080p can be obtained from the original video of any resolution. Among them, 360p/480p/720p/1080p is a video display format, and P means progressive scan. For example, the picture resolution of 1080p is 1920 times 1080.
分辨率调整后,由超分辨模型生成或降采样得到的帧序列中,两帧之间的纹理信息可能存在一定的差距,故而在此采用通过帧间平滑处理,以保证视频播放时纹理、人物边缘等处不会有锯齿或摩尔纹的产生,避免造成视觉上的影响。After the resolution is adjusted, in the frame sequence generated by the super-resolution model or obtained by downsampling, there may be a certain gap in the texture information between the two frames. Therefore, inter-frame smoothing is used here to ensure that the texture and characters are smooth during video playback. There will be no jagged or moiré patterns on the edges to avoid visual impact.
帧间平滑处理例如可以采用平均值的平滑处理方式。例如,连续三帧的图像信息取平均值,将该平均值作为该连续三帧中的中间帧的图像信息。The inter-frame smoothing process may, for example, adopt an average smoothing process. For example, the image information of three consecutive frames is averaged, and the average is used as the image information of the middle frame among the three consecutive frames.
所述帧率调整包括:如果原视频的帧率高于要求的预设帧率,根据原视频的帧率与预设帧率的比例信息对原视频进行抽帧,得到预设帧率的第一视频;如果原视频的帧率低于要求的预设帧率,利用视频插帧模型将原视频插帧至第一帧率,所述第一帧率为原视频插帧之前的帧率与预设帧率的最小公倍数,根据第一帧率与预设帧率的比例信息对插帧后的原视频进行抽帧,得到预设帧率的第一视频,所述视频插帧模型用于生成任意两帧图像之间的过渡帧;如果原视频的帧率等于要求的预设帧率,可以跳过帧率调整的步骤。The frame rate adjustment includes: if the frame rate of the original video is higher than the required preset frame rate, extracting frames from the original video based on the ratio information between the frame rate of the original video and the preset frame rate to obtain the third frame rate of the preset frame rate. A video; if the frame rate of the original video is lower than the required preset frame rate, use the video frame insertion model to insert frames in the original video to the first frame rate, and the first frame rate is the frame rate of the original video before the frame is inserted. The least common multiple of the preset frame rate, extract frames from the original video after frame insertion according to the ratio information between the first frame rate and the preset frame rate, to obtain the first video with the preset frame rate, and the video frame interpolation model is used Generate a transition frame between any two frames of images; if the frame rate of the original video is equal to the required preset frame rate, the frame rate adjustment step can be skipped.
通过帧率调整,可以使得预处理后的第一视频在帧率方面保持一致性,降低原视频差异化帧率对数字人定制效果的影响。并且,插帧操作还可以有效解决两动作间的跳变问题。例如,数字人做完动作A做动作B,未经过插帧处理的视频播放时会使用户感觉到人物动作跳变,不够真实,本实施例通过插帧会在两动作的关键帧之间插入若干过渡帧,使得插帧处理后的视频播放时使用户感觉人物动作过渡自然,比较真实。Through frame rate adjustment, the frame rate of the preprocessed first video can be maintained consistent, and the impact of the differentiated frame rate of the original video on the digital human customization effect can be reduced. Moreover, the frame insertion operation can also effectively solve the jump problem between two actions. For example, after a digital person performs action A and then moves to action B, the user will feel that the character's movements jump when playing a video without frame insertion processing, which is not realistic enough. In this embodiment, frame insertion is used to insert between the key frames of the two actions. Several transition frames make the user feel that the transition of character movements is natural and more realistic when the video after frame insertion is played.
所述视频插帧模型例如是由神经网络经过训练得到的,在训练过程中,将训练视频帧序列中的连续三帧作为三元组,将三元组中的第一帧和第三帧作为神经网络的输入,将三元组中的第二帧作为神经网络的输出的监督信息,对神经网络进行训练得到视频插帧模型。其中,将神经网络基于输入的三元组中的第一帧和第三帧输出的视频帧与三元组中的第二帧的差距信息作为损失函数,迭代地根据损失函数确定的损失更新神经网络的参数,直至损失满足一定条件,训练完成,此时神经网络的输出的视频帧非常接近三元组中的第二帧,将训练后的神经网络作为视频插帧模型,能够生成任意两帧图像之间的过渡帧。其中,神经网络是一大类模型,例如包括但不限于卷积神经网络、基于光流法的循环网络、生成对抗网络等。The video frame insertion model is, for example, obtained by training a neural network. During the training process, three consecutive frames in the training video frame sequence are regarded as a triplet, and the first frame and the third frame in the triplet are regarded as As the input of the neural network, the second frame in the triplet is used as the supervision information of the output of the neural network, and the neural network is trained to obtain the video frame interpolation model. Among them, the gap information between the video frame output by the neural network based on the first frame and the third frame in the input triplet and the second frame in the triplet is used as the loss function, and the neural network is iteratively updated based on the loss determined by the loss function. parameters of the network until the loss meets certain conditions and the training is completed. At this time, the video frame output by the neural network is very close to the second frame in the triplet. The trained neural network is used as a video frame interpolation model and can generate any two frames. Transition frames between images. Among them, neural networks are a large class of models, including but not limited to convolutional neural networks, recurrent networks based on optical flow methods, generative adversarial networks, etc.
其中,神经网络的输入例如包括:第一帧和第三帧的视觉特征信息和深度信息,以及第一帧和第三帧之间的光流信息和形变信息。通过这四部分信息的融合,所推理出的两帧之间应插入的过渡帧能够使视频过渡更加顺畅。The input to the neural network includes, for example: visual feature information and depth information of the first frame and the third frame, as well as optical flow information and deformation information between the first frame and the third frame. Through the fusion of these four parts of information, the inferred transition frame that should be inserted between two frames can make the video transition smoother.
在步骤S230,形象定制。 In step S230, image customization.
根据交互场景相应的人物形象定制信息,对第一视频中的各帧图像中的人物形象进行编辑处理,满足用户对数字人美颜美体的需要。其中,形象定制例如包括磨皮、瘦脸、大眼、五官位置调整、身体比例调整,如瘦身,腿部拉长等美颜美体操作。According to the character customization information corresponding to the interactive scene, the characters in each frame of the image in the first video are edited to meet the user's needs for digital human beauty and body beautification. Among them, image customization includes, for example, skin resurfacing, face slimming, eye enlargement, facial feature position adjustment, body proportion adjustment, such as slimming down, leg lengthening and other beauty and body beautification operations.
在一些实施例中,根据用户在第一视频中的部分视频帧所做的人物形象调整,确定人物形象调整参数,按照所述人物形象调整参数对第一视频中的其余视频帧中的人物形象进行编辑处理。其中,“部分视频帧”例如可以是第一视频中的一个或几个关键帧。第一通过少量编辑工作即可完成全部视频数字人的形象定制,提高数字人定制效率和定制成本。In some embodiments, the character adjustment parameters are determined based on the character adjustments made by the user in some video frames in the first video, and the character images in the remaining video frames in the first video are adjusted according to the character adjustment parameters. Perform editing. The "partial video frame" may be, for example, one or several key frames in the first video. First, the image customization of all video digital people can be completed with a small amount of editing work, which improves the efficiency and cost of digital human customization.
所述按照所述人物形象调整参数对第一视频中的其余视频帧中的人物形象进行编辑处理包括:根据所述人物形象调整参数中的人物形象调整的目标部位,通过关键点检测定位第一视频中的其余视频帧中的人物的目标部位,目标部位例如是五官或人体等;根据所述人物形象调整参数中的人物形象调整的幅度信息或位置信息,通过图形学变换对定位的目标部位的幅度或位置进行调整。The editing process of the characters in the remaining video frames in the first video according to the character adjustment parameters includes: according to the target part of the character adjustment in the character adjustment parameters, positioning the first through key point detection. The target parts of the characters in the remaining video frames in the video, such as facial features or the human body, etc.; according to the amplitude information or position information of the character image adjustment in the character image adjustment parameters, the positioned target parts are transformed through graphics transformation Adjust the amplitude or position.
例如,用户在一些关键帧中调大了人物的眼睛,则先通过人脸检测技术检测到人脸,然后,通过关键点检测技术定位到其余视频帧中人物的眼睛,然后,根据用户调大眼睛的幅度信息,例如,上下眼睑间距的调大幅度,通过图形学变换对其余视频帧中人物的眼睛的幅度进行调整,对视频的所有帧中的人物达到大眼的美颜效果。For example, if the user enlarges the character's eyes in some key frames, the face will be detected first through face detection technology, and then the eyes of the character in the remaining video frames will be located through key point detection technology, and then the eyes will be enlarged according to the user's enlargement. The amplitude information of the eyes, for example, the adjustment amplitude of the distance between the upper and lower eyelids, is used to adjust the amplitude of the eyes of the characters in the remaining video frames through graphical transformation, so as to achieve the beauty effect of big eyes for the characters in all frames of the video.
在步骤S240,表情定制。In step S240, expression customization.
表情定制是指根据交互场景相应的人物表情定制信息,例如目标表情对应的预设分类信息,对第一视频中的各帧图像中的人物表情进行编辑处理的表情生成方法,实现交互场景下数字人面部表情的控制,可以将数字人的一种表情状态迁移至另一种目标表情状态下,同时保证数字人仅面部表情发生变化、说话口型、头部动作等均不受影响。从而当数字人表达相应的语言内容时,表情可以随语言内容做出相应的变化。Expression customization refers to an expression generation method that edits the character expressions in each frame of the image in the first video based on the character expression customization information corresponding to the interaction scene, such as the preset classification information corresponding to the target expression, so as to realize the digital expression in the interactive scene. The control of human facial expressions can transfer one expression state of a digital human to another target expression state, while ensuring that only changes in the digital human's facial expression, mouth shape, head movements, etc. are not affected. Therefore, when the digital human expresses the corresponding language content, the expression can change accordingly with the language content.
图3A为本公开表情生成方法一些实施例的流程图。如图3A所示,该实施例的方法包括:步骤S310~S330。Figure 3A is a flow chart of some embodiments of the expression generation method of the present disclosure. As shown in Figure 3A, the method in this embodiment includes: steps S310 to S330.
在步骤S310中,获取第一视频中每帧图像的特征信息、人脸关键点的特征信息和原表情的分类信息。In step S310, the characteristic information of each frame of the image in the first video, the characteristic information of the key points of the face, and the classification information of the original expression are obtained.
第一视频中的人脸表情为原表情。即,第一视频中各帧图像中人脸表情主要为原表情,原表情例如是平静表情。The facial expressions in the first video are the original expressions. That is, the human facial expression in each frame image in the first video is mainly the original expression, and the original expression is, for example, a calm expression.
在一些实施例中,将第一视频中每帧图像输入人脸特征提取模型,得到输出的每 帧图像的特征信息;将每帧图像的特征信息输入人脸关键点检测模型,得到每帧图像的人脸关键点的坐标信息;采用主成分分析法(Principal Components Analys is,PCA)对所有人脸关键点的坐标信息进行降维,得到预设维度的信息,作为人脸关键点的特征信息;将每帧图像的特征信息输入表情分类模型,得到每帧图像的原表情的分类信息。In some embodiments, each frame of the image in the first video is input to the facial feature extraction model to obtain the output of each frame. Feature information of the frame image; input the feature information of each frame image into the face key point detection model to obtain the coordinate information of the face key points of each frame image; use Principal Components Analysis (PCA) to detect all people The coordinate information of the key points of the face is dimensionally reduced to obtain the information of the preset dimension, which is used as the feature information of the key points of the face; the feature information of each frame of image is input into the expression classification model to obtain the classification information of the original expression of each frame of image.
整体的表情生成模型包括编码器和解码器,编码器可以包括人脸特征提取模型、人脸关键点检测模型和表情分类模型,人脸特征提取模型连接人脸关键点检测模型和表情分类模型。人脸特征提取模型可以采用现有的模型,例如,VGG-19、ResNet、Transformer等具有特征提取功能的深度学习模型。可以将VGG-19block 5之前的部分作为人脸特征提取模型。人脸关键点检测模型和表情分类模型也可以采用现有的模型,例如MLP(多层感知机)等,具体可以是3层MLP。表情生成模型训练完成后用于生成表情,后续将对训练过程进行详细描述。The overall expression generation model includes an encoder and a decoder. The encoder can include a facial feature extraction model, a facial key point detection model and an expression classification model. The facial feature extraction model connects the facial key point detection model and the expression classification model. The facial feature extraction model can use existing models, such as VGG-19, ResNet, Transformer and other deep learning models with feature extraction functions. The part before VGG-19 block 5 can be used as a facial feature extraction model. The face key point detection model and expression classification model can also use existing models, such as MLP (multi-layer perceptron), etc., specifically it can be a 3-layer MLP. After the expression generation model is trained, it is used to generate expressions. The training process will be described in detail later.
第一视频中每帧图像的特征信息例如为人脸特征提取模型输出的特征图(Feature Map),关键点例如包括下巴、眉心、嘴角等68个关键点,每个关键点表示为所在位置的横纵坐标。通过人脸关键点检测模型得到各个关键点的坐标信息后,为了减少冗余信息提高效率,通过PCA对所有人脸关键点的坐标信息进行降维,得到预设维度(例如,6维,可以达到最佳效果)的信息,作为人脸关键点的特征信息。表情分类模型可以输出中性,高兴,悲伤等若干种表情的分类,可以采用one-hot编码的向量表示。原表情的分类信息可以是通过表情分类模型得到的第一视频中每帧图像中原表情的分类的one-hot编码。The feature information of each frame of the image in the first video is, for example, the Feature Map output by the facial feature extraction model. The key points include, for example, 68 key points such as chin, eyebrow center, mouth corner, etc. Each key point is expressed as a horizontal axis at its location. Y-axis. After obtaining the coordinate information of each key point through the facial key point detection model, in order to reduce redundant information and improve efficiency, PCA is used to reduce the dimensionality of the coordinate information of all facial key points to obtain the preset dimensions (for example, 6 dimensions, you can To achieve the best effect) information, as the feature information of key points of the face. The expression classification model can output the classification of several expressions such as neutral, happy, sad, etc., which can be represented by one-hot encoded vectors. The classification information of the original expression may be the one-hot encoding of the classification of the original expression in each frame of the image in the first video obtained through the expression classification model.
在步骤S320中,将每帧图像的特征信息、人脸关键点的特征信息、原表情的分类信息与目标表情对应的预设分类信息进行融合,得到每帧图像对应的融合图像的特征信息。In step S320, the feature information of each frame of image, the feature information of facial key points, the classification information of the original expression and the preset classification information corresponding to the target expression are fused to obtain the feature information of the fused image corresponding to each frame of image.
在一些实施例中,将每帧图像的原表情的分类信息与目标表情对应的预设分类信息进行加和取平均,得到每帧图像对应的融合表情的分类信息;将与训练得到的第一权重相乘后的每帧图像的人脸关键点的特征信息,与训练得到的第二权重相乘后的每帧图像的特征信息,以及每帧图像对应的融合表情的分类信息进行拼接。In some embodiments, the classification information of the original expression of each frame of image and the preset classification information corresponding to the target expression are added and averaged to obtain the classification information of the fused expression corresponding to each frame of image; The feature information of the face key points of each frame of image multiplied by the weights, the feature information of each frame of the image multiplied by the second weight obtained by training, and the classification information of the fused expression corresponding to each frame of the image are spliced.
目标表情与原表情不同,例如为微笑表情,目标表情对应的预设分类信息例如为目标表情的预设one-hot编码。预设分类信息不需要通过模型得到,直接采用预设编码规则(one-hot)进行编码即可。例如,平静表情编码为1000,微笑表情编码为0100。 前述原表情的分类信息是通过表情分类模型得到的,该分类信息可以与原表情对应的预设分类信息有区别,例如,原表情为平静表情,预设one-hot编码为1000,但是表情分类模型得到的one-hot编码可以为0.8 0.2 0 0。The target expression is different from the original expression, for example, a smile expression, and the preset classification information corresponding to the target expression is, for example, a preset one-hot code of the target expression. The preset classification information does not need to be obtained through the model, and can be directly encoded using the preset encoding rules (one-hot). For example, a calm expression is coded as 1000 and a smiling expression is coded as 0100. The aforementioned classification information of the original expression is obtained through the expression classification model. This classification information can be different from the preset classification information corresponding to the original expression. For example, the original expression is a calm expression, and the default one-hot code is 1000, but the expression classification The one-hot encoding obtained by the model can be 0.8 0.2 0 0.
编码器还可以包括特征融合模型,将每帧图像的特征信息、人脸关键点的特征信息、原表情的分类信息与目标表情对应的预设分类信息输入特征融合模型进行融合。特征融合模型中需要训练的参数包括第一权重和第二权重。针对每帧图像,训练得到的第一权重与该图像的人脸关键点的特征信息相乘,得到第一特征向量,训练得到的第二权重与该图像的特征信息相乘,得到第二特征向量,将第一特征向量、第二特征向量与该图像对应的融合表情的分类信息进行拼接,得到该图像对应的融合图像的特征信息。第一权重和第二权重可以使三种信息的值域统一。The encoder can also include a feature fusion model, which inputs the feature information of each frame of image, the feature information of facial key points, the classification information of the original expression and the preset classification information corresponding to the target expression into the feature fusion model for fusion. The parameters that need to be trained in the feature fusion model include the first weight and the second weight. For each frame of image, the first weight obtained by training is multiplied by the feature information of the facial key points of the image to obtain the first feature vector, and the second weight obtained by training is multiplied by the feature information of the image to obtain the second feature. vector, splicing the first feature vector, the second feature vector and the classification information of the fused expression corresponding to the image to obtain the feature information of the fused image corresponding to the image. The first weight and the second weight can unify the value ranges of the three types of information.
在步骤S330中,根据每帧图像对应的融合图像的特征信息,生成每帧图像对应的融合图像,所有融合图像组合形成人脸表情是目标表情的第二视频。In step S330, a fused image corresponding to each frame of image is generated based on the feature information of the fused image corresponding to each frame of image, and all the fused images are combined to form a second video in which the facial expression is the target expression.
在一些实施例中,将每帧图像对应的融合图像的特征信息输入解码器,输出生成的每帧图像对应的融合图像。人脸特征提取模型包括卷积层,解码器包括反卷积层,可以基于特征生成图像。解码器例如为VGG-19的block 5,将最后一层卷积层替换为反卷积层。融合图像即为人脸表情是目标表情的图像,各帧融合图像形成第二视频。In some embodiments, the feature information of the fused image corresponding to each frame of image is input to the decoder, and the generated fused image corresponding to each frame of image is output. The facial feature extraction model includes convolutional layers, and the decoder includes deconvolutional layers that can generate images based on features. The decoder is, for example, block 5 of VGG-19, which replaces the last convolutional layer with a deconvolutional layer. The fused image is an image whose facial expression is the target expression, and the fused images of each frame form a second video.
下面结合图3B描述本公开的一些应用例。Some application examples of the present disclosure are described below with reference to FIG. 3B.
如图3B所示,第一视频中的一帧图像,进行特征提取后得到特征图,根据特征图分别进行人脸关键点检测和表情分类,人脸关键点检测得到的各个关键点的特征信息进行PCA,降维为预设维度的信息作为关键点特征,原表情的分类信息进行one-hot编码与目标表情对应的预设分类信息进行融合,得到表情分类向量(融合表情的分类信息),进而将人脸的特征图,表情分类向量和关键点特征进行融合,得到融合图像的特征信息,将融合图像的特征信息进行特征解码,得到目标表情的人脸图像。As shown in Figure 3B, for a frame of the image in the first video, a feature map is obtained after feature extraction. Face key point detection and expression classification are performed based on the feature map. The feature information of each key point obtained by face key point detection is PCA is performed, and the dimensionality is reduced to the information of preset dimensions as key point features. The classification information of the original expression is one-hot encoded and fused with the preset classification information corresponding to the target expression to obtain the expression classification vector (the classification information of the fused expression). Then, the feature map of the face, the expression classification vector and the key point features are fused to obtain the feature information of the fused image, and the feature information of the fused image is decoded to obtain the face image of the target expression.
上述实施例的方案对第一视频中每帧图像的特征信息、人脸关键点的特征信息和原表情的分类信息进行提取,将提取的信息与目标表情对应的预设分类信息进行融合,得到每帧图像对应的融合图像的特征信息,进而根据每帧图像对应的融合图像的特征信息,生成每帧图像对应的融合图像,所有融合图像即可形成人脸表情是目标表情的第二视频。上述实施例中通过提取人脸关键点的特征信息,并用于特征融合,使得融合图像中的表情更加真实,流畅,通过目标表情对应的预设分类信息的融合,直接实现目标表情的生成,并且与原图像中人物面部动作、口型兼容,不影响人物的口型、头部 动作等,不影响原图像的清晰度,使得生成的视频稳定、清晰、流畅。The solution of the above embodiment extracts the characteristic information of each frame of the image in the first video, the characteristic information of the key points of the face and the classification information of the original expression, and fuses the extracted information with the preset classification information corresponding to the target expression to obtain The feature information of the fused image corresponding to each frame of image is then used to generate a fused image corresponding to each frame of image based on the feature information of the fused image corresponding to each frame of image. All the fused images can form a second video in which the facial expression is the target expression. In the above embodiment, feature information of key points on the human face is extracted and used for feature fusion to make the expressions in the fused image more realistic and smooth. Through the fusion of preset classification information corresponding to the target expression, the target expression is directly generated, and Compatible with the character's facial movements and mouth shape in the original image, without affecting the character's mouth shape and head Movements, etc., do not affect the clarity of the original image, making the generated video stable, clear, and smooth.
图3C为本公开表情生成模型的训练方法一些实施例的流程图。表情生成模型能够根据输入的人脸表情是原表情的第一视频和目标表情对应的预设分类信息,输出得到人脸表情是目标表情的第二视频。Figure 3C is a flow chart of some embodiments of the training method of the expression generation model of the present disclosure. The expression generation model can output a second video in which the facial expression is the target expression based on the input first video in which the facial expression is the original expression and the preset classification information corresponding to the target expression.
如图3C所示,该实施例的方法包括:步骤S410~S450。As shown in Figure 3C, the method in this embodiment includes: steps S410 to S450.
在步骤S410中,获取由第一训练视频的各帧图像与第二训练视频的各帧图像组成的训练对。In step S410, a training pair consisting of each frame image of the first training video and each frame image of the second training video is obtained.
第一训练视频为人脸表情为原表情的视频,第二训练视频为人脸表情为目标表情的视频,第一训练视频的各帧图像与第二训练视频的各帧图像并不需要一一对应。对原表情的分类信息和目标表情的分类信息进行标注。The first training video is a video in which the facial expression is the original expression, and the second training video is a video in which the facial expression is the target expression. Each frame image of the first training video does not need to correspond to each frame image of the second training video. Label the classification information of the original expression and the classification information of the target expression.
以大量人物不同表情说话的视频作为训练数据,以深度学习进行跨域迁移学习(Domain Transfer Learning),学习出由一种表情状态转化至另外一种表情状态的第一生成器,再将表情生成结果与整个数字人相融合。Using a large number of videos of people talking with different expressions as training data, deep learning is used to perform cross-domain transfer learning (Domain Transfer Learning) to learn the first generator that transforms one expression state into another, and then generates the expression The result is integrated with the entire digital human.
在步骤S420中,将第一训练视频的各帧图像输入第一生成器,获取第一训练视频的各帧图像的特征信息、人脸关键点的特征信息和原表情的分类信息,将第一训练视频的各帧图像的特征信息、人脸关键点的特征信息、原表情的分类信息和目标表情对应的预设分类信息进行融合,得到第一训练视频对应的各帧融合图像的特征信息,根据第一训练视频对应的各帧融合图像的特征信息,得到第一生成器输出的第一训练视频对应的各帧融合图像。In step S420, each frame image of the first training video is input into the first generator, the feature information of each frame image of the first training video, the feature information of the facial key points and the classification information of the original expression are obtained, and the first The characteristic information of each frame of the training video, the characteristic information of the key points of the face, the classification information of the original expression and the preset classification information corresponding to the target expression are fused to obtain the characteristic information of each frame of the fused image corresponding to the first training video, According to the feature information of each frame fusion image corresponding to the first training video, each frame fusion image corresponding to the first training video output by the first generator is obtained.
第一生成器训练完成后作为表情生成模型使用。在一些实施例中,将第一训练视频中各帧图像输入第一生成器中的第三人脸特征提取模型,得到输出的各帧图像的特征信息;将各帧图像的特征信息输入第一生成器中第一人脸关键点检测模型,得到各帧图像的人脸关键点的坐标信息;采用主成分分析法对所有人脸关键点的坐标信息进行降维,得到预设维度的第一信息,作为第一训练视频的各帧图像的人脸关键点的特征信息;将第一训练视频中各帧图像的特征信息输入第一生成器中的第三表情分类模型,得到第一训练视频中各帧图像的原表情的分类信息。After the first generator is trained, it is used as an expression generation model. In some embodiments, each frame image in the first training video is input into the third facial feature extraction model in the first generator to obtain the output feature information of each frame image; the feature information of each frame image is input into the first The first facial key point detection model in the generator obtains the coordinate information of the facial key points in each frame image; the principal component analysis method is used to reduce the dimensionality of the coordinate information of all facial key points, and the first facial key point of the preset dimension is obtained. information, as the feature information of the facial key points of each frame image in the first training video; input the feature information of each frame image in the first training video into the third expression classification model in the first generator to obtain the first training video Classification information of the original expression of each frame of image.
将人脸关键点的坐标信息进行主成分分析(PCA),关键点坐标信息降至6维(6维是通过大量实验得到的最好效果)。PCA不涉及训练参数(PCA的特征提取以及前后特征维度对应关系不随训练改变,梯度反向传递时,仅通过初始PCA得到的特征对应关系,向前面的参数传递梯度即可)。 The coordinate information of the key points on the face is subjected to principal component analysis (PCA), and the coordinate information of the key points is reduced to 6 dimensions (6 dimensions is the best result obtained through a large number of experiments). PCA does not involve training parameters (the feature extraction of PCA and the correspondence between front and rear feature dimensions do not change with training. When the gradient is transferred in reverse, only the feature correspondence obtained by the initial PCA is used to transfer the gradient to the previous parameters).
在一些实施例中,将第一训练视频的各帧图像的原表情的分类信息与目标表情对应的预设分类信息进行加和取平均,得到第一训练视频的各帧图像对应的融合表情的分类信息;将与待训练的第一权重相乘后的第一训练视频的各帧图像的人脸关键点的特征信息,与待训练的第二权重相乘后的第一训练视频的各帧图像的特征信息,以及第一训练视频的各帧图像对应的融合表情的分类信息进行拼接,得到第一训练视频对应的各帧融合图像的特征信息。In some embodiments, the classification information of the original expression of each frame image of the first training video and the preset classification information corresponding to the target expression are added and averaged to obtain the fusion expression corresponding to each frame image of the first training video. Classification information; the feature information of the face key points of each frame of the first training video multiplied by the first weight to be trained, and each frame of the first training video multiplied by the second weight to be trained The feature information of the image and the classification information of the fused expression corresponding to each frame of the first training video are spliced to obtain the feature information of each frame of the fused image corresponding to the first training video.
第一生成器中包括第一特征融合模型,第一权重和第二权重为第一特征融合模型中待训练的参数。上述特征提取和特征融合的过程可以参考前述实施例。The first generator includes a first feature fusion model, and the first weight and the second weight are parameters to be trained in the first feature fusion model. For the above feature extraction and feature fusion processes, reference can be made to the foregoing embodiments.
第一生成器包括第一编码器和第一解码器,第一编码器包括:第三人脸特征提取模型,第一人脸关键点检测模型,第三表情分类模型,第一特征融合模型,将第一训练视频对应的各帧融合图像的特征信息输入第一解码器得到生成的第一训练视频对应的各帧融合图像。The first generator includes a first encoder and a first decoder. The first encoder includes: a third facial feature extraction model, a first facial key point detection model, a third expression classification model, and a first feature fusion model, The characteristic information of each frame of the fused image corresponding to the first training video is input into the first decoder to obtain the generated each frame of the fused image corresponding to the first training video.
在步骤S430中,将第二训练视频各帧图像输入第二生成器,获取第二训练视频的各帧图像的特征信息、人脸关键点的特征信息和目标表情的分类信息,将第二训练视频的各帧图像的特征信息、人脸关键点的特征信息、目标表情的分类信息和原表情对应的预设分类信息进行融合,得到第二训练视频对应的各帧融合图像的特征信息,根据第二训练视频对应的各帧融合图像的特征信息,得到第二生成器输出的第二训练视频对应的各帧融合图像。In step S430, each frame image of the second training video is input into the second generator, the feature information of each frame image of the second training video, the feature information of the face key points, and the classification information of the target expression are obtained, and the second training video is The characteristic information of each frame image of the video, the characteristic information of the key points of the face, the classification information of the target expression and the preset classification information corresponding to the original expression are fused to obtain the characteristic information of each frame of the fused image corresponding to the second training video. According to The feature information of each frame of the fused image corresponding to the second training video is obtained to obtain the fused image of each frame corresponding to the second training video output by the second generator.
第二生成器与第一生成器在结构上是相同或相似的,第二生成器的训练目标是基于第二训练视频,生成与第一训练视频表情相同的视频。The second generator is structurally identical or similar to the first generator, and the training goal of the second generator is to generate a video with the same expression as the first training video based on the second training video.
在一些实施例中,将第二训练视频中各帧图像输入第二生成器中的第四人脸特征提取模型,得到输出的各帧图像的特征信息;将各帧图像的特征信息输入第二生成器中第二人脸关键点检测模型,得到各帧图像的人脸关键点的坐标信息;采用主成分分析法对所有人脸关键点的坐标信息进行降维,得到预设维度的第二信息,作为第二训练视频的各帧图像的人脸关键点的特征信息。将第二训练视频中各帧图像的特征信息输入第二生成器中的第四表情分类模型,得到第二训练视频中各帧图像的目标表情的分类信息。In some embodiments, input each frame image in the second training video into the fourth facial feature extraction model in the second generator to obtain the output feature information of each frame image; input the feature information of each frame image into the second The second face key point detection model in the generator obtains the coordinate information of the face key points of each frame image; the principal component analysis method is used to reduce the dimensionality of the coordinate information of all face key points, and the second face key point of the preset dimension is obtained. Information, as the feature information of the face key points of each frame image of the second training video. The characteristic information of each frame image in the second training video is input into the fourth expression classification model in the second generator to obtain the classification information of the target expression of each frame image in the second training video.
第二训练视频的各帧图像的人脸关键点的特征信息与第一训练视频的各帧图像的人脸关键点的特征信息的维度是相同的,例如,6维。The feature information of the face key points in each frame image of the second training video has the same dimension as the feature information of the face key points in each frame image of the first training video, for example, 6 dimensions.
在一些实施例中,将第二训练视频的各帧图像的目标表情的分类信息与原表情对 应的预设分类信息进行加和取平均,得到第二训练视频的各帧图像对应的融合表情的分类信息;将与待训练的第三权重相乘后的第二训练视频的各帧图像的人脸关键点的特征信息,与待训练的第四权重相乘后的第二训练视频的各帧图像的特征信息,以及第二训练视频的各帧图像对应的融合表情的分类信息进行拼接,得到第二训练视频对应的各帧融合图像的特征信息。In some embodiments, the classification information of the target expression in each frame of the second training video is compared with the original expression. The corresponding preset classification information is added and averaged to obtain the classification information of the fused expression corresponding to each frame image of the second training video; the classification information of each frame image of the second training video is multiplied by the third weight to be trained. The characteristic information of the face key points is spliced with the characteristic information of each frame image of the second training video multiplied by the fourth weight to be trained, and the classification information of the fused expression corresponding to each frame image of the second training video, Obtain the feature information of each frame fusion image corresponding to the second training video.
原表情对应的预设分类信息不需要通过模型得到,直接采用预设编码规则进行编码即可。第二生成器包括第二特征融合模型,第三权重和第三权重为第二特征融合模型中待训练的参数。上述特征提取和特征融合的过程可以参考前述实施例,不再赘述。The preset classification information corresponding to the original expression does not need to be obtained through the model, and can be directly encoded using the preset encoding rules. The second generator includes a second feature fusion model, and the third weight and the third weight are parameters to be trained in the second feature fusion model. The above process of feature extraction and feature fusion can be referred to the previous embodiments and will not be described again.
第二生成器包括第二编码器和第二解码器,第二编码器包括:第四人脸特征提取模型,第二人脸关键点检测模型,第四表情分类模型,第二特征融合模型,将第二训练视频对应的各帧融合图像的特征信息输入第二解码器得到生成的第二训练视频对应的各帧融合图像。The second generator includes a second encoder and a second decoder. The second encoder includes: a fourth facial feature extraction model, a second facial key point detection model, a fourth expression classification model, and a second feature fusion model, The feature information of each frame of the fused image corresponding to the second training video is input into the second decoder to obtain the generated each frame of the fused image corresponding to the second training video.
在步骤S440中,根据第一训练视频对应的各帧融合图像、第二训练视频对应的各帧融合图像,确定对抗损失和循环一致损失。In step S440, the adversarial loss and the cycle-consistent loss are determined based on the fused images of each frame corresponding to the first training video and the fused images of each frame corresponding to the second training video.
基于生成对抗学习和跨域迁移学习进行端到端的训练,能够提高模型的准确度,并且提高训练效率。End-to-end training based on generative adversarial learning and cross-domain transfer learning can improve the accuracy of the model and improve training efficiency.
在一些实施例中,对抗损失采用以下方法确定:将第一训练视频对应的各帧融合图像输入第一判别器,得到第一训练视频对应的各帧融合图像的第一判别结果;将第二训练视频对应的各帧融合图像输入第二判别器,得到第二训练视频对应的各帧融合图像的第二判别结果;根据第一训练视频对应的各帧融合图像的第一判别结果,确定第一对抗损失,根据第二训练视频对应的各帧融合图像的第二判别结果确定第二对抗损失。In some embodiments, the adversarial loss is determined using the following method: input the fused image of each frame corresponding to the first training video into the first discriminator to obtain the first discrimination result of the fused image of each frame corresponding to the first training video; The fused images of each frame corresponding to the training video are input into the second discriminator to obtain the second discrimination result of the fused image of each frame corresponding to the second training video; based on the first discrimination result of the fused image of each frame corresponding to the first training video, the second discriminator is determined An adversarial loss, the second adversarial loss is determined based on the second discrimination result of each frame fusion image corresponding to the second training video.
进一步,在一些实施例中,将第一训练视频对应的各帧融合图像输入第一判别器中第一人脸特征提取模型,得到输出的第一训练视频对应的各帧融合图像的特征信息;将第一训练视频对应的各帧融合图像的特征信息输入第一判别器中的第一表情分类模型,得到第一训练视频对应的各帧融合图像的表情的分类信息,作为第一判别结果;将第二训练视频对应的各帧融合图像输入第二判别器中第二人脸特征提取模型,得到输出的第二训练视频对应的各帧融合图像的特征信息;将第二训练视频对应的各帧融合图像的特征信息输入第二判别器中的第二表情分类模型,得到第二训练视频对应的各帧融合图像的表情的分类信息,作为第二判别结果。 Further, in some embodiments, each frame of the fused image corresponding to the first training video is input into the first face feature extraction model in the first discriminator, and the feature information of each frame of the fused image corresponding to the output first training video is obtained; Input the feature information of each frame of the fused image corresponding to the first training video into the first expression classification model in the first discriminator, and obtain the classification information of the expression of each frame of the fused image corresponding to the first training video as the first discrimination result; Input each frame of the fused image corresponding to the second training video into the second facial feature extraction model in the second discriminator to obtain the feature information of each frame of the fused image corresponding to the output second training video; input each frame of the fused image corresponding to the second training video. The feature information of the frame fusion image is input into the second expression classification model in the second discriminator, and the expression classification information of each frame fusion image corresponding to the second training video is obtained as the second discrimination result.
在训练过程中整体模型包括两套生成器加判别器。第一判别器和第二判别器的结构是相同或相似的,都包括人脸特征提取模型和表情分类模型。第一人脸特征提取模型、第二人脸特征提取模型与第三人脸特征提取模型、第四人脸特征提取模型的结构相同或相似,第一表情分类模型、第二表情分类模型与第三表情分类模型、第四表情分类模型的结构相同或相似。During the training process, the overall model includes two sets of generators and discriminators. The structures of the first discriminator and the second discriminator are the same or similar, and both include facial feature extraction models and expression classification models. The first facial feature extraction model, the second facial feature extraction model and the third facial feature extraction model and the fourth facial feature extraction model have the same or similar structures. The first expression classification model and the second expression classification model are the same as the third facial feature extraction model. The structures of the three-expression classification model and the fourth expression classification model are the same or similar.
例如,第一视频的数据采用X={xi}表示,第二视频的数据采用Y={yi}表示。第一生成器G用于实现X→Y,训练使G(X)尽量接近Y,第一判别器DY用于判别第一训练视频对应的各帧融合图像的真假。第一对抗损失可以采用以下公式表示:
For example, the data of the first video is represented by X={ xi }, and the data of the second video is represented by Y={y i }. The first generator G is used to realize X→Y, and is trained to make G(X) as close as possible to Y. The first discriminator D Y is used to determine whether the fused images of each frame corresponding to the first training video are true or false. The first adversarial loss can be expressed by the following formula:
第二生成器F用于实现Y→X,训练使F(Y)尽量接近X,第二判别器DX用于判别第二训练视频对应的各帧融合图像的真假。第二对抗损失可以采用以下公式表示:
The second generator F is used to realize Y→X, and is trained to make F(Y) as close as possible to X. The second discriminator D The second adversarial loss can be expressed by the following formula:
在一些实施例中,循环一致损失(Cycle Consistency Losses)采用以下方法确定:将第一训练视频对应的各帧融合图像输入第二生成器,生成第一训练视频的各帧重构图像,将第二训练视频对应的各帧融合图像输入第一生成器,生成第二训练视频的各帧重构图像;根据第一训练视频的各帧重构图像和第一训练视频的各帧图像的差异,以及第二训练视频的各帧重构图像和第二训练视频的各帧图像的差异,确定循环一致损失。In some embodiments, cycle consistency losses (Cycle Consistency Losses) are determined using the following method: input the fused images of each frame corresponding to the first training video into the second generator, generate a reconstructed image of each frame of the first training video, and convert the fused image of each frame of the first training video into the second generator. The fused images of each frame corresponding to the two training videos are input into the first generator to generate a reconstructed image of each frame of the second training video; based on the difference between the reconstructed image of each frame of the first training video and the image of each frame of the first training video, And the difference between the reconstructed image of each frame of the second training video and the image of each frame of the second training video is used to determine the cycle consistency loss.
为了进一步提高模型的准确率,将第一生成器生成的图像输入第二生成器,得到第一训练视频的各帧重构图像,期望第二生成器生成的第一训练视频的各帧重构图像与第一训练视频的各帧图像尽量一致,即F(G(x))≈x。将第二生成器生成的图像输入第一生成器,得到第二训练视频的各帧重构图像,期望第一生成器生成的第二训练视频的各帧重构图像与第二训练视频的各帧图像尽量一致,即G(F(y))≈y。In order to further improve the accuracy of the model, the images generated by the first generator are input into the second generator to obtain reconstructed images of each frame of the first training video. It is expected that the reconstructed images of each frame of the first training video generated by the second generator The image should be as consistent as possible with each frame of the first training video, that is, F(G(x))≈x. Input the image generated by the second generator into the first generator to obtain the reconstructed image of each frame of the second training video. It is expected that the reconstructed image of each frame of the second training video generated by the first generator is consistent with each frame of the second training video. The frame images should be as consistent as possible, that is, G(F(y))≈y.
第一训练视频的各帧重构图像和第一训练视频的各帧图像的差异可以采用以下方法确定:针对第一训练视频的每帧重构图像和与该重构图像相对应的第一训练视频的图像,确定重构图像和对应的图像每个相同位置的像素的表示向量之间的距离(例如欧氏距离),并对所有的距离求和。The difference between the reconstructed image of each frame of the first training video and the image of each frame of the first training video can be determined using the following method: for each frame of the reconstructed image of the first training video and the first training image corresponding to the reconstructed image For the image of the video, determine the distance (such as Euclidean distance) between the representation vector of each pixel at the same position in the reconstructed image and the corresponding image, and sum all distances.
第二训练视频的各帧重构图像和第二训练视频的各帧图像的差异可以采用以下 方法确定:针对第二训练视频的每帧重构图像和与该重构图像相对应的第二训练视频的图像,确定重构图像和对应的图像每个相同位置的像素的表示向量之间的距离(例如欧氏距离),并对所有的距离求和。The difference between the reconstructed image of each frame of the second training video and the image of each frame of the second training video can be as follows: Method determination: for each frame of the reconstructed image of the second training video and the image of the second training video corresponding to the reconstructed image, determine the relationship between the representation vector of each pixel at the same position in the reconstructed image and the corresponding image. distance (e.g. Euclidean distance) and sum all distances.
在步骤S450中,根据对抗损失和循环一致损失,对第一生成器和第二生成器进行训练。In step S450, the first generator and the second generator are trained according to the adversarial loss and the cycle-consistent loss.
可以将第一对抗损失、第二对抗损失和循环一致损失进行加权求和得到总损失,根据总损失对第一生成器和第二生成器进行训练。例如,总损失可以采用以下公式确定:
L=LGAN(G,DY,X,Y)+LGAN(F,DX,Y,X)+λLcyc(G,F)    (3)
The first adversarial loss, the second adversarial loss and the cycle-consistent loss can be weighted and summed to obtain the total loss, and the first generator and the second generator are trained based on the total loss. For example, the total loss can be determined using the following formula:
L=L GAN (G,D Y ,X,Y)+L GAN (F,D X ,Y,X)+λL cyc (G,F) (3)
其中,Lcyc(G,F)表示循环一致损失,λ为权重,可以通过训练得到。Among them, L cyc (G, F) represents the cycle consistency loss, and λ is the weight, which can be obtained through training.
为了进一步提高模型的准确性,保证输出视频结果的稳定连续,在训练过程中增加视频两帧间的像素差带来的损失。在一些实施例中,根据第一训练视频对应的每相邻两帧融合图像之间的像素差异,以及第二训练视频对应的每相邻两帧融合图像之间的像素差异,确定像素对像素损失,根据对抗损失、循环一致损失和像素对像素损失,对第一生成器和第二生成器进行训练。In order to further improve the accuracy of the model and ensure the stability and continuity of the output video results, the loss caused by the pixel difference between the two frames of the video is increased during the training process. In some embodiments, the pixel-to-pixel determination is based on the pixel difference between each two adjacent frames of the fused image corresponding to the first training video and the pixel difference between each two adjacent frames of the fused image corresponding to the second training video. Loss,The first and second generators are trained based on,adversarial loss, cycle-consistent loss and pixel-by-pixel,loss.
进一步,在一些实施例中,针对第一训练视频对应的每相邻两帧融合图像中的每个位置,确定该相邻两帧融合图像中在该位置上的两个像素的表示向量之间的距离,将所有位置对应的距离进行加和,得到第一损失;针对第二训练视频对应的每相邻两帧融合图像中的每个位置,确定该相邻两帧融合图像中在位置上的两个像素的表示向量之间的距离,将所有位置对应的距离进行加和,得到第二损失;将第一损失和第二损失加和,得到像素对像素损失。像素对像素损失可以使生成的视频相邻两帧的变化不会太大。Further, in some embodiments, for each position in each two adjacent frames of the fused image corresponding to the first training video, determine the relationship between the representation vectors of the two pixels at the position in the two adjacent frames of the fused image. distance, add the distances corresponding to all positions to obtain the first loss; for each position in the fused image of each adjacent two frames corresponding to the second training video, determine the position in the fused image of the two adjacent frames The distance between the two pixels represents the vector, and the distance corresponding to all positions is summed to obtain the second loss; the first loss and the second loss are summed to obtain the pixel-to-pixel loss. Pixel-for-pixel loss allows the generated video to not change too much between adjacent frames.
在一些实施例中,将对抗损失、循环一致损失和像素对像素损失进行加权求和,得到总损失;根据总损失对第一生成器和第二生成器进行训练。例如,总损失可以采用以下公式确定:
L=LGAN(G,DY,X,Y)+LGAN(F,DX,Y,X)+λ1Lcyc(G,F)+
λ2LP2P(G(xi),G(xi+1))+λ3LP2P(F(yj),F(yj+1))    (4)
In some embodiments, the adversarial loss, the cycle-consistent loss and the pixel-to-pixel loss are weighted and summed to obtain a total loss; the first generator and the second generator are trained according to the total loss. For example, the total loss can be determined using the following formula:
L=L GAN (G,D Y ,X,Y)+L GAN (F,D X ,Y,X)+λ 1 L cyc (G,F)+
λ 2 L P2P (G(x i ),G(x i+1 ))+λ 3 L P2P (F(y j ),F(y j+1 )) (4)
其中,λ1,λ2,λ3为权重,可以通过训练得到,LP2P(G(xi),G(xi+1))表示第一损失,LP2P(F(yj),F(yj+1))表示第二损失。Among them, λ 1 , λ 2 , λ 3 are weights, which can be obtained through training. L P2P (G(x i ),G(x i+1 )) represents the first loss, L P2P (F(y j ),F (y j+1 )) represents the second loss.
如图3D所示,在进行端到端的训练之前,可以针对各部分的模型进行预训练,例如,首先选取大量开源人脸识别数据对人脸识别模型进行预训练,选取其输出特征图之前的部 分作为人脸特征提取模型(该部分方法不唯一,以vgg-19为例,选取block5之前部分,可输出8×8×512维的特征图)。之后固定人脸特征提取模型以及参数,在后边分为两条支路,两分支为人脸关键点检测模型和表情分类模型,分别以人脸关键点检测数据集与表情分类数据对各自分支进行微调(fine-tune)只训练这两部分模型结构当中的参数。人脸关键点检测模型不唯一,只要是基于卷积网络模型的能够得到准确的关键点的模型即可接入改方案;表情分类模型即为基于卷积网络模型的单标签分类任务。在预训练之后,可以再基于前述实施例执行端到端的训练过程。这样可以提高训练效率。As shown in Figure 3D, before end-to-end training, each part of the model can be pre-trained. For example, first select a large amount of open source face recognition data to pre-train the face recognition model, and select the output feature map before department points as a facial feature extraction model (the method for this part is not unique, taking vgg-19 as an example, selecting the part before block 5 can output an 8×8×512-dimensional feature map). After that, the facial feature extraction model and parameters are fixed, and then divided into two branches. The two branches are the facial key point detection model and the expression classification model. The respective branches are fine-tuned using the facial key point detection data set and the expression classification data respectively. (fine-tune) only trains the parameters in these two parts of the model structure. The face key point detection model is not unique, as long as it is based on the convolutional network model and can obtain accurate key points, it can be connected to the modified solution; the expression classification model is a single-label classification task based on the convolutional network model. After pre-training, an end-to-end training process can be performed based on the foregoing embodiments. This can improve training efficiency.
上述实施例的方法采用对抗损失、循环一致损失以及视频相邻两帧之间的像素损失对整体模型进行训练,可以提高模型的准确性,并且端到端的训练过程可以提高效率,节省计算资源。The method of the above embodiment uses adversarial loss, cycle consistent loss, and pixel loss between two adjacent frames of the video to train the overall model, which can improve the accuracy of the model, and the end-to-end training process can improve efficiency and save computing resources.
本公开的方案适用于视频中人脸表情的编辑。本公开通过采用独特的深度学习模型,融合表情识别、关键点检测等技术,通过数据的训练,学习不同表情下人面部关键点移动的规律,最终通过向模型输入目标表情的分类信息来控制模型所输出的面部表情状态,且表情作为一种风格状态存在,当人物说话或做出歪头、眨眼等动作时能够很好的效果叠加,使得最终输出的人物面部动作视频自然、不违和。输出结果可与输入的图像具有相同的分辨率以及细节程度,在1080p甚至2k分辨率下依旧保持输出结果稳定、清晰、无瑕疵。The disclosed solution is suitable for editing facial expressions in videos. This disclosure adopts a unique deep learning model, integrates expression recognition, key point detection and other technologies, and through data training, learns the movement rules of key points on the human face under different expressions, and finally controls the model by inputting the classification information of the target expression into the model. The output facial expression state, and the expression exists as a style state, can be well superimposed when the character speaks or makes actions such as tilting the head, blinking, etc., making the final output facial action video of the character natural and consistent. The output result can have the same resolution and detail level as the input image, and the output result can still be stable, clear, and flawless at 1080p or even 2k resolution.
在步骤S250,动作定制。In step S250, the action is customized.
动作定制是指根据交互场景相应的人物动作定制信息,对第一视频中的各帧图像中的人物动作进行编辑处理,实现交互场景下数字人动作的编辑和控制。Action customization refers to editing and processing the character actions in each frame of the image in the first video based on the character action customization information corresponding to the interactive scene, so as to realize the editing and control of digital human actions in the interactive scene.
在一些实施例中,根据交互场景相应的人物动作定制信息,对第一视频中的各帧图像中的人物动作进行编辑处理包括:对第一视频中原第一关键帧中的人物在第一动作时的第一人体关键点进行调整,得到人物在第二动作时的第二人体关键点,作为人物动作定制信息;可利用特征提取模型,如卷积核模型,从原第一关键帧中提取各个第二人体关键点邻域的特征信息;将各个第二人体关键点及其邻域的特征信息输入图像生成模型,输出人物在第二动作时的目标第一关键帧。In some embodiments, editing the character actions in each frame of the image in the first video according to the character action customization information corresponding to the interaction scene includes: editing the first action of the character in the original first key frame in the first video. Adjust the first human body key points at the time to obtain the second human body key points at the second action of the character as the character action customization information; a feature extraction model, such as the convolution kernel model, can be used to extract from the original first key frame Characteristic information of each second human body key point neighborhood; input characteristic information of each second human body key point and its neighborhood into the image generation model, and output the target first key frame of the character during the second action.
第一人体关键点包括人物在第一动作时的人体轮廓特征点,如图4A所示的14对白色圆点,第二人体关键点包括人物在第二动作时的人体轮廓特征点,如图4B所示的14对白色圆点。The first human body key points include the character's human body outline feature points during the first action, such as the 14 pairs of white dots shown in Figure 4A. The second human body key points include the character's human body outline feature points during the second action, as shown in Figure 4A The 14 pairs of white dots shown in 4B.
利用人体轮廓特征点进行人物动作编辑,相对于利用人体骨架特征点进行人物动 作编辑,所生成的人物动作更准确,不易出现形变、扭曲等现象,提升生成的图像质量。Using human outline feature points to edit character movements is different from using human skeleton feature points to edit character movements. For editing, the generated character movements are more accurate and less prone to deformation, distortion, etc., improving the quality of the generated images.
在调整人物在第一动作时的人体轮廓特征点之前,先提取人物在第一动作时的人体轮廓特征点。提取人物在第一动作时的人体轮廓特征点例如包括:利用语义分割网络模型,提取人物的轮廓线;利用目标检测网络模型,提取人物上的多个关键点,如图4C所示的黑色圆点;根据人物的结构信息,连接所述多个关键点,确定多条关键连线,如图4C所示的白色直线;根据所述多条关键连线的垂线与所述轮廓线的交点,确定出人物在第一动作时成对的多个人体轮廓特征点。Before adjusting the human body contour feature points of the character during the first movement, first extract the human body contour feature points of the character during the first movement. Extracting the human body contour feature points during the first action of the character includes, for example: using the semantic segmentation network model to extract the contour line of the character; using the target detection network model to extract multiple key points on the character, such as the black circle shown in Figure 4C points; according to the structural information of the character, connect the multiple key points and determine multiple key connections, such as the white straight lines shown in Figure 4C; according to the intersection points of the vertical lines of the multiple key connections and the contour lines , determine the pairs of multiple human body contour feature points during the first action of the character.
所述图像生成模型的获得方法包括:将训练视频帧以及训练视频帧中的人物的人体关键点作为一对训练数据,将训练数据中人体关键点及其在训练视频帧中邻域的特征信息作为图像生成网络的输入,将训练数据中的训练视频帧作为图像生成网络的输出的监督信息,对图像生成网络进行训练得到所述图像生成模型。其中,将图像生成网络基于输入数据输出的视频帧与训练视频帧的差距信息作为损失函数,迭代地根据损失函数确定的损失更新图像生成网络的参数,直至损失满足一定条件,训练完成,此时图像生成网络输出的视频帧非常接近训练视频帧,将训练后的图像生成网络作为图像生成模型。其中,图像生成网络是一大类模型,例如包括但不限于卷积神经网络、基于光流法的循环网络、生成对抗网络等。如果图像生成网络是生成对抗网络,则总的损失函数还包括图像判别网络的判别损失函数。The method of obtaining the image generation model includes: using the training video frame and the human body key points of the characters in the training video frame as a pair of training data, and using the human body key points in the training data and the characteristic information of their neighborhoods in the training video frame. As input to the image generation network, the training video frames in the training data are used as supervision information for the output of the image generation network, and the image generation network is trained to obtain the image generation model. Among them, the gap information between the video frame output by the image generation network based on the input data and the training video frame is used as the loss function, and the parameters of the image generation network are iteratively updated according to the loss determined by the loss function until the loss meets certain conditions and the training is completed. The video frames output by the image generation network are very close to the training video frames, and the trained image generation network is used as the image generation model. Among them, image generation networks are a large class of models, including but not limited to convolutional neural networks, recurrent networks based on optical flow methods, generative adversarial networks, etc. If the image generation network is a generative adversarial network, the total loss function also includes the discriminant loss function of the image discriminant network.
在步骤S260,渲染输出。In step S260, the output is rendered.
利用各步骤S220~250处理后的素材结果对人物形象进行建模,可根据应用场景选取不同的渲染技术,结合智能对话、语音识别、语音合成、动作交互等人工智能技术,组合成一套完整的可与场景互动的数字人视频(即第二视频)并输出。Use the material results processed in steps S220 to 250 to model the character image. Different rendering technologies can be selected according to the application scenario, combined with artificial intelligence technologies such as intelligent dialogue, speech recognition, speech synthesis, and action interaction, to form a complete set of The digital human video (i.e. the second video) that can interact with the scene is output.
上述实施例,根据交互场景相应的人物定制信息对视频中的人物进行编辑处理,通过人物编辑生成与交互场景匹配的数字人视频,例如,生成与交互场景匹配的数字人形象、数字人表情、数字人动作等。按照本公开实施例的方法,录制一套人物形象视频,能够快速生产出不同场景的不同人物形象风格的多套视频。并且,并不需要专业工程师接入,用户根据场景需要可以自行调整人物的形象、表情、动作等。In the above embodiment, the characters in the video are edited according to the character customization information corresponding to the interaction scene, and a digital human video matching the interaction scene is generated through character editing. For example, a digital human image, digital human expression, etc. that match the interaction scene are generated. Digital human actions and more. According to the method of the disclosed embodiment, recording a set of character videos can quickly produce multiple sets of videos with different character styles in different scenes. Moreover, professional engineers are not required to access the system. Users can adjust the character's image, expressions, actions, etc. according to the needs of the scene.
图5示出本公开一些实施例的数字人生成装置的结构示意图。如图5所示,该实施例的数字人生成装置500包括单元510~530。Figure 5 shows a schematic structural diagram of a digital human generation device according to some embodiments of the present disclosure. As shown in FIG. 5 , the digital human generation device 500 of this embodiment includes units 510 to 530.
获取单元510,被配置为获取第一视频,具体可参见步骤S220。 The acquisition unit 510 is configured to acquire the first video. For details, see step S220.
定制单元520,被配置为根据交互场景相应的人物定制信息,对第一视频中的各帧图像中的人物进行编辑处理,具体可参见步骤S230~250。The customization unit 520 is configured to edit the characters in each frame of the image in the first video according to the character customization information corresponding to the interaction scene. For details, see steps S230-250.
定制单元520例如包括形象定制单元521,表情定制单元522,动作定制单元523等。形象定制单元521,被配置为根据交互场景相应的人物形象定制信息,对第一视频中的各帧图像中的人物形象进行编辑处理,具体可参见步骤S230。表情定制单元522,被配置为根据交互场景相应的人物表情定制信息,对第一视频中的各帧图像中的人物表情进行编辑处理,具体可参见步骤S240。动作定制单元523,被配置为根据交互场景相应的人物动作定制信息,对第一视频中的各帧图像中的人物动作进行编辑处理,具体可参见步骤S250。The customization unit 520 includes, for example, an image customization unit 521, an expression customization unit 522, an action customization unit 523, and the like. The image customization unit 521 is configured to edit the characters in each frame image in the first video according to the character customization information corresponding to the interaction scene. For details, see step S230. The expression customization unit 522 is configured to edit the character expressions in each frame image in the first video according to the character expression customization information corresponding to the interaction scene. For details, see step S240. The action customization unit 523 is configured to edit the character actions in each frame image in the first video according to the character action customization information corresponding to the interaction scene. For details, see step S250.
输出单元530,被配置为根据处理后的第一视频中的各帧图像,输出第二视频,具体可参见步骤S260。The output unit 530 is configured to output the second video according to each frame image in the processed first video. For details, see step S260.
图6示出本公开另一些实施例的数字人生成装置的结构示意图。如图6所示,该实施例的数字人生成装置600包括:存储器610以及耦接至该存储器610的处理器620,处理器620被配置为基于存储在存储器610中的指令,执行前述任意一些实施例中的数字人生成方法。Figure 6 shows a schematic structural diagram of a digital human generation device according to other embodiments of the present disclosure. As shown in Figure 6, the digital human generation device 600 of this embodiment includes: a memory 610 and a processor 620 coupled to the memory 610. The processor 620 is configured to execute any of the foregoing based on instructions stored in the memory 610. Digital human generation method in the embodiment.
其中,存储器610例如可以包括系统存储器、固定非易失性存储介质等。系统存储器例如存储有操作系统、应用程序、引导装载程序(Boot Loader)以及其他程序等。The memory 610 may include, for example, system memory, fixed non-volatile storage media, etc. System memory stores, for example, operating systems, applications, boot loaders, and other programs.
其中,处理器620可以用通用处理器、数字信号处理器(Digital Signal Processor,DSP)、应用专用集成电路(Application Specific Integrated Circuit,ASIC)、现场可编程门阵列(Field Programmable Gate Array,FPGA)或其它可编程逻辑设备、分立门或晶体管等分立硬件组件方式来实现。Among them, the processor 620 can be a general-purpose processor, a digital signal processor (Digital Signal Processor, DSP), an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), a field programmable gate array (Field Programmable Gate Array, FPGA) or It can be implemented by other discrete hardware components such as programmable logic devices, discrete gates or transistors.
装置600还可以包括输入输出接口630、网络接口640、存储接口650等。这些接口630,640,650以及存储器610和处理器620之间例如可以通过总线660连接。其中,输入输出接口630为显示器、鼠标、键盘、触摸屏等输入输出设备提供连接接口。网络接口640为各种联网设备提供连接接口。存储接口650为SD卡、U盘等外置存储设备提供连接接口。总线660可以使用多种总线结构中的任意总线结构。例如,总线结构包括但不限于工业标准体系结构(Industry Standard Architecture,ISA)总线、微通道体系结构(Micro Channel Architecture,MCA)总线、外围组件互连(Peripheral Component Interconnect,PCI)总线。The device 600 may also include an input/output interface 630, a network interface 640, a storage interface 650, and the like. These interfaces 630, 640, 650, the memory 610 and the processor 620 may be connected through a bus 660, for example. Among them, the input and output interface 630 provides a connection interface for input and output devices such as a monitor, mouse, keyboard, and touch screen. Network interface 640 provides a connection interface for various networked devices. The storage interface 650 provides a connection interface for external storage devices such as SD cards and USB disks. Bus 660 may use any of a variety of bus structures. For example, bus structures include, but are not limited to, Industry Standard Architecture (ISA) bus, Micro Channel Architecture (MCA) bus, and Peripheral Component Interconnect (PCI) bus.
本公开一些实施例提出一种非瞬时性计算机可读存储介质,其上存储有计算机程 序,该程序被处理器执行时实现前述任意一些实施例中的数字人生成方法的步骤。Some embodiments of the present disclosure provide a non-transitory computer-readable storage medium on which a computer program is stored A program that, when executed by a processor, implements the steps of the digital human generation method in any of the foregoing embodiments.
本领域内的技术人员应当明白,本公开的实施例可提供为方法、系统、或计算机程序产品。因此,本公开可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且,本公开可采用在一个或多个其中包含有计算机程序代码的非瞬时性计算机可读存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。Those skilled in the art will appreciate that embodiments of the present disclosure may be provided as methods, systems, or computer program products. Accordingly, the present disclosure may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment that combines software and hardware aspects. Furthermore, the present disclosure may take the form of a computer program product embodied on one or more non-transitory computer-readable storage media (including, but not limited to, disk memory, CD-ROM, optical storage, etc.) embodying computer program code therein. .
本公开是参照根据本公开实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解为可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。The disclosure is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the disclosure. It will be understood that each process and/or block in the flowchart illustrations and/or block diagrams, and combinations of processes and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing device to produce a machine, such that the instructions executed by the processor of the computer or other programmable data processing device produce a use A device for realizing the functions specified in one process or multiple processes of the flowchart and/or one block or multiple blocks of the block diagram.
这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。These computer program instructions may also be stored in a computer-readable memory that causes a computer or other programmable data processing apparatus to operate in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including the instruction means, the instructions The device implements the functions specified in a process or processes of the flowchart and/or a block or blocks of the block diagram.
这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上,使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。These computer program instructions may also be loaded onto a computer or other programmable data processing device, causing a series of operating steps to be performed on the computer or other programmable device to produce computer-implemented processing, thereby executing on the computer or other programmable device. Instructions provide steps for implementing the functions specified in a process or processes of a flowchart diagram and/or a block or blocks of a block diagram.
以上所述仅为本公开的较佳实施例,并不用以限制本公开,凡在本公开的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本公开的保护范围之内。 The above are only preferred embodiments of the present disclosure and are not intended to limit the present disclosure. Any modifications, equivalent substitutions, improvements, etc. made within the spirit and principles of the present disclosure shall be included in the protection of the present disclosure. within the range.

Claims (30)

  1. 一种数字人生成方法,包括:A digital human generation method, including:
    获取第一视频;Get the first video;
    根据交互场景相应的人物定制信息,对第一视频中的各帧图像中的人物进行编辑处理;Edit the characters in each frame of the image in the first video according to the character customization information corresponding to the interactive scene;
    根据处理后的第一视频中的各帧图像,输出第二视频。A second video is output based on each frame image in the processed first video.
  2. 根据权利要求1所述的方法,其中,所述第一视频是由原视频经过预处理得到的,所述预处理包括分辨率调整、帧间平滑处理、帧率调整中的一项或多项。The method according to claim 1, wherein the first video is obtained by preprocessing the original video, and the preprocessing includes one or more of resolution adjustment, inter-frame smoothing, and frame rate adjustment. .
  3. 根据权利要求2所述的方法,其中,所述分辨率调整包括:The method of claim 2, wherein the resolution adjustment includes:
    如果原视频的分辨率高于要求的预设分辨率,按照预设分辨率对原视频进行降采样,得到预设分辨率的第一视频;If the resolution of the original video is higher than the required preset resolution, downsample the original video according to the preset resolution to obtain the first video with the preset resolution;
    如果原视频的分辨率低于要求的预设分辨率,利用超分辨率模型对原视频进行处理,得到预设分辨率的第一视频,所述超分辨率模型用于将输入视频的分辨率提升至预设分辨率。If the resolution of the original video is lower than the required preset resolution, a super-resolution model is used to process the original video to obtain the first video of the preset resolution. The super-resolution model is used to convert the resolution of the input video into Upgrade to default resolution.
  4. 根据权利要求3所述的方法,其中,所述超分辨率模型是由神经网络经过训练得到的,在训练过程中,将来自高清视频的第一视频帧按照预设分辨率进行降采样得到第二视频帧,将第二视频帧作为神经网络的输入,将第一视频帧作为神经网络的输出的监督信息,对神经网络进行训练得到超分辨率模型。The method according to claim 3, wherein the super-resolution model is obtained by training a neural network. During the training process, the first video frame from the high-definition video is down-sampled according to a preset resolution to obtain the second video frame. Two video frames, the second video frame is used as the input of the neural network, the first video frame is used as the supervision information of the output of the neural network, and the neural network is trained to obtain a super-resolution model.
  5. 根据权利要求2所述的方法,其中,所述帧率调整包括:The method of claim 2, wherein the frame rate adjustment includes:
    如果原视频的帧率高于要求的预设帧率,根据原视频的帧率与预设帧率的比例信息对原视频进行抽帧,得到预设帧率的第一视频;If the frame rate of the original video is higher than the required preset frame rate, extract frames from the original video based on the ratio information between the frame rate of the original video and the preset frame rate to obtain the first video with the preset frame rate;
    如果原视频的帧率低于要求的预设帧率,利用视频插帧模型将原视频插帧至第一帧率,所述第一帧率为原视频插帧之前的帧率与预设帧率的最小公倍数,根据第一帧率与预设帧率的比例信息对插帧后的原视频进行抽帧,得到预设帧率的第一视频,所述视频插帧模型用于生成任意两帧图像之间的过渡帧。 If the frame rate of the original video is lower than the required preset frame rate, the video frame insertion model is used to insert frames of the original video to the first frame rate, and the first frame rate is the frame rate of the original video before the frame is inserted and the preset frame The least common multiple of the rate, extract frames from the original video after frame insertion based on the ratio information between the first frame rate and the preset frame rate, to obtain the first video with the preset frame rate. The video frame interpolation model is used to generate any two Transition frame between frame images.
  6. 根据权利要求5所述的方法,其中,所述视频插帧模型是由神经网络经过训练得到的,在训练过程中,将训练视频帧序列中的连续三帧作为三元组,将三元组中的第一帧和第三帧作为神经网络的输入,将三元组中的第二帧作为神经网络的输出的监督信息,对神经网络进行训练得到视频插帧模型。The method according to claim 5, wherein the video frame insertion model is obtained by training a neural network. During the training process, three consecutive frames in the training video frame sequence are used as triples, and the triples are The first and third frames in the triplet are used as the input of the neural network, and the second frame in the triplet is used as the supervision information of the output of the neural network. The neural network is trained to obtain the video frame interpolation model.
  7. 根据权利要求6所述的方法,其中,神经网络的输入包括:第一帧和第三帧的视觉特征信息和深度信息,以及第一帧和第三帧之间的光流信息和形变信息。The method according to claim 6, wherein the input of the neural network includes: visual feature information and depth information of the first frame and the third frame, and optical flow information and deformation information between the first frame and the third frame.
  8. 根据权利要求1所述的方法,其中,所述根据交互场景相应的人物定制信息,对第一视频中的各帧图像中的人物进行编辑处理包括以下中的一项或多项:The method according to claim 1, wherein editing the characters in each frame image in the first video according to the character customization information corresponding to the interaction scene includes one or more of the following:
    根据交互场景相应的人物形象定制信息,对第一视频中的各帧图像中的人物形象进行编辑处理;Edit the characters in each frame of the image in the first video according to the character customization information corresponding to the interactive scene;
    根据交互场景相应的人物表情定制信息,对第一视频中的各帧图像中的人物表情进行编辑处理;Edit the character expressions in each frame of the first video according to the character expression customization information corresponding to the interaction scene;
    根据交互场景相应的人物动作定制信息,对第一视频中的各帧图像中的人物动作进行编辑处理。Edit the character movements in each frame of the image in the first video according to the customized character movement information corresponding to the interaction scene.
  9. 根据权利要求8所述的方法,其中,所述根据交互场景相应的人物形象定制信息,对第一视频中的各帧图像中的人物形象进行编辑处理包括:The method according to claim 8, wherein editing the characters in each frame of the image in the first video according to the character customization information corresponding to the interaction scene includes:
    根据用户在第一视频中的部分视频帧所做的人物形象调整,确定人物形象调整参数,按照所述人物形象调整参数对第一视频中的其余视频帧中的人物形象进行编辑处理。According to the character adjustment made by the user in some video frames in the first video, the character adjustment parameters are determined, and the characters in the remaining video frames in the first video are edited according to the character adjustment parameters.
  10. 根据权利要求9所述的方法,其中,所述按照所述人物形象调整参数对第一视频中的其余视频帧中的人物形象进行编辑处理包括:The method according to claim 9, wherein editing the characters in the remaining video frames in the first video according to the character adjustment parameters includes:
    根据所述人物形象调整参数中的人物形象调整的目标部位,通过关键点检测定位第一视频中的其余视频帧中的人物的目标部位;According to the target part of the character image adjustment in the character image adjustment parameter, locate the target part of the character in the remaining video frames in the first video through key point detection;
    根据所述人物形象调整参数中的人物形象调整的幅度信息或位置信息,通过图形学变换对定位的目标部位的幅度或位置进行调整。 According to the amplitude information or position information of the character image adjustment in the character image adjustment parameters, the amplitude or position of the positioned target part is adjusted through graphics transformation.
  11. 根据权利要求8所述的方法,其中,The method of claim 8, wherein
    所述人物表情定制信息包括目标表情对应的预设分类信息,The character expression customization information includes preset classification information corresponding to the target expression,
    所述根据交互场景相应的人物表情定制信息,对第一视频中的各帧图像中的人物表情进行编辑处理,包括:The editing process of the character expressions in each frame of the image in the first video based on the character expression customization information corresponding to the interaction scene includes:
    获取第一视频中每帧图像的特征信息、人脸关键点的特征信息和原表情的分类信息;Obtain the feature information of each frame of the image in the first video, the feature information of the key points of the face, and the classification information of the original expression;
    将每帧图像的特征信息、人脸关键点的特征信息、原表情的分类信息与目标表情对应的预设分类信息进行融合,得到所述每帧图像对应的融合图像的特征信息;Fuse the feature information of each frame of image, the feature information of facial key points, the classification information of the original expression and the preset classification information corresponding to the target expression to obtain the feature information of the fused image corresponding to each frame of image;
    根据所述每帧图像对应的融合图像的特征信息,生成所述每帧图像对应的融合图像,所有融合图像形成人脸表情是目标表情的第二视频。According to the characteristic information of the fused image corresponding to each frame of image, the fused image corresponding to each frame of image is generated, and all the fused images form a second video in which the facial expression is the target expression.
  12. 根据权利要求11所述的方法,其中,所述获取第一视频中每帧图像的特征信息、人脸关键点的特征信息和原表情的分类信息包括:The method according to claim 11, wherein said obtaining the characteristic information of each frame of image in the first video, the characteristic information of facial key points and the classification information of the original expression includes:
    将所述第一视频中每帧图像输入人脸特征提取模型,得到输出的所述每帧图像的特征信息;Input each frame of image in the first video into the facial feature extraction model to obtain the output feature information of each frame of image;
    将所述每帧图像的特征信息输入人脸关键点检测模型,得到所述每帧图像的人脸关键点的坐标信息,采用主成分分析法对所有人脸关键点的坐标信息进行降维,得到预设维度的信息,作为所述人脸关键点的特征信息;Input the characteristic information of each frame of image into the facial key point detection model to obtain the coordinate information of the facial key points of each frame of image, and use the principal component analysis method to reduce the dimensionality of the coordinate information of all facial key points, Obtain information of preset dimensions as feature information of key points of the human face;
    将所述每帧图像的特征信息输入表情分类模型,得到所述每帧图像的原表情的分类信息。The characteristic information of each frame of image is input into the expression classification model to obtain the classification information of the original expression of each frame of image.
  13. 根据权利要求11所述的方法,其中,所述将每帧图像的特征信息、人脸关键点的特征信息、原表情的分类信息与目标表情对应的预设分类信息进行融合包括:The method according to claim 11, wherein said fusing the characteristic information of each frame of image, the characteristic information of facial key points, the classification information of the original expression and the preset classification information corresponding to the target expression includes:
    将所述每帧图像的原表情的分类信息与所述目标表情对应的预设分类信息进行加和取平均,得到所述每帧图像对应的融合表情的分类信息;Add and average the classification information of the original expression of each frame of image and the preset classification information corresponding to the target expression to obtain the classification information of the fused expression corresponding to each frame of image;
    将与训练得到的第一权重相乘后的所述每帧图像的人脸关键点的特征信息,与训练得到的第二权重相乘后的所述每帧图像的特征信息,以及所述每帧图像对应的融合表情的分类信息进行拼接。 The characteristic information of the facial key points of each frame of image multiplied by the first weight obtained by training, the characteristic information of each frame of image multiplied by the second weight obtained by training, and the characteristic information of each frame of image multiplied by the second weight obtained by training The classification information of the fused expressions corresponding to the frame images is spliced.
  14. 根据权利要求12所述的方法,其中,所述根据所述每帧图像对应的融合图像的特征信息,生成所述每帧图像对应的融合图像包括:The method according to claim 12, wherein generating the fused image corresponding to each frame of image according to the feature information of the fused image corresponding to each frame of image includes:
    将所述每帧图像对应的融合图像的特征信息输入解码器,输出生成的所述每帧图像对应的融合图像;Input the feature information of the fused image corresponding to each frame of image into the decoder, and output the generated fused image corresponding to each frame of image;
    其中,所述人脸特征提取模型包括卷积层,所述解码器包括反卷积层。Wherein, the facial feature extraction model includes a convolution layer, and the decoder includes a deconvolution layer.
  15. 根据权利要求11所述的方法,其中,The method of claim 11, wherein
    将人脸表情是原表情的第一视频和目标表情对应的预设分类信息输入表情生成模型,输出得到人脸表情是目标表情的第二视频;Input the first video in which the human facial expression is the original expression and the preset classification information corresponding to the target expression into the expression generation model, and output the second video in which the human facial expression is the target expression;
    所述表情生成模型的训练方法,包括:The training method of the expression generation model includes:
    获取由第一训练视频的各帧图像与第二训练视频的各帧图像组成的训练对;Obtain a training pair consisting of each frame image of the first training video and each frame image of the second training video;
    将所述第一训练视频的各帧图像输入第一生成器,获取所述第一训练视频的各帧图像的特征信息、人脸关键点的特征信息和原表情的分类信息,将所述第一训练视频的各帧图像的特征信息、人脸关键点的特征信息、原表情的分类信息和目标表情对应的预设分类信息进行融合,得到所述第一训练视频对应的各帧融合图像的特征信息,根据所述第一训练视频对应的各帧融合图像的特征信息,得到所述第一生成器输出的所述第一训练视频对应的各帧融合图像;Each frame image of the first training video is input into the first generator, the feature information of each frame image of the first training video, the feature information of the face key points and the classification information of the original expression are obtained, and the first generator is The characteristic information of each frame of the training video, the characteristic information of the face key points, the classification information of the original expression and the preset classification information corresponding to the target expression are fused to obtain the fusion image of each frame corresponding to the first training video. Feature information: according to the feature information of each frame fusion image corresponding to the first training video, obtain each frame fusion image corresponding to the first training video output by the first generator;
    将所述第二训练视频各帧图像输入第二生成器,获取所述第二训练视频的各帧图像的特征信息、人脸关键点的特征信息和目标表情的分类信息,将所述第二训练视频的各帧图像的特征信息、人脸关键点的特征信息、目标表情的分类信息和原表情对应的预设分类信息进行融合,得到所述第二训练视频对应的各帧融合图像的特征信息,根据所述第二训练视频对应的各帧融合图像的特征信息,得到所述第二生成器输出的所述第二训练视频对应的各帧融合图像;Each frame image of the second training video is input into the second generator, the feature information of each frame image of the second training video, the feature information of the face key points and the classification information of the target expression are obtained, and the second generator is The characteristic information of each frame of the training video, the characteristic information of the face key points, the classification information of the target expression and the preset classification information corresponding to the original expression are fused to obtain the characteristics of each frame of the fused image corresponding to the second training video. Information, according to the feature information of each frame fusion image corresponding to the second training video, obtain each frame fusion image corresponding to the second training video output by the second generator;
    根据所述第一训练视频对应的各帧融合图像、所述第二训练视频对应的各帧融合图像,确定对抗损失和循环一致损失;Determine the adversarial loss and cycle-consistent loss according to each frame fusion image corresponding to the first training video and each frame fusion image corresponding to the second training video;
    根据所述对抗损失和循环一致损失,对所述第一生成器和所述第二生成器进行训练,第一生成器训练完成后作为表情生成模型使用。The first generator and the second generator are trained according to the adversarial loss and the cycle-consistent loss. After the training of the first generator is completed, it is used as an expression generation model.
  16. 根据权利要求15所述的方法,还包括:The method of claim 15, further comprising:
    根据所述第一训练视频对应的每相邻两帧融合图像之间的像素差异,以及所述第 二训练视频对应的每相邻两帧融合图像之间的像素差异,确定像素对像素损失;According to the pixel difference between each two adjacent frames of the fused image corresponding to the first training video, and the The pixel difference between each two adjacent frames of fused images corresponding to the two training videos is used to determine the pixel-to-pixel loss;
    其中,所述根据所述对抗损失和循环一致损失,对所述第一生成器和所述第二生成器进行训练包括:Wherein, training the first generator and the second generator according to the adversarial loss and the cycle-consistent loss includes:
    根据所述对抗损失、所述循环一致损失和所述像素对像素损失,对所述第一生成器和所述第二生成器进行训练。The first generator and the second generator are trained based on the adversarial loss, the cycle-consistent loss, and the pixel-for-pixel loss.
  17. 根据权利要求15或16所述的方法,其中,所述根据所述第一训练视频对应的各帧融合图像、所述第二训练视频对应的各帧融合图像,确定对抗损失包括:The method according to claim 15 or 16, wherein determining the adversarial loss based on each frame fusion image corresponding to the first training video and each frame fusion image corresponding to the second training video includes:
    将所述第一训练视频对应的各帧融合图像输入第一判别器,得到所述第一训练视频对应的各帧融合图像的第一判别结果;Input the fused images of each frame corresponding to the first training video into the first discriminator to obtain the first discrimination result of the fused image of each frame corresponding to the first training video;
    将所述第二训练视频对应的各帧融合图像输入第二判别器,得到所述第二训练视频对应的各帧融合图像的第二判别结果;Input each frame of the fused image corresponding to the second training video into the second discriminator to obtain a second discrimination result of each frame of the fused image corresponding to the second training video;
    根据所述第一训练视频对应的各帧融合图像的第一判别结果,确定第一对抗损失,根据所述第二训练视频对应的各帧融合图像的第二判别结果确定第二对抗损失。The first adversarial loss is determined based on the first discrimination result of each frame of the fused image corresponding to the first training video, and the second adversarial loss is determined based on the second discrimination result of each frame of the fused image corresponding to the second training video.
  18. 根据权利要求17所述的方法,其中,将所述第一训练视频对应的各帧融合图像输入第一判别器,得到所述第一训练视频对应的各帧融合图像的第一判别结果包括:The method according to claim 17, wherein inputting each frame fusion image corresponding to the first training video into the first discriminator, and obtaining the first discrimination result of each frame fusion image corresponding to the first training video includes:
    将所述第一训练视频对应的各帧融合图像输入所述第一判别器中第一人脸特征提取模型,得到输出的所述第一训练视频对应的各帧融合图像的特征信息;Input each frame of the fused image corresponding to the first training video into the first facial feature extraction model in the first discriminator, and obtain the output feature information of each frame of the fused image corresponding to the first training video;
    将所述第一训练视频对应的各帧融合图像的特征信息输入所述第一判别器中的第一表情分类模型,得到所述第一训练视频对应的各帧融合图像的表情的分类信息,作为第一判别结果;Input the feature information of each frame of the fused image corresponding to the first training video into the first expression classification model in the first discriminator to obtain the classification information of the expression of each frame of the fused image corresponding to the first training video, As the first judgment result;
    所述将所述第二训练视频对应的各帧融合图像输入第二判别器,得到所述第二训练视频对应的各帧融合图像的第二判别结果包括:The input of the fused images of each frame corresponding to the second training video into the second discriminator to obtain the second discrimination result of the fused images of each frame corresponding to the second training video includes:
    将所述第二训练视频对应的各帧融合图像输入所述第二判别器中第二人脸特征提取模型,得到输出的所述第二训练视频对应的各帧融合图像的特征信息;Input each frame of the fused image corresponding to the second training video into the second face feature extraction model in the second discriminator, and obtain the output feature information of each frame of the fused image corresponding to the second training video;
    将所述第二训练视频对应的各帧融合图像的特征信息输入所述第二判别器中的第二表情分类模型,得到所述第二训练视频对应的各帧融合图像的表情的分类信息,作为第二判别结果。 Input the feature information of each frame of the fused image corresponding to the second training video into the second expression classification model in the second discriminator to obtain the classification information of the expression of each frame of the fused image corresponding to the second training video, as the second judgment result.
  19. 根据权利要求15或16所述的方法,其中,所述循环一致损失采用以下方法确定:The method according to claim 15 or 16, wherein the cycle consistency loss is determined using the following method:
    将所述第一训练视频对应的各帧融合图像输入所述第二生成器,生成所述第一训练视频的各帧重构图像,将所述第二训练视频对应的各帧融合图像输入所述第一生成器,生成所述第二训练视频的各帧重构图像;Each frame fusion image corresponding to the first training video is input into the second generator to generate a reconstructed image of each frame of the first training video, and each frame fusion image corresponding to the second training video is input into the second generator. The first generator generates reconstructed images of each frame of the second training video;
    根据所述第一训练视频的各帧重构图像和所述第一训练视频的各帧图像的差异,以及所述第二训练视频的各帧重构图像和所述第二训练视频的各帧图像的差异,确定循环一致损失。The difference between the reconstructed image of each frame of the first training video and each frame of the first training video, and the reconstructed image of each frame of the second training video and each frame of the second training video The difference in images determines the cycle-consistent loss.
  20. 根据权利要求16所述的方法,其中,所述像素对像素损失采用以下方法确定:The method of claim 16, wherein the pixel-to-pixel loss is determined using the following method:
    针对所述第一训练视频对应的每相邻两帧融合图像中的每个位置,确定该相邻两帧融合图像中在该位置上的两个像素的表示向量之间的距离,将所有位置对应的距离进行加和,得到第一损失;For each position in the fused image of each adjacent two frames corresponding to the first training video, determine the distance between the representation vectors of the two pixels at the position in the fused image of the two adjacent frames, and combine all positions The corresponding distances are added to obtain the first loss;
    针对所述第二训练视频对应的每相邻两帧融合图像中的每个位置,确定该相邻两帧融合图像中在位置上的两个像素的表示向量之间的距离,将所有位置对应的距离进行加和,得到第二损失;For each position in the fused image of each adjacent two frames corresponding to the second training video, determine the distance between the representation vectors of the two pixels at the position in the fused image of the two adjacent frames, and match all positions The distances are added to obtain the second loss;
    将所述第一损失和所述第二损失加和,得到所述像素对像素损失。The first loss and the second loss are summed to obtain the pixel-to-pixel loss.
  21. 根据权利要求15所述的方法,其中,所述获取所述第一训练视频的各帧图像的特征信息、人脸关键点的特征信息和原表情的分类信息包括:The method according to claim 15, wherein said obtaining the feature information of each frame image of the first training video, the feature information of facial key points and the classification information of the original expression includes:
    将所述第一训练视频中各帧图像输入所述第一生成器中的第三人脸特征提取模型,得到输出的所述各帧图像的特征信息;将所述各帧图像的特征信息输入所述第一生成器中第一人脸关键点检测模型,得到所述各帧图像的人脸关键点的坐标信息;采用主成分分析法对所有人脸关键点的坐标信息进行降维,得到预设维度的第一信息,作为所述第一训练视频的各帧图像的人脸关键点的特征信息;将所述第一训练视频中各帧图像的特征信息输入所述第一生成器中的第三表情分类模型,得到所述第一训练视频中各帧图像的原表情的分类信息;Input each frame image in the first training video into the third facial feature extraction model in the first generator to obtain the output feature information of each frame image; input the feature information of each frame image The first face key point detection model in the first generator is used to obtain the coordinate information of the face key points of each frame image; the principal component analysis method is used to reduce the dimensionality of the coordinate information of all face key points, and we obtain The first information of the preset dimension is used as the feature information of the facial key points of each frame image of the first training video; the feature information of each frame image of the first training video is input into the first generator The third expression classification model obtains the classification information of the original expression of each frame image in the first training video;
    所述获取所述第二训练视频的各帧图像的特征信息、人脸关键点的特征信息和目 标表情的分类信息包括:Obtaining the characteristic information of each frame image of the second training video, the characteristic information of the face key points and the object Classification information of marked emoticons includes:
    将所述第二训练视频中各帧图像输入所述第二生成器中的第四人脸特征提取模型,得到输出的所述各帧图像的特征信息;将所述各帧图像的特征信息输入所述第二生成器中第二人脸关键点检测模型,得到所述各帧图像的人脸关键点的坐标信息;采用主成分分析法对所有人脸关键点的坐标信息进行降维,得到预设维度的第二信息,作为所述第二训练视频的各帧图像的人脸关键点的特征信息;将所述第二训练视频中各帧图像的特征信息输入所述第二生成器中的第四表情分类模型,得到所述第二训练视频中各帧图像的目标表情的分类信息。Input each frame image in the second training video into the fourth facial feature extraction model in the second generator to obtain the output feature information of each frame image; input the feature information of each frame image The second face key point detection model in the second generator is used to obtain the coordinate information of the face key points of each frame image; the principal component analysis method is used to reduce the dimensionality of the coordinate information of all face key points, and we obtain The second information of the preset dimension is used as the feature information of the face key points of each frame image of the second training video; the feature information of each frame image of the second training video is input into the second generator The fourth expression classification model is used to obtain the classification information of the target expression of each frame image in the second training video.
  22. 根据权利要求15所述的方法,其中,所述将所述第一训练视频的各帧图像的特征信息、人脸关键点的特征信息、原表情的分类信息和目标表情对应的预设分类信息进行融合包括:The method according to claim 15, wherein the characteristic information of each frame image of the first training video, the characteristic information of facial key points, the classification information of the original expression and the preset classification information corresponding to the target expression Fusion includes:
    将所述第一训练视频的各帧图像的原表情的分类信息与所述目标表情对应的预设分类信息进行加和取平均,得到所述第一训练视频的各帧图像对应的融合表情的分类信息;将与待训练的第一权重相乘后的所述第一训练视频的各帧图像的人脸关键点的特征信息,与待训练的第二权重相乘后的所述第一训练视频的各帧图像的特征信息,以及所述第一训练视频的各帧图像对应的融合表情的分类信息进行拼接;The classification information of the original expression of each frame image of the first training video and the preset classification information corresponding to the target expression are added and averaged to obtain the fused expression corresponding to each frame image of the first training video. Classification information; the feature information of the face key points of each frame image of the first training video multiplied by the first weight to be trained, and the first training multiplied by the second weight to be trained The characteristic information of each frame image of the video and the classification information of the fused expression corresponding to each frame image of the first training video are spliced;
    所述将所述第二训练视频的各帧图像的特征信息、人脸关键点的特征信息、目标表情的分类信息和原表情对应的预设分类信息进行融合包括:The fusion of the feature information of each frame image of the second training video, the feature information of key facial points, the classification information of the target expression and the preset classification information corresponding to the original expression includes:
    将所述第二训练视频的各帧图像的目标表情的分类信息与所述原表情对应的预设分类信息进行加和取平均,得到所述第二训练视频的各帧图像对应的融合表情的分类信息;将与待训练的第三权重相乘后的所述第二训练视频的各帧图像的人脸关键点的特征信息,与待训练的第四权重相乘后的所述第二训练视频的各帧图像的特征信息,以及所述第二训练视频的各帧图像对应的融合表情的分类信息进行拼接。The classification information of the target expression of each frame image of the second training video and the preset classification information corresponding to the original expression are added and averaged to obtain the fusion expression corresponding to each frame image of the second training video. Classification information; the feature information of the face key points of each frame image of the second training video multiplied by the third weight to be trained, and the second training multiplied by the fourth weight to be trained The feature information of each frame image of the video and the classification information of the fused expression corresponding to each frame image of the second training video are spliced.
  23. 根据权利要求16所述的方法,其中,所述根据所述对抗损失、所述循环一致损失和所述像素对像素损失,对所述第一生成器和所述第二生成器进行训练包括:The method of claim 16, wherein training the first generator and the second generator based on the adversarial loss, the cycle-consistent loss, and the pixel-for-pixel loss includes:
    将所述对抗损失、所述循环一致损失和所述像素对像素损失进行加权求和,得到总损失;Perform a weighted sum of the adversarial loss, the cycle-consistent loss and the pixel-to-pixel loss to obtain a total loss;
    根据所述总损失对所述第一生成器和所述第二生成器进行训练。 The first generator and the second generator are trained based on the total loss.
  24. 根据权利要求8所述的方法,其中,The method of claim 8, wherein
    所述根据交互场景相应的人物动作定制信息,对第一视频中的各帧图像中的人物动作进行编辑处理,包括:The editing process of the character actions in each frame of the image in the first video based on the character action customization information corresponding to the interactive scene includes:
    对第一视频中原第一关键帧中的人物在第一动作时的第一人体关键点进行调整,得到人物在第二动作时的第二人体关键点,作为人物动作定制信息;Adjust the first human body key points of the character in the original first key frame in the first video during the first action to obtain the second human body key points of the character during the second action as the character action customization information;
    从原第一关键帧中提取各个第二人体关键点邻域的特征信息;Extract feature information of each second human body key point neighborhood from the original first key frame;
    将各个第二人体关键点及其邻域的特征信息输入图像生成模型,输出人物在第二动作时的目标第一关键帧。The feature information of each second human body key point and its neighborhood is input into the image generation model, and the target first key frame of the character during the second action is output.
  25. 根据权利要求24所述的方法,其中,所述图像生成模型的获得方法包括:The method according to claim 24, wherein the method for obtaining the image generation model includes:
    将训练视频帧以及训练视频帧中的人物的人体关键点作为一对训练数据,将训练数据中人体关键点及其在训练视频帧中邻域的特征信息作为图像生成网络的输入,将训练数据中的训练视频帧作为图像生成网络的输出的监督信息,对图像生成网络进行训练得到所述图像生成模型。The training video frame and the human body key points of the characters in the training video frame are used as a pair of training data. The human body key points in the training data and the characteristic information of their neighborhoods in the training video frame are used as the input of the image generation network. The training data The training video frames in are used as supervision information output by the image generation network, and the image generation network is trained to obtain the image generation model.
  26. 根据权利要求24所述的方法,其中,第一人体关键点包括人物在第一动作时的人体轮廓特征点,第二人体关键点包括人物在第二动作时的人体轮廓特征点。The method according to claim 24, wherein the first human body key points include human body outline feature points of the character during the first action, and the second human body key points include human body outline feature points of the character during the second action.
  27. 一种数字人生成装置,包括:A digital human generation device, including:
    存储器;以及耦接至所述存储器的处理器,所述处理器被配置为基于存储在所述存储器中的指令,执行权利要求1-26任一项所述的数字人生成方法。A memory; and a processor coupled to the memory, the processor configured to execute the digital human generation method of any one of claims 1-26 based on instructions stored in the memory.
  28. 一种数字人生成装置,包括:A digital human generation device, including:
    获取单元,被配置为获取第一视频;an acquisition unit configured to acquire the first video;
    定制单元,被配置为根据交互场景相应的人物定制信息,对第一视频中的各帧图像中的人物进行编辑处理;The customization unit is configured to edit the characters in each frame of the image in the first video according to the character customization information corresponding to the interaction scene;
    输出单元,被配置为根据处理后的第一视频中的各帧图像,输出第二视频。The output unit is configured to output the second video according to each frame image in the processed first video.
  29. 一种非瞬时性计算机可读存储介质,其上存储有计算机程序,该程序被处理 器执行时实现权利要求1-26任一项所述的数字人生成方法的步骤。A non-transitory computer-readable storage medium on which a computer program is stored, which is processed When the computer is executed, the steps of the digital human generation method described in any one of claims 1-26 are implemented.
  30. 一种计算机程序,包括:A computer program consisting of:
    指令,所述指令由处理器执行时使所述处理器执行根据权利要求1--26任一项所述的数字人生成方法。 Instructions, which when executed by a processor, cause the processor to execute the digital human generation method according to any one of claims 1 to 26.
PCT/CN2023/087271 2022-05-18 2023-04-10 Digital human generation method and apparatus, and storage medium WO2023221684A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210541984.9 2022-05-18
CN202210541984.9A CN114863533A (en) 2022-05-18 2022-05-18 Digital human generation method and device and storage medium

Publications (1)

Publication Number Publication Date
WO2023221684A1 true WO2023221684A1 (en) 2023-11-23

Family

ID=82639735

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/087271 WO2023221684A1 (en) 2022-05-18 2023-04-10 Digital human generation method and apparatus, and storage medium

Country Status (2)

Country Link
CN (1) CN114863533A (en)
WO (1) WO2023221684A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117576267A (en) * 2024-01-16 2024-02-20 广州光点信息科技股份有限公司 Digital person generation method based on LLM and ANN and application of digital person generation method in cloud video

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114863533A (en) * 2022-05-18 2022-08-05 京东科技控股股份有限公司 Digital human generation method and device and storage medium
CN115665507B (en) * 2022-12-26 2023-03-21 海马云(天津)信息技术有限公司 Method, apparatus, medium, and device for generating video stream data including avatar

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110349081A (en) * 2019-06-17 2019-10-18 达闼科技(北京)有限公司 Generation method, device, storage medium and the electronic equipment of image
US20210374391A1 (en) * 2020-05-28 2021-12-02 Science House LLC Systems, methods, and apparatus for enhanced cameras
CN113920230A (en) * 2021-09-15 2022-01-11 上海浦东发展银行股份有限公司 Character image video generation method and device, computer equipment and storage medium
CN114863533A (en) * 2022-05-18 2022-08-05 京东科技控股股份有限公司 Digital human generation method and device and storage medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110349081A (en) * 2019-06-17 2019-10-18 达闼科技(北京)有限公司 Generation method, device, storage medium and the electronic equipment of image
US20210374391A1 (en) * 2020-05-28 2021-12-02 Science House LLC Systems, methods, and apparatus for enhanced cameras
CN113920230A (en) * 2021-09-15 2022-01-11 上海浦东发展银行股份有限公司 Character image video generation method and device, computer equipment and storage medium
CN114863533A (en) * 2022-05-18 2022-08-05 京东科技控股股份有限公司 Digital human generation method and device and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117576267A (en) * 2024-01-16 2024-02-20 广州光点信息科技股份有限公司 Digital person generation method based on LLM and ANN and application of digital person generation method in cloud video
CN117576267B (en) * 2024-01-16 2024-04-12 广州光点信息科技股份有限公司 Digital person generation method based on LLM and ANN and application of digital person generation method in cloud video

Also Published As

Publication number Publication date
CN114863533A (en) 2022-08-05

Similar Documents

Publication Publication Date Title
WO2023221684A1 (en) Digital human generation method and apparatus, and storage medium
CN111489287A (en) Image conversion method, image conversion device, computer equipment and storage medium
Zhou et al. An image-based visual speech animation system
JP2024500896A (en) Methods, systems and methods for generating 3D head deformation models
Zhang et al. Dinet: Deformation inpainting network for realistic face visually dubbing on high resolution video
CN112819933A (en) Data processing method and device, electronic equipment and storage medium
CN115100334B (en) Image edge tracing and image animation method, device and storage medium
JP2024503794A (en) Method, system and computer program for extracting color from two-dimensional (2D) facial images
CN115393480A (en) Speaker synthesis method, device and storage medium based on dynamic nerve texture
Tan et al. Style2talker: High-resolution talking head generation with emotion style and art style
CN113395569A (en) Video generation method and device
Tous Pictonaut: movie cartoonization using 3D human pose estimation and GANs
CN115035219A (en) Expression generation method and device and expression generation model training method and device
Gowda et al. From pixels to portraits: A comprehensive survey of talking head generation techniques and applications
CN115578298A (en) Depth portrait video synthesis method based on content perception
Sun et al. Generation of virtual digital human for customer service industry
Wang et al. Expression-aware neural radiance fields for high-fidelity talking portrait synthesis
Wang et al. Flow2Flow: Audio-visual cross-modality generation for talking face videos with rhythmic head
Lin et al. Emotional Semantic Neural Radiance Fields for Audio-Driven Talking Head
Eisert et al. Hybrid human modeling: making volumetric video animatable
CN117152843B (en) Digital person action control method and system
US20240169701A1 (en) Affordance-based reposing of an object in a scene
CN118175324A (en) Multidimensional generation framework for video generation
Zhang et al. REFA: Real-time Egocentric Facial Animations for Virtual Reality
KR20230163907A (en) Systen and method for constructing converting model for cartoonizing image into character image, and image converting method using the converting model

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23806640

Country of ref document: EP

Kind code of ref document: A1