WO2021034443A1 - Human motion transfer for dancing video synthesis - Google Patents

Human motion transfer for dancing video synthesis Download PDF

Info

Publication number
WO2021034443A1
WO2021034443A1 PCT/US2020/043253 US2020043253W WO2021034443A1 WO 2021034443 A1 WO2021034443 A1 WO 2021034443A1 US 2020043253 W US2020043253 W US 2020043253W WO 2021034443 A1 WO2021034443 A1 WO 2021034443A1
Authority
WO
WIPO (PCT)
Prior art keywords
subject
body configuration
source
pose
computer
Prior art date
Application number
PCT/US2020/043253
Other languages
French (fr)
Inventor
Alexei A. EFROS
Tinghui Zhou
Shiry Sara GINOSAR
Caroline CHAN
Original Assignee
The Regents Of The University Of California
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by The Regents Of The University Of California filed Critical The Regents Of The University Of California
Publication of WO2021034443A1 publication Critical patent/WO2021034443A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • G06T7/73Determining position or orientation of objects or cameras using feature-based methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/246Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30196Human being; Person
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30196Human being; Person
    • G06T2207/30201Face

Definitions

  • Video Rewrite creates videos of a subject user saying a phrase that the subject did not originally utter by finding frames where the subject user’s mouth position matches the desired speech.
  • Another approach uses optical flow as a descriptor matches different subjects performing similar actions allowing “Do as I do” and “Do as I say” retargeting.
  • Other approaches use three-dimensional (3D) transfer motion for graphics and animation purposes. Since the retargeting problem was first proposed between animated characters, solutions have included the introduction of inverse kinematic solvers to the problem and retargeting between significantly different skeletons. More recent approaches apply deep learning techniques to retarget motion without supervised data.
  • some approaches propose an elaborate multi-view system to calibrate a personalized kinematic model, obtain 3D joint estimations, and render images of a human subject performing new motions.
  • Recent studies of motion in video have been able to learn to distinguish the movements from appearance and consequently synthesize novel motions in video.
  • Some approaches employ unsupervised adversarial training to learn this separation and generates videos of subjects performing novel motions or facial expressions. This theme is continued through subsequent work which transfers facial expressions from a source subject in a video onto a target person given in a static image.
  • Modern approaches have shown success in generating detailed images of human subjects in novel poses. Furthermore, recent methods can synthesize such images for temporally coherent video and future prediction.
  • Implementations of the disclosed technology are generally directed to “do as I do" motion transfer techniques in which, given a source video of a person dancing or otherwise performing at least movement, that performance can be transferred to a novel (e.g., amateur) target after only a short time (e.g., a few minutes) of the target subject performing standard moves, which may be posed as a per-frame image-to-image translation with spatio-temporal smoothing.
  • a mapping can be learned from pose images to a target subject’s appearance.
  • the disclosed approaches are generally designed for video subjects which can be found online or captured in person, although embodiments may include learning to synthesize novel motions rather than manipulating existing frames.
  • the disclosed techniques generally include motion transfer between two-dimensional (2D) video subjects where there is a lack of 3D information. As such, the disclosed approaches advantageously avoid both source-target data calibration and lifting into 3D space.
  • Certain implementations apply a representation of motion (e.g., posed stick figures) to different target subjects to generate new motions while specializing in synthesizing detailed dance movements.
  • Implementations of the disclosed technology generally account for video generation while preserving important details such as facial features.
  • Certain implementations include learning a mapping from pose to target subject due to advances in image generation and substantial work on general image mapping frameworks. Due to the disclosed approaches toward motion transfer, implementations generally allow the ability to choose from and adopt such frameworks for certain purposes.
  • FIGURE 1 illustrates multiple examples 102-112 of motion transfer from a source onto two target subjects in accordance with certain implementations of the disclosed technology.
  • FIGURE 2 illustrates an example 200 of a correspondence between a posed stick figure 202 and a target person frame 204 in accordance with certain implementations of the disclosed technology.
  • FIGURE 3 illustrates an example of a training phase 300 of a computer-implemented method in accordance with certain implementations of the disclosed technology.
  • FIGURE 4 illustrates an example of a transfer phase 400 of a computer-implemented method in accordance with certain implementations of the disclosed technology.
  • FIGURE 5 illustrates an example of a training phase 502 and a transfer phase 504 in accordance with certain implementations of the disclosed technology.
  • FIGURE 6 illustrates an example of temporal smoothing 600 in accordance with certain implementations of the disclosed technology.
  • FIGURE 7 illustrates an example of a face GAN setup 700 in accordance with certain implementations of the disclosed technology.
  • FIGURE 8 illustrates multiple examples 802-816 of transfer results in accordance with certain implementations of the disclosed technology.
  • FIGURE 9 illustrates multiple examples 902-906 of comparisons of synthesis results for different models in accordance with certain implementations of the disclosed technology.
  • FIGURE 10 illustrates multiple examples 1002-1006 of face image comparisons from different models in accordance with certain implementations of the disclosed technology.
  • DETAILED DESCRIPTION Implementations of the disclosed technology are generally directed to systems and methods for transferring motion between human subjects in different videos. Given two videos – one of a target person whose appearance is to be synthesized, and the other of a source subject whose motion is to be imposed onto the target person – motion can be transferred between these subjects via an end to end pixel-based pipeline. This is in contrast to approaches over the last two decades which employ nearest neighbor search or retarget motion in 3D.
  • a variety of videos can be created, enabling untrained amateurs to spin and twirl like ballerinas, perform martial arts kicks, or dance as vibrantly as pop stars.
  • a mapping is generally learned between images of the two individuals. Certain implementations may include discovering an image-to-image translation between the source and target sets but there are no corresponding pairs of images of the two subjects performing the same motions to supervise learning this translation directly. Even if both subjects perform the same routine, it is still unlikely to have an exact frame to frame body-pose correspondence due to body shape and stylistic differences unique to each subject.
  • FIGURE 1 illustrates multiple examples 102-112 of motion transfer from a source onto two target subjects in accordance with certain implementations of the disclosed technology.
  • Each of the examples 102-112 shows a video frame of a Source Subject applied to each of two Target Subjects such that the pose of each Target Subject mirrors the pose of the Source Subject.
  • Keypoint-based poses which inherently encode body position but not appearance, can serve as an intermediate representation between any two subjects. Poses preserve motion signatures over time while abstracting away as much subject identity as possible.
  • an intermediate representation may be designed to be posed stick figures such as the example 200 illustrated by FIGURE 2.
  • the example 200 indicates a correspondence between a posed stick figure 202 and a target person frame 204 in accordance with certain implementations of the disclosed technology.
  • pose detections may be obtained for each frame yielding a set of (pose stick figure, target person image) corresponding pairs.
  • an image-to-image translation model can be learned between posed stick figures and images of the target person in a supervised way.
  • the model can be trained to produce personalized videos of a specific target subject. Then, to transfer motion from source to target, the posed stick figures can be inputted into the trained model to obtain images of the target subject in the same pose as the source.
  • Two components can be added to improve the quality of the results: (1) to encourage the temporal smoothness of the generated videos, the prediction at each frame on that of the previous time step can be conditioned; and (2) to increase facial realism in the results, a specialized GAN trained to generate the target person‘s face can be included.
  • the disclosed methods may produce videos where motion is transferred between a variety of video subjects without the need for expensive 3D or motion capture data. Contributions can include a learning-based pipeline for human motion transfer between videos, and the quality of the results which demonstrate complex motion transfer in realistic and detailed videos.
  • a goal Given a video of a source person and another of a target person, a goal may be to generate a new video of the target person enacting the same motions as the source.
  • FIGURE 3 illustrates an example of a training phase 300 of a computer-implemented method in accordance with certain implementations of the disclosed technology.
  • a body configuration detector receives multiple video frames from an original video of a target subject, each video frame including the target subject in a certain body configuration.
  • the body configuration detector determines, for each of the frames, a corresponding digital representation for the target subject in the body configuration.
  • corresponding (x,y) image pairs are used to learn a mapping that synthesizes the body configuration of the target subject with the digital representation.
  • the body configuration detector may also obtain multiple joint estimates for the source subject. Alternatively or in addition thereto, a discriminator may be used to distinguish between “real” image pairs and “fake” image pairs.
  • FIGURE 4 illustrates an example of a transfer phase 400 of a computer-implemented method in accordance with certain implementations of the disclosed technology.
  • the body configuration detector receives multiple video frames from an original video of a source subject.
  • the body configuration detector extracts body configuration information corresponding to a source subject from a source frame to generate multiple pose joints for the source subject.
  • the mapping is applied to the digital representation to generate an image of the target subject in a body configuration corresponding to the body configuration of the source subject.
  • the example 400 may also include applying a normalizing process to the pose joints to transform the pose joints into joints for the source subject to create a posed stick figure. Such normalizing process may advantageously result in a better match between the body configuration of the source subject and the target subject.
  • the body configuration detector may draw a representation of the posed stick figure by plotting a plurality of keypoints and drawing lines between connected joints.
  • the example 400 may further include transforming pose keypoints of the source subject so that they appear in accordance with the body shape and proportion of the target subject.
  • Such transforming may include analyzing the ankle position for each pose of each subject and using a linear mapping between the closest and furthest ankle positions.
  • the transforming may include calculating the scale and translation for each frame based on its corresponding body configuration detection.
  • the source subject is a first person and the target subject is a second person. Also, there may be multiple target subjects, each target subject being a different person.
  • a system in accordance with the disclosed technology may include a video device configured to capture an original video of a target subject, a body configuration detector configured to receive from the video device multiple video frames from the original video of the target subject, each video frame including the target subject in a certain body configuration, wherein the body configuration detector is further configured to determine, for each of the plurality of frames, a corresponding digital representation for the target subject in the body configuration.
  • the system may also include a training module configured to use corresponding (x,y) image pairs to learn a mapping that synthesizes the body configuration of the target subject with the digital representation.
  • the body configuration detector may be further configured to extract pose information corresponding to a source subject from a source frame to generate a plurality of pose joints for the source subject.
  • a transfer module may be configured to apply the mapping to the digital representation to generate an image of the target subject in a body configuration corresponding to the body configuration of the source subject.
  • the transfer module may be further configured to transform pose keypoints of the source subject so that they appear in accordance with the body shape and proportion of the target subject.
  • a normalizing module may be configured to apply a normalizing process to the pose joints to transform the pose joints into joints for the source subject to create a posed stick figure.
  • FIGURE 5 illustrates an example of a training phase 502 and a transfer phase 504 in accordance with certain implementations of the disclosed technology.
  • corresponding (x,y) pairs may be used to learn a mapping G which synthesizes images of the target person given pose stick x.
  • the generated output G(x) may be optimized to resemble the ground truth target subject frame y.
  • D attempts to distinguish between “real” image pairs (i.e. (pose stick figure x, ground truth image y)) and “fake” image pairs (i.e. (pose stick figure x, model output G(x)).
  • the pose detector P extracts pose information from source frame y ⁇ yielding pose stick figure x '. However, in the video the source subject likely appears bigger, or smaller, and standing in a different position than the subject in the target video.
  • a global pose normalization Norm may be applied to transform the source’s original pose x ⁇ to be more consistent with the poses in the target video x.
  • the normalized posed stick figure x may then be passed into the trained model G to obtain an image G(x) of the target person which corresponds with the original image of the source y'.
  • POSE ESTIMATION AND NORMALIZATION [0050]
  • Pose Estimation In order to create images which encode body position, a pretrained pose detector P which accurately estimates x,y joint coordinates may be used. A representation of the resulting posed stick figure may be drawn by plotting the keypoints and drawing lines between connected joints as illustrated by FIGURE 2.
  • posed stick figures of the target person may be inputted to the generator G.
  • P may obtain joint estimates for the source subject which are then normalized as discussed below to better match the poses of the transfer subject seen in training.
  • the normalized pose coordinates may be used to create input pose stick figures for the generator G.
  • Global Pose Normalization [0053] In different videos, subjects may have different limb proportions or stand closer to or further from the camera than one another. Therefore, when transferring motion between two subjects, it may be necessary to transform the pose keypoints of the source person so that they appear in accordance with the target person’s body shape and proportion as in the Transfer section 504 of FIGURE 5.
  • pix2pixHD This transformation may be found by analyzing the heights and ankle positions for poses of each subject and using a linear mapping between the closest and furthest ankle positions in both videos. After gathering these statistics, the scale and translation may be calculated for each frame based on its corresponding pose detection.
  • the adversarial training setup of pix2pixHD may be modified to: (1) produce temporally coherent video frames; and (2) synthesize realistic face image.
  • pix2pixHD Framework [0057] Disclosed methods may be based on the objective presented in pix2pixHD.
  • the generator’s task is to synthesize realistic images in order to fool the discriminator which must discern between “real” (i.e., ground truth data) images from the “fake” images produced by the generator.
  • These two networks are trained simultaneously and drive each other to improve, as the generator must learn to synthesize more realistic images to deceive the discriminator which in turn learns differences between generator outputs and ground truth data.
  • the original pix2pixHD objective takes the following form: [0058] [0059] where L GAN (G, D) is the adversarial loss: [0060] [0061] where L FM (G, D) is the discriminator feature-matching loss, and L VGG (G(x),y) is the perceptual reconstruction loss which compares pretrained VGGNet features at different layers of the network. [0062] Temporal Smoothing [0063] To create video sequences, the single image generation setup may be modified to enforce temporal coherence between adjacent frames.
  • FIGURE 6 illustrates an example of temporal smoothing 600 in accordance with certain implementations of the disclosed technology.
  • first output G(x t -1 ) is conditioned on its corresponding pose stick figure x t -1 and a zero image z (a placeholder since there is no previously generated frame at time t - 2).
  • the second output G(x t ) is conditioned on its corresponding pose stick figure xt and the first output G(x t -1 ).
  • the discriminator is now tasked with determining both the difference in realism and temporal coherence between the “fake” sequence (x t -1 , x t , G(x t -1 ), G(x t )) and “real” sequence (x t -1 , x t , y t -1 , y t ).
  • FIGURE 7 illustrates an example of a face GAN setup 700 in accordance with certain implementations of the disclosed technology.
  • the final output may be the addition of the residual with original face region r +G(x)F and this change may be reflected in the relevant region of the full image.
  • a discriminator Df may then attempt to discern the “real" face pairs (x F ,y F ) (face region of the input pose stick figure, face region of the ground truth target person image) from the “fake" face pairs (xF , r + G(x)F ) similarly to the original pix2pix objective: [0068] [0069] where xF is the face region of the original pose stick figure x, yF is the face region of ground truth target person image y.
  • the technique may include adding a perceptual reconstruction loss on comparing the final face r + G(x) F to the ground truth target person’s face y F .
  • Implementations may employ training in stages where the full image GAN is optimized separately from the specialized face GAN. Certain embodiments may include first training the main generator and discriminator (G, D) during which the full objective is: [0072] [0073] After this stage, the full image generator and discriminator weights are frozen and the face GAN may be optimized with full objective: [0074] [0075] EXAMPLE IMPLEMENTATION [0076] Data Collection [0077] Certain implementations may collect source and target videos in slightly different manners. To learn the appearance of the target subject in many poses, it is important that the target video captures a sufficient range of motion and sharp frames with minimal blur.
  • a target subject was filmed for around 20 minutes of real time footage at 120 frames per second, which is possible with some modern cell phone cameras. Since the pose representation does not encode information about clothes, the target subjects wore tight clothing with minimal wrinkling. [0078] In contrast to some of the preparation required for filming a target subject, source videos do not require the same (though still reasonable) quality as only decent pose detections are needed from the source video. Without such limitations, many high quality videos of a subject performing a dance are abundant online. [0079] In the example, pre-smoothing pose keypoints were enormous helpful in reducing jittering in the outputs.
  • FIGURE 8 illustrates multiple examples 802-816 of transfer results in accordance with certain implementations of the disclosed technology.
  • Each of the examples 802-816 includes a sequence of video frames for a source subject, resulting stick figure representation, and resulting video frames of a target subject in the pose represented by the stick figure generated from the source subject.
  • Network Architecture [0082] Architectures were adapted from various models for different stages of the pipeline. To extract pose keypoints for the body, face, and hands, the example included architectures provided by a state of the art pose detector.
  • Table 1 contains mean image similarity measurements for a region around the body.
  • FIGURE 9 illustrates multiple examples 902-906 of comparisons of synthesis results for different models in accordance with certain implementations of the disclosed technology. Both SSIM and LPIPS scores are similar for all model variations. Qualitatively, the pix2pixHD baseline already reasonably synthesizes the target person as reflected by the similarity measurements. Scores on full images are even more similar between the ablations, as all ablations have no difficulty generating the static background. Table 2 shows mean scores for the face region (for the face GAN, this is the region for which the face residual is generated).
  • Table 3 shows the mean pose distance using the method described in Equation 7 for each ablation. The pose metric was run on particular sets of keypoints (body, face, hands) to determine the regions which incur the most error. Adding the temporal smoothing setup does not seem to decrease the reconstructed pose distances significantly, however including the face GAN adds substantial improvements overall, especially for the face and hand keypoints. [0099] In Table 4, the number of missed detections (i.e.
  • the maximum ankle position is the y foot coordinate closest to the bottom of the image.
  • the minimum foot position is found by clustering the y foot coordinates which are less than (or spatially above) the median ankle position and about the same distance as the maximum ankle position’s distance to the median ankle position. The clustering is as described by the set [0106] [0107] where med is the median foot position, max is the maximum ankle position, and ⁇ is a scalar. [0108] Once the minimum and maximum ankle positions of each subject are found, a linear mapping between the minimum and maximum ankle positions of each video (i.e. minimum of source mapped to minimum of target, and same for the maximum ankle positions) may be carried out.
  • the transformation may be characterized in terms of scale and translation in the y direction, which is calculated for each frame.
  • the translation may be calculated according to the average of the left and right ankle y coordinates and its distance between the maximum and minimum ankle positions in the source frame. Then the new transformed food position is the coordinate between the maximum and minimum ankle position in the target video with the same relative/interpolated distance.
  • the translation b Given an average ankle position a source in the a source frame, the translation b may be calculated for that frame according to the following equation [0110] [0111] where t min and t max are the minimum and maximum ankle positions in the target video, and smin and smax are the minimum and maximum ankle positions in the source video.
  • embodiments may include clustering the heights around the minimum ankle position and the maximum ankle position and finding the maximum height for each cluster for each video.
  • Embodiments may include calling these maximum heights tclose for the maximum of the cluster near the target person’s maximum ankle position, tfar for the maximum of the cluster near the target person’s minimum ankle position, and sclose and sfar respectively.
  • the close ratio may be obtained by taking the ratio between the target’s close height and the source’s close height, and similarly for the far ratio.
  • aspects of the disclosure may operate on particularly created hardware, firmware, digital signal processors, or on a specially programmed computer including a processor operating according to programmed instructions.
  • controller or processor as used herein are intended to include microprocessors, microcomputers, Application Specific Integrated Circuits (ASICs), and dedicated hardware controllers.
  • ASICs Application Specific Integrated Circuits
  • One or more aspects of the disclosure may be embodied in computer-usable data and computer-executable instructions, such as in one or more program modules, executed by one or more computers (including monitoring modules), or other devices.
  • program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types when executed by a processor in a computer or other device.
  • the computer executable instructions may be stored on a computer readable storage medium such as a hard disk, optical disk, removable storage media, solid state memory, Random Access Memory (RAM), etc.
  • RAM Random Access Memory
  • Computer-readable media means any media that can be accessed by a computing device.
  • computer-readable media may comprise computer storage media and communication media.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • Processing Or Creating Images (AREA)

Abstract

A computer-implemented method can include a training phase and a transfer phase. During the training phase, a body configuration detector can receive a plurality of video frames from an original video of a target subject, each video frame including the target subject in a certain body configuration. The body configuration detector can also determine, for each of the plurality of frames, a corresponding digital representation for the target subject in the body configuration. Using corresponding (x,y) image pairs, the detector can learn a mapping that synthesizes the body configuration of the target subject with the digital representation. During the transfer phase, the body configuration detector can extract body configuration information corresponding to a source subject from a source frame to generate pose joints for the source subject and apply the mapping to the digital representation to generate an image of the target subject in a body configuration corresponding to the body configuration of the source subject.

Description

HUMAN MOTION TRANSFER FOR DANCING VIDEO SYNTHESIS CROSS REFERENCE TO RELATED APPLICATION [0001] This application claims priority to, and the benefit of, U.S. Provisional Application No.62/889,788, filed August 21, 2019, which is incorporated herein by reference in its entirety. GOVERNMENT SUPPORT [0002] This invention was made with government support under Grant Number 1633310 awarded by the National Science Foundation. The government has certain rights in the invention. BACKGROUND [0003] Over the last two decades, there have been extensive studies dedicated toward motion transfer or retargeting. Early methods typically focused on creating new content by manipulating existing video footage of subject users. For example, Video Rewrite creates videos of a subject user saying a phrase that the subject did not originally utter by finding frames where the subject user’s mouth position matches the desired speech. Another approach uses optical flow as a descriptor matches different subjects performing similar actions allowing “Do as I do” and “Do as I say” retargeting. [0004] Other approaches use three-dimensional (3D) transfer motion for graphics and animation purposes. Since the retargeting problem was first proposed between animated characters, solutions have included the introduction of inverse kinematic solvers to the problem and retargeting between significantly different skeletons. More recent approaches apply deep learning techniques to retarget motion without supervised data. To mitigate this problem, some approaches propose an elaborate multi-view system to calibrate a personalized kinematic model, obtain 3D joint estimations, and render images of a human subject performing new motions. [0005] Recent studies of motion in video have been able to learn to distinguish the movements from appearance and consequently synthesize novel motions in video. Some approaches employ unsupervised adversarial training to learn this separation and generates videos of subjects performing novel motions or facial expressions. This theme is continued through subsequent work which transfers facial expressions from a source subject in a video onto a target person given in a static image. [0006] Modern approaches have shown success in generating detailed images of human subjects in novel poses. Furthermore, recent methods can synthesize such images for temporally coherent video and future prediction. Certain frameworks learn mappings between different videos and demonstrate motion transfer between faces and from poses to body respectively. [0007] Since the recent emergence of Generative Adversarial Networks (GANs) for approximating generative models, GANs have been used for many purposes including image generation, especially because they can produce high quality images with sharp details. These advances have led to use of Conditional GANs, in which the generated output is conditioned on a structured input. In addition to specific applications or mappings, some studies have employed adversarial training to learn arbitrary image to image translations. Over the past few years there have been several frameworks, which often (but not all) use GANS, developed to solve such mappings. [0008] There remains a need for improved systems and methods for human motion transfer for video synthesis. SUMMARY [0009] Implementations of the disclosed technology are generally directed to “do as I do" motion transfer techniques in which, given a source video of a person dancing or otherwise performing at least movement, that performance can be transferred to a novel (e.g., amateur) target after only a short time (e.g., a few minutes) of the target subject performing standard moves, which may be posed as a per-frame image-to-image translation with spatio-temporal smoothing. Using pose detections as an intermediate representation between source and target, a mapping can be learned from pose images to a target subject’s appearance. [0010] The disclosed approaches are generally designed for video subjects which can be found online or captured in person, although embodiments may include learning to synthesize novel motions rather than manipulating existing frames. [0011] Unlike conventional approaches, the disclosed techniques generally include motion transfer between two-dimensional (2D) video subjects where there is a lack of 3D information. As such, the disclosed approaches advantageously avoid both source-target data calibration and lifting into 3D space. [0012] Certain implementations apply a representation of motion (e.g., posed stick figures) to different target subjects to generate new motions while specializing in synthesizing detailed dance movements. [0013] Implementations of the disclosed technology generally account for video generation while preserving important details such as facial features. [0014] Certain implementations include learning a mapping from pose to target subject due to advances in image generation and substantial work on general image mapping frameworks. Due to the disclosed approaches toward motion transfer, implementations generally allow the ability to choose from and adopt such frameworks for certain purposes. BRIEF DESCRIPTION OF THE DRAWINGS [0015] FIGURE 1 illustrates multiple examples 102-112 of motion transfer from a source onto two target subjects in accordance with certain implementations of the disclosed technology. [0016] FIGURE 2 illustrates an example 200 of a correspondence between a posed stick figure 202 and a target person frame 204 in accordance with certain implementations of the disclosed technology. [0017] FIGURE 3 illustrates an example of a training phase 300 of a computer-implemented method in accordance with certain implementations of the disclosed technology. [0018] FIGURE 4 illustrates an example of a transfer phase 400 of a computer-implemented method in accordance with certain implementations of the disclosed technology. [0019] FIGURE 5 illustrates an example of a training phase 502 and a transfer phase 504 in accordance with certain implementations of the disclosed technology. [0020] FIGURE 6 illustrates an example of temporal smoothing 600 in accordance with certain implementations of the disclosed technology. [0021] FIGURE 7 illustrates an example of a face GAN setup 700 in accordance with certain implementations of the disclosed technology. [0022] FIGURE 8 illustrates multiple examples 802-816 of transfer results in accordance with certain implementations of the disclosed technology. [0023] FIGURE 9 illustrates multiple examples 902-906 of comparisons of synthesis results for different models in accordance with certain implementations of the disclosed technology. [0024] FIGURE 10 illustrates multiple examples 1002-1006 of face image comparisons from different models in accordance with certain implementations of the disclosed technology. DETAILED DESCRIPTION [0025] Implementations of the disclosed technology are generally directed to systems and methods for transferring motion between human subjects in different videos. Given two videos – one of a target person whose appearance is to be synthesized, and the other of a source subject whose motion is to be imposed onto the target person – motion can be transferred between these subjects via an end to end pixel-based pipeline. This is in contrast to approaches over the last two decades which employ nearest neighbor search or retarget motion in 3D. With the disclosed framework, a variety of videos can be created, enabling untrained amateurs to spin and twirl like ballerinas, perform martial arts kicks, or dance as vibrantly as pop stars. [0026] In order to transfer motion between two video subjects in a frame-by-frame manner, a mapping is generally learned between images of the two individuals. Certain implementations may include discovering an image-to-image translation between the source and target sets but there are no corresponding pairs of images of the two subjects performing the same motions to supervise learning this translation directly. Even if both subjects perform the same routine, it is still unlikely to have an exact frame to frame body-pose correspondence due to body shape and stylistic differences unique to each subject. [0027] FIGURE 1 illustrates multiple examples 102-112 of motion transfer from a source onto two target subjects in accordance with certain implementations of the disclosed technology. Each of the examples 102-112 shows a video frame of a Source Subject applied to each of two Target Subjects such that the pose of each Target Subject mirrors the pose of the Source Subject. [0028] Keypoint-based poses, which inherently encode body position but not appearance, can serve as an intermediate representation between any two subjects. Poses preserve motion signatures over time while abstracting away as much subject identity as possible. Thus, an intermediate representation may be designed to be posed stick figures such as the example 200 illustrated by FIGURE 2. The example 200 indicates a correspondence between a posed stick figure 202 and a target person frame 204 in accordance with certain implementations of the disclosed technology. From the target video, pose detections may be obtained for each frame yielding a set of (pose stick figure, target person image) corresponding pairs. With this aligned data, an image-to-image translation model can be learned between posed stick figures and images of the target person in a supervised way. [0029] As such, the model can be trained to produce personalized videos of a specific target subject. Then, to transfer motion from source to target, the posed stick figures can be inputted into the trained model to obtain images of the target subject in the same pose as the source. Two components can be added to improve the quality of the results: (1) to encourage the temporal smoothness of the generated videos, the prediction at each frame on that of the previous time step can be conditioned; and (2) to increase facial realism in the results, a specialized GAN trained to generate the target person‘s face can be included. [0030] The disclosed methods may produce videos where motion is transferred between a variety of video subjects without the need for expensive 3D or motion capture data. Contributions can include a learning-based pipeline for human motion transfer between videos, and the quality of the results which demonstrate complex motion transfer in realistic and detailed videos. [0031] Given a video of a source person and another of a target person, a goal may be to generate a new video of the target person enacting the same motions as the source. To accomplish this task, the pipeline may be divided into three stages: pose detection, global pose normalization, and mapping from normalized posed stick figures to the target subject. In the pose detection stage, a pretrained pose detector may be used to create posed stick figures given frames from the source video. The global pose normalization stage accounts for differences between the source and target body shapes and locations within frame. Finally, a system may be designed to learn the mapping from the normalized pose stick figures to images of the target person with adversarial training. [0032] FIGURE 3 illustrates an example of a training phase 300 of a computer-implemented method in accordance with certain implementations of the disclosed technology. At 302, a body configuration detector receives multiple video frames from an original video of a target subject, each video frame including the target subject in a certain body configuration. [0033] At 304, the body configuration detector determines, for each of the frames, a corresponding digital representation for the target subject in the body configuration. [0034] At 306, corresponding (x,y) image pairs are used to learn a mapping that synthesizes the body configuration of the target subject with the digital representation. [0035] In the example 300, the body configuration detector may also obtain multiple joint estimates for the source subject. Alternatively or in addition thereto, a discriminator may be used to distinguish between “real” image pairs and “fake” image pairs. [0036] FIGURE 4 illustrates an example of a transfer phase 400 of a computer-implemented method in accordance with certain implementations of the disclosed technology. At 402, the body configuration detector receives multiple video frames from an original video of a source subject. [0037] At 404, the body configuration detector extracts body configuration information corresponding to a source subject from a source frame to generate multiple pose joints for the source subject. [0038] At 406, the mapping is applied to the digital representation to generate an image of the target subject in a body configuration corresponding to the body configuration of the source subject. [0039] The example 400 may also include applying a normalizing process to the pose joints to transform the pose joints into joints for the source subject to create a posed stick figure. Such normalizing process may advantageously result in a better match between the body configuration of the source subject and the target subject. [0040] In certain implementations, the body configuration detector may draw a representation of the posed stick figure by plotting a plurality of keypoints and drawing lines between connected joints. [0041] In certain implementations, the example 400 may further include transforming pose keypoints of the source subject so that they appear in accordance with the body shape and proportion of the target subject. Such transforming may include analyzing the ankle position for each pose of each subject and using a linear mapping between the closest and furthest ankle positions. Alternatively or in addition thereto, the transforming may include calculating the scale and translation for each frame based on its corresponding body configuration detection. [0042] In certain implementations, the source subject is a first person and the target subject is a second person. Also, there may be multiple target subjects, each target subject being a different person. [0043] A system in accordance with the disclosed technology may include a video device configured to capture an original video of a target subject, a body configuration detector configured to receive from the video device multiple video frames from the original video of the target subject, each video frame including the target subject in a certain body configuration, wherein the body configuration detector is further configured to determine, for each of the plurality of frames, a corresponding digital representation for the target subject in the body configuration. The system may also include a training module configured to use corresponding (x,y) image pairs to learn a mapping that synthesizes the body configuration of the target subject with the digital representation. [0044] In certain implementations, the body configuration detector may be further configured to extract pose information corresponding to a source subject from a source frame to generate a plurality of pose joints for the source subject. In such embodiments, a transfer module may be configured to apply the mapping to the digital representation to generate an image of the target subject in a body configuration corresponding to the body configuration of the source subject. The transfer module may be further configured to transform pose keypoints of the source subject so that they appear in accordance with the body shape and proportion of the target subject. [0045] In certain implementations, a normalizing module may be configured to apply a normalizing process to the pose joints to transform the pose joints into joints for the source subject to create a posed stick figure. Such a normalizing process may advantageously result in a better match between the body configuration of the source subject and the target subject. [0046] In certain implementations, the system may include a discriminator configured to distinguish between “real” image pairs and “fake” image pairs. [0047] FIGURE 5 illustrates an example of a training phase 502 and a transfer phase 504 in accordance with certain implementations of the disclosed technology. Given frame y from an original target video, a pose detector P may be used to obtain a corresponding pose stick figure x = P(y). During the training phase 502, corresponding (x,y) pairs may be used to learn a mapping G which synthesizes images of the target person given pose stick x. Through adversarial training with discriminator D and a perceptual reconstruction loss dist using a pretrained VGGNet, the generated output G(x) may be optimized to resemble the ground truth target subject frame y. D attempts to distinguish between “real” image pairs (i.e. (pose stick figure x, ground truth image y)) and “fake” image pairs (i.e. (pose stick figure x, model output G(x)). [0048] During the transfer phase 504 in the example, the pose detector P extracts pose information from source frame y¢ yielding pose stick figure x '. However, in the video the source subject likely appears bigger, or smaller, and standing in a different position than the subject in the target video. In order for the source pose to better align with the filming setup of the target, a global pose normalization Norm may be applied to transform the source’s original pose x ¢ to be more consistent with the poses in the target video x. The normalized posed stick figure x may then be passed into the trained model G to obtain an image G(x) of the target person which corresponds with the original image of the source y'. [0049] POSE ESTIMATION AND NORMALIZATION [0050] Pose Estimation [0051] In order to create images which encode body position, a pretrained pose detector P which accurately estimates x,y joint coordinates may be used. A representation of the resulting posed stick figure may be drawn by plotting the keypoints and drawing lines between connected joints as illustrated by FIGURE 2. During training, posed stick figures of the target person may be inputted to the generator G. For transfer, P may obtain joint estimates for the source subject which are then normalized as discussed below to better match the poses of the transfer subject seen in training. The normalized pose coordinates may be used to create input pose stick figures for the generator G. [0052] Global Pose Normalization [0053] In different videos, subjects may have different limb proportions or stand closer to or further from the camera than one another. Therefore, when transferring motion between two subjects, it may be necessary to transform the pose keypoints of the source person so that they appear in accordance with the target person’s body shape and proportion as in the Transfer section 504 of FIGURE 5. This transformation may be found by analyzing the heights and ankle positions for poses of each subject and using a linear mapping between the closest and furthest ankle positions in both videos. After gathering these statistics, the scale and translation may be calculated for each frame based on its corresponding pose detection. [0054] ADVERSARIAL TRAINING OF IMAGE TO IMAGE TRANSLATION [0055] The adversarial training setup of pix2pixHD may be modified to: (1) produce temporally coherent video frames; and (2) synthesize realistic face image. [0056] pix2pixHD Framework [0057] Disclosed methods may be based on the objective presented in pix2pixHD. In the original conditional GAN setup, the generator network G is engaged in a minimax game against multi-scale discriminators D = (D1, D2, D3). The generator’s task is to synthesize realistic images in order to fool the discriminator which must discern between “real" (i.e., ground truth data) images from the “fake” images produced by the generator. These two networks are trained simultaneously and drive each other to improve, as the generator must learn to synthesize more realistic images to deceive the discriminator which in turn learns differences between generator outputs and ground truth data. The original pix2pixHD objective takes the following form: [0058]
Figure imgf000014_0001
[0059] where LGAN(G, D) is the adversarial loss: [0060]
Figure imgf000014_0002
[0061] where LFM(G, D) is the discriminator feature-matching loss, and LVGG (G(x),y) is the perceptual reconstruction loss which compares pretrained VGGNet features at different layers of the network. [0062] Temporal Smoothing [0063] To create video sequences, the single image generation setup may be modified to enforce temporal coherence between adjacent frames. FIGURE 6 illustrates an example of temporal smoothing 600 in accordance with certain implementations of the disclosed technology. Instead of generating individual frames, two consecutive frames are predicated where the first output G(xt -1) is conditioned on its corresponding pose stick figure xt -1 and a zero image z (a placeholder since there is no previously generated frame at time t - 2). The second output G(xt ) is conditioned on its corresponding pose stick figure xt and the first output G(xt -1). Consequently, the discriminator is now tasked with determining both the difference in realism and temporal coherence between the “fake” sequence (xt -1, xt, G(xt -1), G(xt)) and “real” sequence (xt -1, xt, yt -1, yt). The temporal smoothing changes are now reflected in the updated GAN objective: [0064] Lsmooth(G, D) = E(x,y)[log D(xt -1, xt, yt -1, yt)]+Ex [log(1 - D(xt -1, xt, G(xt -1), G(xt))] [0065] Face GAN [0066] In certain implementations, a specialized GAN setup designed to add more detail and realism to the face region may be added. FIGURE 7 illustrates an example of a face GAN setup 700 in accordance with certain implementations of the disclosed technology. It may be shown that the face GAN produces convincing facial features and improves upon the results of the full image GAN in ablation studies, for example. [0067] After generating the full image of the scene with the main generator G, the method may input a smaller section of the image centered around the face G(x)F and the input pose stick figure sectioned in the same fashion xF to another generator Gf which outputs a residual r = Gf (xF , G(x)F ). The final output may be the addition of the residual with original face region r +G(x)F and this change may be reflected in the relevant region of the full image. A discriminator Df may then attempt to discern the “real" face pairs (xF ,yF ) (face region of the input pose stick figure, face region of the ground truth target person image) from the “fake" face pairs (xF , r + G(x)F ) similarly to the original pix2pix objective: [0068]
Figure imgf000015_0001
[0069] where xF is the face region of the original pose stick figure x, yF is the face region of ground truth target person image y. Similarly to the full image, the technique may include adding a perceptual reconstruction loss on comparing the final face r + G(x)F to the ground truth target person’s face yF. [0070] Full Objective [0071] Implementations may employ training in stages where the full image GAN is optimized separately from the specialized face GAN. Certain embodiments may include first training the main generator and discriminator (G, D) during which the full objective is: [0072]
Figure imgf000016_0001
[0073] After this stage, the full image generator and discriminator weights are frozen and the face GAN may be optimized with full objective: [0074]
Figure imgf000016_0002
[0075] EXAMPLE IMPLEMENTATION [0076] Data Collection [0077] Certain implementations may collect source and target videos in slightly different manners. To learn the appearance of the target subject in many poses, it is important that the target video captures a sufficient range of motion and sharp frames with minimal blur. To ensure the quality of the frames, a target subject was filmed for around 20 minutes of real time footage at 120 frames per second, which is possible with some modern cell phone cameras. Since the pose representation does not encode information about clothes, the target subjects wore tight clothing with minimal wrinkling. [0078] In contrast to some of the preparation required for filming a target subject, source videos do not require the same (though still reasonable) quality as only decent pose detections are needed from the source video. Without such limitations, many high quality videos of a subject performing a dance are abundant online. [0079] In the example, pre-smoothing pose keypoints were immensely helpful in reducing jittering in the outputs. For videos with a high framerate (e.g., 120 fps), the keypoints were gaussian smoothed over time, and median smoothing was used for videos with lower framerates. [0080] FIGURE 8 illustrates multiple examples 802-816 of transfer results in accordance with certain implementations of the disclosed technology. Each of the examples 802-816 includes a sequence of video frames for a source subject, resulting stick figure representation, and resulting video frames of a target subject in the pose represented by the stick figure generated from the source subject. [0081] Network Architecture [0082] Architectures were adapted from various models for different stages of the pipeline. To extract pose keypoints for the body, face, and hands, the example included architectures provided by a state of the art pose detector. [0083] For the image translation stage of the pipeline, certain architectures were adapted. To create 128x128 face image residuals, the full capability of the entire pix2pixHD generator is not needed and, therefore, face residuals were predicted using the global generator of pix2pixHD. Similarly, a single 70x70 Patch-GAN discriminator was used for the face discriminator. Alternative implementations may use the LSGAN objective during training similarly to pix2pixHD for both the full image and face GANs. [0084] EXPERIMENTS [0085] The effects of modifications to the pix2pixHD base-line were explored and the quality of the results on the collected dataset were evaluated. With no ground truth data for retargeting between two different video subjects, the reconstruction of the target person (i.e., the source person is the target person) was analyzed with validation data. An ablation study was conducted on the inclusion of the temporal smoothing setup and face GAN compared to a pix2pixHD baseline. [0086] To assess the quality of individual frames, both Structural Similarity (SSIM) and Learned Perceptual Image Patch Similarity (LPIPS) were measured. With ground truth flows for the data, qualitative analysis was relied on to evaluate the temporal coherence of the output videos. [0087] In addition, the pose detector P was run on the outputs of each system, and these reconstructed keypoints were compared to the pose detections of the original input video. If all body parts are synthesized correctly, then the reconstructed pose should be close to the input pose on which the output was conditioned. Therefore, these pose reconstructions can be evaluated to analyze the quality of the results. [0088] For a pose distance metric between two poses p, p' each with n joints p1, ..., pn and p'1, ..., p'n, the L2 distances may be summed between the corresponding joints pk = (xk ,yk ) and p'k = (x'k,y'k ) normalized by the number of keypoints: [0089]
Figure imgf000018_0001
[0090] To avoid dealing with missing detections (i.e., without viewing the original image of the subject it can be hard to discern whether a “missed” detection is due to noise or occlusion), only poses where all joints are detected were compared [0091]
Figure imgf000018_0002
[0092]
[0093]
Figure imgf000019_0001
[0094]
[0095] ABLATION STUDY
[0096] The results of an ablation study are presented in Tables 1 to 4 above. Results were compared from the pix2pixHD baseline (pix2pixHD), a version of the model with just the temporal smoothing setup (T.S.), and the full model with both the temporal smoothing setup and the face GAN (T.S. + Face).
[0097] Table 1 contains mean image similarity measurements for a region around the body. FIGURE 9 illustrates multiple examples 902-906 of comparisons of synthesis results for different models in accordance with certain implementations of the disclosed technology. Both SSIM and LPIPS scores are similar for all model variations. Qualitatively, the pix2pixHD baseline already reasonably synthesizes the target person as reflected by the similarity measurements. Scores on full images are even more similar between the ablations, as all ablations have no difficulty generating the static background. Table 2 shows mean scores for the face region (for the face GAN, this is the region for which the face residual is generated). Again, scores are generally favorable for all ablations, although the full model with both the temporal smoothing and face GAN setups obtains the best scores with the biggest discrepancy in the face region. [0098] Table 3 shows the mean pose distance using the method described in Equation 7 for each ablation. The pose metric was run on particular sets of keypoints (body, face, hands) to determine the regions which incur the most error. Adding the temporal smoothing setup does not seem to decrease the reconstructed pose distances significantly, however including the face GAN adds substantial improvements overall, especially for the face and hand keypoints. [0099] In Table 4, the number of missed detections (i.e. joints detected on ground truth frames but not on outputs) on various regions are counted and the whole pose as the pose metric does not accurately depict missed detections. With the addition of the model parts, the number of missed detections generally decreases, especially for the face keypoints. [0100] Qualitative Assessment [0101] Although the ablation study scores for the temporal smoothing setup are generally comparable or an improvement to the pix2pixHD baseline, significant differences occur in video results where the temporal smoothing setup exhibits more frame to frame coherence than the pix2pixHD baseline. Qualitatively, the temporal smoothing setup helps with smooth motion, color consistency across frames, and also in individual frame synthesis. [0102] Consistent with the ablation study, adding a specialized facial generator and discriminator adds considerable detail and encourages synthesizing realistic body parts. The face synthesis is compared with and without the face GAN illustrated by the examples 1002- 1006 of a face image comparison from different models 1000 and in the video results. [0103] Overall, the disclosed model is able to create reasonable and arbitrarily long videos of a target person dancing given body movements to follow through an input video of another subject dancing. [0104] Global Pose Normalization Details [0105] To find a transformation in terms of scale and translation between a source pose and a target pose, embodiments may include finding the minimum and maximum ankle positions in image coordinates of each subject while they are on the ground (i.e., feet raised in the air are not considered). These coordinates represent the farthest and closest distances to the camera respectively). The maximum ankle position is the y foot coordinate closest to the bottom of the image. The minimum foot position is found by clustering the y foot coordinates which are less than (or spatially above) the median ankle position and about the same distance as the maximum ankle position’s distance to the median ankle position. The clustering is as described by the set [0106]
Figure imgf000021_0001
[0107] where med is the median foot position, max is the maximum ankle position, and ^ is a scalar. [0108] Once the minimum and maximum ankle positions of each subject are found, a linear mapping between the minimum and maximum ankle positions of each video (i.e. minimum of source mapped to minimum of target, and same for the maximum ankle positions) may be carried out. The transformation may be characterized in terms of scale and translation in the y direction, which is calculated for each frame. [0109] The translation may be calculated according to the average of the left and right ankle y coordinates and its distance between the maximum and minimum ankle positions in the source frame. Then the new transformed food position is the coordinate between the maximum and minimum ankle position in the target video with the same relative/interpolated distance. Given an average ankle position a source in the a source frame, the translation b may be calculated for that frame according to the following equation [0110]
Figure imgf000022_0001
[0111] where tmin and tmax are the minimum and maximum ankle positions in the target video, and smin and smax are the minimum and maximum ankle positions in the source video. [0112] To calculate the scale, embodiments may include clustering the heights around the minimum ankle position and the maximum ankle position and finding the maximum height for each cluster for each video. Embodiments may include calling these maximum heights tclose for the maximum of the cluster near the target person’s maximum ankle position, tfar for the maximum of the cluster near the target person’s minimum ankle position, and sclose and sfar respectively. The close ratio may be obtained by taking the ratio between the target’s close height and the source’s close height, and similarly for the far ratio. Given average ankle position asource, the scale for this frame may be interpolated between these two ratios in the same way as the translation is interpolated as described in the following equation: [0113]
Figure imgf000022_0002
[0114] Aspects of the disclosure may operate on particularly created hardware, firmware, digital signal processors, or on a specially programmed computer including a processor operating according to programmed instructions. The terms controller or processor as used herein are intended to include microprocessors, microcomputers, Application Specific Integrated Circuits (ASICs), and dedicated hardware controllers. One or more aspects of the disclosure may be embodied in computer-usable data and computer-executable instructions, such as in one or more program modules, executed by one or more computers (including monitoring modules), or other devices. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types when executed by a processor in a computer or other device. The computer executable instructions may be stored on a computer readable storage medium such as a hard disk, optical disk, removable storage media, solid state memory, Random Access Memory (RAM), etc. [0115] As will be appreciated by one of skill in the art, the functionality of the program modules may be combined or distributed as desired in various aspects. In addition, the functionality may be embodied in whole or in part in firmware or hardware equivalents such as integrated circuits, FPGA, and the like. Particular data structures may be used to more effectively implement one or more aspects of the disclosure, and such data structures are contemplated within the scope of computer executable instructions and computer-usable data described herein. [0116] The disclosed aspects may be implemented, in some cases, in hardware, firmware, software, or any combination thereof. The disclosed aspects may also be implemented as instructions carried by or stored on one or more or computer-readable storage media, which may be read and executed by one or more processors. Such instructions may be referred to as a computer program product. Computer-readable media, as discussed herein, means any media that can be accessed by a computing device. By way of example, and not limitation, computer-readable media may comprise computer storage media and communication media. [0117] Having described and illustrated the principles of the invention with reference to illustrated embodiments, it will be recognized that the illustrated embodiments may be modified in arrangement and detail without departing from such principles, and may be combined in any desired manner. And although the foregoing discussion has focused on particular embodiments, other configurations are contemplated. [0118] Consequently, in view of the wide variety of permutations to the embodiments that are described herein, this detailed description and accompanying material is intended to be illustrative only, and should not be taken as limiting the scope of the invention. What is claimed as the invention, therefore, is all such modifications as may come within the scope and spirit of the following claims and equivalents thereto.

Claims

We claim: 1. A computer-implemented method, comprising: during a training phase: a body configuration detector receiving a plurality of video frames from an original video of a target subject, each video frame including the target subject in a certain body configuration; the body configuration detector determining, for each of the plurality of frames, a corresponding digital representation for the target subject in the body configuration; and using corresponding (x,y) image pairs to learn a mapping that synthesizes the body configuration of the target subject with the digital representation; and during a transfer phase: the body configuration detector extracting body configuration information corresponding to a source subject from a source frame to generate a plurality of pose joints for the source subject; and applying the mapping to the digital representation to generate an image of the target subject in a body configuration corresponding to the body configuration of the source subject.
2. The computer-implemented method of claim 1, further comprising, during the transfer phase, applying a normalizing process to the plurality of pose joints to transform the pose joints into joints for the source subject to create a posed stick figure.
3. The computer-implemented method of claim 2, wherein the normalizing process results in a better match between the body configuration of the source subject and the target subject.
4. The computer-implemented method of claim 2, further comprising the body configuration detector drawing a representation of the posed stick figure by plotting a plurality of keypoints and drawing lines between connected joints.
5. The computer-implemented method of claim 1, further comprising the body configuration detector obtaining a plurality of joint estimates for the source subject.
6. The computer-implemented method of claim 1, further comprising transforming pose keypoints of the source subject so that they appear in accordance with the body shape and proportion of the target subject.
7. The computer-implemented method of claim 6, wherein the transforming includes analyzing the ankle position for each pose of each subject and using a linear mapping between the closest and furthest ankle positions.
8. The computer-implemented method of claim 6, wherein the transforming includes calculating the scale and translation for each frame based on its corresponding body configuration detection.
9. The computer-implemented method of claim 1, wherein the source subject is a first person and the target subject is a second person.
10. The computer-implemented method of claim 1, further comprising a discriminator distinguishing between “real” image pairs and “fake” image pairs.
11. One or more tangible, non-transitory computer-readable media including instructions that, when executed by a processor, cause the processor to perform the computer-implemented method of claim 1.
12. One or more tangible, non-transitory computer-readable media including instructions that, when executed by a processor, cause the processor to perform the computer-implemented method of claim 6.
13. A system, comprising: a video device configured to capture an original video of a target subject; a body configuration detector configured to receive from the video device a plurality of video frames from the original video of the target subject, each video frame including the target subject in a certain body configuration, wherein the body configuration detector is further configured to determine, for each of the plurality of frames, a corresponding digital representation for the target subject in the body configuration; and a training module configured to use corresponding (x,y) image pairs to learn a mapping that synthesizes the body configuration of the target subject with the digital representation.
14. The system of claim 13, wherein the body configuration detector is further configured to extract pose information corresponding to a source subject from a source frame to generate a plurality of pose joints for the source subject.
15. The system of claim 14, further comprising a transfer module configured to apply the mapping to the digital representation to generate an image of the target subject in a body configuration corresponding to the body configuration of the source subject.
16. The system of claim 15, further comprising a normalizing module configured to apply a normalizing process to the plurality of pose joints to transform the pose joints into joints for the source subject to create a posed stick figure.
17. The computer-implemented method of claim 16, wherein the normalizing process results in a better match between the body configuration of the source subject and the target subject.
18. The system of claim 13, further comprising a discriminator configured to distinguish between “real” image pairs and “fake” image pairs.
19. The system of claim 15, wherein the transfer module is further configured to transform pose keypoints of the source subject so that they appear in accordance with the body shape and proportion of the target subject.
20. The system of claim 13, wherein the source subject is a first person and the target subject is a second person.
PCT/US2020/043253 2019-08-21 2020-07-23 Human motion transfer for dancing video synthesis WO2021034443A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US201962889788P 2019-08-21 2019-08-21
US62/889,788 2019-08-21

Publications (1)

Publication Number Publication Date
WO2021034443A1 true WO2021034443A1 (en) 2021-02-25

Family

ID=74660032

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2020/043253 WO2021034443A1 (en) 2019-08-21 2020-07-23 Human motion transfer for dancing video synthesis

Country Status (1)

Country Link
WO (1) WO2021034443A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113870313A (en) * 2021-10-18 2021-12-31 南京硅基智能科技有限公司 Action migration method
CN113870315A (en) * 2021-10-18 2021-12-31 南京硅基智能科技有限公司 Training method of action migration model and action migration method
CN114821811A (en) * 2022-06-21 2022-07-29 平安科技(深圳)有限公司 Method and device for generating person composite image, computer device and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120327194A1 (en) * 2011-06-21 2012-12-27 Takaaki Shiratori Motion capture from body mounted cameras
US20140219550A1 (en) * 2011-05-13 2014-08-07 Liberovision Ag Silhouette-based pose estimation
US20190116322A1 (en) * 2017-10-13 2019-04-18 Fyusion, Inc. Skeleton-based effects and background replacement

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140219550A1 (en) * 2011-05-13 2014-08-07 Liberovision Ag Silhouette-based pose estimation
US20120327194A1 (en) * 2011-06-21 2012-12-27 Takaaki Shiratori Motion capture from body mounted cameras
US20190116322A1 (en) * 2017-10-13 2019-04-18 Fyusion, Inc. Skeleton-based effects and background replacement

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113870313A (en) * 2021-10-18 2021-12-31 南京硅基智能科技有限公司 Action migration method
CN113870315A (en) * 2021-10-18 2021-12-31 南京硅基智能科技有限公司 Training method of action migration model and action migration method
CN113870315B (en) * 2021-10-18 2023-08-25 南京硅基智能科技有限公司 Multi-algorithm integration-based action migration model training method and action migration method
CN113870313B (en) * 2021-10-18 2023-11-14 南京硅基智能科技有限公司 Action migration method
CN114821811A (en) * 2022-06-21 2022-07-29 平安科技(深圳)有限公司 Method and device for generating person composite image, computer device and storage medium

Similar Documents

Publication Publication Date Title
Chan et al. Everybody dance now
Pavlakos et al. Expressive body capture: 3d hands, face, and body from a single image
Tome et al. xr-egopose: Egocentric 3d human pose from an hmd camera
US10431000B2 (en) Robust mesh tracking and fusion by using part-based key frames and priori model
Stoll et al. Fast articulated motion tracking using a sums of gaussians body model
WO2021034443A1 (en) Human motion transfer for dancing video synthesis
Park et al. Synctalkface: Talking face generation with precise lip-syncing via audio-lip memory
CN110544301A (en) Three-dimensional human body action reconstruction system, method and action training system
US20140204084A1 (en) Systems and Methods for Animating the Faces of 3D Characters Using Images of Human Faces
US11282257B2 (en) Pose selection and animation of characters using video data and training techniques
Sinha et al. Emotion-controllable generalized talking face generation
Elhayek et al. Fully automatic multi-person human motion capture for vr applications
Chen et al. Markerless monocular motion capture using image features and physical constraints
Zhao et al. Mask-off: Synthesizing face images in the presence of head-mounted displays
Rochette et al. Weakly-supervised 3d pose estimation from a single image using multi-view consistency
Song et al. Real-time 3D face-eye performance capture of a person wearing VR headset
CN111275734A (en) Object identification and tracking system and method thereof
Mehta et al. Single-shot multi-person 3d body pose estimation from monocular rgb input
US11361467B2 (en) Pose selection and animation of characters using video data and training techniques
Purps et al. Reconstructing facial expressions of HMD users for avatars in VR
Wang et al. Speech Driven Talking Head Generation via Attentional Landmarks Based Representation.
JP7326965B2 (en) Image processing device, image processing program, and image processing method
Moreira et al. Eyes and eyebrows detection for performance driven animation
Ravichandran et al. Synthesizing photorealistic virtual humans through cross-modal disentanglement
Robertini et al. Capture of arm-muscle deformations using a depth-camera

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20854748

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20854748

Country of ref document: EP

Kind code of ref document: A1