EP3912160A1 - Systems and methods for template-based generation of personalized videos - Google Patents

Systems and methods for template-based generation of personalized videos

Info

Publication number
EP3912160A1
EP3912160A1 EP20707900.5A EP20707900A EP3912160A1 EP 3912160 A1 EP3912160 A1 EP 3912160A1 EP 20707900 A EP20707900 A EP 20707900A EP 3912160 A1 EP3912160 A1 EP 3912160A1
Authority
EP
European Patent Office
Prior art keywords
image
sequence
frame
face
video
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
EP20707900.5A
Other languages
German (de)
French (fr)
Inventor
Victor Shaburov
Pavel Savchenkov
Alexander MASHRABOV
Dmitry Matov
Sofia SAVINOVA
Alexey PCHELNIKOV
Roman GOLOBKOV
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Snap Inc
Original Assignee
Snap Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US16/251,472 external-priority patent/US11049310B2/en
Priority claimed from US16/251,436 external-priority patent/US10789453B2/en
Priority claimed from US16/434,185 external-priority patent/US10839586B1/en
Priority claimed from US16/551,756 external-priority patent/US10776981B1/en
Priority claimed from US16/594,690 external-priority patent/US11089238B2/en
Priority claimed from US16/594,771 external-priority patent/US11394888B2/en
Priority claimed from US16/661,122 external-priority patent/US11308677B2/en
Priority claimed from US16/661,086 external-priority patent/US11288880B2/en
Application filed by Snap Inc filed Critical Snap Inc
Publication of EP3912160A1 publication Critical patent/EP3912160A1/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T11/002D [Two Dimensional] image generation
    • G06T11/60Editing figures and text; Combining figures or text
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/103Formatting, i.e. changing of presentation of documents
    • G06F40/109Font handling; Temporal or kinetic typography
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • G06F40/186Templates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/50Business processes related to the communications industry
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11BINFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN RECORD CARRIER AND TRANSDUCER
    • G11B27/00Editing; Indexing; Addressing; Timing or synchronising; Monitoring; Measuring tape travel
    • G11B27/02Editing, e.g. varying the order of information signals recorded on, or reproduced from, record carriers
    • G11B27/031Electronic editing of digitised analogue information signals, e.g. audio or video signals

Definitions

  • This disclosure generally relates to digital image processing. More particularly, this disclosure relates to methods and systems for template-based generation of personalized videos.
  • Sharing media such as stickers and emojis
  • messaging applications also referred herein to as messengers.
  • the messengers provide users with an option for generating and sending images and short videos to other users via a communication chat.
  • Certain existing messengers allow users to modify the short videos prior to transmission.
  • the short videos prior to transmission.
  • modifications of the short videos provided by the existing messengers are limited to visualization effects, filters, and texts.
  • the users of the current messengers cannot perform complex editing, such as, for example, replacing one face with another face.
  • Such editing of the videos is not provided by current messengers and requires sophisticated third-party video editing software.
  • a system for template- based generation of personalized videos may include at least one processor and a memory storing processor-executable codes.
  • the at least one processor may be configured to receive, by a computing device, video configuration data.
  • the video configuration data may include a sequence of frame images, a sequence of face area parameters defining positions of a face area in the frame images, and a sequence of facial landmark parameters defining positions of facial landmarks in the frame images.
  • Each of the facial landmark parameters may correspond to a facial expression.
  • the at least one processor may be configured to receive, by the computer device, an image of a source face.
  • the at least one processor may be configured to generate, by the computing device, an output video.
  • the generation of the output video may include modifying a frame image of the sequence of frame images.
  • the image of the source face may be modified based on facial landmark parameters corresponding to the frame image to obtain a further image featuring the source face adopting a facial expression corresponding to the facial landmark parameters.
  • the further image may be inserted into the frame image at a position determined by face area parameters corresponding to the frame image.
  • a method for template-based generation of personalized videos may commence with receiving, by a computing device, video configuration data.
  • the video configuration data may include a sequence of frame images, a sequence of face area parameters defining positions of a face area in the frame images, and a sequence of facial landmark parameters defining positions of facial landmarks in the frame images.
  • Each of the facial landmark parameters may correspond to a facial expression.
  • the method may continue with receiving, by the computer device, an image of a source face.
  • the method may further include generating, by the computing device, an output video.
  • the generation of the output video may include modifying a frame image of the sequence of frame images.
  • the image of the source face may be modified to obtain a further image featuring the source face adopting a facial expression corresponding to the facial landmark parameters.
  • the modification of the image may be performed based on facial landmark parameters corresponding to the frame image.
  • the further image may be inserted into the frame image at a position determined by face area parameters corresponding to the frame image.
  • processor-readable medium which stores processor-readable instructions.
  • processor-readable instructions When executed by a processor, they cause the processor to implement the above-mentioned method for template-based generation of personalized videos.
  • FIG. 1 is a block diagram showing an example environment wherein systems and methods for template-based generation of personalized videos can be implemented.
  • FIG. 2 is a block diagram showing an example embodiment of a computing device for implementing methods for template-based generation of personalized videos.
  • FIG. 3 is a flow chart showing a process for template-based generation of personalized videos, according to some example embodiments of the disclosure.
  • FIG. 4 is a flow chart showing functionality of a system for template-based generation of the personalized videos, according to some example embodiments of the disclosure.
  • FIG. 5 is a flow chart showing a process of generation of live action videos for use in the generation of video templates, according to some example embodiments.
  • FIG. 6 shows frames of example live action videos for generating video templates, according to some example embodiments.
  • FIG. 7 shows an original image of a face and an image of the face with normalized illumination, according to an example embodiment.
  • FIG. 8 shows a segmented head image, the head image with facial landmarks, and a facial mask, according to an example embodiment.
  • FIG. 9 shows a frame featuring a user face, a skin mask, and a result of recoloring the skin mask, according to an example embodiment.
  • FIG. 10 shows an image of a facial image of a face synchronization actor, an image of the face synchronization actor's facial landmarks, an image of a user's facial landmarks, and image of the user's face with the facial expression of the face
  • FIG. 11 shows a segmented face image, a hair mask, a hair mask warped to a target image, and the hair mask applied to the target image, according to an example embodiment.
  • FIG. 12 shows an original image of an eye, an image with reconstructed sclera of the eye, an image with reconstructed iris, and an image with reconstructed moved iris, according to an example embodiment.
  • FIGs. 13-14 show frames of example personalized video generated based on video templates, according to some example embodiments.
  • FIG. 15 is a flow chart showing a method for template-based generation of personalized videos, in accordance with an example embodiment.
  • FIG. 16 shows an example computer system that can be used to
  • This disclosure relates to methods and systems for template-based generation of personalized videos.
  • the embodiments provided in this disclosure solve at least some issues of known art.
  • the present disclosure can be designed to work on mobile devices, such as smartphones, tablet computers, or mobile phones, in real-time, although the embodiments can be extended to approaches involving a web service or a cloud-based resource.
  • Methods described herein can be implemented by software running on a computer system and/or by hardware utilizing either a combination of microprocessors or other specifically designed application-specific integrated circuits (ASICs), programmable logic devices, or any combinations thereof.
  • ASICs application-specific integrated circuits
  • the methods described herein can be implemented by a series of computer-executable instructions residing on a non-transitory storage medium such as a disk drive or computer-readable medium.
  • a personalized video may be generated in the form of an audiovisual media (e.g., a video, an animation, or any other type of media) that features a face of a user or faces of multiple users.
  • the personalized videos can be generated based on pre-generated video templates.
  • a video template may include video configuration data.
  • the video configuration data may include a sequence of frame images, a sequence of face area parameters defining positions of a face area in the frame images, and a sequence of facial landmark parameters defining positions of facial landmarks in the frame images.
  • Each of the facial landmark parameters may corresponds to a facial expression.
  • the frame images can be generated based on an animation video or a live action video.
  • the facial landmark parameters can be generated based on another live action video featuring a face of an actor (also called a face synchronization (facesync) as described in more detail below), animation video, audio file, text, or manually.
  • the video configuration file may also include sequence of skin masks.
  • the skin masks may define a skin area of a body of an actor featured in the frame images or a skin area of 2D/3D animation of a body.
  • the skin mask and the facial landmark parameters can be generated based on two different live action videos capturing different actors (referred to herein as an actor and facesync actor, respectively).
  • the video configuration data may further include a sequence of mouth region images and a sequence of eye parameters.
  • the eye parameters may define positions of an iris in a sclera of a facesync actor featured in the frame images.
  • the video configuration data may include a sequence of head parameters defining a rotation and a turn of a head, a position, a scale, and other parameters of the head.
  • a user may keep his head still when taking an image and look directly at the camera, therefore, the scale and rotations of the head may be adjusted manually.
  • the head parameters can be transferred from a different actor (also referred to herein as a facesync actor).
  • a facesync actor is a person whose facial landmark parameters are being used
  • an actor is another person whose body is being used in a video template and whose skin may be recolored
  • a user is a person who takes an image of his/her face to generate a personalized video.
  • the personalized video includes the face of the user modified to have facial expressions of the facesync actor and includes the body of the actor taken from the video template and recolored to match the color of the face of the user.
  • the video configuration data include a sequence of animated object images.
  • the video configuration data includes a soundtrack and/or voice.
  • the pre-generated video templates can be stored remotely in a cloud- based computing resource and can be downloadable by a user of a computing device (such as a smartphone).
  • the user of the computing may capture, by the computing device, an image of a face or select an image of the face from a camera roll, from a prepared collection of images, or via the web link.
  • the image may include an animal instead of a face of a person or may be in the form of a drawn picture.
  • the computing device may further generate a personalized video.
  • an example method for template-based generation of personalized videos may include receiving, by a computing device, video configuration data.
  • the video configuration data may include a sequence of frame images, a sequence of face area parameters defining positions of a face area in the frame images, and a sequence of facial landmark parameters defining positions of facial landmarks in the frame images.
  • Each of the facial landmark parameters may correspond to a facial expression of a facesync actor.
  • the method may continue with receiving an image of a source face and generating, by the computing device, an output video.
  • the generation of the output video may include modifying a frame image of the sequence of frame images.
  • the modification of the frame image may include modifying the image of the source face to obtain a further image featuring the source face adopting a facial expression corresponding to the facial landmark parameters and inserting the further image into the frame image at a position determined by face area parameters corresponding to the frame image.
  • the source face may be modified, e.g., by changing color, making eyes bigger, and so forth.
  • the image of the source face may be modified based on facial landmark parameters corresponding to the frame image.
  • FIG. 1 shows an example environment 100, in which a system and a method for template-based generation of personalized videos can be implemented.
  • the environment 100 may include a computing device 105, a user 102, a computing device 110, a user 104, a network 120, and a messenger services system 130.
  • the computing device 105 and computing device 110 can refer to a mobile device such as a mobile phone, smartphone, or tablet computer.
  • the computing device 110 can refer to a personal computer, laptop computer, netbook, set top box, television device, multimedia device, personal digital assistant, game console, entertainment system, infotainment system, vehicle computer, or any other computing device.
  • the computing device 105 and the computer device 110 can be
  • the messenger services system 130 can be implemented as a cloud-based computing resource(s).
  • the messenger services system 130 can include computing resource(s) (hardware and software) available at a remote location and accessible over a network (e.g., the Internet).
  • the cloud-based computing resource(s) can be shared by multiple users and can be dynamically re-allocated based on demand.
  • the cloud-based computing resources can include one or more server farms/clusters including a collection of computer servers which can be co-located with network switches and/or routers.
  • the network 120 may include any wired, wireless, or optical networks including, for example, the Internet, intranet, local area network (LAN), Personal Area Network (PAN), Wide Area Network (WAN), Virtual Private Network (VPN), cellular phone networks (e.g., Global System for Mobile (GSM) communications network, and so forth.
  • LAN local area network
  • PAN Personal Area Network
  • WAN Wide Area Network
  • VPN Virtual Private Network
  • GSM Global System for Mobile
  • the computing device 105 can be configured to enable a communication chat between the user 102 and the user 104 of the computing device 110.
  • the user 102 and the user 104 may exchange text messages and videos.
  • the videos may include personalized videos.
  • the personalized videos can be generated based on pre-generated video templates stored in the computing device 105 or the computing device 110.
  • the pre-generated video templates can be stored in the messenger services system 130 and downloaded to the computing device 105 or the computing device 110 on demand.
  • the messenger services system 130 may include a system 140 for pre processing videos.
  • the system 140 may generate video templates based on animation videos or live action videos.
  • the messenger services system 130 may include a video templates database 145 for storing the video templates.
  • the video templates can be downloaded to the computing device 105 or the computing device 110.
  • the messenger services system 130 may be also configured to store user profiles 135.
  • the user profiles 135 may include images of the face of the user 102, images of the face of the user 104, and images of faces of other persons.
  • the images of the faces can be downloaded to the computing device 105 or the computing device 110 on demand and based on permissions. Additionally, the images of the face of the user 102 can be generated using the computing device 105 and stored in a local memory of the computing device 105.
  • the images of the faces can be generated based on other images stored in the computing device 105.
  • the images of the faces can be further used by the computing device 105 to generate personalized videos based on the pre generated video templates. Similarly, the computing device 110 may be used to generate images of the face of the user 104.
  • the images of the face of the user 104 can be used to generate personalized videos on the computing device 110.
  • the images of the face of user 102 and images of the face of the user 104 can be mutually used to generate personalized videos on the computing device 105 or the computing device 110.
  • FIG. 2 is a block diagram showing an example embodiment of a computing device 105 (or computing device 110) for implementing methods for personalized videos.
  • the computing device 110 includes both hardware components and software components.
  • the computing device 110 includes a camera 205 or any other image-capturing device or scanner to acquire digital images.
  • the computing device 110 can further include a processor module 210 and a storage module 215 for storing software components and processor-readable (machine-readable) instructions or codes, which when performed by the processor module 210 cause the computing device 105 to perform at least some steps of methods for template-based generation of personalized videos as described herein.
  • the computing device 105 may include a graphical display system 230 and a communication module 240. In other embodiments, the computing device 105 may include additional or different components.
  • the computing device 105 can include fewer components that perform functions similar or equivalent to those depicted in FIG. 2.
  • the computing device 110 can further include a messenger 220 for enabling communication chats with another computing device (such as the computing device 110) and a system 250 for template-based generation of personalized videos.
  • the system 250 is described in more detail below with reference to FIG. 4.
  • the messenger 220 and the system 250 may be implemented as software components and processor- readable (machine-readable) instructions or codes stored in the memory storage 215, which when performed by the processor module 210 cause the computing device 105 to perform at least some steps of methods for providing communication chats and generation of personalized videos as described herein.
  • the system 250 for template-based generation of personalized videos can be integrated in the messenger 220.
  • a user interface of the messenger 220 and the system 250 for template-based personalized videos can be provided via the graphical display system 230.
  • the communication chats can be enabled via the communication module 240 and the network 120.
  • the communication module 240 may include a GSM module, a WiFi module, a BluetoothTM module, and so forth.
  • FIG. 3 is a flow chart showing steps of a process 300 for template-based generation of personalized videos, according to some example embodiment of the disclosure.
  • the process 300 may include production 305, post-production 310, resources preparation 315, skin recoloring 320, lip synchronization and facial reenactment 325, hair animation 330, eyes animation 335, and deploy 340.
  • the resource preparation 315 can be performed by the system 140 for pre-processing videos in the messenger services system 130 (shown in FIG. 1).
  • the resource preparation 315 results in generating video templates that may include video configuration data.
  • the skin recoloring 320, lip synchronization and facial reenactment 325, hair animation 330, eyes animation 335, and deploy 340 can be performed by the system 250 for template-based generation of personalized videos in computing device 105 (shown in FIG. 2).
  • the system 250 may receive an image of the user's face and video configuration data and generate a personalized video featuring the user's face.
  • the skin recoloring 320, lip synchronization and facial reenactment 325, hair animation 330, eyes animation 335, and deploy 340 can be also performed by the system 140 for pre-processing videos in messenger services system 130.
  • the system 140 can receive test images of user faces and a video configuration file.
  • the system 140 may generate test personalized videos featuring the user faces.
  • the test personalized videos can be reviewed by an operator. Based on a result of the review, the video
  • configuration file can be stored in the video templates database 145 and can then be downloaded to the computing device 105 or computing device 110.
  • the production 305 may include idea and scenario creation, pre- production during which a location, props, actors, costumes and effects are identified, and production itself, which can require one or more recording sessions.
  • the recording may be performed by recording a scene/actor on a chroma key background, also referred herein to as a green screen or chroma key screen.
  • the actors may wear chroma key face masks (e.g., balaclavas) with tracking marks that cover the face of the actors, but leave the neck and the bottom of the chin open.
  • the idea and scenario creation are shown in detail in FIG. 5.
  • the steps of pre-production and subsequent production 305 is optional. Instead of recording an actor, 2D or 3D animation may be created or third-party footages/images may be used. Furthermore, an original background of the image of the user may be used.
  • FIG. 5 is a block diagram showing a process 500 of generating live action videos.
  • the live action videos can be further used to generate video templates for generation of personalized video.
  • the process 500 may include generating an idea at step 505 and creating a scenario at step 510.
  • the process 500 may continue with pre- production at step 515, which is followed by production 305.
  • the production 305 may include recording using a chroma key screen 525 or at a real life location 530.
  • FIG. 6 shows frames of example live action videos for generating video templates.
  • Frames for video 605 and video 615 are recorded at a real life location 530.
  • Frames for video 610, video 620, and video 625 are recorded using a chroma key screen 525.
  • the actors may wear chroma key face masks 630 with tracking marks that cover the face of the actors.
  • the post-production 310 may include video editing or animation, visual effects, clean-up, sound design and voice over recording.
  • the resources prepared for further deploy may include the following components: a background footage without a head of an actor (i.e., preparing a cleaned-up background where the head of the actor is removed); a footage with an actor on a black background (only for recorded
  • personalized videos a foreground sequence of frames; an example footage with a generic head and soundtrack; coordinates for head position, rotation, and scale;
  • animated elements that are attached to the head (optional); soundtracks with and without a voice-over; a voice-over in a separate file (optional); and so forth. All of these components are optional and may be rendered in different formats. The number and configuration of the components depend on the format of the personalized video. For example, a voice-over is not needed for customized personalized videos, background footages and head coordinates are not needed if the original background from a picture of the user is used, and so forth. In an example embodiment, the area where the face needs to be located may be indicated (e.g., manually) instead of preparing a file with coordinates.
  • the skin recoloring 320 allows to match the color of a skin of the actor in the personalized video to the color of a face on an image of the user.
  • skin masks that indicate specifically which part of background has to be recolored may be prepared. It may be preferable to have a separate mask for each body part of the actor (neck, left and right hands, etc.).
  • the skin recoloring 320 may include facial image illumination
  • FIG. 7 shows an original image 705 of a face and an image 710 of the face with normalized illumination, according to an example embodiment. Shadows or highlights caused by uneven illumination affect color distribution and may lead to too dark or too light of a skin tone after recoloring. To avoid this, shadows and highlights in the face of the user may be detected and removed.
  • the facial image illumination normalization process includes the following steps. An image of a face of the user may be transformed using a deep convolutional neural network. The network may receive an original image 705 in the form of a portrait image taken under arbitrary illumination and change the illumination of the original image 705 to make the original image 705 evenly illuminated while keeping the subject in original image 705 the same.
  • the input of the facial image illumination normalization process includes the original image 705 in the form of the image of the face of the user and facial landmarks.
  • the output of the facial image illumination normalization process includes the image 710 of the face with normalized illumination.
  • the skin recoloring 320 may include mask creation and body statistics.
  • masks There may only be a mask for the whole skin or separate masks for body parts. Also, different masks can be created for different scenes in the video (e.g., due to significant illumination change). Masks may be created semi-automatically, e.g., by such technologies as keying, with some human guidance. Prepared masks may be merged into video assets and then used in the recoloring. Also, to avoid unnecessary
  • color statistics may be calculated for each mask in advance.
  • Statistics may include mean value, median value, standard deviation, and some percentiles for each color channel.
  • Statistics can be computed in Red, Green, Blue (RGB) color space as well as in the other color spaces (Hue, Saturation, Value (HSV) color space, CIELAB color space (also known as CIE L*a*b* or abbreviated as "LAB” color space), etc.).
  • the input of the mask creation process may include grayscale masks for body parts of an actor with uncovered skin in the form of videos or image sequences.
  • the output of the mask creation process may include masks compressed and merged to videos and color statistics per each mask.
  • the skin recoloring 320 may further include facial statistics computation.
  • FIG. 8 shows a segmented head image 805, the segmented head image 805 with facial landmarks 810, and a facial mask 815, according to an example embodiment.
  • the facial mask 815 of the user may be created. Regions such as eyes, mouth, hair, or accessories (like glasses) may be not included in the facial mask 815.
  • the segmented head image 805 of the user and the facial mask may be used to compute the statistics for facial skin of the user.
  • the input of the facial statistics computation may include the segmented head image 805 of the user, facial landmarks 810, and facial segmentation
  • the output of the facial statistics computation may include color statistics for the facial skin of the user.
  • the skin recoloring 320 may further include skin-tone matching and recoloring.
  • FIG. 9 shows a frame 905 featuring a user face, a skin mask 910, a result 915 of recoloring the skin mask 910, according to an example embodiment.
  • the skin-tone matching and recoloring may be performed using statistics that describe color distributions in the actor's skin and user's skin, and recoloring of a background frame may be performed in real-time on a computing device. For each color channel, distribution matching may be performed and values of background pixels may be modified in order to make the distribution of transformed values close to the distribution of facial values. Distribution matching may be performed either under assumption that color distribution is normal or by applying techniques like
  • the input of the skin-tone matching and recoloring process may include a background frame, actor skin masks for the frame, actor body skin color statistics for each mask, and user facial skin color statistics
  • the output may include the background frame with all body parts with uncovered skin recolored.
  • a version of the personalized video that has the closest skin tone to the skin tone of the image of the user may be used.
  • a predetermined lookup table may be used to adjust the color of the face to the illumination of a scene.
  • the LUT may be also used to change the color of the face, for example, to make the face green.
  • the lip synchronization and facial reenactment 325 may result in photorealistic face animation.
  • FIG. 10 shows an example process of the lip
  • FIG. 10 shows an image 1005 of a facesync actor face, an image 1010 of the facesync actor facial landmarks, an image 1015 of user facial landmarks, and an image 1020 of the user's face with the facial expression of the facesync actor, according to an example embodiment.
  • the steps of lip synchronization and facial reenactment 325 may include recording a facesync actor and pre-processing a source video/image to obtain the image 1005 of a facesync actor face. Then, the facial landmarks may be extracted as shown by the image 1010 of the facesync actor facial landmarks. This step also may include gaze tracking of the facesync actor.
  • previously prepared animated 2D or 3D face and mouth region models may be used instead of recording a facesync actor.
  • the animated 2D or 3D face and mouth region models may be generated by machine learning techniques.
  • fine tuning of the facial landmarks may be performed.
  • the fine tuning of the facial landmarks is performed manually. These steps can be performed in a cloud when preparing the video configuration file. In some example embodiments, these steps may be performed during the resource preparation 315.
  • the user's facial landmarks may be extracted as shown by the image 1015 of the user's facial landmarks.
  • the next step of the synchronization and facial reenactment 325 may include animation of the target image with extracted landmarks to obtain the image 1020 of the user's face with the facial expression of the facesync actor. This step may be performed on a computing device based on an image of a face of the user.
  • the method of animation is described in detail in U.S. patent application No. 16/251,472, the disclosure of which is incorporated herein by reference in its entirety.
  • the lip synchronization and facial reenactment 325 can also be enriched with Artificial Intelligence-made head turns.
  • a 3D model of the user's head may be created.
  • the step of lip synchronization and facial reenactment 325 may be omitted.
  • the hair animation 330 may be performed to animate hair of the user. For example, if the user has hair, the hair may be animated when the user moves or rotates his head.
  • the hair animation 330 is shown in FIG. 11.
  • FIG. 11 shows a segmented face image 1105, a hair mask 1110, a hair mask warped to the face image 1115, and the hair mask applied to the face image 1120, according to an example embodiment.
  • the hair animation 330 may include one or more of the following steps: classifying the hair type, modifying appearance of the hair, modifying a hair style, making the hair longer, changing the color of the hair, cutting and animating the hair, and so forth. As shown in FIG.
  • a face image in the form of a segmented face image 1105 may be obtained. Then, a hair mask 1110 may be applied to the segmented face image 1105. The image 1115 shows the hair mask 1110 warped to the face image. The image 1120 shows the hair mask 1110 applied to the face image.
  • the hair animation 330 is described in detail in the US patent application No. 16/551,756, the disclosure of which is incorporated herein by reference in its entirety.
  • the eyes animation 335 may allow making the facial expressions of the user more realistic.
  • the eyes animation 335 is shown in detail in FIG. 12.
  • the process of eyes animation 335 may consist of the following steps: reconstruction of an eye region of the user face, gaze movement step, and eye blinking step.
  • the eye region is segmented into parts: eyeball, iris, pupil, eyelashes, and eyelid. If some part of the eye region (e.g., iris or eyelid) is not fully visible, the full texture of this part may be synthesized.
  • a 3D morphable model of an eye may be fitted, and a 3D-shape of the eye may be obtained together with the texture of the eye.
  • FIG. 12 shows an original image 1205 of an eye, an image 1210 with the reconstructed sclera of the eye, and an image 1215 with the reconstructed iris.
  • the gaze movement step includes tracking a gaze direction and pupil position in a video of a facesync actor. This data may be manually edited if the eye movements of the facesync actor are not rich enough. Gaze movements may then be transferred to the eye region of the user by synthesizing a new eye image with transformed eye shape and the same position of iris as that of the facesync actor. FIG.
  • FIG. 12 shows an image 1220 with the reconstructed moved iris.
  • the visible part of the eye of the user may be determined by tracking the eyes of the facesync actor.
  • a changed appearance of eyelids and eyelashes may be generated based on the reconstruction of eye region.
  • the steps of the eyes animation 335 may be done either explicitly (as described) or implicitly if face reenactment is done using generative adversarial networks (GAN).
  • GAN generative adversarial networks
  • the neural network may implicitly capture all the necessary information from the image of the user face and the source video.
  • the user face may be photorealistically animated and automatically inserted in footage templates.
  • the files from the previous steps may be used as data for a configuration file. Examples of personalized videos with a predefined set of user faces can be generated for initial review. After the issues that were identified during the review are eliminated, the personalized video may be deployed.
  • the configuration file may also include a component that allows to indicate the text parameters for customized personalized videos.
  • a customized personalized video is a type of a personalized video that allows users to add any text the users want on top of the final video. The generating of personalized videos with customized text messages is described in more detail in U.S. patent application No. 16/661,122 dated October 23, 2019, titled "SYSTEMS AND METHODS FOR
  • the generation of the personalized videos may further include the steps of generating significant head turns of a user's head; body animation and changing clothes; facial augmentations such as hairstyle changing, beautification, adding accessories, and the like; changing the scene illumination;
  • synthesizing the voice that may read/sing the text that user has typed or changing the voice over tone to match the voice of the user; gender switching; construction of a background and a foreground depending on the user input; and so forth.
  • FIG. 4 is a schematic showing functionality 400 of the system 250 for template-based generation of the personalized videos, according to some example embodiments.
  • the system 250 may receive an image of a source face shown as a user face image 405 and a video template including video configuration data 410.
  • the video configuration data 410 may include data sequences 420.
  • the video configuration data 410 may include a sequence of frame images, a sequence of face area parameters defining positions of a face area in the frame images, and a sequence of facial landmark parameters defining positions of facial landmarks in the frame images.
  • Each of the facial landmark parameters may correspond to a facial expression.
  • the sequence of frame images may be generated based on an animation video or based on a live action video.
  • the sequence of facial landmark parameters may be generated based on a live action video featuring a face of a facesync actor.
  • the video configuration data 410 may further include a skin mask, eyes parameters, a mouth region image, head parameters, animated object images, preset text parameters, and so forth.
  • the video configuration data may include a sequence of skin masks defining a skin area of a body of at least one actor featured in the frame images.
  • the video configuration data 410 may further include a sequence of mouth region images. Each of the mouth region images may correspond to at least one of the frame images.
  • the video configuration data 410 may include a sequence of eye parameters defining positions of an iris in a sclera of a facesync actor featured in the frame images and/or a sequence of head parameters defining a rotation, a turn, a scale, and other parameters of a head.
  • the video configuration data 410 may further include a sequence of animated object images. Each of the animated object images may correspond to at least one of the frame images.
  • the video configuration data 410 may further include a soundtrack 450.
  • the system 250 may determine, based on the user face image 405, user data 435.
  • the user data may include user facial landmarks, a user face mask, user color data, a user hair mask, and so forth.
  • the system 250 may generate, based on the user data 435 and the data sequences 420, frames 445 of an output video shown as a personalized video 440.
  • the system 250 may further add the soundtrack to the personalized video 440.
  • the personalized video 440 may be generated by modifying a frame image of the sequence of frame images.
  • the modifying of the frame image may include modifying the user face image 405 to obtain a further image featuring the source face adopting a facial expression corresponding to the facial landmark parameters.
  • the modification may be performed based on facial landmark parameters corresponding to the frame image.
  • the further image may be inserted into the frame image at a position determined by face area parameters corresponding to the frame image.
  • the generation of the output video may further include determining color data associated with the source face and, based on the color data, recoloring the skin area in the frame image.
  • the generation of the output video may further include determining a hair mask based on the source face image, generating a hair image based on the hair mask and head parameters corresponding to the frame image, and inserting the hair image into the frame image. Additionally, the generation of the output video may include inserting, into the frame image, an animated object image corresponding to the frame image.
  • FIGs. 13-14 show frames of example personalized videos generated based on video templates, according to some example embodiments.
  • FIG. 13 shows a filmed personalized video 1305 with an actor, in which the recoloring was performed.
  • FIG. 13 further shows a personalized video 1310 created based on a stock video obtained from a third party.
  • a user face 1320 is inserted into the stock video.
  • FIG. 13 further shows a personalized video 1315, which is a 2D animation with a user head 1325 added on top of the 2D animation.
  • FIG. 14 shows a personalized video 1405, which is a 3D animation with a user face 1415 inserted into the 3D animation.
  • FIG. 14 further shows a personalized video 1410 with effects, animated elements 1420, and, optionally, text added on top of the image of the user face.
  • FIG. 15 is a flow chart showing a method 1500 for template-based generation of personalized videos, according to some example embodiments of the disclosure.
  • the method 1500 can be performed by the computing device 105.
  • the method 1500 may commence with receiving video configuration data at step 1505.
  • the video configuration data may include a sequence of frame images, a sequence of face area parameters defining positions of a face area in the frame images, and a sequence of facial landmark parameters defining positions of facial landmarks in the frame images.
  • Each of the facial landmark parameters may correspond to a facial expression.
  • the sequence of frame images may be generated based on an animation video or based on a live action video.
  • the sequence of facial landmark parameters may be generated based on a live action video featuring a face of a facesync actor.
  • the video configuration data may include one or more of the following: a sequence of skin masks defining a skin area of a body of at least one actor featured in the frame images, a sequence of mouth region images where each of the mouth region images corresponds to at least one of the frame images, a sequence of eye parameters defining positions of an iris in a sclera of a facesync actor featured in the frame images, a sequence of head parameters defining a rotation, a scale, a turn, and other parameters of a head, a sequence of animated object images, wherein each of the animated object images corresponds to at least one of the frame images, and so forth.
  • the method 1500 may continue with receiving an image of a source face at step 1510.
  • the method 1500 may further include generating an output video at step 1515.
  • the generation of the output video may include modifying a frame image of the sequence of frame images.
  • the frame image may be modified by modifying the image of the source face to obtain a further image featuring the source face adopting a facial expression corresponding to the facial landmark parameters.
  • the image of the source face may be modified based on facial landmark parameters corresponding to the frame image.
  • the further image may be inserted into the frame image at a position determined by face area parameters corresponding to the frame image.
  • the generation of the output video may further optionally include one or more of the following steps: determining color data associated with the source face and recoloring the skin area in the frame image based on the color data, inserting a mouth region corresponding to the frame image into the frame image, generating an image of eyes region based on the eye parameters corresponding to the frame, inserting the image of the eyes region in the frame image, determining a hair mask based on the source face image and generating a hair image based on the hair mask and head parameters corresponding to the frame image, inserting the hair image into the frame image, and inserting an animated object image corresponding to the frame image into the frame image.
  • FIG. 16 illustrates an example computing system 1600 that can be used to implement methods described herein.
  • the computing system 1600 can be implemented in the contexts of the likes of computing devices 105 and 110, the messenger services system 130, the messenger 220, and the system 250 for template-based generation of personalized videos.
  • the 1600 may include one or more processors 1610 and memory 1620.
  • Memory 1620 stores, in part, instructions and data for execution by processor 1610.
  • Memory 1620 can store the executable code when the system 1600 is in operation.
  • the system 1600 may further include an optional mass storage device 1630, optional portable storage medium drive(s) 1640, one or more optional output devices 1650, one or more optional input devices 1660, an optional network interface 1670, and one or more optional peripheral devices 1680.
  • the computing system 1600 can also include one or more software components 1695 (e.g., ones that can implement the method for template-based generation of personalized videos as described herein).
  • FIG. 16 The components shown in FIG. 16 are depicted as being connected via a single bus 1690.
  • the components may be connected through one or more data transport means or data network.
  • the processor 1610 and memory 1620 may be connected via a local microprocessor bus, and the mass storage device 1630, peripheral device(s) 1680, portable storage device 1640, and network interface 1670 may be connected via one or more input/output (I/O) buses.
  • I/O input/output
  • the mass storage device 1630 which may be implemented with a magnetic disk drive, solid-state disk drive, or an optical disk drive, is a non-volatile storage device for storing data and instructions for use by the processor 1610.
  • Mass storage device 1630 can store the system software (e.g., software components 1695) for implementing embodiments described herein.
  • Portable storage medium drive(s) 1640 operates in conjunction with a portable non-volatile storage medium, such as a compact disk (CD), or digital video disc (DVD), to input and output data and code to and from the computing system 1600.
  • a portable non-volatile storage medium such as a compact disk (CD), or digital video disc (DVD)
  • the system software e.g., software components 1695
  • software components 1695 for implementing embodiments described herein may be stored on such a portable medium and input to the computing system 1600 via the portable storage medium drive(s) 1640.
  • the optional input devices 1660 provide a portion of a user interface.
  • the input devices 1660 may include an alphanumeric keypad, such as a keyboard, for inputting alphanumeric and other information, or a pointing device, such as a mouse, a trackball, a stylus, or cursor direction keys.
  • the input devices 1660 can also include a camera or scanner.
  • the system 1600 as shown in FIG. 16 includes optional output devices 1650. Suitable output devices include speakers, printers, network interfaces, and monitors.
  • the network interface 1670 can be utilized to communicate with external devices, external computing devices, servers, and networked systems via one or more communications networks such as one or more wired, wireless, or optical networks including, for example, the Internet, intranet, LAN, WAN, cellular phone networks, Bluetooth radio, and an IEEE 802.11-based radio frequency network, among others.
  • the network interface 1670 may be a network interface card, such as an Ethernet card, optical transceiver, radio frequency transceiver, or any other type of device that can send and receive information.
  • the optional peripherals 1680 may include any type of computer support device to add additional functionality to the computer system.
  • the components contained in the computing system 1600 are intended to represent a broad category of computer components.
  • the computing system 1600 can be a server, personal computer, hand-held computing device, telephone, mobile computing device, workstation, minicomputer, mainframe computer, network node, or any other computing device.
  • the computing system 1600 can also include different bus configurations, networked platforms, multi-processor platforms, and so forth.
  • Various operating systems (OS) can be used including UNIX, Linux, Windows, Macintosh OS, Palm OS, and other suitable operating systems.
  • Some of the above-described functions may be composed of instructions that are stored on storage media (e.g., computer-readable medium or processor- readable medium).
  • the instructions may be retrieved and executed by the processor.
  • Some examples of storage media are memory devices, tapes, disks, and the like.
  • the instructions are operational when executed by the processor to direct the processor to operate in accord with the invention.
  • Those skilled in the art are familiar with instructions, processor(s), and storage media.
  • Non-volatile media include, for example, optical or magnetic disks, such as a fixed disk.
  • Volatile media include dynamic memory, such as system random access memory (RAM).
  • Transmission media include coaxial cables, copper wire, and fiber optics, among others, including the wires that include one embodiment of a bus. Transmission media can also take the form of acoustic or light waves, such as those generated during radio frequency (RF) and infrared (IR) data communications.
  • RF radio frequency
  • IR infrared
  • Common forms of computer- readable media include, for example, a floppy disk, a flexible disk, a hard disk, magnetic tape, any other magnetic medium, a CD-read-only memory (ROM) disk,
  • DVD any other optical medium, any other physical medium with patterns of marks or holes, a RAM, a PROM, an EPROM, an EEPROM, any other memory chip or cartridge, a carrier wave, or any other medium from which a computer can read.
  • Various forms of computer-readable media may be involved in carrying one or more sequences of one or more instructions to a processor for execution.
  • a bus carries the data to system RAM, from which a processor retrieves and executes the instructions.
  • the instructions received by the system processor can optionally be stored on a fixed disk either before or after execution by a processor.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Multimedia (AREA)
  • Business, Economics & Management (AREA)
  • Tourism & Hospitality (AREA)
  • Economics (AREA)
  • Marketing (AREA)
  • Primary Health Care (AREA)
  • Strategic Management (AREA)
  • Human Resources & Organizations (AREA)
  • General Business, Economics & Management (AREA)
  • Processing Or Creating Images (AREA)
  • Information Transfer Between Computers (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Human Computer Interaction (AREA)
  • Operations Research (AREA)

Abstract

Disclosed are systems and methods for template-based generation of personalized videos. An example method may commence with receiving video configuration data including a sequence of frame images, a sequence of face area parameters defining positions of a face area in the frame images, and a sequence of facial landmark parameters defining positions of facial landmarks in the frame images. The method may continue with receiving an image of a source face. The method may further include generating an output video. The generation of the output video may include modifying a frame image of the sequence of frame images. Specifically, the image of the source face may be modified to obtain a further image featuring the source face adopting a facial expression corresponding to the facial landmark parameters. The further image may be inserted into the frame image at position determined by face area parameters corresponding to the frame image.

Description

SYSTEMS AND METHODS FOR TEMPLATE-BASED GENERATION
OF PERSONALIZED VIDEOS
TECHNICAL FIELD
[0001] This disclosure generally relates to digital image processing. More particularly, this disclosure relates to methods and systems for template-based generation of personalized videos.
BACKGROUND
[0002] Sharing media, such as stickers and emojis, has become a standard option in messaging applications (also referred herein to as messengers). Currently, some of the messengers provide users with an option for generating and sending images and short videos to other users via a communication chat. Certain existing messengers allow users to modify the short videos prior to transmission. However, the
modifications of the short videos provided by the existing messengers are limited to visualization effects, filters, and texts. The users of the current messengers cannot perform complex editing, such as, for example, replacing one face with another face. Such editing of the videos is not provided by current messengers and requires sophisticated third-party video editing software.
SUMMARY
[0003] This section is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description section. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter.
[0004] According to one embodiment of the disclosure, a system for template- based generation of personalized videos is disclosed. The system may include at least one processor and a memory storing processor-executable codes. The at least one processor may be configured to receive, by a computing device, video configuration data. The video configuration data may include a sequence of frame images, a sequence of face area parameters defining positions of a face area in the frame images, and a sequence of facial landmark parameters defining positions of facial landmarks in the frame images. Each of the facial landmark parameters may correspond to a facial expression. The at least one processor may be configured to receive, by the computer device, an image of a source face. The at least one processor may be configured to generate, by the computing device, an output video. The generation of the output video may include modifying a frame image of the sequence of frame images. Specifically, the image of the source face may be modified based on facial landmark parameters corresponding to the frame image to obtain a further image featuring the source face adopting a facial expression corresponding to the facial landmark parameters. The further image may be inserted into the frame image at a position determined by face area parameters corresponding to the frame image.
[0005] According to one example embodiment, a method for template-based generation of personalized videos is disclosed. The method may commence with receiving, by a computing device, video configuration data. The video configuration data may include a sequence of frame images, a sequence of face area parameters defining positions of a face area in the frame images, and a sequence of facial landmark parameters defining positions of facial landmarks in the frame images. Each of the facial landmark parameters may correspond to a facial expression. The method may continue with receiving, by the computer device, an image of a source face. The method may further include generating, by the computing device, an output video. The generation of the output video may include modifying a frame image of the sequence of frame images. Specifically, the image of the source face may be modified to obtain a further image featuring the source face adopting a facial expression corresponding to the facial landmark parameters. The modification of the image may be performed based on facial landmark parameters corresponding to the frame image. The further image may be inserted into the frame image at a position determined by face area parameters corresponding to the frame image.
[0006] According to yet another aspect of the disclosure, there is provided a non- transitory processor-readable medium, which stores processor-readable instructions. When the processor-readable instructions are executed by a processor, they cause the processor to implement the above-mentioned method for template-based generation of personalized videos.
[0007] Additional objects, advantages, and novel features of the examples will be set forth in part in the description which follows, and in part will become apparent to those skilled in the art upon examination of the following description and the
accompanying drawings or may be learned by production or operation of the examples. The objects and advantages of the concepts may be realized and attained by means of the methodologies, instrumentalities and combinations particularly pointed out in the appended claims.
BRIEF DESCRIPTION OF THE DRAWINGS
[0008] Embodiments are illustrated by way of example and not limitation in the figures of the accompanying drawings, in which like references indicate similar elements.
[0009] FIG. 1 is a block diagram showing an example environment wherein systems and methods for template-based generation of personalized videos can be implemented.
[0010] FIG. 2 is a block diagram showing an example embodiment of a computing device for implementing methods for template-based generation of personalized videos.
[0011] FIG. 3 is a flow chart showing a process for template-based generation of personalized videos, according to some example embodiments of the disclosure.
[0012] FIG. 4 is a flow chart showing functionality of a system for template-based generation of the personalized videos, according to some example embodiments of the disclosure.
[0013] FIG. 5 is a flow chart showing a process of generation of live action videos for use in the generation of video templates, according to some example embodiments.
[0014] FIG. 6 shows frames of example live action videos for generating video templates, according to some example embodiments.
[0015] FIG. 7 shows an original image of a face and an image of the face with normalized illumination, according to an example embodiment.
[0016] FIG. 8 shows a segmented head image, the head image with facial landmarks, and a facial mask, according to an example embodiment.
[0017] FIG. 9 shows a frame featuring a user face, a skin mask, and a result of recoloring the skin mask, according to an example embodiment. [0018] FIG. 10 shows an image of a facial image of a face synchronization actor, an image of the face synchronization actor's facial landmarks, an image of a user's facial landmarks, and image of the user's face with the facial expression of the face
synchronization actor, according to an example embodiment.
[0019] FIG. 11 shows a segmented face image, a hair mask, a hair mask warped to a target image, and the hair mask applied to the target image, according to an example embodiment.
[0020] FIG. 12 shows an original image of an eye, an image with reconstructed sclera of the eye, an image with reconstructed iris, and an image with reconstructed moved iris, according to an example embodiment.
[0021] FIGs. 13-14 show frames of example personalized video generated based on video templates, according to some example embodiments.
[0022] FIG. 15 is a flow chart showing a method for template-based generation of personalized videos, in accordance with an example embodiment.
[0023] FIG. 16 shows an example computer system that can be used to
implement methods for template-based generation of personalized videos.
DETAILED DESCRIPTION
[0024] The following detailed description of embodiments includes references to the accompanying drawings, which form a part of the detailed description. Approaches described in this section are not prior art to the claims and are not admitted prior art by inclusion in this section. The drawings show illustrations in accordance with example embodiments. These example embodiments, which are also referred to herein as "examples," are described in enough detail to enable those skilled in the art to practice the present subject matter. The embodiments can be combined, other embodiments can be utilized, or structural, logical and operational changes can be made without departing from the scope of what is claimed. The following detailed description is, therefore, not to be taken in a limiting sense, and the scope is defined by the appended claims and their equivalents.
[0025] For purposes of this patent document, the terms "or" and "and" shall mean "and/or" unless stated otherwise or clearly intended otherwise by the context of their use. The term "a" shall mean "one or more" unless stated otherwise or where the use of "one or more" is clearly inappropriate. The terms "comprise," "comprising," "include," and "including" are interchangeable and not intended to be limiting. For example, the term "including" shall be interpreted to mean "including, but not limited to."
[0026] This disclosure relates to methods and systems for template-based generation of personalized videos. The embodiments provided in this disclosure solve at least some issues of known art. The present disclosure can be designed to work on mobile devices, such as smartphones, tablet computers, or mobile phones, in real-time, although the embodiments can be extended to approaches involving a web service or a cloud-based resource. Methods described herein can be implemented by software running on a computer system and/or by hardware utilizing either a combination of microprocessors or other specifically designed application-specific integrated circuits (ASICs), programmable logic devices, or any combinations thereof. In particular, the methods described herein can be implemented by a series of computer-executable instructions residing on a non-transitory storage medium such as a disk drive or computer-readable medium.
[0027] Some embodiments of the disclosure may allow generating personalized videos in real time on a user computing device, such as a smartphone. A personalized video may be generated in the form of an audiovisual media (e.g., a video, an animation, or any other type of media) that features a face of a user or faces of multiple users. The personalized videos can be generated based on pre-generated video templates. A video template may include video configuration data. The video configuration data may include a sequence of frame images, a sequence of face area parameters defining positions of a face area in the frame images, and a sequence of facial landmark parameters defining positions of facial landmarks in the frame images. Each of the facial landmark parameters may corresponds to a facial expression. The frame images can be generated based on an animation video or a live action video. The facial landmark parameters can be generated based on another live action video featuring a face of an actor (also called a face synchronization (facesync) as described in more detail below), animation video, audio file, text, or manually.
[0028] The video configuration file may also include sequence of skin masks.
The skin masks may define a skin area of a body of an actor featured in the frame images or a skin area of 2D/3D animation of a body. In an example embodiment, the skin mask and the facial landmark parameters can be generated based on two different live action videos capturing different actors (referred to herein as an actor and facesync actor, respectively). The video configuration data may further include a sequence of mouth region images and a sequence of eye parameters. The eye parameters may define positions of an iris in a sclera of a facesync actor featured in the frame images. The video configuration data may include a sequence of head parameters defining a rotation and a turn of a head, a position, a scale, and other parameters of the head. A user may keep his head still when taking an image and look directly at the camera, therefore, the scale and rotations of the head may be adjusted manually. The head parameters can be transferred from a different actor (also referred to herein as a facesync actor). As used herein, a facesync actor is a person whose facial landmark parameters are being used, an actor is another person whose body is being used in a video template and whose skin may be recolored, and a user is a person who takes an image of his/her face to generate a personalized video. Thus, in some embodiments, the personalized video includes the face of the user modified to have facial expressions of the facesync actor and includes the body of the actor taken from the video template and recolored to match the color of the face of the user. The video configuration data include a sequence of animated object images. Optionally, the video configuration data includes a soundtrack and/or voice.
[0029] The pre-generated video templates can be stored remotely in a cloud- based computing resource and can be downloadable by a user of a computing device (such as a smartphone). The user of the computing may capture, by the computing device, an image of a face or select an image of the face from a camera roll, from a prepared collection of images, or via the web link. In some embodiments, the image may include an animal instead of a face of a person or may be in the form of a drawn picture. Based on the image of the face and one of the pre-generated video templates, the computing device may further generate a personalized video. The user may send the personalized video, via a communication chat, to another user of another computing device, share on social media, download to a local storage of a computing device, or upload to a cloud storage or a video sharing service. [0030] According to one embodiment of the disclosure, an example method for template-based generation of personalized videos may include receiving, by a computing device, video configuration data. The video configuration data may include a sequence of frame images, a sequence of face area parameters defining positions of a face area in the frame images, and a sequence of facial landmark parameters defining positions of facial landmarks in the frame images. Each of the facial landmark parameters may correspond to a facial expression of a facesync actor. The method may continue with receiving an image of a source face and generating, by the computing device, an output video. The generation of the output video may include modifying a frame image of the sequence of frame images. The modification of the frame image may include modifying the image of the source face to obtain a further image featuring the source face adopting a facial expression corresponding to the facial landmark parameters and inserting the further image into the frame image at a position determined by face area parameters corresponding to the frame image. Additionally, the source face may be modified, e.g., by changing color, making eyes bigger, and so forth. The image of the source face may be modified based on facial landmark parameters corresponding to the frame image.
[0031] Referring now to the drawings, example embodiments are described. The drawings are schematic illustrations of idealized example embodiments. Thus, the example embodiments discussed herein should not be understood as limited to the particular illustrations presented herein; rather, these example embodiments can include deviations and differ from the illustrations presented herein as shall be evident to those skilled in the art.
[0032] FIG. 1 shows an example environment 100, in which a system and a method for template-based generation of personalized videos can be implemented. The environment 100 may include a computing device 105, a user 102, a computing device 110, a user 104, a network 120, and a messenger services system 130. The computing device 105 and computing device 110 can refer to a mobile device such as a mobile phone, smartphone, or tablet computer. In further embodiments, the computing device 110 can refer to a personal computer, laptop computer, netbook, set top box, television device, multimedia device, personal digital assistant, game console, entertainment system, infotainment system, vehicle computer, or any other computing device.
[0033] The computing device 105 and the computer device 110 can be
communicatively connected to messenger services system 130 via the network 120. The messenger services system 130 can be implemented as a cloud-based computing resource(s). The messenger services system 130 can include computing resource(s) (hardware and software) available at a remote location and accessible over a network (e.g., the Internet). The cloud-based computing resource(s) can be shared by multiple users and can be dynamically re-allocated based on demand. The cloud-based computing resources can include one or more server farms/clusters including a collection of computer servers which can be co-located with network switches and/or routers.
[0034] The network 120 may include any wired, wireless, or optical networks including, for example, the Internet, intranet, local area network (LAN), Personal Area Network (PAN), Wide Area Network (WAN), Virtual Private Network (VPN), cellular phone networks (e.g., Global System for Mobile (GSM) communications network, and so forth.
[0035] In some embodiments of the disclosure, the computing device 105 can be configured to enable a communication chat between the user 102 and the user 104 of the computing device 110. During the communication chat, the user 102 and the user 104 may exchange text messages and videos. The videos may include personalized videos. The personalized videos can be generated based on pre-generated video templates stored in the computing device 105 or the computing device 110. In some embodiments, the pre-generated video templates can be stored in the messenger services system 130 and downloaded to the computing device 105 or the computing device 110 on demand.
[0036] The messenger services system 130 may include a system 140 for pre processing videos. The system 140 may generate video templates based on animation videos or live action videos. The messenger services system 130 may include a video templates database 145 for storing the video templates. The video templates can be downloaded to the computing device 105 or the computing device 110.
[0037] The messenger services system 130 may be also configured to store user profiles 135. The user profiles 135 may include images of the face of the user 102, images of the face of the user 104, and images of faces of other persons. The images of the faces can be downloaded to the computing device 105 or the computing device 110 on demand and based on permissions. Additionally, the images of the face of the user 102 can be generated using the computing device 105 and stored in a local memory of the computing device 105. The images of the faces can be generated based on other images stored in the computing device 105. The images of the faces can be further used by the computing device 105 to generate personalized videos based on the pre generated video templates. Similarly, the computing device 110 may be used to generate images of the face of the user 104. The images of the face of the user 104 can be used to generate personalized videos on the computing device 110. In further embodiments, the images of the face of user 102 and images of the face of the user 104 can be mutually used to generate personalized videos on the computing device 105 or the computing device 110.
[0038] FIG. 2 is a block diagram showing an example embodiment of a computing device 105 (or computing device 110) for implementing methods for personalized videos. In the example shown in FIG. 2, the computing device 110 includes both hardware components and software components. Particularly, the computing device 110 includes a camera 205 or any other image-capturing device or scanner to acquire digital images. The computing device 110 can further include a processor module 210 and a storage module 215 for storing software components and processor-readable (machine-readable) instructions or codes, which when performed by the processor module 210 cause the computing device 105 to perform at least some steps of methods for template-based generation of personalized videos as described herein. The computing device 105 may include a graphical display system 230 and a communication module 240. In other embodiments, the computing device 105 may include additional or different components. Moreover, the computing device 105 can include fewer components that perform functions similar or equivalent to those depicted in FIG. 2.
[0039] The computing device 110 can further include a messenger 220 for enabling communication chats with another computing device (such as the computing device 110) and a system 250 for template-based generation of personalized videos. The system 250 is described in more detail below with reference to FIG. 4. The messenger 220 and the system 250 may be implemented as software components and processor- readable (machine-readable) instructions or codes stored in the memory storage 215, which when performed by the processor module 210 cause the computing device 105 to perform at least some steps of methods for providing communication chats and generation of personalized videos as described herein.
[0040] In some embodiments, the system 250 for template-based generation of personalized videos can be integrated in the messenger 220. A user interface of the messenger 220 and the system 250 for template-based personalized videos can be provided via the graphical display system 230. The communication chats can be enabled via the communication module 240 and the network 120. The communication module 240 may include a GSM module, a WiFi module, a Bluetooth™ module, and so forth.
[0041] FIG. 3 is a flow chart showing steps of a process 300 for template-based generation of personalized videos, according to some example embodiment of the disclosure. The process 300 may include production 305, post-production 310, resources preparation 315, skin recoloring 320, lip synchronization and facial reenactment 325, hair animation 330, eyes animation 335, and deploy 340. The resource preparation 315 can be performed by the system 140 for pre-processing videos in the messenger services system 130 (shown in FIG. 1). The resource preparation 315 results in generating video templates that may include video configuration data.
[0042] The skin recoloring 320, lip synchronization and facial reenactment 325, hair animation 330, eyes animation 335, and deploy 340 can be performed by the system 250 for template-based generation of personalized videos in computing device 105 (shown in FIG. 2). The system 250 may receive an image of the user's face and video configuration data and generate a personalized video featuring the user's face.
[0043] The skin recoloring 320, lip synchronization and facial reenactment 325, hair animation 330, eyes animation 335, and deploy 340 can be also performed by the system 140 for pre-processing videos in messenger services system 130. The system 140 can receive test images of user faces and a video configuration file. The system 140 may generate test personalized videos featuring the user faces. The test personalized videos can be reviewed by an operator. Based on a result of the review, the video
configuration file can be stored in the video templates database 145 and can then be downloaded to the computing device 105 or computing device 110.
[0044] The production 305 may include idea and scenario creation, pre- production during which a location, props, actors, costumes and effects are identified, and production itself, which can require one or more recording sessions. In some example embodiments, the recording may be performed by recording a scene/actor on a chroma key background, also referred herein to as a green screen or chroma key screen. To allow the subsequent head tracking and resources clean-up, the actors may wear chroma key face masks (e.g., balaclavas) with tracking marks that cover the face of the actors, but leave the neck and the bottom of the chin open. The idea and scenario creation are shown in detail in FIG. 5.
[0045] In an example embodiment, the steps of pre-production and subsequent production 305 is optional. Instead of recording an actor, 2D or 3D animation may be created or third-party footages/images may be used. Furthermore, an original background of the image of the user may be used.
[0046] FIG. 5 is a block diagram showing a process 500 of generating live action videos. The live action videos can be further used to generate video templates for generation of personalized video. The process 500 may include generating an idea at step 505 and creating a scenario at step 510. The process 500 may continue with pre- production at step 515, which is followed by production 305. The production 305 may include recording using a chroma key screen 525 or at a real life location 530.
[0047] FIG. 6 shows frames of example live action videos for generating video templates. Frames for video 605 and video 615 are recorded at a real life location 530. Frames for video 610, video 620, and video 625 are recorded using a chroma key screen 525. The actors may wear chroma key face masks 630 with tracking marks that cover the face of the actors.
[0048] The post-production 310 may include video editing or animation, visual effects, clean-up, sound design and voice over recording.
[0049] During the resources preparation 315, the resources prepared for further deploy may include the following components: a background footage without a head of an actor (i.e., preparing a cleaned-up background where the head of the actor is removed); a footage with an actor on a black background (only for recorded
personalized videos); a foreground sequence of frames; an example footage with a generic head and soundtrack; coordinates for head position, rotation, and scale;
animated elements that are attached to the head (optional); soundtracks with and without a voice-over; a voice-over in a separate file (optional); and so forth. All of these components are optional and may be rendered in different formats. The number and configuration of the components depend on the format of the personalized video. For example, a voice-over is not needed for customized personalized videos, background footages and head coordinates are not needed if the original background from a picture of the user is used, and so forth. In an example embodiment, the area where the face needs to be located may be indicated (e.g., manually) instead of preparing a file with coordinates.
[0050] The skin recoloring 320 allows to match the color of a skin of the actor in the personalized video to the color of a face on an image of the user. To implement this step, skin masks that indicate specifically which part of background has to be recolored may be prepared. It may be preferable to have a separate mask for each body part of the actor (neck, left and right hands, etc.).
[0051] The skin recoloring 320 may include facial image illumination
normalization. FIG. 7 shows an original image 705 of a face and an image 710 of the face with normalized illumination, according to an example embodiment. Shadows or highlights caused by uneven illumination affect color distribution and may lead to too dark or too light of a skin tone after recoloring. To avoid this, shadows and highlights in the face of the user may be detected and removed. The facial image illumination normalization process includes the following steps. An image of a face of the user may be transformed using a deep convolutional neural network. The network may receive an original image 705 in the form of a portrait image taken under arbitrary illumination and change the illumination of the original image 705 to make the original image 705 evenly illuminated while keeping the subject in original image 705 the same. Thus, the input of the facial image illumination normalization process includes the original image 705 in the form of the image of the face of the user and facial landmarks. The output of the facial image illumination normalization process includes the image 710 of the face with normalized illumination.
[0052] The skin recoloring 320 may include mask creation and body statistics.
There may only be a mask for the whole skin or separate masks for body parts. Also, different masks can be created for different scenes in the video (e.g., due to significant illumination change). Masks may be created semi-automatically, e.g., by such technologies as keying, with some human guidance. Prepared masks may be merged into video assets and then used in the recoloring. Also, to avoid unnecessary
computations in real-time, color statistics may be calculated for each mask in advance. Statistics may include mean value, median value, standard deviation, and some percentiles for each color channel. Statistics can be computed in Red, Green, Blue (RGB) color space as well as in the other color spaces (Hue, Saturation, Value (HSV) color space, CIELAB color space (also known as CIE L*a*b* or abbreviated as "LAB" color space), etc.). The input of the mask creation process may include grayscale masks for body parts of an actor with uncovered skin in the form of videos or image sequences. The output of the mask creation process may include masks compressed and merged to videos and color statistics per each mask.
[0053] The skin recoloring 320 may further include facial statistics computation.
FIG. 8 shows a segmented head image 805, the segmented head image 805 with facial landmarks 810, and a facial mask 815, according to an example embodiment. Based on segmentation of the head image of the user and facial landmarks, the facial mask 815 of the user may be created. Regions such as eyes, mouth, hair, or accessories (like glasses) may be not included in the facial mask 815. The segmented head image 805 of the user and the facial mask may be used to compute the statistics for facial skin of the user. Thus, the input of the facial statistics computation may include the segmented head image 805 of the user, facial landmarks 810, and facial segmentation, and the output of the facial statistics computation may include color statistics for the facial skin of the user.
[0054] The skin recoloring 320 may further include skin-tone matching and recoloring. FIG. 9 shows a frame 905 featuring a user face, a skin mask 910, a result 915 of recoloring the skin mask 910, according to an example embodiment. The skin-tone matching and recoloring may be performed using statistics that describe color distributions in the actor's skin and user's skin, and recoloring of a background frame may be performed in real-time on a computing device. For each color channel, distribution matching may be performed and values of background pixels may be modified in order to make the distribution of transformed values close to the distribution of facial values. Distribution matching may be performed either under assumption that color distribution is normal or by applying techniques like
multidimensional probability density function transfer. Thus, the input of the skin-tone matching and recoloring process may include a background frame, actor skin masks for the frame, actor body skin color statistics for each mask, and user facial skin color statistics, and the output may include the background frame with all body parts with uncovered skin recolored.
[0055] In some embodiments, to apply skin recoloring 320, several actors with different skin tones may be recorded and then a version of the personalized video that has the closest skin tone to the skin tone of the image of the user may be used.
[0056] In an example embodiment, instead of skin recoloring 320, a predetermined lookup table (LUT) may be used to adjust the color of the face to the illumination of a scene. The LUT may be also used to change the color of the face, for example, to make the face green.
[0057] The lip synchronization and facial reenactment 325 may result in photorealistic face animation. FIG. 10 shows an example process of the lip
synchronization and facial reenactment 325. FIG. 10 shows an image 1005 of a facesync actor face, an image 1010 of the facesync actor facial landmarks, an image 1015 of user facial landmarks, and an image 1020 of the user's face with the facial expression of the facesync actor, according to an example embodiment. The steps of lip synchronization and facial reenactment 325 may include recording a facesync actor and pre-processing a source video/image to obtain the image 1005 of a facesync actor face. Then, the facial landmarks may be extracted as shown by the image 1010 of the facesync actor facial landmarks. This step also may include gaze tracking of the facesync actor. In some embodiments, instead of recording a facesync actor, previously prepared animated 2D or 3D face and mouth region models may be used. The animated 2D or 3D face and mouth region models may be generated by machine learning techniques.
[0058] Optionally, fine tuning of the facial landmarks may be performed. In some example embodiments, the fine tuning of the facial landmarks is performed manually. These steps can be performed in a cloud when preparing the video configuration file. In some example embodiments, these steps may be performed during the resource preparation 315. Then, the user's facial landmarks may be extracted as shown by the image 1015 of the user's facial landmarks. The next step of the synchronization and facial reenactment 325 may include animation of the target image with extracted landmarks to obtain the image 1020 of the user's face with the facial expression of the facesync actor. This step may be performed on a computing device based on an image of a face of the user. The method of animation is described in detail in U.S. patent application No. 16/251,472, the disclosure of which is incorporated herein by reference in its entirety. The lip synchronization and facial reenactment 325 can also be enriched with Artificial Intelligence-made head turns.
[0059] In some example embodiments, after the user takes an image, a 3D model of the user's head may be created. In this embodiment, the step of lip synchronization and facial reenactment 325 may be omitted.
[0060] The hair animation 330 may be performed to animate hair of the user. For example, if the user has hair, the hair may be animated when the user moves or rotates his head. The hair animation 330 is shown in FIG. 11. FIG. 11 shows a segmented face image 1105, a hair mask 1110, a hair mask warped to the face image 1115, and the hair mask applied to the face image 1120, according to an example embodiment. The hair animation 330 may include one or more of the following steps: classifying the hair type, modifying appearance of the hair, modifying a hair style, making the hair longer, changing the color of the hair, cutting and animating the hair, and so forth. As shown in FIG. 11, a face image in the form of a segmented face image 1105 may be obtained. Then, a hair mask 1110 may be applied to the segmented face image 1105. The image 1115 shows the hair mask 1110 warped to the face image. The image 1120 shows the hair mask 1110 applied to the face image. The hair animation 330 is described in detail in the US patent application No. 16/551,756, the disclosure of which is incorporated herein by reference in its entirety.
[0061] The eyes animation 335 may allow making the facial expressions of the user more realistic. The eyes animation 335 is shown in detail in FIG. 12. The process of eyes animation 335 may consist of the following steps: reconstruction of an eye region of the user face, gaze movement step, and eye blinking step. During the reconstruction of the eye region, the eye region is segmented into parts: eyeball, iris, pupil, eyelashes, and eyelid. If some part of the eye region (e.g., iris or eyelid) is not fully visible, the full texture of this part may be synthesized. In some embodiments, a 3D morphable model of an eye may be fitted, and a 3D-shape of the eye may be obtained together with the texture of the eye. FIG. 12 shows an original image 1205 of an eye, an image 1210 with the reconstructed sclera of the eye, and an image 1215 with the reconstructed iris.
[0062] The gaze movement step includes tracking a gaze direction and pupil position in a video of a facesync actor. This data may be manually edited if the eye movements of the facesync actor are not rich enough. Gaze movements may then be transferred to the eye region of the user by synthesizing a new eye image with transformed eye shape and the same position of iris as that of the facesync actor. FIG.
12 shows an image 1220 with the reconstructed moved iris.
[0063] During the eye blinking step, the visible part of the eye of the user may be determined by tracking the eyes of the facesync actor. A changed appearance of eyelids and eyelashes may be generated based on the reconstruction of eye region.
[0064] The steps of the eyes animation 335 may be done either explicitly (as described) or implicitly if face reenactment is done using generative adversarial networks (GAN). In the latter case, the neural network may implicitly capture all the necessary information from the image of the user face and the source video.
[0065] During the deploy 340, the user face may be photorealistically animated and automatically inserted in footage templates. The files from the previous steps (resources preparation 315, skin recoloring 320, lip synchronization and facial reenactment 325, hair animation 330, and eyes animation 335) may be used as data for a configuration file. Examples of personalized videos with a predefined set of user faces can be generated for initial review. After the issues that were identified during the review are eliminated, the personalized video may be deployed.
[0066] The configuration file may also include a component that allows to indicate the text parameters for customized personalized videos. A customized personalized video is a type of a personalized video that allows users to add any text the users want on top of the final video. The generating of personalized videos with customized text messages is described in more detail in U.S. patent application No. 16/661,122 dated October 23, 2019, titled "SYSTEMS AND METHODS FOR
GENERATING PERSONALIZED VIDEOS WITH CUSTOMIZED TEXT MESSAGES," the disclosure of which is incorporated herein in its entirety.
[0067] In an example embodiment, the generation of the personalized videos may further include the steps of generating significant head turns of a user's head; body animation and changing clothes; facial augmentations such as hairstyle changing, beautification, adding accessories, and the like; changing the scene illumination;
synthesizing the voice that may read/sing the text that user has typed or changing the voice over tone to match the voice of the user; gender switching; construction of a background and a foreground depending on the user input; and so forth.
[0068] FIG. 4 is a schematic showing functionality 400 of the system 250 for template-based generation of the personalized videos, according to some example embodiments. The system 250 may receive an image of a source face shown as a user face image 405 and a video template including video configuration data 410. The video configuration data 410 may include data sequences 420. For example, the video configuration data 410 may include a sequence of frame images, a sequence of face area parameters defining positions of a face area in the frame images, and a sequence of facial landmark parameters defining positions of facial landmarks in the frame images. Each of the facial landmark parameters may correspond to a facial expression. The sequence of frame images may be generated based on an animation video or based on a live action video. The sequence of facial landmark parameters may be generated based on a live action video featuring a face of a facesync actor. The video configuration data 410 may further include a skin mask, eyes parameters, a mouth region image, head parameters, animated object images, preset text parameters, and so forth. The video configuration data may include a sequence of skin masks defining a skin area of a body of at least one actor featured in the frame images. In an example embodiment, the video configuration data 410 may further include a sequence of mouth region images. Each of the mouth region images may correspond to at least one of the frame images.
In a further example embodiment, the video configuration data 410 may include a sequence of eye parameters defining positions of an iris in a sclera of a facesync actor featured in the frame images and/or a sequence of head parameters defining a rotation, a turn, a scale, and other parameters of a head. In another example embodiment, the video configuration data 410 may further include a sequence of animated object images. Each of the animated object images may correspond to at least one of the frame images. The video configuration data 410 may further include a soundtrack 450.
[0069] The system 250 may determine, based on the user face image 405, user data 435. The user data may include user facial landmarks, a user face mask, user color data, a user hair mask, and so forth.
[0070] The system 250 may generate, based on the user data 435 and the data sequences 420, frames 445 of an output video shown as a personalized video 440. The system 250 may further add the soundtrack to the personalized video 440. The personalized video 440 may be generated by modifying a frame image of the sequence of frame images. The modifying of the frame image may include modifying the user face image 405 to obtain a further image featuring the source face adopting a facial expression corresponding to the facial landmark parameters. The modification may be performed based on facial landmark parameters corresponding to the frame image.
The further image may be inserted into the frame image at a position determined by face area parameters corresponding to the frame image. In an example embodiment, the generation of the output video may further include determining color data associated with the source face and, based on the color data, recoloring the skin area in the frame image. Additionally, the generation of the output video may include inserting, into the frame image, a mouth region corresponding to the frame image. Further steps of the generation of the output video may include generating an image of an eyes region based on the eye parameters corresponding to the frame and inserting the image of the eyes region in the frame image. In an example embodiment, the generation of the output video may further include determining a hair mask based on the source face image, generating a hair image based on the hair mask and head parameters corresponding to the frame image, and inserting the hair image into the frame image. Additionally, the generation of the output video may include inserting, into the frame image, an animated object image corresponding to the frame image.
[0071] FIGs. 13-14 show frames of example personalized videos generated based on video templates, according to some example embodiments. FIG. 13 shows a filmed personalized video 1305 with an actor, in which the recoloring was performed. FIG. 13 further shows a personalized video 1310 created based on a stock video obtained from a third party. In the personalized video 1310, a user face 1320 is inserted into the stock video. FIG. 13 further shows a personalized video 1315, which is a 2D animation with a user head 1325 added on top of the 2D animation.
[0072] FIG. 14 shows a personalized video 1405, which is a 3D animation with a user face 1415 inserted into the 3D animation. FIG. 14 further shows a personalized video 1410 with effects, animated elements 1420, and, optionally, text added on top of the image of the user face.
[0073] FIG. 15 is a flow chart showing a method 1500 for template-based generation of personalized videos, according to some example embodiments of the disclosure. The method 1500 can be performed by the computing device 105. The method 1500 may commence with receiving video configuration data at step 1505. The video configuration data may include a sequence of frame images, a sequence of face area parameters defining positions of a face area in the frame images, and a sequence of facial landmark parameters defining positions of facial landmarks in the frame images. Each of the facial landmark parameters may correspond to a facial expression. In an example embodiment, the sequence of frame images may be generated based on an animation video or based on a live action video. The sequence of facial landmark parameters may be generated based on a live action video featuring a face of a facesync actor. The video configuration data may include one or more of the following: a sequence of skin masks defining a skin area of a body of at least one actor featured in the frame images, a sequence of mouth region images where each of the mouth region images corresponds to at least one of the frame images, a sequence of eye parameters defining positions of an iris in a sclera of a facesync actor featured in the frame images, a sequence of head parameters defining a rotation, a scale, a turn, and other parameters of a head, a sequence of animated object images, wherein each of the animated object images corresponds to at least one of the frame images, and so forth.
[0074] The method 1500 may continue with receiving an image of a source face at step 1510. The method 1500 may further include generating an output video at step 1515. Specifically, the generation of the output video may include modifying a frame image of the sequence of frame images. The frame image may be modified by modifying the image of the source face to obtain a further image featuring the source face adopting a facial expression corresponding to the facial landmark parameters. The image of the source face may be modified based on facial landmark parameters corresponding to the frame image. The further image may be inserted into the frame image at a position determined by face area parameters corresponding to the frame image. In an example embodiment, the generation of the output video may further optionally include one or more of the following steps: determining color data associated with the source face and recoloring the skin area in the frame image based on the color data, inserting a mouth region corresponding to the frame image into the frame image, generating an image of eyes region based on the eye parameters corresponding to the frame, inserting the image of the eyes region in the frame image, determining a hair mask based on the source face image and generating a hair image based on the hair mask and head parameters corresponding to the frame image, inserting the hair image into the frame image, and inserting an animated object image corresponding to the frame image into the frame image.
[0075] FIG. 16 illustrates an example computing system 1600 that can be used to implement methods described herein. The computing system 1600 can be implemented in the contexts of the likes of computing devices 105 and 110, the messenger services system 130, the messenger 220, and the system 250 for template-based generation of personalized videos.
[0076] As shown in FIG. 16, the hardware components of the computing system
1600 may include one or more processors 1610 and memory 1620. Memory 1620 stores, in part, instructions and data for execution by processor 1610. Memory 1620 can store the executable code when the system 1600 is in operation. The system 1600 may further include an optional mass storage device 1630, optional portable storage medium drive(s) 1640, one or more optional output devices 1650, one or more optional input devices 1660, an optional network interface 1670, and one or more optional peripheral devices 1680. The computing system 1600 can also include one or more software components 1695 (e.g., ones that can implement the method for template-based generation of personalized videos as described herein).
[0077] The components shown in FIG. 16 are depicted as being connected via a single bus 1690. The components may be connected through one or more data transport means or data network. The processor 1610 and memory 1620 may be connected via a local microprocessor bus, and the mass storage device 1630, peripheral device(s) 1680, portable storage device 1640, and network interface 1670 may be connected via one or more input/output (I/O) buses.
[0078] The mass storage device 1630, which may be implemented with a magnetic disk drive, solid-state disk drive, or an optical disk drive, is a non-volatile storage device for storing data and instructions for use by the processor 1610. Mass storage device 1630 can store the system software (e.g., software components 1695) for implementing embodiments described herein.
[0079] Portable storage medium drive(s) 1640 operates in conjunction with a portable non-volatile storage medium, such as a compact disk (CD), or digital video disc (DVD), to input and output data and code to and from the computing system 1600. The system software (e.g., software components 1695) for implementing embodiments described herein may be stored on such a portable medium and input to the computing system 1600 via the portable storage medium drive(s) 1640.
[0080] The optional input devices 1660 provide a portion of a user interface. The input devices 1660 may include an alphanumeric keypad, such as a keyboard, for inputting alphanumeric and other information, or a pointing device, such as a mouse, a trackball, a stylus, or cursor direction keys. The input devices 1660 can also include a camera or scanner. Additionally, the system 1600 as shown in FIG. 16 includes optional output devices 1650. Suitable output devices include speakers, printers, network interfaces, and monitors.
[0081] The network interface 1670 can be utilized to communicate with external devices, external computing devices, servers, and networked systems via one or more communications networks such as one or more wired, wireless, or optical networks including, for example, the Internet, intranet, LAN, WAN, cellular phone networks, Bluetooth radio, and an IEEE 802.11-based radio frequency network, among others. The network interface 1670 may be a network interface card, such as an Ethernet card, optical transceiver, radio frequency transceiver, or any other type of device that can send and receive information. The optional peripherals 1680 may include any type of computer support device to add additional functionality to the computer system.
[0082] The components contained in the computing system 1600 are intended to represent a broad category of computer components. Thus, the computing system 1600 can be a server, personal computer, hand-held computing device, telephone, mobile computing device, workstation, minicomputer, mainframe computer, network node, or any other computing device. The computing system 1600 can also include different bus configurations, networked platforms, multi-processor platforms, and so forth. Various operating systems (OS) can be used including UNIX, Linux, Windows, Macintosh OS, Palm OS, and other suitable operating systems.
[0083] Some of the above-described functions may be composed of instructions that are stored on storage media (e.g., computer-readable medium or processor- readable medium). The instructions may be retrieved and executed by the processor. Some examples of storage media are memory devices, tapes, disks, and the like. The instructions are operational when executed by the processor to direct the processor to operate in accord with the invention. Those skilled in the art are familiar with instructions, processor(s), and storage media.
[0084] It is noteworthy that any hardware platform suitable for performing the processing described herein is suitable for use with the invention. The terms
"computer-readable storage medium" and "computer-readable storage media" as used herein refer to any medium or media that participate in providing instructions to a processor for execution. Such media can take many forms, including, but not limited to, non-volatile media, volatile media, and transmission media. Non-volatile media include, for example, optical or magnetic disks, such as a fixed disk. Volatile media include dynamic memory, such as system random access memory (RAM).
Transmission media include coaxial cables, copper wire, and fiber optics, among others, including the wires that include one embodiment of a bus. Transmission media can also take the form of acoustic or light waves, such as those generated during radio frequency (RF) and infrared (IR) data communications. Common forms of computer- readable media include, for example, a floppy disk, a flexible disk, a hard disk, magnetic tape, any other magnetic medium, a CD-read-only memory (ROM) disk,
DVD, any other optical medium, any other physical medium with patterns of marks or holes, a RAM, a PROM, an EPROM, an EEPROM, any other memory chip or cartridge, a carrier wave, or any other medium from which a computer can read.
[0085] Various forms of computer-readable media may be involved in carrying one or more sequences of one or more instructions to a processor for execution. A bus carries the data to system RAM, from which a processor retrieves and executes the instructions. The instructions received by the system processor can optionally be stored on a fixed disk either before or after execution by a processor.
[0086] Thus, the methods and systems for template-based generation of personalized videos have been described. Although embodiments have been described with reference to specific example embodiments, it will be evident that various modifications and changes can be made to these example embodiments without departing from the broader spirit and scope of the present application. Accordingly, the specification and drawings are to be regarded in an illustrative rather than a restrictive sense.

Claims

CLAIMS What is claimed is:
1. A method for template-based generation of personalized videos, the method comprising:
receiving, by a computing device, video configuration data including: a sequence of frame images;
a sequence of face area parameters defining positions of a face area in the frame images; and
a sequence of facial landmark parameters defining positions of facial landmarks in the frame images, wherein each of the facial landmark parameters corresponds to a facial expression;
receiving, by the computer device, an image of a source face; and generating, by the computing device, an output video, wherein the generating the output video includes modifying a frame image of the sequence of frame images by:
modifying, based on facial landmark parameters corresponding to the frame image, the image of the source face to obtain a further image featuring the source face adopting a facial expression corresponding to the facial landmark parameters; and
inserting the further image into the frame image at a position determined by face area parameters corresponding to the frame image.
2. The method of claim 1, wherein the sequence of frame images is generated based on one of the following: an animation video and a live action video.
3. The method of claim 1, wherein the sequence of facial landmark parameters is generated based on a live action video featuring a face of a face synchronization actor.
4. The method of claim 1, wherein:
the video configuration data include a sequence of skin masks defining a skin area of a body of at least one actor featured in the frame images or a skin area of 2D/3D animation of a further body; and
the generating the output video includes:
determining color data associated with the source face; and recoloring, based on the color data, the skin area in the frame image.
5. The method of claim 1, wherein:
the video configuration data further include a sequence of mouth region images, each of the mouth region images corresponding to at least one of the frame images; and
the generating the output video includes inserting, into the frame image, a mouth region corresponding to the frame image.
6. The method of claim 1, wherein:
the video configuration data further include a sequence of eye parameters defining positions of an iris in a sclera of a face synchronization actor featured in the frame images; and
the generating the output video includes: generating, based on the eye parameters corresponding to the frame, an image of eyes region; and
inserting the image of eyes region in the frame image.
7. The method of claim 1, wherein:
the video configuration data include a sequence of head parameters defining one or more of a rotation, a turn, a position, and a scale of a head.
8. The method of claim 1, wherein:
the generating the output video includes:
determining, based on the image of the source face, a hair mask; generating, based on the hair mask, a hair image; and
inserting the hair image into the frame image.
9. The method of claim 1, wherein:
the video configuration data include a sequence of animated object images, wherein each of the animated object images corresponds to at least one of the frame images; and
the generating the output video includes inserting, into the frame image, an animated object image corresponding to the frame image.
10. The method of claim 1, wherein:
the video configuration data include a soundtrack; and
the generating the output video further includes adding the soundtrack to the output video.
11. A system for template-based generation of personalized videos, the system comprising at least one processor and a memory storing processor-executable codes, wherein the at least one processor is configured to implement the following operations upon executing the processor-executable codes:
receiving, by a computing device, video configuration data including: a sequence of frame images;
a sequence of face area parameters defining positions of a face area in the frame images; and
a sequence of facial landmark parameters defining positions of facial landmarks in the frame images, wherein each of the facial landmark parameters corresponds to a facial expression;
receiving, by the computer device, an image of a source face; and generating, by the computing device, an output video, wherein the generating the output video includes modifying a frame image of the sequence of frame images by:
modifying, based on facial landmark parameters corresponding to the frame image, the image of the source face to obtain a further image featuring the source face adopting a facial expression corresponding to the facial landmark parameters; and
inserting the further image into the frame image at a position determined by face area parameters corresponding to the frame image.
12. The system of claim 11, wherein the sequence of frame images is generated based on one of the following: an animation video and a live action video.
13. The system of claim 11, wherein the sequence of facial landmark parameters is generated based on a live action video featuring a face of a face synchronization actor.
14. The system of claim 11, wherein:
the video configuration data include a sequence of skin masks defining a skin area of a body of at least one actor featured in the frame images or a skin area of 2D/3D animation of a further body; and
the generating the output video includes:
determining color data associated with the source face; and recoloring, based on the color data, the skin area in the frame image.
15. The system of claim 11, wherein:
the video configuration data further include a sequence of mouth region images, each of the mouth region images corresponding to at least one of the frame images; and
the generating the output video includes inserting, into the frame image, a mouth region corresponding to the frame image.
16. The system of claim 11, wherein:
the video configuration data further include a sequence of eye parameters defining positions of an iris in a sclera of a face synchronization actor featured in the frame images; and
the generating the output video includes: generating, based on the eye parameters corresponding to the frame, an image of eyes region; and
inserting the image of eyes region in the frame image.
17. The system of claim 11, wherein:
the video configuration data include a sequence of head parameters defining one or more of a rotation, a turn, a position, and a scale of a head.
18. The system of claim 11, wherein:
the generating the output video includes:
determining, based on the source face image, a hair mask;
generating, based on the hair mask, a hair image; and
inserting the hair image into the frame image.
19. The system of claim 11, wherein:
the video configuration data include a sequence of animated object images, wherein each of the animated object images corresponds to at least one of the frame images; and
the generating the output video includes inserting, into the frame image, an animated object image corresponding to the frame image.
20. A non-transitory processor-readable medium having instructions stored thereon, which when executed by one or more processors, cause the one or more processors to implement a method for template-based generation of personalized videos, the method comprising:
receiving, by a computing device, video configuration data including: a sequence of frame images;
a sequence of face area parameters defining positions of a face area in the frame images; and
a sequence of facial landmark parameters defining positions of facial landmarks in the frame images, wherein each of the facial landmark parameters corresponds to a facial expression;
receiving, by the computer device, an image of a source face; and generating, by the computing device, an output video, wherein the generating the output video includes modifying a frame image of the sequence of frame images by:
modifying, based on facial landmark parameters corresponding to the frame image, the image of the source face to obtain a further image featuring the source face adopting a facial expression corresponding to the facial landmark parameters; and
inserting the further image into the frame image at a position determined by face area parameters corresponding to the frame image.
EP20707900.5A 2019-01-18 2020-01-18 Systems and methods for template-based generation of personalized videos Pending EP3912160A1 (en)

Applications Claiming Priority (9)

Application Number Priority Date Filing Date Title
US16/251,472 US11049310B2 (en) 2019-01-18 2019-01-18 Photorealistic real-time portrait animation
US16/251,436 US10789453B2 (en) 2019-01-18 2019-01-18 Face reenactment
US16/434,185 US10839586B1 (en) 2019-06-07 2019-06-07 Single image-based real-time body animation
US16/551,756 US10776981B1 (en) 2019-06-07 2019-08-27 Entertaining mobile application for animating a single image of a human body and applying effects
US16/594,690 US11089238B2 (en) 2019-01-18 2019-10-07 Personalized videos featuring multiple persons
US16/594,771 US11394888B2 (en) 2019-01-18 2019-10-07 Personalized videos
US16/661,122 US11308677B2 (en) 2019-01-18 2019-10-23 Generating personalized videos with customized text messages
US16/661,086 US11288880B2 (en) 2019-01-18 2019-10-23 Template-based generation of personalized videos
PCT/US2020/014225 WO2020150692A1 (en) 2019-01-18 2020-01-18 Systems and methods for template-based generation of personalized videos

Publications (1)

Publication Number Publication Date
EP3912160A1 true EP3912160A1 (en) 2021-11-24

Family

ID=71613940

Family Applications (2)

Application Number Title Priority Date Filing Date
EP20707901.3A Pending EP3912136A1 (en) 2019-01-18 2020-01-18 Systems and methods for generating personalized videos with customized text messages
EP20707900.5A Pending EP3912160A1 (en) 2019-01-18 2020-01-18 Systems and methods for template-based generation of personalized videos

Family Applications Before (1)

Application Number Title Priority Date Filing Date
EP20707901.3A Pending EP3912136A1 (en) 2019-01-18 2020-01-18 Systems and methods for generating personalized videos with customized text messages

Country Status (4)

Country Link
EP (2) EP3912136A1 (en)
KR (5) KR20240050468A (en)
CN (2) CN113302694A (en)
WO (2) WO2020150692A1 (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11477366B2 (en) * 2020-03-31 2022-10-18 Snap Inc. Selfie setup and stock videos creation
CN112215927B (en) * 2020-09-18 2023-06-23 腾讯科技(深圳)有限公司 Face video synthesis method, device, equipment and medium
CN112153475B (en) * 2020-09-25 2022-08-05 北京字跳网络技术有限公司 Method, apparatus, device and medium for generating text mode video
CN112866798B (en) * 2020-12-31 2023-05-05 北京字跳网络技术有限公司 Video generation method, device, equipment and storage medium
WO2022212503A1 (en) * 2021-03-31 2022-10-06 Snap Inc. Facial synthesis in augmented reality content for third party applications
KR102345729B1 (en) * 2021-04-08 2022-01-03 주식회사 닫닫닫 Method and apparatus for generating video
US11803996B2 (en) 2021-07-30 2023-10-31 Lemon Inc. Neural network architecture for face tracking

Family Cites Families (35)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6512522B1 (en) * 1999-04-15 2003-01-28 Avid Technology, Inc. Animation of three-dimensional characters along a path for motion video sequences
KR100606076B1 (en) * 2004-07-02 2006-07-28 삼성전자주식회사 Method for controlling image in wireless terminal
CN101563698A (en) * 2005-09-16 2009-10-21 富利克索尔股份有限公司 Personalizing a video
EP1941423A4 (en) * 2005-09-16 2010-06-30 Flixor Inc Personalizing a video
CN101005609B (en) * 2006-01-21 2010-11-03 腾讯科技(深圳)有限公司 Method and system for forming interaction video frequency image
US8265349B2 (en) * 2006-02-07 2012-09-11 Qualcomm Incorporated Intra-mode region-of-interest video object segmentation
US20090016617A1 (en) * 2007-07-13 2009-01-15 Samsung Electronics Co., Ltd. Sender dependent messaging viewer
CN100448271C (en) * 2007-08-10 2008-12-31 浙江大学 Video editing method based on panorama sketch split joint
JP5247356B2 (en) * 2008-10-29 2013-07-24 キヤノン株式会社 Information processing apparatus and control method thereof
CN102054287B (en) * 2009-11-09 2015-05-06 腾讯科技(深圳)有限公司 Facial animation video generating method and device
US8443285B2 (en) * 2010-08-24 2013-05-14 Apple Inc. Visual presentation composition
CN102426568A (en) * 2011-10-04 2012-04-25 上海量明科技发展有限公司 Instant messaging text information picture editing method, client and system
US9277198B2 (en) * 2012-01-31 2016-03-01 Newblue, Inc. Systems and methods for media personalization using templates
RU2627096C2 (en) * 2012-10-30 2017-08-03 Сергей Анатольевич Гевлич Methods for multimedia presentations prototypes manufacture, devices for multimedia presentations prototypes manufacture, methods for application of devices for multimedia presentations prototypes manufacture (versions)
CA2818052A1 (en) * 2013-03-15 2014-09-15 Keith S. Lerner Dynamic customizable personalized label
GB2525841A (en) * 2014-03-07 2015-11-11 Starbucks Hk Ltd Image modification
US9881002B1 (en) * 2014-03-19 2018-01-30 Amazon Technologies, Inc. Content localization
US9576175B2 (en) * 2014-05-16 2017-02-21 Verizon Patent And Licensing Inc. Generating emoticons based on an image of a face
WO2016070354A1 (en) * 2014-11-05 2016-05-12 Intel Corporation Avatar video apparatus and method
US20170004646A1 (en) * 2015-07-02 2017-01-05 Kelly Phillipps System, method and computer program product for video output from dynamic content
WO2017130158A1 (en) * 2016-01-27 2017-08-03 Vats Nitin Virtually trying cloths on realistic body model of user
EP3475920A4 (en) * 2016-06-23 2020-01-15 Loomai, Inc. Systems and methods for generating computer ready animation models of a human head from captured data images
CN107545210A (en) * 2016-06-27 2018-01-05 北京新岸线网络技术有限公司 A kind of method of video text extraction
CN106126709A (en) * 2016-06-30 2016-11-16 北京奇虎科技有限公司 Generate the method and device of chatting facial expression in real time
WO2018102880A1 (en) * 2016-12-09 2018-06-14 Frangos Marcus George Systems and methods for replacing faces in videos
US10636175B2 (en) * 2016-12-22 2020-04-28 Facebook, Inc. Dynamic mask application
US10446189B2 (en) * 2016-12-29 2019-10-15 Google Llc Video manipulation with face replacement
US10878612B2 (en) * 2017-04-04 2020-12-29 Intel Corporation Facial image replacement using 3-dimensional modelling techniques
KR20230144661A (en) * 2017-05-16 2023-10-16 애플 인크. Emoji recording and sending
CN107566892B (en) * 2017-09-18 2020-09-08 北京小米移动软件有限公司 Video file processing method and device and computer readable storage medium
CN107770626B (en) * 2017-11-06 2020-03-17 腾讯科技(深圳)有限公司 Video material processing method, video synthesizing device and storage medium
CN108133455A (en) * 2017-12-01 2018-06-08 天脉聚源(北京)科技有限公司 The display methods and device of user's name in poster image
CN108305309B (en) * 2018-04-13 2021-07-20 腾讯科技(成都)有限公司 Facial expression generation method and device based on three-dimensional animation
CN108965104A (en) * 2018-05-29 2018-12-07 深圳市零度智控科技有限公司 Merging sending method, device and the readable storage medium storing program for executing of graphic message
CN108924622B (en) * 2018-07-24 2022-04-22 腾讯科技(深圳)有限公司 Video processing method and device, storage medium and electronic device

Also Published As

Publication number Publication date
KR102658104B1 (en) 2024-04-17
WO2020150692A1 (en) 2020-07-23
KR20210119440A (en) 2021-10-05
CN113302659A (en) 2021-08-24
KR20240050468A (en) 2024-04-18
EP3912136A1 (en) 2021-11-24
KR20230173220A (en) 2023-12-26
KR20230173221A (en) 2023-12-26
KR20210119439A (en) 2021-10-05
WO2020150693A1 (en) 2020-07-23
CN113302694A (en) 2021-08-24
KR102616013B1 (en) 2023-12-21

Similar Documents

Publication Publication Date Title
US11694417B2 (en) Template-based generation of personalized videos
KR102658104B1 (en) Template-based personalized video creation system and method
US11861936B2 (en) Face reenactment
US20230282025A1 (en) Global configuration interface for default self-images
US11842433B2 (en) Generating personalized videos with customized text messages
US11558561B2 (en) Personalized videos featuring multiple persons
US20220385808A1 (en) Selfie setup and stock videos creation
US11895260B2 (en) Customizing modifiable videos of multimedia messaging application

Legal Events

Date Code Title Description
STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: UNKNOWN

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE INTERNATIONAL PUBLICATION HAS BEEN MADE

PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: REQUEST FOR EXAMINATION WAS MADE

17P Request for examination filed

Effective date: 20210816

AK Designated contracting states

Kind code of ref document: A1

Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR

DAV Request for validation of the european patent (deleted)
DAX Request for extension of the european patent (deleted)
RIN1 Information on inventor provided before grant (corrected)

Inventor name: TKACHENKO, GRIGORIY

Inventor name: GOLOBKOV, ROMAN

Inventor name: PCHELNIKOV, ALEXEY

Inventor name: SAVINOVA, SOFIA

Inventor name: MATOV, DMITRY

Inventor name: MASHRABOV, ALEXANDER

Inventor name: SAVCHENKOV, PAVEL

Inventor name: SHABUROV, VICTOR

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: EXAMINATION IS IN PROGRESS

17Q First examination report despatched

Effective date: 20230327

RAP3 Party data changed (applicant data changed or rights of an application transferred)

Owner name: SNAP INC.

RIN1 Information on inventor provided before grant (corrected)

Inventor name: TKACHENKO, GRIGORIY

Inventor name: GOLOBKOV, ROMAN

Inventor name: PCHELNIKOV, ALEXEY

Inventor name: SAVINOVA, SOFIA

Inventor name: MATOV, DMITRY

Inventor name: MASHRABOV, ALEXANDER

Inventor name: SAVCHENKOV, PAVEL

Inventor name: SHABUROV, VICTOR

RIN1 Information on inventor provided before grant (corrected)

Inventor name: TKACHENKO, GRIGORIY

Inventor name: GOLOBKOV, ROMAN

Inventor name: PCHELNIKOV, ALEXEY

Inventor name: SAVINOVA, SOFIA

Inventor name: MATOV, DMITRY

Inventor name: MASHRABOV, ALEXANDER

Inventor name: SAVCHENKOV, PAVEL

Inventor name: SHABUROV, VICTOR