WO2018102880A1

WO2018102880A1 - Systems and methods for replacing faces in videos

Info

Publication number: WO2018102880A1
Application number: PCT/AU2017/051353
Authority: WO
Inventors: Marcus George FRANGOS
Original assignee: Frangos Marcus George
Priority date: 2016-12-09
Filing date: 2017-12-08
Publication date: 2018-06-14

Abstract

A method of replacing faces in videos is disclosed. The method comprises providing a pre-processed video wherein a target face appearing in multiple frames of the video has been mapped to a 3D target model and receiving, from a user device, a user's facial image. The method further comprises automatically identifying facial landmarks in the user's facial image, automatically mapping the facial landmarks to a 3D subject model to generate a user's facial image texture and automatically compositing the user's facial image texture into each video frame by transforming coordinates of the 3D subject model and corresponding pixels of the user's facial image texture to coordinates of the 3D target model and pixels of the target face respectively. The method also comprises displaying the composited video on the user device. A system for replacing faces in videos is also disclosed.

Description

SYSTEMS AND METHODS FOR REPLACING FACES IN VIDEOS

Field

[0001 ] The present invention relates to systems and methods for replacing facial images in videos.

Background

[0002] With increasingly more content being shared to social media, the ability to personalise and modify content has become an important differentiating factor.

[0003] For example, several popular existing mobile applications allow the user to personalise a photograph for entertainment purposes, by replacing a face in the photograph with the user's portrait.

[0004] It is much more challenging to implement a face replacement in a video, due to the changing position and angle of the target face throughout the video. Conventionally, such video editing would require substantial skill, time and processing power for frame- by-frame processing, which would inhibit implementation on a mobile device such as a smartphone.

[0005] In this context, there is a need for an improved system and method for replacing faces in videos.

Summary

[0006] According to the present invention, there is provided a method comprising:

providing a pre-processed video wherein a target face appearing in multiple frames of the video has been mapped to a 3D target model;

receiving, from a user device, a user's facial image;

automatically identifying facial landmarks in the user's facial image;

automatically mapping the facial landmarks to a 3D subject model to generate a user's facial image texture;

automatically compositing the user's facial image texture into each video frame by transforming coordinates of the 3D subject model and corresponding pixels of the user's facial image texture to coordinates of the 3D target model and pixels of the target face respectively; and displaying the composited video on the user device.

[0007] Pre-processing of the video may comprises:

selecting a frame from the video;

detecting the target face in the selected frame;

mapping facial landmarks of the target face to a 3D target model;

automatically tracking the target face and fitting the 3D target model to the target face in remaining frames of the video.

[0008] The target face in the selected frame may be automatically detected by facial image recognition.

[0009] The steps of identifying the facial landmarks of the user's facial image, mapping the facial landmarks to the 3D subject model and compositing the user's facial image texture into each video frame may be performed by a processor of the user device.

[0010] The user's facial image may be obtained via a camera of the user device.

[001 1 ] The method may further comprise displaying a head positioning guide on the user device while capturing the user's facial image.

[0012] The method may further comprise receiving front and profile facial images of the user and combining the front and profile facial images into a single facial image texture.

[0013] The user device may be a mobile device comprising a tablet or a smartphone.

[0014] The pre-processed video may comprise two or more different target faces appearing in multiple frames of the video, each target face being mapped to a 3D target model, wherein the user selects, via the user device, one of the target faces for compositing with the user's facial image.

[0015] The method may further comprise processing each composited video frame by texture blending, alpha blending, pixel intensity blending, luminescence blending, hue manipulation, applying blur filters, or a combination thereof.

[0016] In another aspect of the present invention, there is provided a system comprising:

a processor; and a non-transitory computer-readable medium coupled to the processor and having instructions stored thereon, which, when executed by the processor, cause the processor to perform operations comprising:

receiving, from a user device, a user's facial image;

automatically identifying facial landmarks in the user's facial image; automatically mapping the facial landmarks to a 3D subject model to generate a user's facial image texture;

automatically compositing the user's facial image texture into each video frame by transforming coordinates of the 3D subject model and corresponding pixels of the user's facial image texture to coordinates of the 3D target model and pixels of the target face respectively; and

displaying the composited video on the user device.

[0017] The pre-processed video may be stored on a server.

[0018] In another aspect of the present invention, there is provided a non-transitory computer-readable medium having instructions stored thereon, which, when executed by a processor, cause the processor to perform operations comprising:

receiving, from a user device, a user's facial image;

automatically identifying facial landmarks in the user's facial image;

displaying the composited video on the user device.

Brief Description of Drawings

[0019] Embodiments of the invention will now be described by way of example only with reference to the accompanying drawings, in which: Figure 1 is a flowchart of a method for replacing faces in videos according to an embodiment;

Figure 2 is a block diagram illustrating the system and method for replacing faces in videos according to an embodiment;

Figures 3a to 3i are screenshots of the method and system implemented on a smartphone;

Figures 4a to 4d are screenshots illustrating the video processing method in more detail; and

Figure 5 illustrates a composited frame, in which the target face has been replaced with a user image.

Description of Embodiments

[0020] Figures 1 illustrates a method 10 for replacing faces in videos according to one embodiment. The method may be performed by one or more specially programmed computing devices. The method comprises three main components: video processing 30, user image processing (to a 3D model) 40, and video compositing 50. The method may involve receiving input from the user and displaying results to the user via a user device 2. The user device 2 may generally include a memory for storing instructions and data, and a processor for executing stored instructions. The memory may include both read only and writable memory. For example, the user device 2 may be a mobile device such as a smartphone or tablet coupled to the one or more specially programmed computing devices through a data communication network, eg, local area network (LAN) or wide area network (WAN), eg, the Internet, or a combination of networks, any of which may include wireless links.

[0021 ] The method 10 starts by providing a video that has been processed to map a target face 4 (eg an actor's or actress' face) to a 3D target model 8, in each frame of the video where the target face 4 appears.

[0022] Next, a user's facial image 1 1 is received via the user device 2. In some embodiments, the image 1 1 is obtained via a camera of the user device 2, eg an integrated smartphone camera, or a web camera connected to the user's device, etc. In other embodiments, the user may upload an image 1 1 or may select an image that has been previously uploaded and saved. In some embodiments, at least two user facial images are obtained, each from a different angle (preferably orthogonal to each other), such as front and profile views. Images from multiple angles allow for the generation of a combined facial image texture, which retains high image quality and fidelity when viewed from any angle.

[0023] The method then automatically identifies a plurality of facial landmarks (not shown) in the user's facial image 1 1. These facial landmarks preferably correspond to naturally occurring facial features that are common to all or most people, for example corners of the eyes and eyebrows, tip of nose, corners of mouth, etc. In some embodiments, the facial landmarks in the user's facial image 1 1 may be automatically identified via face-fitting techniques. For example, cascade classifiers or stochastic methods, and an Active Shape Model (ASM) or Active Appearance Model (AAM) may be used to locate the facial features.

[0024] In some embodiments, the method further involves displaying a head positioning guide 18 on the screen of the user device while the user's facial image is being captured by the camera of the user device. The positioning guide approximately aligns the facial landmarks on the user's face at specific locations and orientations, to assist with the step of detecting the user's facial landmarks.

[0025] Next, the method automatically maps the identified facial landmarks to a 3D subject mesh model, via mesh generation, face-fitting techniques, etc, to generate a user's facial image texture. In the embodiment where face-fitting is used for automatic facial feature detection, the fitted ASM or AAM may provide the starting point for fitting the image to a suitable face model.

[0026] Next, the method automatically composites the user's facial image texture into each frame of the video by transforming coordinates of the 3D subject model and corresponding pixels of the user's facial image texture to coordinates of the 3D target model 8 and pixels of the target face 4 respectively. In some embodiments, the user's facial image texture is combined from multiple facial images of the user (eg front and profile views), such that the user's image texture is accurately composited into the video, regardless of the angle of the target face. Subsequent steps of texture blending, eg alpha, pixel intensity, luminescence blending, hue manipulation, applying blur filters, etc, may be implemented.

[0027] The method ends by displaying the composited video on the user device 2. The user may also share the composited video on social media platforms. [0028] In preferred embodiments, the user image processing 40 and video compositing 50 components (ie. the steps of identifying the facial landmarks of the user's face, mapping the facial landmarks of the user's face to the 3D subject model and compositing the user's facial image into each video frame) are all performed by the processor of the user device 2. In other embodiments, the user's facial image 1 1 is uploaded to a server, and these automatic processing steps may instead be performed by an external computing device, eg via cloud computing.

[0029] Figure 2 illustrates a system 100 for replacing faces in videos according to one embodiment, and the associated method steps that may be implemented by and/or on one or more specially programmed computing devices of the system 100. The system 100 may comprise one or more servers 20. A user may interact with the system 100 through the user device 2.

[0030] In some embodiments, the video processing component 30 is performed externally of the user device 2. That is, a provider/curator 22 may select, process and upload videos onto server 20. The user may then access the library of pre-processed videos via a mobile application running on the user device 2, as illustrated in Figure 3a.

[0031 ] Figures 4a to 4c illustrate exemplary steps of the video processing component 30 in more detail. First, a frame 14 of the video displaying the target face 4 is selected, and the target face 4 is detected. In some cases, the selection of the frame 14 may be automated, for example by facial image recognition of the or a target face 4. This may be implemented as face detection tool 32 of system 100.

[0032] Next, facial landmarks 6 of the target face 4 are identified. These facial landmarks are preferably the same landmarks detected in the user's facial image during user image processing 40. In some embodiments, this step is performed manually by the curator 22. In other embodiments, the facial landmarks may be automatically detected, for example via face-fitting techniques discussed above. However, it will be appreciated that because the target face 4 in the selected frame 14 could be in any orientation, it may be more challenging to apply face-fitting techniques which typically rely on the facial image being in a known orientation, eg a front or profile view. Accordingly, after automatic detection, the curator 22 may review the frame 22 to ensure that that the facial landmarks have been correctly identified. If not, as shown in Figures 4b to 4d, the curator may reposition the facial landmarks appropriately. The facial landmarks 6 are fitted to a 3D target model 8, via mesh generation, face-fitting techniques, etc, as described above. The 3D target model 8 has the same parameters as the 3D subject model, eg the same number of nodes, cell type, node number associated with a facial landmark, etc.

[0033] Next, the method automatically tracks the target face 4 and fits the 3D target model to the target face in remaining frames of the video. Automatic tracking may be performed by using the fitted model in the previous frame as the initialising conditions for the current frame, since there is typically minimal change and movement of the target face from frame to frame. After automatic tracking, the curator 22 may review the frames to ensure that the target face 4 has been fitted correctly across the entire video. Accordingly, the time-varying or frame-varying mesh coordinates of the 3D target model may be obtained.

[0034] The processed video and the associated 3D target model coordinates for each frame may then be uploaded to server 20 and may subsequently be accessed by the user device 2 for video compositing 50, as described above. Figure 5 is an exemplary composited frame 16 illustrating results from the face replacement method, in which the target face 4 shown in Figure 4 has been replaced with a user's image.

[0035] In some embodiments, the video may comprise two or more different target faces 4a, 4b appearing in multiple frames of the video. The video is processed to map each target face to separate 3D target models. The user may then select, via the user device 2, one of the target faces for compositing with the user's facial image 1 1 , as illustrated in Figure 3b and 3c.

[0036] Figures 3a to 3i are example user interfaces that are displayed on the user device 2 to enable the user to replace the user's facial image with a target face in a pre- processed video. The resulting composited video may be stored on a video content platform for sharing with other users.

[0037] Embodiments of the present invention provide systems and methods that are useful for implementing a face replacement in a video.

[0038] For the purpose of this specification, the word "comprising" means "including but not limited to", and the word "comprises" has a corresponding meaning. [0039] The above embodiments have been described by way of example only and modifications are possible within the scope of the claims that follow.

Claims

1 . A method comprising:

receiving, from a user device, a user's facial image;

automatically identifying facial landmarks in the user's facial image;

displaying the composited video on the user device.

2. The method of claim 1 , wherein pre-processing of the video comprises:

selecting a frame from the video;

detecting the target face in the selected frame;

mapping facial landmarks of the target face to a 3D target model;

3. The method of claim 2, wherein the target face in the selected frame is automatically detected by facial image recognition.

4. The method of any one of the preceding claims, wherein the steps of identifying the facial landmarks of the user's facial image, mapping the facial landmarks to the 3D subject model and compositing the user's facial image texture into each video frame are performed by a processor of the user device.

5. The method of any one of the preceding claims, wherein the user's facial image is obtained via a camera of the user device.

6. The method of claim 5, further comprising displaying a head positioning guide on the user device while capturing the user's facial image.

7. The method of any one of the preceding claims, comprising receiving front and profile facial images of the user.

8. The method of any one of the preceding claims, wherein the user device comprises a computer, a tablet or a smartphone.

9. The method of any one of the preceding claims, wherein the pre-processed video comprises two or more different target faces appearing in multiple frames of the video, each target face being mapped to a 3D target model, and

wherein the user selects, via the user device, one of the target faces for compositing with the user's facial image.

10. The method of any one of the preceding claims, further comprising processing each composited video frame by texture blending, alpha blending, pixel intensity blending, luminescence blending, hue manipulation, applying blur filters, or a combination thereof.

1 1 . A system, comprising:

a processor; and

a non-transitory computer-readable medium coupled to the processor and having instructions stored thereon, which, when executed by the processor, cause the processor to perform operations comprising:

receiving, from a user device, a user's facial image;

displaying the composited video on the user device.

12. The system of claim 1 1 , wherein the pre-processed video is stored on a server.

13. A non-transitory computer-readable medium having instructions stored thereon, which, when executed by a processor, cause the processor to perform operations comprising:

receiving, from a user device, a user's facial image;

automatically identifying facial landmarks in the user's facial image;

displaying the composited video on the user device.