WO2023069086A1

WO2023069086A1 - System and method for dynamic portrait relighting

Info

Publication number: WO2023069086A1
Application number: PCT/US2021/055776
Authority: WO
Inventors: Celong LIU; Jiang Li; Lingyu Wang; Yi Xu
Original assignee: Innopeak Technology, Inc.
Priority date: 2021-10-20
Filing date: 2021-10-20
Publication date: 2023-04-27

Abstract

A method is implemented at an electronic device for relighting an input image. The electronic device reconstructs a 3D face model from a face image of the input image based on a plurality of facial landmarks, renders the 3D face model in a predefined lighting condition with a facial shadow and shading, blends the facial shadow and shading with the input image, and updates a non-facial portion of the input image with a style that matches the predefined lighting condition. In some embodiments, when the non-facial portion of the input image is updated, a color style of the non-facial portion of the input image is adjusted according to the predefined lighting condition and using a style-transfer neural network model. The predefined lighting condition corresponds to one of: leaf light, window light, contour light, morning light, dusk light, and arbitrary light.

Description

System and Method for Dynamic Portrait Relighting TECHNICAL FIELD [0001] This application relates generally to data processing technology including, but not limited to, methods, systems, and non-transitory computer-readable media for rendering a new image from an existing image based on a lighting condition. BACKGROUND [0002] Existing relighting processes extract lighting information from a reference image and transfer it to a target image. The target image relies on the lighting information of the reference image and cannot be relighted with any arbitrarily defined lighting. The relighting problem has been formulated as a mass transport problem that is solved via a non- linear optimization process and takes an extended time to get a solution. The light information extracted from the reference image is usually used to represent environment lighting in the target image. Such lighting information cannot be applied to create facial shadows that are normally caused by a point light or a directional light. Additionally, training data often use a controlled light stage setup made of a densely sampled sphere of light, which limits lighting flexibility and causing the training data to have a similar relighting appearance. It would be beneficial to have an efficient image relighting method to render a new image from an existing image based on a lighting condition (particularly, to render a new portrait image based on an arbitrary lighting condition). SUMMARY [0003] Various embodiments of this application are directed to generating relighting effects for input images (e.g., human portraits). This application expands types of lighting conditions compared with the existing relighting processes that have been limited to lighting conditions of reference images. Shadowing and shading on the face are physically correct and realistic. Particularly, relit areas include facial regions, body regions, and a background, while skin color is protected during relighting. In some embodiments, a coarse three- dimensional (3D) face mesh is applied as a proxy model to generate face shadows and shading. Application of the coarse mesh is important to allow the algorithm to run in real time on a computer system having limited resources (e.g., a mobile phone). In some embodiments, a deep learning based real-time style transfer is used to relight the input image, while skin color is protected from relighting. [0004] In one aspect, a method is implemented at an electronic device for relighting an input image. The method includes reconstructing a 3D face model from a face image of the input image based on a plurality of facial landmarks, rendering the 3D face model in a predefined lighting condition with a facial shadow and shading, blending the facial shadow and shading with the input image, and updating a non-facial portion of the input image with a style that matches the predefined lighting condition. In some embodiments, blending the facial shadow and shading with the input image further includes expanding a distribution of the facial shadow and shading along a boundary of the facial shadow and shading to soften the boundary, and layering the expanded facial shadow and shading with the softened boundary with the input image according to a position of the 3D face model. In some embodiments, updating the non-facial portion of the input image with the style that matches the predefined lighting condition further includes adjusting a color style of the non-facial portion of the input image according to the predefined lighting condition and using a style- transfer neural network model. [0005] In another aspect, some implementations include a computer system that includes one or more processors and memory having instructions stored thereon, which when executed by the one or more processors cause the processors to perform any of the above methods. [0006] In yet another aspect, some implementations include a non-transitory computer- readable medium, having instructions stored thereon, which when executed by one or more processors cause the processors to perform any of the above methods. BRIEF DESCRIPTION OF THE DRAWINGS [0007] For a better understanding of the various described implementations, reference should be made to the Detailed Description below, in conjunction with the following drawings in which like reference numerals refer to corresponding parts throughout the figures. [0008] Figure 1 is an example data processing environment having one or more servers communicatively coupled to one or more client devices, in accordance with some embodiments. [0009] Figure 2 is a block diagram illustrating a data processing system, in accordance with some embodiments. [0010] Figure 3 is an example data processing environment for training and applying a neural network based (NN-based) data processing model for processing visual and/or audio data, in accordance with some embodiments. [0011] Figure 4A is an example neural network (NN) applied to process content data in an NN-based data processing model, in accordance with some embodiments. [0012] Figure 4B is an example node in the neural network (NN), in accordance with some embodiments. [0013] Figure 5 is a flow chart of a process of relighting an input image including a facial portion, in accordance with some embodiments. [0014] Figure 6 a structural diagram of a Cycle-generative adversarial network (Cycle-GAN), in accordance with some embodiments. [0015] Figure 7A are an input image and three output images to which the input image is converted with first lighting effects (e.g., leaf light, window light, contour light), in accordance with some embodiments. [0016] Figure 7B are an input image and two output images to which the input image is converted with second lighting effects (e.g., morning and dust lights), in accordance with some embodiments. [0017] Figure 7C are an input image and two output images to which the input image is converted with arbitrary lighting effects, in accordance with some embodiments. [0018] Figure 8 is a flowchart of a method for relighting an input image, in accordance with some embodiments. [0019] Like reference numerals refer to corresponding parts throughout the several views of the drawings. DETAILED DESCRIPTION [0020] Reference will now be made in detail to specific embodiments, examples of which are illustrated in the accompanying drawings. In the following detailed description, numerous non-limiting specific details are set forth in order to assist in understanding the subject matter presented herein. But it will be apparent to one of ordinary skill in the art that various alternatives may be used without departing from the scope of claims and the subject matter may be practiced without these specific details. For example, it will be apparent to one of ordinary skill in the art that the subject matter presented herein can be implemented on many types of electronic devices with image or video processing capabilities. [0021] Systems and methods for relighting portrait photos or videos under different lighting conditions are disclosed herein. [0022] In some embodiments, the systems and methods disclosed herein reconstruct a 3D face model from the input image(s) and track this 3D face in the following frames if the input is a video. In some embodiments, the systems and methods disclosed herein use a light- weight face renderer to render the face shadow and shading by incorporating the 3D face model and desired lighting. The shadow and shading are blended with the input image to achieve the face relighting effects. In some embodiments, the systems and methods disclosed herein generate the shadow of body when the input portrait contains the full body of the human, and when the designed lighting is a point light or a directional light. In some embodiments, the systems and methods disclosed herein automatically detect the skin region in the input portrait. When relighting a person, the skin color will be protected in order to avoid un-realistic color changing on the skin area. [0023] In some embodiments, the systems and methods disclosed herein are used in a camera or a camera application on a mobile phone. A user can change the lighting condition to his or her preferred style when a portrait photo or video is taken. The user can also adjust the lighting in the preview mode and choose the best lighting before taking the photo or video. With this technology, the use cases of the camera are enriched. For example, when a user has a good scene, but the lighting condition is bad, the user can still take a photo or video and then use this relighting technology to change the lighting to a good one. This feature is previously missing for most of existing mobile cameras or photo applications. [0024] Before the embodiments of the present application are further described in detail, names and terms involved in the embodiments of the present application are described, and the names and terms involved in the embodiments of the present application have the following explanations. Specifically, in various embodiments of this application, facial shadow synthesis is implemented for a given portrait image to generate shadows on the face under specific lighting conditions. Skin segmentation is conducted to segment the skin area of a person in a portrait image. Human area segmentation is applied for segmenting the human area in a portrait image. For a given portrait video, a three-dimensional (3D) face model is aligned to the person in the video frame by frame. The color style of an image is transferred to another style. Two images are blended seamlessly. [0025] Figure 1 is an example data processing environment 100 having one or more servers 102 communicatively coupled to one or more client devices 104, in accordance with some embodiments. The one or more client devices 104 may be, for example, desktop computers 104A, tablet computers 104B, mobile phones 104C, or intelligent, multi-sensing, network-connected home devices (e.g., a camera). Each client device 104 can collect data or user inputs, executes user applications, and present outputs on its user interface. The collected data or user inputs are processed locally (e.g., for training and/or for prediction) at the client device 104 and/or remotely by the server(s) 102. The one or more servers 102 provide system data (e.g., boot files, operating system images, and user applications) to the client devices 104, and in some embodiments, process the data and user inputs received from the client device(s) 104 when the user applications are executed on the client devices 104. In some embodiments, the data processing environment 100 further includes a storage 106 for storing data related to the servers 102, client devices 104, and applications executed on the client devices 104. For example, storage 106 may store video content for training a machine learning model (e.g., deep learning network) and/or video content obtained by a user to which a trained machine learning model is applied to determine one or more actions associated with the video content. [0026] The one or more servers 102 can enable real-time data communication with the client devices 104 that are remote from each other or from the one or more servers 102. Further, in some embodiments, the one or more servers 102 can implement data processing tasks that cannot be or are preferably not completed locally by the client devices 104. For example, the client devices 104 include a game console that executes an interactive online gaming application. The game console receives a user instruction and sends it to a game server 102 with user data. The game server 102 generates a stream of video data based on the user instruction and user data and providing the stream of video data for display on the game console and other client devices that are engaged in the same game session with the game console. In another example, the client devices 104 include a networked surveillance camera and a mobile phone 104C. The networked surveillance camera collects video data and streams the video data to a surveillance camera server 102 in real time. While the video data is optionally pre-processed on the surveillance camera, the surveillance camera server 102 processes the video data to identify motion or audio events in the video data and share information of these events with the mobile phone 104C, thereby allowing a user of the mobile phone 104 to monitor the events occurring near the networked surveillance camera in the real time and remotely. [0027] The one or more servers 102, one or more client devices 104, and storage 106 are communicatively coupled to each other via one or more communication networks 108, which are the medium used to provide communication links between these devices and computers connected together within the data processing environment 100. The one or more communication networks 108 may include connections, such as wire, wireless communication links, or fiber optic cables. Examples of the one or more communication networks 108 include local area networks (LAN), wide area networks (WAN) such as the Internet, or a combination thereof. The one or more communication networks 108 are, optionally, implemented using any known network protocol, including various wired or wireless protocols, such as Ethernet, Universal Serial Bus (USB), FIREWIRE, Long Term Evolution (LTE), Global System for Mobile Communications (GSM), Enhanced Data GSM Environment (EDGE), code division multiple access (CDMA), time division multiple access (TDMA), Bluetooth, Wi-Fi, voice over Internet Protocol (VoIP), Wi-MAX, or any other suitable communication protocol. A connection to the one or more communication networks 108 may be established either directly (e.g., using 3G/4G connectivity to a wireless carrier), or through a network interface 110 (e.g., a router, switch, gateway, hub, or an intelligent, dedicated whole-home control node), or through any combination thereof. As such, the one or more communication networks 108 can represent the Internet of a worldwide collection of networks and gateways that use the Transmission Control Protocol/Internet Protocol (TCP/IP) suite of protocols to communicate with one another. At the heart of the Internet is a backbone of high-speed data communication lines between major nodes or host computers, consisting of thousands of commercial, governmental, educational, and other computer systems that route data and messages. [0028] Deep learning techniques are applied in the data processing environment 100 to process content data (e.g., video data, visual data, audio data) obtained by an application executed at a client device 104 to identify information contained in the content data, match the content data with other data, categorize the content data, or synthesize related content data. In these deep learning techniques, data processing models are created based on one or more neural networks to process the content data. These data processing models are trained with training data before they are applied to process the content data. In some embodiments, both model training and data processing are implemented locally at each individual client device 104 (e.g., the client device 104C). The client device 104C obtains the training data from the one or more servers 102 or storage 106 and applies the training data to train the data processing models. Subsequently to model training, the client device 104C obtains the content data (e.g., captures image or video data via an internal camera) and processes the content data using the training data processing models locally. Alternatively, in some embodiments, both model training and data processing are implemented remotely at a server 102 (e.g., the server 102A) associated with a client device 104 (e.g. the client device 104A). The server 102A obtains the training data from itself, another server 102 or the storage 106 and applies the training data to train the data processing models. The client device 104A obtains the content data, sends the content data to the server 102A (e.g., in an application) for data processing using the trained data processing models, receives data processing results from the server 102A, and presents the results on a user interface (e.g., associated with the application). The client device 104A itself implements no or little data processing on the content data prior to sending them to the server 102A. Additionally, in some embodiments, data processing is implemented locally at a client device 104 (e.g., the client device 104B), while model training is implemented remotely at a server 102 (e.g., the server 102B) associated with the client device 104B. The server 102B obtains the training data from itself, another server 102 or the storage 106 and applies the training data to train the data processing models. The trained data processing models are optionally stored in the server 102B or storage 106. The client device 104B imports the trained data processing models from the server 102B or storage 106, processes the content data using the data processing models, and generates data processing results to be presented on a user interface locally. [0029] Figure 2 is a block diagram illustrating a data processing system 200, in accordance with some embodiments. The data processing system 200 includes a server 102, a client device 104, a storage 106, or a combination thereof. The data processing system 200, typically, includes one or more processing units (CPUs) 202, one or more network interfaces 204, memory 206, and one or more communication buses 208 for interconnecting these components (sometimes called a chipset). The data processing system 200 includes one or more input devices 210 that facilitate user input, such as a keyboard, a mouse, a voice- command input unit or microphone, a touch screen display, a touch-sensitive input pad, a gesture capturing camera, or other input buttons or controls. Furthermore, in some embodiments, the client device 104 of the data processing system 200 uses a microphone and voice recognition or a camera and gesture recognition to supplement or replace the keyboard. In some embodiments, the client device 104 includes one or more cameras, scanners, or photo sensor units for capturing images, for example, of graphic serial codes printed on the electronic devices. The data processing system 200 also includes one or more output devices 212 that enable presentation of user interfaces and display content, including one or more speakers and/or one or more visual displays. Optionally, the client device 104 includes a location detection device, such as a GPS (global positioning satellite) or other geo-location receiver, for determining the location of the client device 104. [0030] Memory 206 includes high-speed random access memory, such as DRAM, SRAM, DDR RAM, or other random access solid state memory devices; and, optionally, includes non-volatile memory, such as one or more magnetic disk storage devices, one or more optical disk storage devices, one or more flash memory devices, or one or more other non-volatile solid state storage devices. Memory 206, optionally, includes one or more storage devices remotely located from one or more processing units 202. Memory 206, or alternatively the non-volatile memory within memory 206, includes a non-transitory computer readable storage medium. In some embodiments, memory 206, or the non- transitory computer readable storage medium of memory 206, stores the following programs, modules, and data structures, or a subset or superset thereof: ● Operating system 214 including procedures for handling various basic system services and for performing hardware dependent tasks; ● Network communication module 216 for connecting each server 102 or client device 104 to other devices (e.g., server 102, client device 104, or storage 106) via one or more network interfaces 204 (wired or wireless) and one or more communication networks 108, such as the Internet, other wide area networks, local area networks, metropolitan area networks, and so on; ● User interface module 218 for enabling presentation of information (e.g., a graphical user interface for application(s) 224, widgets, websites and web pages thereof, and/or games, audio and/or video content, text, etc.) at each client device 104 via one or more output devices 212 (e.g., displays, speakers, etc.); ● Input processing module 220 for detecting one or more user inputs or interactions from one of the one or more input devices 210 and interpreting the detected input or interaction; ● Web browser module 222 for navigating, requesting (e.g., via HTTP), and displaying websites and web pages thereof, including a web interface for logging into a user account associated with a client device 104 or another electronic device, controlling the client or electronic device if associated with the user account, and editing and reviewing settings and data that are associated with the user account; ● One or more user applications 224 for execution by the data processing system 200 (e.g., games, social network applications, smart home applications, and/or other web or non-web based applications for controlling another electronic device and reviewing data captured by such devices); ● Model training module 226 for receiving training data (.g., training data 238) and establishing a data processing model (e.g., data processing module 228) for processing content data (e.g., video data, visual data, audio data) to be collected or obtained by a client device 104; ● Data processing module 228 for processing content data using data processing models 240, thereby identifying information contained in the content data, matching the content data with other data, categorizing the content data, or synthesizing related content data, where in some embodiments, the data processing module 228 is associated with one of the user applications 224 to process the content data in response to a user instruction received from the user application 224; ● One or more databases 230 for storing at least data including one or more of: o Device settings 232 including common device settings (e.g., service tier, device model, storage capacity, processing capabilities, communication capabilities, etc.) of the one or more servers 102 or client devices 104; o User account information 234 for the one or more user applications 224, e.g., user names, security questions, account history data, user preferences, and predefined account settings; o Network parameters 236 for the one or more communication networks 108, e.g., IP address, subnet mask, default gateway, DNS server and host name; o Training data 238 for training one or more data processing models 240; o Data processing model(s) 240 for processing content data (e.g., video data, visual data, audio data) using deep learning techniques; and o Content data and results 242 that are obtained by and outputted to the client device 104 of the data processing system 200, respectively, where the content data is processed by the data processing models 240 locally at the client device 104 or remotely at the server 102 to provide the associated results 242 to be presented on client device 104. [0031] Optionally, the one or more databases 230 are stored in one of the server 102, client device 104, and storage 106 of the data processing system 200. Optionally, the one or more databases 230 are distributed in more than one of the server 102, client device 104, and storage 106 of the data processing system 200. In some embodiments, more than one copy of the above data is stored at distinct devices, e.g., two copies of the data processing models 240 are stored at the server 102 and storage 106, respectively. [0032] Each of the above identified elements may be stored in one or more of the previously mentioned memory devices, and corresponds to a set of instructions for performing a function described above. The above identified modules or programs (i.e., sets of instructions) need not be implemented as separate software programs, procedures, modules or data structures, and thus various subsets of these modules may be combined or otherwise re-arranged in various embodiments. In some embodiments, memory 206, optionally, stores a subset of the modules and data structures identified above. Furthermore, memory 206, optionally, stores additional modules and data structures not described above. [0033] Figure 3 is another example data processing system 300 for training and applying a neural network based (NN-based) data processing model 240 for processing content data (e.g., video data, visual data, audio data), in accordance with some embodiments. The data processing system 300 includes a model training module 226 for establishing the data processing model 240 and a data processing module 228 for processing the content data using the data processing model 240. In some embodiments, both of the model training module 226 and the data processing module 228 are located on a client device 104 of the data processing system 300, while a training data source 304 distinct form the client device 104 provides training data 306 to the client device 104. The training data source 304 is optionally a server 102 or storage 106. Alternatively, in some embodiments, both of the model training module 226 and the data processing module 228 are located on a server 102 of the data processing system 300. The training data source 304 providing the training data 306 is optionally the server 102 itself, another server 102, or the storage 106. Additionally, in some embodiments, the model training module 226 and the data processing module 228 are separately located on a server 102 and client device 104, and the server 102 provides the trained data processing model 240 to the client device 104. [0034] The model training module 226 includes one or more data pre-processing modules 308, a model training engine 310, and a loss control module 312. The data processing model 240 is trained according to a type of the content data to be processed. The training data 306 is consistent with the type of the content data, and a data pre-processing module 308 applied to process the training data 306 consistent with the type of the content data. For example, a video pre-processing module 308 is configured to process video training data 306 to a predefined image format, e.g., group frames (e.g., video frames, visual frames) of the video content into video segments. In another example, the data pre-processing module 308 may also extract a region of interest (ROI) in each frame or separate a frame into foreground and background components, and crop each frame to a predefined image size. The model training engine 310 receives pre-processed training data provided by the data pre- processing module(s) 308, further processes the pre-processed training data using an existing data processing model 240, and generates an output from each training data item. During this course, the loss control module 312 can monitor a loss function comparing the output associated with the respective training data item and a ground truth of the respective training data item. The model training engine 310 modifies the data processing model 240 to reduce the loss function, until the loss function satisfies a loss criteria (e.g., a comparison result of the loss function is minimized or reduced below a loss threshold). The modified data processing model 240 is provided to the data processing module 228 to process the content data. [0035] In some embodiments, the model training module 226 offers supervised learning in which the training data is entirely labelled and includes a desired output for each training data item (also called the ground truth in some situations). Conversely, in some embodiments, the model training module 226 offers unsupervised learning in which the training data are not labelled. The model training module 226 is configured to identify previously undetected patterns in the training data without pre-existing labels and with no or little human supervision. Additionally, in some embodiments, the model training module 226 offers partially supervised learning in which the training data are partially labelled. [0036] The data processing module 228 includes a data pre-processing modules 314, a model-based processing module 316, and a data post-processing module 318. The data pre- processing modules 314 pre-processes the content data based on the type of the content data. Functions of the data pre-processing modules 314 are consistent with those of the pre- processing modules 308 and covert the content data to a predefined content format that is acceptable by inputs of the model-based processing module 316. Examples of the content data include one or more of: video data, visual data (e.g., image data), audio data, textual data, and other types of data. For example, each video is pre-processed to group frames in the video into video segments. The model-based processing module 316 applies the trained data processing model 240 provided by the model training module 226 to process the pre- processed content data. The model-based processing module 316 can also monitor an error indicator to determine whether the content data has been properly processed in the data processing model 240. In some embodiments, the processed content data is further processed by the data post-processing module 318 to present the processed content data in a preferred format or to provide other related information that is derived from the processed content data. [0037] Figure 4A is an example neural network (NN) 400 applied to process content data in an NN-based data processing model 240, in accordance with some embodiments, and Figure 4B is an example node 420 in the neural network 400, in accordance with some embodiments. The data processing model 240 is established based on the neural network 400. A corresponding model-based processing module 316 applies the data processing model 240 including the neural network 400 to process content data that has been converted to a predefined content format. The neural network 400 includes a collection of nodes 420 that are connected by links 412. Each node 420 receives one or more node inputs and applies a propagation function to generate a node output from the one or more node inputs. As the node output is provided via one or more links 412 to one or more other nodes 420, a weight w associated with each link 412 is applied to the node output. Likewise, the one or more node inputs are combined based on corresponding weights w₁, w₂, w₃, and w₄ according to the propagation function. In an example, the propagation function is a product of a non-linear activation function and a linear weighted combination of the one or more node inputs. [0038] The collection of nodes 420 is organized into one or more layers in the neural network 400. Optionally, the one or more layers includes a single layer acting as both an input layer and an output layer. Optionally, the one or more layers includes an input layer 402 for receiving inputs, an output layer 406 for providing outputs, and zero or more hidden layers 404 (e.g., 404A and 404B) between the input layer 402 and the output layer 406. A deep neural network has more than one hidden layer 404 between the input layer 402 and the output layer 406. In the neural network 400, each layer may be only connected with its immediately preceding and/or immediately following layer. In some embodiments, a layer 402 or 404B is a fully connected neural network layer because each node 420 in the layer 402 or 404B is connected to every node 420 in its immediately following layer. In some embodiments, one of the one or more hidden layers 404 includes two or more nodes that are connected to the same node in its immediately following layer for down sampling or pooling the nodes 420 between these two layers. Particularly, max pooling uses a maximum value of the two or more nodes in the layer 404B for generating the node of the immediately following layer 406 connected to the two or more nodes. [0039] In some embodiments, a convolutional neural network (CNN) is applied in a data processing model 240 to process content data (particularly, video and image data). The CNN employs convolution operations and belongs to a class of deep neural networks 400, i.e., a feedforward neural network that moves data forward from the input layer 402 through the hidden layers to the output layer 406. The one or more hidden layers of the CNN are convolutional layers convolving with a multiplication or dot product. Each node in a convolutional layer receives inputs from a receptive area associated with a previous layer (e.g., five nodes), and the receptive area is smaller than the entire previous layer and may vary based on a location of the convolution layer in the convolutional neural network. Video or image data is pre-processed to a predefined video/image format corresponding to the inputs of the CNN. The pre-processed video or image data is abstracted by each layer of the CNN to a respective feature map. By these means, video and image data can be processed by the CNN for video and image recognition, classification, analysis, imprinting, or synthesis. [0040] Alternatively and additionally, in some embodiments, a recurrent neural network (RNN) is applied in the data processing model 240 to process content data (particularly, visual data and audio data). Nodes in successive layers of the RNN follow a temporal sequence, such that the RNN exhibits a temporal dynamic behavior. In an example, each node 420 of the RNN has a time-varying real-valued activation. Examples of the RNN include, but are not limited to, a long short-term memory (LSTM) network, a fully recurrent network, an Elman network, a Jordan network, a Hopfield network, a bidirectional associative memory (BAM network), an echo state network, an independently RNN (IndRNN), a recursive neural network, and a neural history compressor. In some embodiments, the RNN can be used for handwriting or speech recognition. [0041] Alternatively and additionally, in some embodiments, a generative neural network is applied in the data processing model 240 to process content data (particularly, visual data and audio data). A generative neural is trained by providing it with a large amount of data (e.g., millions of images, sentences, or sounds, etc.) and then the neural network is trained to generate data like the input data. In some examples, the generative neural networks have significantly smaller amount of parameters than the amount of input training data, so the generative neural networks are forced to find and efficiently internalize the essence of the data for generating data. [0042] It is noted that in some embodiments, two or more types of content data are processed by the data processing module 228, and two or more types of neural networks (e.g., both CNN and RNN) are applied to process the content data jointly. [0043] The training process is a process for calibrating all of the weights w_i for each layer of the learning model using a training data set which is provided in the input layer 402. The training process typically includes two steps, forward propagation and backward propagation, which are repeated multiple times until a predefined convergence condition is satisfied. In the forward propagation, the set of weights for different layers are applied to the input data and intermediate results from the previous layers. In the backward propagation, a margin of error of the output (e.g., a loss function) is measured, and the weights are adjusted accordingly to decrease the error. The activation function is optionally linear, rectified linear unit, sigmoid, hyperbolic tangent, or of other types. In some embodiments, a network bias term b is added to the sum of the weighted outputs from the previous layer before the activation function is applied. The network bias b provides a perturbation that helps neural network 400 avoid over fitting the training data. The result of the training includes the network bias parameter b for each layer. [0044] Figure 5 is a flow chart of a process 500 of relighting an input image including a facial portion, in accordance with some embodiments. In some embodiments, an operation of real time 3D face reconstruction and tracking 510 is implemented. In some examples, a 3D face is used as a proxy model to utilize a 3D rendering process to generate shadow and shading. Facial details (e.g., wrinkles) are not needed. In some examples, a shape, an expression, and an orientation of a 3D face model are obtained from a deep learning model from input images or frames of input video data. In some embodiments, the deep learning models/neural networks are shown in Figures 1, 3, 4A and 4B. The expression is represented by a change of a geometry of the facial portion, such as how much a mouth is opened. A coarse mesh with correct face shape is obtained. In some embodiments, a 3D parametric (multilinear) face model is fitted by aligning a plurality of landmarks (e.g., nose tip, eye corner, etc.) from input images or frames of the input video data and the face model. In some embodiments, the multilinear face model separably parameterizes the geometric space variations of 3D face meshes associated with different attributes (e.g., identity, and expression, and mouth shape). Optionally, each of the different attributes is varied independently. A multilinear face model is estimated from a Cartesian product of examples (identities × expressions × mouth shapes) using statistical analysis. Preprocessing of the geometric attribute data samples to secure the one-to-one relationship is needed before the Cartesian process. In some embodiments, the preprocessing includes minimizing cross- coupling the artifacts, and filling in any missing examples. For a series of input pictures or frames of the input video data, meshes are calculated for every key point of the face. And the difference between the landmarks and the corresponding key points on the 3D model meshes is minimized. In some examples, for face tracking, the shape of the parametric face is fixed, and the expression is solved by the same landmark alignment optimization/fitting. [0045] In some embodiments, a facial shadow and shading generation 520 process is implemented after the real time 3D face reconstruction and tracking 510. Shadows exist because of light block. Shadings are changes of color of a surface and shadings reflect how light is defected on the surface. With the abovementioned reconstructed 3D face, the 3D face model is rendered with a rendering engine along with a desired lighting condition. A lightweight renderer is developed for this purpose. In some embodiments, deferred shading is used to control a demand for computation resources required to render lightings. The renderer would output a facial shadow map and shading distribution in this operation 520. For shadow generation, a percentage closer soft shadow (PCSS) is used to simulate shadows, and edges are softened based on different types of light sources. In some examples, the PCSS method is used for generating perceptually accurate soft shadows by replacing a typical shadow mapping pixel shader with a PCSS shader. PCSS searches a shadow map and determines the search region based on a light source size and the distance of points that are being shadowed from the light source. PCSS computes a variable kernel/region size based on the distance between the relative position of the shadowed point, an approximation of the search region, and the area light. When the edges are soft, the edges are changed gradually. For shading generation, a microfacet Disney bidirectional reflectance distribution function (BRDF) is used. The microfacet BRDF model is a shading model established based on a Disney model, compares many options for each term, and uses the same input parameters. The microfacet BRDF model evaluates an importance level of each term compared with other efficient alternatives. [0046] In some embodiments, a process of facial shadow and shading blending 530 is implemented after facial shadow and shading generation 520 process. The facial shadow and shading are generated and blended with the input image. In a quotient image, the shadow and shading distribution are expanded along a boundary of the facial portion to soften the boundary. The softened shadow and shading maps are directly blended with the input image. In some embodiments, the quotient image method is an image-based, class-based, identification, and re-rendering method. The quotient image approach extracts, under a general condition, from a small set of sample images an illumination invariant quotient image for a novel object of a class from only a single input image. In some examples, given two objects including a first object and a second object, the quotient image is defined by the ratio of their albedo (surface texture) functions. The quotient image is illumination invariant. In the absence of the albedo functions, the quotient image can still be recovered, analytically, from bootstrapping a set of images. In some instances, the quotient image is recovered, and the entire image space under varying lighting conditions of the first object is generated by the quotient image and three images of the second object. [0047] In some embodiments, a process of style transfer and skin color protection 540 is implemented after the facial shadow and shading blending 530 process. In some examples, style transfer is combining the content of one image with the style of another image. In some instances, in a relighting task, only changing the facial region will generate unpleasant results. A better process should update and transfer the parts of the body and background other than the facial region according to the new light source. However, unlike the face, those regions do not have strong 3D geometry prior to the style transfer, and it is hard to change the appearance of them by 3D rendering. Therefore, a deep style transfer neural network is trained to adjust the color style of those regions. In some embodiments, the deep learning models/neural networks used for deep style transfer are shown in Figures 1, 3, 4A and 4B. The training data is a collection of images and each image has a lighting style label. The lighting style labels are created manually or by a software. In some examples, the lighting style of each image is collected. In some embodiments, the deep learning model can utilize all types of images. In some embodiments, a generative neural network is trained using the photo dataset. In some embodiments, a Cycle-GAN is used to train based on the collection of images. In some instances of the deep learning process, a human face is not needed. In generative adversarial networks (GAN), an adversarial loss is utilized that forces the generated images to be indistinguishable from the real images.

[0048] Figure 6 is a structural diagram of an example Cycle-GAN 600, in accordance with some embodiments. An adversarial loss is utilized in the Cycle-GAN 600 to learn the mapping such that the translated image cannot be distinguished from the target images. Mapping functions are learned between two domains S (602) and T (604) given training samples. Two mappings X (606): S → T and Y (608): T → S are included. In addition, the learned mapping functions are cycle-consistent in Cycle-GAN. For example, for each image, such as 610, from domain S, the image translation cycle should be able to bring a back to the original image, to reach a forward cycle consistency. Similarly, for each image, such as 614, from domain T, Y and X should also satisfy backward cycle consistency. A cycle consistency loss is utilized as a way of using transitivity to supervise the training in some instances. For example, after mapping X (606), the sample 610 in the domain S becomes the sample 612 in the domain T, and after mapping Y (608), the sample 612 in the domain T becomes sample 616 in the domain S. The difference between the sample 610 and 616 is represented as a cycle consistency loss 618. For example, after mapping Y (608), the sample 614 in the domain T becomes the sample 620 in the domain S, and after mapping X (606), the sample 620 in the domain S becomes sample 622 in the domain T. The difference between the sample 614 and 622 is represented as a cycle consistency loss 624. [0049] In some examples, the style transferred images usually have the similar color changing on the skin region and the non-skin region. In order to protect the skin’s color, the skin region is segmented out in the input image based on the color distribution. When performing the style transfer, the segmented skin region has a decayed or lessened effect of the transfer, for example, the skin region has less dramatic color changes compared with the other parts of the image. [0050] In some embodiments, an optional process of generating body shadow and different lighting effects 550 is implemented after the style transfer and skin color protection 540 process. In some examples, when using a point light or directional light as the new light source, and the input image includes a full body human, a shadow of the human body casted on a nearby structure, for example, the wall or floor, is desired for aesthetic purposes. This is achieved by a body mask projection. In some examples, the body shape is differentiated or segmented from the background of the picture. In some examples, the body shape is segmented from the background of the picture according to a 3D body model. With the designed light position or direction, the segmented body mask is projected onto the nearby structure, such as the wall or the ground. The projected body mask is used to simulate the body shadow on the wall, for example, by changing the intensity of pixels of wall that are within the projected body mask. [0051] Figure 7A are an input image 702 and three output images 704-708 to which the input image is converted with first lighting effects (e.g., leaf light, window light, contour light), in accordance with some embodiments. The output images 704, 706, and 708 are rendered from the input image 702 under leaf light, window light, and contour light, respectively. This is a set of specific style of lighting conditions. For example, in the image 704, the leaf light simulates the lighting under a tree, and in the image 706, window light simulates the lighting near a window. The shadow is casted onto both the wall and the person. In some examples, a 3D parametric body model (e.g., a skinned multi-person linear model (SMPL)) is fitted to the input image 702 to estimate the correct shadow boundary given a light source. In some examples, the parameters of a 3D parametric body model are learned from data including human pose template, blend weights, blend shapes and is regressed from vertices to joint locations. In some examples, SMPL is a realistic 3D human body model that is based on skinning and blend shapes and is learned from a large quantity of 3D body scans. SMPL is a skinned vertex-based model trained from different natural people in different human poses. In some embodiments, the 3D body model is very coarse with simple chapes. In some embodiments, the 3D body model includes a face. The image 708 with contour light highlights the people and darkens the background region, in a seamless way. [0052] Figure 7B are an input image 710 and two output images 712 and 714 to which the input image 710 is converted with second lighting effects (e.g., morning and dust lights), in accordance with some embodiments. The output images 712 and 714 are rendered from the input image with morning light and dusk light, respectively. These two light effects are used to change the color style of the photo or used for making the photo brighter. The output image 712 with morning light is in a white style and bright. The output image 714 with the dusk light is a bit yellow and the brightness is lower. [0053] Figure 7C are an input image 720 and three output images 722-726 to which the input image 720 is converted with arbitrary lighting effects, in accordance with some embodiments. Each output image 722, 724, or 726 is associated with a respective arbitrary lighting condition defined by a unique combination of direction, color, intensity distribution, and other characteristics of light. In some embodiments, the user selects the different light direction, color, intensity distribution, and other characteristics of the light to apply an arbitrary lighting effect to an input image 720. In some embodiments, a user selects specific lighting conditions to apply in a user interface of an application and/or a user device. [0054] Figure 8 is a flowchart of a method 800 for relighting an input image, in accordance with some embodiments. The method 800 is implemented by a computer system (e.g., a client device 104, a server 102, or a combination thereof). An example of the client device 104 is a mobile phone. Method 800 is, in some embodiments, governed by instructions that are stored in a non-transitory computer readable storage medium and that are executed by one or more processors of the computer system. Each of the operations shown in Figure 8 may correspond to instructions stored in a computer memory or non-transitory computer readable storage medium (e.g., memory 206 of the system 200 in Figure 2). The computer readable storage medium may include a magnetic or optical disk storage device, solid state storage devices such as Flash memory, or other non-volatile memory device or devices. The instructions stored on the computer readable storage medium may include one or more of: source code, assembly language code, object code, or other instruction format that is interpreted by one or more processors. Some operations in method 800 may be combined and/or the order of some operations may be changed. [0055] The computer system reconstructs (810) a 3D face model from a face image of the input image based on a plurality of facial landmarks, and renders (820) the 3D face model in a predefined lighting condition with a facial shadow and shading. The computer system blends (830) the facial shadow and shading with the input image, and updates (840) a non-facial portion of the input image with a style that matches the predefined lighting condition. [0056] In some embodiments, the 3D face model includes a face shape, and the operation (810) of reconstructing the 3D face model from the input image further includes fitting the 3D face model by aligning a plurality of first facial feature landmarks of the input image and a plurality of second facial feature landmarks of the 3D face model. Some exemplary facial feature landmarks include nose tip, eye corner, etc. In some embodiments, the 3D face model is reconstructed using a machine learning model that is trained in a supervised manner using one or more sample images, and each sample image is annotated with one or more ground-truth facial feature landmarks. [0057] In some embodiments, the input image is followed by a first image in a video clip, and the 3D face model reconstructed from the input image includes a first 3D face model. In some embodiments, the process of relighting an input image further includes tracking a variation of the plurality of facial landmarks from the input image to the first image and reconstructing a second 3D face model for the second image based on the first 3D face model and the variation of the plurality of facial landmarks. The second 3D face model is rendered in a corresponding predefined lighting condition. [0058] In some embodiments, the facial shadow and shading are represented by a facial shadow map and a facial shading distribution, respectively, and the operation (820) of rendering the 3D face model in the predefined lighting condition with the facial shadow and shading further includes: generating the facial shadow map using a percentage closer soft shadow (PCSS) method; and generating the facial shading distribution using a Disney bidirectional reflectance distribution function (BRDF). [0059] In some embodiments, the operation (830) of blending the facial shadow and shading with the input image includes softening the facial shadow map along a boundary of the 3D face model, softening the facial shading distribution along the boundary of the 3D face model, and blending the softened facial shadow map and facial shading distribution with the input image. In some embodiments, the blending is on a pixel basis. [0060] In some embodiments, the operation (830) of blending the facial shadow and shading with the input image further includes expanding a distribution of the facial shadow and shading along a boundary of the facial shadow and shading to soften the boundary, and layering the expanded facial shadow and shading with the softened boundary with the input image according to a position of the 3D face model. [0061] In some embodiments, the operation (840) of updating the non-facial portion of the input image with the style that matches the predefined lighting condition further includes: adjusting a color style of the non-facial portion of the input image according to the predefined lighting condition using a style-transfer neural network model. In some embodiments, the style-transfer neural network is trained in a supervised manner to adjust color styles of non-facial portions from a set of training images under a plurality of predefined lighting conditions. [0062] In some embodiments, the non-facial portion of the input image includes at least one of other non-facial body parts and a background of the input image. [0063] In some embodiments, the non-facial portion includes a skin region and a non- skin region, and the operation (840) of updating the non-facial portion of the input image with the style that matches the predefined lighting condition further includes: segmenting the skin region from the non-facial portion of the input image; and updating the skin region of the non-facial portion with a decayed effect of the style that matches the predefined lighting condition compared with other non-skin region of the non-facial portion. In some embodiments, segmenting the skin region from the non-facial portion of the input image is based on a color distribution. In some embodiments, the skin region has a color closer to that of the input image. [0064] In some embodiments, the process of relighting an input image further includes: in accordance with a determination that the input image includes a body and the predefined lighting condition utilizes a point light or a directional light, casting a body shadow on an adjacent structure. In some embodiments, the body is a half body or full body. In some embodiments, the adjacent structure is a wall or a floor. [0065] In some embodiments, casting the body shadow on the adjacent structure includes segmenting a body shape as a mask, and projecting the mask on the adjacent structure. [0066] In some embodiments, in accordance with a determination that the predefined lighting condition is selected from the group consisting of leaf light, window light, and contour light, a 3D parametric body model is fitted to the input image to estimate the body shadow corresponding to the predefined lighting condition. [0067] In some embodiments, the predefined lighting condition is one selected from the group consisting of morning light, dusk light, and arbitrary light. In some embodiments, arbitrary light is defined by a direction, color, intensity, distribution, etc. [0068] It should be understood that the particular order in which the operations in Figure 8 have been described are merely exemplary and are not intended to indicate that the described order is the only order in which the operations could be performed. One of ordinary skill in the art would recognize various ways to relighting an input image as described herein. Additionally, it should be noted that details of other processes described above with respect to Figures 1-7C are also applicable in an analogous manner to method 800 described above with respect to Figure 8. For brevity, these details are not repeated here. [0069] The terminology used in the description of the various described implementations herein is for the purpose of describing particular implementations only and is not intended to be limiting. As used in the description of the various described implementations and the appended claims, the singular forms “a”, “an” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will also be understood that the term “and/or” as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items. It will be further understood that the terms “includes,” “including,” “comprises,” and/or “comprising,” when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. Additionally, it will be understood that, although the terms “first,” “second,” etc. may be used herein to describe various elements, these elements should not be limited by these terms. These terms are only used to distinguish one element from another. [0070] As used herein, the term “if” is, optionally, construed to mean “when” or “upon” or “in response to determining” or “in response to detecting” or “in accordance with a determination that,” depending on the context. Similarly, the phrase “if it is determined” or “if [a stated condition or event] is detected” is, optionally, construed to mean “upon determining” or “in response to determining” or “upon detecting [the stated condition or event]” or “in response to detecting [the stated condition or event]” or “in accordance with a determination that [a stated condition or event] is detected,” depending on the context. [0071] The foregoing description, for purpose of explanation, has been described with reference to specific embodiments. However, the illustrative discussions above are not intended to be exhaustive or to limit the claims to the precise forms disclosed. Many modifications and variations are possible in view of the above teachings. The embodiments were chosen and described in order to best explain principles of operation and practical applications, to thereby enable others skilled in the art. [0072] Although various drawings illustrate a number of logical stages in a particular order, stages that are not order dependent may be reordered and other stages may be combined or broken out. While some reordering or other groupings are specifically mentioned, others will be obvious to those of ordinary skill in the art, so the ordering and groupings presented herein are not an exhaustive list of alternatives. Moreover, it should be recognized that the stages can be implemented in hardware, firmware, software or any combination thereof.

Claims

What is claimed is: 1. A method of relighting an input image, comprising: reconstructing a 3D face model from a face image of the input image based on a plurality of facial landmarks; rendering the 3D face model in a predefined lighting condition with a facial shadow and shading; blending the facial shadow and shading with the input image; and updating a non-facial portion of the input image with a style that matches the predefined lighting condition.

2. The method of claim 1, wherein the 3D face model includes a face shape, and reconstructing the 3D face model from the input image further comprises: fitting the 3D face model by aligning a plurality of first facial feature landmarks of the input image and a plurality of second facial feature landmarks of the 3D face model.

3. The method of claim 2, wherein the 3D face model is reconstructed using a machine learning model that is trained in a supervised manner using one or more sample images, and each sample image is annotated with one or more ground-truth facial feature landmarks.

4. The method of claim 2, wherein the input image is followed by a first image in a video clip, and the 3D face model reconstructed from the input image includes a first 3D face model, the method further comprising: tracking a variation of the plurality of facial landmarks from the input image to the first image; and reconstructing a second 3D face model for the second image based on the first 3D face model and the variation of the plurality of facial landmarks.

5. The method of any of the preceding claims, wherein the facial shadow and shading are represented by a facial shadow map and a facial shading distribution, respectively, and rendering the 3D face model in the predefined lighting condition with the facial shadow and shading further comprises: generating the facial shadow map using a percentage closer soft shadow (PCSS) method; and generating the facial shading distribution using a Disney bidirectional reflectance distribution function (BRDF).

6. The method of claim 5, wherein blending the facial shadow and shading with the input image includes: softening the facial shadow map along a boundary of the 3D face model; softening the facial shading distribution along the boundary of the 3D face model; and blending the softened facial shadow map and facial shading distribution with the input image.

7. The method of any of claims 1-5, wherein blending the facial shadow and shading with the input image further comprises: expanding a distribution of the facial shadow and shading along a boundary of the facial shadow and shading to soften the boundary; and layering the expanded facial shadow and shading with the softened boundary with the input image according to a position of the 3D face model.

8. The method of any of the preceding claims, wherein updating the non-facial portion of the input image with the style that matches the predefined lighting condition further comprises: adjusting a color style of the non-facial portion of the input image according to the predefined lighting condition using a style-transfer neural network model.

9. The method of any of the preceding claims, wherein the non-facial portion of the input image includes at least one of other non-facial body parts and a background of the input image.

10. The method of any of the preceding claims, wherein the non-facial portion includes a skin region and a non-skin region, and updating the non-facial portion of the input image with the style that matches the predefined lighting condition further comprises: segmenting the skin region from the non-facial portion of the input image; and updating the skin region of the non-facial portion with a decayed effect of the style that matches the predefined lighting condition compared with other non-skin region of the non-facial portion.

11. The method of any of the preceding claims, further comprising: in accordance with a determination that the input image includes a body and the predefined lighting condition utilizes a point light or a directional light, casting a body shadow on an adjacent structure.

12. The method of claim 11, wherein casting the body shadow on the adjacent structure includes segmenting a body shape as a mask, and projecting the mask on the adjacent structure.

13. The method of claim 11, in accordance with a determination that the predefined lighting condition is selected from the group consisting of leaf light, window light, and contour light, a 3D parametric body model is fitted to the input image to estimate the body shadow corresponding to the predefined lighting condition.

14. The method of any of the preceding claims, wherein the predefined lighting condition is one selected from the group consisting of morning light, dusk light, and arbitrary light.

15. A computer system, comprising: one or more processors; and memory having instructions stored thereon, which when executed by the one or more processors cause the processors to perform a method of any of claims 1-14.

16. A non-transitory computer-readable medium, having instructions stored thereon, which when executed by one or more processors cause the processors to perform a method of any of claims 1-14.