WO2023133285A1

WO2023133285A1 - Anti-aliasing of object borders with alpha blending of multiple segmented 3d surfaces

Info

Publication number: WO2023133285A1
Application number: PCT/US2023/010333
Authority: WO
Inventors: Kim C. NG; Jinglin SHEN; Chiu Man HO
Original assignee: Innopeak Technology, Inc.
Priority date: 2022-01-07
Filing date: 2023-01-06
Publication date: 2023-07-13

Abstract

This application is directed to rendering three-dimensional (3D) visual content. An electronic device obtains an input image and an associated input depth map. A foreground object is identified in the input image based on the depth image. The electronic device identifies a plurality of edge pixels associated with an edge of the foreground object on the depth map, e.g., by generating an edge map identifying the edge pixels. Each edge pixel corresponds to a background pixel that is occluded by the foreground object. A respective alpha value is determined for each edge pixel. In some embodiments, a Gaussian blur filter on the edge map is applied to generate a blurred edge map, which is converted to an alpha map based on the depth map. The electronic device renders the input image by blending colors of each edge pixel and the respective background pixel based on the respective alpha value.

Description

Anti-Aliasing of Object Borders with Alpha Blending of Multiple

Segmented 3D Surfaces

RELATED APPLICATIONS

[0001] This application claims the benefit of U.S. Provisional Application No. 63/297,149, entitled “Anti-aliasing of Object Borders with Alpha Blending of Multiple Segmented 3D Surfaces,” filed January 7, 2022, and is a continuation of and claims priority to PCT Application No. PCT/US2022/081273, entitled “Serialization and Deserialization of Layered Depth Images for 3D Rendering,” filed December 9, 2022, all of which are hereby incorporated by reference in their entirety.

TECHNICAL FIELD

[0002] This application relates generally to image processing technology including, but not limited to, methods, systems, and non-transitory computer-readable storage media for improving quality of an image, e.g., by applying color blending to border pixels of foreground objects to enhance at least image quality around object borders.

BACKGROUND

[0003] Images are oftentimes reconstructed from imaging details of objects and meshes stored on a pixel level. These imaging details are processed using Open Graphics Library (OpenGL) to render the objects and meshes and generate visual content appreciated by spectators. Anti-aliasing techniques are used to remove jaggy or staircase effects from borders of objects within the images. Examples of the anti-aliasing techniques include supersample anti-aliasing, multisample anti-aliasing, fast approximate anti-aliasing, morphological anti-aliasing, subpixel morphological anti-aliasing, and temporal anti-aliasing. These anti-aliasing techniques oftentimes demand large computation and storage resources and involve complicated operations on fragments, blurriness, motion vectors, and/or transparent textures. It would be beneficial to develop systems and methods for enhancing quality of images or video clips effectively and efficiently to conserve storage and communication resources allocated for rendering visual information. SUMMARY

[0004] Some implementations of this application are directed to anti-aliasing techniques applied in an image rendering system particularly to suppress geometric aliasing on object borders (e.g., remove jaggy or staircase effects associated with borders of objects in images). Such anti-aliasing techniques use three-dimensional (3D) spatial information and color information, which is optionally collected by one or multiple cameras or sensors disposed in a 3D scene or artificially synthesized by a graphical system. Based on 3D spatial and color information of a scene, a ray is radiated from a given view point and intersects with multiple surfaces in the 3D scene. A depth map of the surfaces is projected based on the given view point and used to detect object borders or silhouettes. The borders or silhouettes of the surfaces are further processed to derive an alpha map. The alpha map is used to blend the borders or silhouettes of the surfaces (i.e., blend pixels on a foreground surface and pixels on a background) to derive new colors for anti-aliasing. These anti-aliasing techniques overcome large aliasing (e.g., greater than a threshold dimension) and handle object transparency efficiently, thereby providing a simple and effective solution against geometric aliasing distributed along the object borders without demanding large amount of computation and memory usage.

[0005] In an aspect, an image processing method is implemented by an electronic device. The method includes obtaining an input image and a depth map corresponding to the input image. The method further includes identifying a foreground object in the input image based on the depth map and identifying a plurality of edge pixels associated with an edge of the foreground object on the depth map. Each edge pixel corresponds to a respective background pixel that is occluded by the foreground object. The method further includes determining a respective alpha value for each of the plurality of edge pixels and rendering the input image by blending colors of each of the plurality of edge pixels and the respective background pixel based on the respective alpha value.

[0006] In some embodiments, identifying the plurality of edge pixels on the depth map further includes generating an edge map identifying the plurality of edge pixels. Further, in some embodiments, the edge map includes a binary map. The binary map has (1) a first set of pixels corresponding to the plurality of edge pixels and having a first binary value and (2) a second set of pixels corresponding to a plurality of remaining pixels and having a second binary value. Each remaining pixel is distinct from the plurality of edge pixels in the input image. Additionally, in some embodiments, the method further includes applying a Gaussian blur filter on the edge map to generate a blurred edge map and converting the blurred edge map to an alpha map based on the depth map. The alpha map has a plurality of alpha values corresponding to a plurality of pixels of the input image, and the plurality of alpha values include the respective alpha value of each edge pixel.

[0007] Additionally, some implementations of this application are directed to methods, systems, devices, non-transitory computer-readable media for organizing 3D visual information in layered depth images (LDIs) and a data serialization format (e.g., a YAML or XML file). One or more cameras capture one or more images containing color information and 3D spatial information of the scene. The one or more images are structured to a LDI representation that represents a 3D scene and includes multiple pixels along each line of sight. Particularly, in some embodiments, the LDI representation is constructed from a single input image and provides layered and detailed image information in an outline region of each foreground object in the input image.

[0008] The LDI representation is serialized for storage or transmission according to the data serialization file. The LDI representation can subsequently be deserialized and rendered to visual content based on the data serialization file. Specifically, serialization is a process of translating a data structure or object state of the LDI representation into a format that can be stored (e.g., in a file or memory data buffer), transmitted (e.g., over a computer network), and reconstructed (e.g., in a different computer environment). Deserialization is a process of extracting a data structure or object state from a series of bytes that can be used to create a semantically identical clone of the original data structure or object. Deserialization is an opposite process of serialization. In some embodiments, an image format is used to represent serialized data of the LDI representation to facilitate compression of necessary information of the LDI representation. In some embodiments, the serialized data of the LDI representation is saved or transmitted using a database, such as using a structured query language (SQL). By these means, various embodiments of this application conserve storage and communication resources allocated for rendering the 3D visual information (e.g., reduces storage resources storing an example 3D scene from 40Mb to less than 100KB for a two- dimensional (2D) image having a resolution of 1 mega pixels).

[0009] In one aspect, an image processing method is implemented at an electronic device. The method includes obtaining an input image, obtaining an input depth map corresponding to the input image, and identifying a foreground object in the input image. The method further includes determining a set of first images that capture respective visual view that is occluded by the foreground object and forming a first image layer including the set of first images of the foreground object. The method further includes associating the input image and the first image layer with a data serialization file for rendering the input image in a 3D format. The data serialization file identifies locations of the set of first images of the first image layer in the input image. In some embodiments, each first image corresponds to a portion of a contour of the foreground object, and provides a visual view that is occluded by an outline region including the portion of the contour of the foreground object.

[0010] In some embodiments, the method further includes determining a set of first depth maps corresponding to the set of first images, and each first depth map corresponds to a respective distinct first image and having a first common location in the input image as the respective distinct first image. Further, in some embodiments, the method includes obtaining the input image, the input depth map, the first image layer including the set of first images and the set of first depth maps, and the data serialization file. The method further includes determining the locations of the set of first images of the first image layer in the input image from the data serialization file and combining the set of first images and the set of first depth maps of the first image layer with the input image according to the determined locations, thereby rendering the input image in the 3D format.

[0011] In some embodiments, the method further includes storing the input image, the input depth image, one or more image layers including the first image layer, and the data serialization file jointly in memory of the electronic device. In some embodiments, the method further includes transferring the input image, the input depth image, the one or more image layers including the first image layer, and the data serialization file to a computer system, such that the computer system renders the input image in the 3D format.

[0012] In another aspect, some implementations include an electronic device that includes one or more processors and memory having instructions stored thereon, which when executed by the one or more processors cause the processors to perform any of the above methods.

[0013] In yet another aspect, some implementations include a non-transitory computer-readable medium, having instructions stored thereon, which when executed by one or more processors cause the processors to perform any of the above methods.

[0014] These illustrative embodiments and implementations are mentioned not to limit or define the disclosure, but to provide examples to aid understanding thereof. Additional embodiments are discussed in the Detailed Description, and further description is provided there. BRIEF DESCRIPTION OF THE DRAWINGS

[0015] For a better understanding of the various described implementations, reference should be made to the Detailed Description below, in conjunction with the following drawings in which like reference numerals refer to corresponding parts throughout the figures.

[0016] Figure 1 is an example data processing environment having one or more servers communicatively coupled to one or more client devices, in accordance with some embodiments.

[0017] Figure 2 is a block diagram illustrating an electronic device configured to process content data (e.g., image data), in accordance with some embodiments.

[0018] Figure 3 is an example data processing environment for training and applying a neural network-based data processing model for processing visual and/or audio data, in accordance with some embodiments.

[0019] Figure 4A is an example neural network applied to process content data in an NN-based data processing model, in accordance with some embodiments, and Figure 4B is an example node in the neural network, in accordance with some embodiments.

[0020] Figure 5 illustrates a simplified process of rendering an output image from an input image, in accordance with some embodiments.

[0021] Figure 6 illustrates another example process of serializing layered depth images (LDIs) into a data serialization file, in accordance with some embodiments.

[0022] Figure 7A illustrates example LDIs of an input image, in accordance with some embodiments, and Figure 7B is a perspective view of objects in an input image, in accordance with some embodiments. Figure 7C is an example serialization file, in accordance with some embodiments.

[0023] Figure 8 is a flow diagram of an example process of serializing and deserializing LDIs of an input image, in accordance with some embodiments.

[0024] Figure 9A illustrates LDIs including two image layers, in accordance with some embodiments, and Figure 9B illustrates information of an image segment occluded by a foreground object of an input image, in accordance with some embodiments. Figure 9C is an example serialization file associated with an image layer shown in Figure 9A, in accordance with some embodiments.

[0025] Figure 10 is a flow diagram of an example image processing method, in accordance with some embodiments. [0026] Figure 11 is a flow diagram explaining a method for updating edge pixels of a foreground object in an input image, in accordance with some embodiments.

[0027] Figure 12 is a flow diagram of an image processing method implemented based on color blending, in accordance with some embodiments.

[0028] Figures 13 A and 13B are an input image and a depth image, in accordance with some embodiments.

[0029] Figures 14A and 14B are an input image and an updated image rendered based on alpha blending, in accordance with some embodiments.

[0030] Figure 15 is an example color blending file associated with an input image, in accordance with some embodiments.

[0031] Figure 16 is a flow diagram of another example image processing method implemented based on color blending, in accordance with some embodiments.

[0032] Like reference numerals refer to corresponding parts throughout the several views of the drawings.

DETAILED DESCRIPTION

[0033] Reference will now be made in detail to specific embodiments, examples of which are illustrated in the accompanying drawings. In the following detailed description, numerous non-limiting specific details are set forth in order to assist in understanding the subject matter presented herein. But it will be apparent to one of ordinary skill in the art that various alternatives may be used without departing from the scope of claims and the subject matter may be practiced without these specific details. For example, it will be apparent to one of ordinary skill in the art that the subject matter presented herein can be implemented on many types of electronic devices with digital video capabilities.

[0034] Some embodiments of this application are directed to organizing 3D visual information (e.g., depth information and color information) of LDIs in a data serialization format (e.g., a YAML or XML file) for rendering the 3D visual information in a user application (e.g., a photo album application). The 3D visual information is captured directly by camera(s) or estimated from one or more images using computer vision techniques (e.g., information of occluded pixels behind a foreground object is interpolated by an inpainting algorithm). Particularly, the color information is extracted from images captured by one or more imaging devices. In an example, a scene is captured by a single image or multiple images. The 3D visual information is recovered at least partially using artificial intelligence and/or computer vision algorithms, and organized in a structure of LDIs. In some embodiments associated with a virtual reality application, the depth information and color information are synthesized. In an example, depth information is recovered from a single image using MiDaS. Serialized LDIs are optionally stored in a storage medium or transmitted through a network. When the serialized LDIs are deserialized for 3D rendering, the depth and color information is reconstructed from the LDIs, and synthesis of the depth and color information is not repeated. By these means, organizing the 3D visual information in LDIs during data serialization conserves at least storage and communication allocated for rendering the 3D visual information.

[0035] Additionally, some embodiments of this application are directed to antialiasing techniques applied in an image rendering system particularly to suppress geometric aliasing on object borders. Such anti-aliasing techniques use 3D spatial and color information, which is optionally collected by one or multiple cameras or sensors disposed in a 3D scene or artificially synthesized by a graphical system. Based on 3D spatial and color information of a scene, a ray is radiated from a given view point and intersects with multiple surfaces in the scene. A depth map of the surfaces is projected onto the given view point and used to detect object borders or silhouettes. The borders or silhouettes of the surfaces are further processed to derive an alpha map. The alpha map is used to blend the borders or silhouettes of the surfaces with background surfaces to derive new colors for anti-aliasing. These anti-aliasing techniques overcome large aliasing (e.g., greater than a threshold dimension) and handle object transparency efficiently, thereby providing a simple and effective solution against geometric aliasing distributed along the object borders without demanding large amount of computation and memory usage.

[0036] Figure 1 is an example data processing environment 100 having one or more servers 102 communicatively coupled to one or more client devices 104, in accordance with some embodiments. The one or more client devices 104 may be, for example, desktop computers 104A, tablet computers 104B, mobile phones 104C, head-mounted display (HMD) (also called augmented reality (AR) glasses) 104D, or intelligent, multi-sensing, network- connected home devices (e.g., a surveillance camera 104E, a smart television device, a drone). Each client device 104 can collect data or user inputs, executes user applications, and present outputs on its user interface. The collected data or user inputs can be processed locally at the client device 104 and/or remotely by the server(s) 102. The one or more servers 102 provide system data (e.g., boot files, operating system images, and user applications) to the client devices 104, and in some embodiments, processes the data and user inputs received from the client device(s) 104 when the user applications are executed on the client devices 104. In some embodiments, the data processing environment 100 further includes a storage 106 for storing data related to the servers 102, client devices 104, and applications executed on the client devices 104.

[0037] The one or more servers 102 are configured to enable real-time data communication with the client devices 104 that are remote from each other or from the one or more servers 102. Further, in some embodiments, the one or more servers 102 are configured to implement data processing tasks that cannot be or are preferably not completed locally by the client devices 104. For example, the client devices 104 include a game console (e.g., the HMD 104D) that executes an interactive online gaming application. The game console receives a user instruction and sends it to a game server 102 with user data. The game server 102 generates a stream of video data based on the user instruction and user data and providing the stream of video data for display on the game console and other client devices that are engaged in the same game session with the game console. In another example, the client devices 104 include a networked surveillance camera 104E and a mobile phone 104C. The networked surveillance camera 104E collects video data and streams the video data to a surveillance camera server 102 in real time. While the video data is optionally pre-processed on the surveillance camera 104E, the surveillance camera server 102 processes the video data to identify motion or audio events in the video data and share information of these events with the mobile phone 104C, thereby allowing a user of the mobile phone 104 to monitor the events occurring near the networked surveillance camera 104E in the real time and remotely. [0038] The one or more servers 102, one or more client devices 104, and storage 106 are communicatively coupled to each other via one or more communication networks 108, which are the medium used to provide communications links between these devices and computers connected together within the data processing environment 100. The one or more communication networks 108 may include connections, such as wire, wireless communication links, or fiber optic cables. Examples of the one or more communication networks 108 include local area networks (LAN), wide area networks (WAN) such as the Internet, or a combination thereof. The one or more communication networks 108 are, optionally, implemented using any known network protocol, including various wired or wireless protocols, such as Ethernet, Universal Serial Bus (USB), FIREWIRE, Long Term Evolution (LTE), Global System for Mobile Communications (GSM), Enhanced Data GSM Environment (EDGE), code division multiple access (CDMA), time division multiple access (TDMA), Bluetooth, Wi-Fi, voice over Internet Protocol (VoIP), Wi-MAX, or any other suitable communication protocol. A connection to the one or more communication networks 108 may be established either directly (e.g., using 3G/4G connectivity to a wireless carrier), or through a network interface 110 (e.g., a router, switch, gateway, hub, or an intelligent, dedicated whole-home control node), or through any combination thereof. As such, the one or more communication networks 108 can represent the Internet of a worldwide collection of networks and gateways that use the Transmission Control Protocol/Internet Protocol (TCP/IP) suite of protocols to communicate with one another. At the heart of the Internet is a backbone of high-speed data communication lines between major nodes or host computers, consisting of thousands of commercial, governmental, educational and other computer systems that route data and messages.

[0039] In some embodiments, deep learning techniques are applied in the data processing environment 100 to process content data (e.g., video data, visual data, audio data) obtained by an application executed at a client device 104 to identify information contained in the content data, match the content data with other data, categorize the content data, or synthesize related content data. In these deep learning techniques, data processing models are created based on one or more neural networks to process the content data. These data processing models are trained with training data before they are applied to process the content data. Subsequently to model training, the client device 104 obtains the content data

(e.g., captures video data via an internal camera) and processes the content data using the data processing models locally.

[0040] In some embodiments, both model training and data processing are implemented locally at each individual client device 104. The client device 104 obtains the training data from the one or more servers 102 or storage 106 and applies the training data to train the data processing models. Alternatively, in some embodiments, both model training and data processing are implemented remotely at a server 102 (e.g., the server 102A) associated with a client device 104. The server 102A obtains the training data from itself, another server 102 or the storage 106 applies the training data to train the data processing models. The client device 104 obtains the content data, sends the content data to the server 102A (e.g., in an application) for data processing using the trained data processing models, receives data processing results (e.g., recognized hand gestures) from the server 102A, presents the results on a user interface (e.g., associated with the application), renders virtual objects in a field of view based on the poses, or implements some other functions based on the results. The client device 104 itself implements no or little data processing on the content data prior to sending them to the server 102 A. Additionally, in some embodiments, data processing is implemented locally at a client device 104, while model training is implemented remotely at a server 102 (e.g., the server 102B) associated with the client device 104. The server 102B obtains the training data from itself, another server 102 or the storage 106 and applies the training data to train the data processing models. The trained data processing models are optionally stored in the server 102B or storage 106. The client device 104 imports the trained data processing models from the server 102B or storage 106, processes the content data using the data processing models, and generates data processing results to be presented on a user interface or used to initiate some functions (e.g., rendering virtual objects based on device poses) locally.

[0041] In some embodiments, a pair of AR glasses 104D (also called an HMD) are communicatively coupled in the data processing environment 100. The HMD 104D includes a camera, a microphone, a speaker, one or more inertial sensors (e.g., gyroscope, accelerometer), and a display. The camera and microphone are configured to capture video and audio data from a scene of the HMD 104D, while the one or more inertial sensors are configured to capture inertial sensor data. In some situations, the camera captures hand gestures of a user wearing the HMD 104D, and recognizes the hand gestures locally and in real time using a two-stage hand gesture recognition model. In some situations, the microphone records ambient sound, including user’s voice commands. In some situations, both video or static visual data captured by the camera and the inertial sensor data measured by the one or more inertial sensors are applied to determine and predict device poses. The video, static image, audio, or inertial sensor data captured by the HMD 104D is processed by the HMD 104D, server(s) 102, or both to recognize the device poses. Optionally, deep learning techniques are applied by the server(s) 102 and HMD 104D jointly to recognize and predict the device poses. The device poses are used to control the HMD 104D itself or interact with an application (e.g., a gaming application) executed by the HMD 104D. In some embodiments, the display of the HMD 104D displays a user interface, and the recognized or predicted device poses are used to render or interact with user selectable display items (e.g., an avatar) on the user interface.

[0042] Figure 2 is a block diagram illustrating an electronic system 200 configured to process content data (e.g., image data), in accordance with some embodiments. The electronic system 200 includes a server 102, a client device 104 (e.g., HMD 104D in Figure 1), a storage 106, or a combination thereof. The electronic system 200 , typically, includes one or more processing units (CPUs) 202, one or more network interfaces 204, memory 206, and one or more communication buses 208 for interconnecting these components (sometimes called a chipset). The electronic system 200 includes one or more input devices 210 that facilitate user input, such as a keyboard, a mouse, a voice-command input unit or microphone, a touch screen display, a touch-sensitive input pad, a gesture capturing camera, or other input buttons or controls. Furthermore, in some embodiments, the client device 104 of the electronic system 200 uses a microphone for voice recognition or a camera 260 for gesture recognition to supplement or replace the keyboard. In some embodiments, the client device 104 includes one or more optical cameras 260 (e.g., an RGB camera), scanners, or photo sensor units for capturing images, for example, of graphic serial codes printed on the electronic devices. The electronic system 200 also includes one or more output devices 212 that enable presentation of user interfaces and display content, including one or more speakers and/or one or more visual displays. Optionally, the client device 104 includes a location detection device, such as a GPS (global positioning system) or other geo-location receiver, for determining the location of the client device 104. Optionally, the client device 104 includes one or more depth sensors 280 configured to determine a depth of an object in a view of a user or a camera 260.

[0043] Memory 206 includes high-speed random access memory, such as DRAM, SRAM, DDR RAM, or other random access solid state memory devices; and, optionally, includes non-volatile memory, such as one or more magnetic disk storage devices, one or more optical disk storage devices, one or more flash memory devices, or one or more other non-volatile solid state storage devices. Memory 206, optionally, includes one or more storage devices remotely located from one or more processing units 202. Memory 206, or alternatively the non-volatile memory within memory 206, includes a non-transitory computer readable storage medium. In some embodiments, memory 206, or the non- transitory computer readable storage medium of memory 206, stores the following programs, modules, and data structures, or a subset or superset thereof:

• Operating system 214 including procedures for handling various basic system services and for performing hardware dependent tasks;

• Network communication module 216 for connecting each server 102 or client device 104 to other devices (e.g., server 102, client device 104, or storage 106) via one or more network interfaces 204 (wired or wireless) and one or more communication networks 108, such as the Internet, other wide area networks, local area networks, metropolitan area networks, and so on;

• User interface module 218 for enabling presentation of information (e.g., a graphical user interface for application(s) 224, widgets, websites and web pages thereof, and/or games, audio and/or video content, text, etc.) at each client device 104 via one or more output devices 212 (e.g., displays, speakers, etc.);

• Input processing module 220 for detecting one or more user inputs or interactions from one of the one or more input devices 210 and interpreting the detected input or interaction;

• Web browser module 222 for navigating, requesting (e.g., via HTTP), and displaying websites and web pages thereof, including a web interface for logging into a user account associated with a client device 104 or another electronic device, controlling the client or electronic device if associated with the user account, and editing and reviewing settings and data that are associated with the user account;

• One or more user applications 224 for execution by the electronic system 200 (e.g., games, social network applications, smart home applications, and/or other web or non-web based applications for controlling another electronic device and reviewing data captured by such devices), where in some embodiments, the user application(s) 224 include a photo album application organizing a plurality of images;

• Model training module 226 for receiving training data and establishing a data processing model for processing content data (e.g., video, image, audio, or textual data) to be collected or obtained by a client device 104;

• Data processing module 228 for processing content data using data processing models 250, thereby identifying information contained in the content data, matching the content data with other data, categorizing the content data, or synthesizing related content data, where in some embodiments, the data processing module 228 is associated with one of the user applications 224 to determine depth information of an input image, determine visual content occluded by foreground objects in the input image, organize the input image, depth information, and occluded visual content in LDIs, and generate a data serialization file associated with the LDIs for creating a 3D effect in the input image;

• One or more databases 230 for storing at least data including one or more of: o Device settings 232 including common device settings (e.g., service tier, device model, storage capacity, processing capabilities, communication capabilities, etc.) of the one or more servers 102 or client devices 104; o User account information 234 for the one or more user applications 224, e.g., user names, security questions, account history data, user preferences, and predefined account settings; o Network parameters 236 for the one or more communication networks 108, e.g., IP address, subnet mask, default gateway, DNS server and host name; o Training data 238 for training one or more data processing models 240; o Data processing model(s) 240 for processing content data (e.g., video, image, audio, or textual data) using deep learning techniques, where the data processing models 240 includes a depth map model for creating a depth map from an input image and an inpainting model for determining image segments including portions of an outline of a foreground object in the input image based on context regions of a background that are immediately adjacent to the portions of the outline of the foreground object; and o Content data and results 242 that are obtained by and outputted to the client device 104 of the electronic system 200 , respectively, where the content data is processed by the data processing models 240 locally at the client device 104 or remotely at the server 102 to provide the associated results to be presented on client device 104, and in some embodiments, the content data includes LDIs constructed for an input image and a data serialization file.

[0044] In some embodiments, the data processing module 228 is associated with one of the user applications 224 to enhance image or video quality by determining depth information of an input image, determining visual content occluded by foreground objects in the input image, identifying a plurality of edge pixels associated with an edge of the foreground object on the depth map, determining an alpha value for each of the plurality of edge pixels, blending colors of the plurality of edge pixels and corresponding background pixel occluded by the edge pixels based on alpha values, and rendering the input image with the enhanced image or video quality. In some embodiments, the content data and results 242 include an input image (e.g., 1102 in Figure 11), a depth map (e.g., 1104 in Figure 11), background information, an alpha map (e.g., 1120 in Figure 11), and one or more edge maps (e.g., 1110 in Figure 11).

[0045] Optionally, the one or more databases 240 are stored in one of the server 102, client device 104, and storage 106 of the electronic system 200 . Optionally, the one or more databases 240 are distributed in more than one of the server 102, client device 104, and storage 106 of the electronic system 200 . In some embodiments, more than one copy of the above data is stored at distinct devices, e.g., two copies of the data processing models 250 are stored at the server 102 and storage 106, respectively. [0046] Each of the above identified elements may be stored in one or more of the previously mentioned memory devices, and corresponds to a set of instructions for performing a function described above. The above identified modules or programs (i.e., sets of instructions) need not be implemented as separate software programs, procedures, modules or data structures, and thus various subsets of these modules may be combined or otherwise re-arranged in various embodiments. In some embodiments, memory 206, optionally, stores a subset of the modules and data structures identified above. Furthermore, memory 206, optionally, stores additional modules and data structures not described above.

[0047] Figure 3 is another example of a data processing system 300 for training and applying a neural network based (NN-based) data processing model 240 for processing content data (e.g., video, image, audio, or textual data), in accordance with some embodiments. The data processing system 300 includes a model training module 226 for establishing the data processing model 240 and a data processing module 228 for processing the content data using the data processing model 240. In some embodiments, both of the model training module 226 and the data processing module 228 are located on a client device 104 of the data processing system 300, while a training data source 304 distinct from the client device 104 provides training data 238 to the client device 104. The training data source 304 is optionally a server 102 or storage 106. Alternatively, in some embodiments, the model training module 226 and the data processing module 228 are both located on a server 102 of the data processing system 300. The training data source 304 providing the training data 238 is optionally the server 102 itself, another server 102, or the storage 106. Additionally, in some embodiments, the model training module 226 and the data processing module 228 are separately located on a server 102 and client device 104, and the server 102 provides the trained data processing model 240 to the client device 104.

[0048] The model training module 226 includes one or more data pre-processing modules 308, a model training engine 310, and a loss control module 312. The data processing model 240 is trained according to the type of content data to be processed. The training data 238 is consistent with the type of the content data, so is a data pre-processing module 308 applied to process the training data 238 consistent with the type of the content data. For example, an image pre-processing module 308A is configured to process image training data 238 to a predefined image format, e.g., extract a region of interest (ROI) in each training image, and crop each training image to a predefined image size. Alternatively, an audio pre-processing module 308B is configured to process audio training data 238 to a predefined audio format, e.g., converting each training sequence to a frequency domain using a Fourier transform. The model training engine 310 receives pre-processed training data provided by the data pre-processing modules 308, further processes the pre-processed training data using an existing data processing model 240, and generates an output from each training data item. During this course, the loss control module 312 can monitor a loss function comparing the output associated with the respective training data item and a ground truth of the respective training data item. The model training engine 310 modifies the data processing model 240 to reduce the loss function, until the loss function satisfies a loss criterion (e.g., a comparison result of the loss function is minimized or reduced below a loss threshold). The modified data processing model 240 is provided to the data processing module 228 to process the content data.

[0049] In some embodiments, the model training module 226 offers supervised learning in which the training data is entirely labelled and includes a desired output for each training data item (also called the ground truth in some situations). Conversely, in some embodiments, the model training module 226 offers unsupervised learning in which the training data are not labelled. The model training module 226 is configured to identify previously undetected patterns in the training data without pre-existing labels and with no or little human supervision. Additionally, in some embodiments, the model training module 226 offers partially supervised learning in which the training data are partially labelled.

[0050] The data processing module 228 includes a data pre-processing module 314, a model-based processing module 316, and a data post-processing module 318. The data preprocessing modules 314 pre-processes the content data based on the type of the content data. Functions of the data pre-processing modules 314 are consistent with those of the preprocessing modules 308 and covert the content data to a predefined content format that is acceptable by inputs of the model-based processing module 316. Examples of the content data include one or more of the following: video, image, audio, textual, and other types of data. For example, each image is pre-processed to extract an ROI or cropped to a predefined image size, and an audio clip is pre-processed to convert to a frequency domain using a Fourier transform. In some situations, the content data includes two or more types, e.g., video data and textual data. The model -based processing module 316 applies the trained data processing model 240 provided by the model training module 226 to process the pre- processed content data. The model-based processing module 316 can also monitor an error indicator to determine whether the content data has been properly processed in the data processing module 228. In some embodiments, the processed content data is further processed by the data post-processing module 318 to present the processed content data in a preferred format or to provide other related information that can be derived from the processed content data.

[0051] Figure 4A is an exemplary neural network (NN) 400 applied to process content data in an NN-based data processing model 240, in accordance with some embodiments, and Figure 4B is an example of a node 420 in the neural network (NN) 400, in accordance with some embodiments. The data processing model 240 is established based on the neural network 400. A corresponding model-based processing module 316 applies the data processing model 240 including the neural network 400 to process content data that has been converted to a predefined content format. The neural network 400 includes a collection of nodes 420 that are connected by links 412. Each node 420 receives one or more node inputs and applies a propagation function to generate a node output from the node input(s). As the node output is provided via one or more links 412 to one or more other nodes 420, a weight w associated with each link 412 is applied to the node output. Likewise, the node input(s) can be combined based on corresponding weights wi, W2, ws, and W4 according to the propagation function. For example, the propagation function is a product of a non-linear activation function and a linear weighted combination of the node input(s).

[0052] The collection of nodes 420 is organized into one or more layers in the neural network 400. Optionally, the layer(s) may include a single layer acting as both an input layer and an output layer. Optionally, the layer(s) may include an input layer 402 for receiving inputs, an output layer 406 for providing outputs, and zero or more hidden layers 404 (e.g., 404A and 404B) between the input and output layers 402 and 406. A deep neural network has more than one hidden layer 404 between the input and output layers 402 and 406. In the neural network 400, each layer is only connected with its immediately preceding and/or immediately following layer. In some embodiments, a layer 402 or 404B is a fully connected layer because each node 420 in the layer 402 or 404B is connected to every node 420 in its immediately following layer. In some embodiments, one of the hidden layers 404 includes two or more nodes that are connected to the same node in its immediately following layer for down sampling or pooling the nodes 420 between these two layers. Particularly, max pooling uses a maximum value of the two or more nodes in the layer 404B for generating the node of the immediately following layer 406 connected to the two or more nodes.

[0053] In some embodiments, a convolutional neural network (CNN) is applied in a data processing model 240 to process content data (particularly, video and image data). The CNN employs convolution operations and belongs to a class of deep neural networks 400, i.e., a feedforward neural network that only moves data forward from the input layer 402 through the hidden layers to the output layer 406. The hidden layer(s) of the CNN can be convolutional layers convolving with multiplication or dot product. Each node in a convolutional layer receives inputs from a receptive area associated with a previous layer (e.g., five nodes), and the receptive area is smaller than the entire previous layer and may vary based on a location of the convolution layer in the convolutional neural network. Video or image data is pre-processed to a predefined video/image format corresponding to the inputs of the CNN. The pre-processed video or image data is abstracted by each layer of the CNN to a respective feature map. By these means, video and image data can be processed by the CNN for video and image recognition, classification, analysis, imprinting, or synthesis. [0054] Alternatively and additionally, in some embodiments, a recurrent neural network (RNN) is applied in the data processing model 240 to process content data (particularly, textual and audio data). Nodes in successive layers of the RNN follow a temporal sequence, such that the RNN exhibits a temporal dynamic behavior. For example, each node 420 of the RNN has a time-varying real-valued activation. Examples of the RNN include, but are not limited to, a long short-term memory (LSTM) network, a fully recurrent network, an Elman network, a Jordan network, a Hopfield network, a bidirectional associative memory (BAM network), an echo state network, an independently RNN (IndRNN), a recursive neural network, and a neural history compressor. In some embodiments, the RNN can be used for handwriting or speech recognition. It is noted that in some embodiments, two or more types of content data are processed by the data processing module 228, and two or more types of neural networks (e.g., both CNN and RNN) are applied to process the content data jointly.

[0055] The training process is a process for calibrating all of the weights w, for each layer of the learning model using a training data set which is provided in the input layer 402. The training process typically includes two steps, forward propagation and backward propagation, which are repeated multiple times until a predefined convergence condition is satisfied. In the forward propagation, the set of weights for different layers are applied to the input data and intermediate results from the previous layers. In the backward propagation, a margin of error of the output (e.g., a loss function) is measured, and the weights are adjusted accordingly to decrease the error. The activation function is optionally linear, rectified linear unit, sigmoid, hyperbolic tangent, or of other types. In some embodiments, a network bias term b is added to the sum of the weighted outputs from the previous layer before the activation function is applied. The network bias b provides a perturbation that helps the NN 400 avoid over fitting the training data. The result of the training includes the network bias parameter b for each layer.

[0056] Figure 5 illustrates a simplified process 500 of rendering an output image 502 from an input image 504, in accordance with some embodiments. An electronic device (e.g., a mobile phone 104C) executes a user application 224 (e.g., a photo album application) to perform a 3D rendering function on the input image 504, e.g., jointly with a data processing module 228. The input image 504 is optionally captured by a camera 260 of the electronic device or transferred to the electronic device. The 3D rendering function of the user application 224 determines additional depth information and color information of the input image 504 for enabling a 3D effect of the input image 504. In some embodiments, the depth information of the input image 504 is estimated from one or more images (which optionally includes the input image 504) using a computer vision processing method and/or a machine learning method 506. Alternatively, in some embodiments, the depth information of the input image 504 is captured directly by one or more depth sensors 280. In some embodiments, the color information of the input image 504 is estimated from one or more images (which optionally includes the input image 504) captured by the camera 260.

[0057] In some embodiments, the depth information and color information of the input image 504 are stored in layered depth images (LDIs) 508. The LDIs 508 are optionally extracted locally and rendered into the output image 502. The LDIs 508 may be transferred to a distinct electronic device for further processing, e.g., via a server 102. Alternatively, in some embodiments, the depth information and color information of the input image 504 are organized, stored, and transferred based on a set of meshes 510 of objects.

[0058] In some embodiments, the user application 224 automatically generates the LDIs 508 for each image 502 or for an image 502 that satisfies a 3D conversion criterion. Alternatively, in some embodiments, the user application 224 includes a user actionable affordance item configured to receiver a user action. In response to the user action, the user application 224 automatically generates the LDIs 508 for the input image 504 and stores the input image 504 with the LDIs 508. Additionally, in some embodiments, the user application 224 is configured to display the output image 502 with the 3D effect on a user interface based on the depth information and color information of the input image 504. In some situations, the user application 224 receivers one or more interactive inputs 512 (e.g., an input selecting a foreground object 514, such as the bird in the input image 504) on the user interface, and enables the output image 502 (e.g., limits the 3D effect to the selected bird) based on the one or more interactive inputs 512. [0059] Figure 6 illustrates another example process 600 of serializing layered depth images (LDIs) 508 into a data serialization file 660, in accordance with some embodiments. The process 600 is implemented at an electronic device (e.g., a mobile phone 104C, an HMD 104D) to generate the LDIs 508 and associated data serialization file 660 from the input image 504. The LDIs 508 can be organized according to information stored in the data serialization file 660 to render a 3D effect for a two-dimensional (2D) input image 504. The electronic device obtains the input image 504 and an input depth map 604 corresponding to the input image 504. The input image 504 and input depth map 604 forms a top image layer 606 of the LDIs 508.

[0060] In some embodiments, the input depth map 604 is generated from the input image 504, e.g., using a machine learning method and/or a computer vision processing method. For example, MiDaS is a machine learning model established based on ResNet and applied to estimate the input depth map 604. Alternatively, in some embodiments, the input depth map 604 is measured by one or more depth sensors 280 concurrently while the input image 504 is captured by a camera 260. In some embodiments, the input depth map 604 includes a plurality of depth pixels, and each depth pixel includes a depth value and one or more of: a color value, a segmentation label, a silhouette pixel flag value. In an example, the depth value includes a normalized depth value in a range of [0, 1], and each segmentation label has one of a first value indicating that the respective depth pixel corresponds to a foreground object 514 on the corresponding input image 504 and a second value indicating that the respective depth pixel corresponds to a pixel occluded by the foreground object 514. [0061] The electronic device identifies one or more foreground objects 514 (e.g., the bird in Figure 5) in the input image 504. In some embodiments, the one or more foreground objects 514 include a single foreground object 514. In some embodiments, the one or more foreground objects 514 include a plurality of foreground object 514-1, 514-2, ..., and 514-N, where N is equal to 2 and above. For a first foreground object 514-1, the electronic device generates a set of first images 608A (e.g., including 608A-1, ..., and 608-M, where M is equal to a positive integer). Each first image 608A-i captures a respective visual view that is occluded by the first foreground object 514-1. A first image layer 608 is formed to include the set of first images 608A of the first foreground object 514-1. The input image 504 and the first image layer 608 are associated with a data serialization file 660 for rendering the input image 504 in a 3D format. The data serialization file 660 identifies locations of the set of first images 608A of the first image layer 608 in the input image 504. [0062] In some embodiments, the electronic device determines a set of first depth maps 608B-1, and 608B-M corresponding to the set of first images 608A-1, and 608- M, respectively. Each first depth map 608B-i corresponds to a respective distinct first image 608A-i, and has a first common location in the input image 504 with the respective distinct first image 608A-i. The first image layer 608 further includes the set of first depth maps 608B. The corresponding first common location is stored in the data serialization file 660. Further, in some embodiments, during serialization, the input image 504, input depth map 604, and first image layer 608 are associated and stored with the data serialization file 660 for rendering the input image 504 in 3D format. For example, the data serialization file 660 stores locations of the set of first images 608A of the first image layer 608 in the input image 504. In some situations, the electronic device extracts the input image 504, input depth map 604, first image layer 608 including the set of first images 608A and the set of first depth maps 608B, and data serialization file 660. During deserialization, the electronic device determines the locations of the set of first images 608A of the first image layer 608 in the input image 504 from the data serialization file 660, combines the set of first images 608A and the set of first depth maps 608B of the first image layer 608 with the input image 504 according to the determined locations, and renders the input image 504 in the 3D format.

[0063] Alternatively, in some situations, the electronic device (e.g., a first electronic device) provides the input image 504, input depth map 604, first image layer 608 including the set of first images 608A and the set of first depth maps 608B, and data serialization file 660 to a second electronic device. The second electronic device determines the locations of the set of first images 608A of the first image layer 608 in the input image 504 from the data serialization file 660, combines the set of first images 608A and the set of first depth maps 608B of the first image layer 608 with the input image 504 according to the determined locations, and renders the input image 504 in the 3D format on the second electronic device. [0064] It is noted, in some situations, when the 3D effect is enabled for the input image 504, visual views occluded by an outline region of the first foreground object 514-1 need to be recovered, while what is occluded by a body of the first foreground object 514-1 can never be seen and does not need to be recovered. So the set of first images 608A capture visual views related to the outline region of the first foreground object 514-1. The outline region of the first foreground object 514-1 includes an outline, edge, or contour of the first foreground object 514-1 and portions of the first foreground object 514-1 that are immediately adjacent to the outline, edge, or contour. In an example, the portions of the first foreground object 514-1 include a number of pixels (e.g., 100 pixels) immediately adjacent to the outline.

[0065] Additionally, in some embodiments, for the first foreground object 514-1, the electronic device further identifies one or more layer objects 610 (including 610-1, ..., 610-K, where K is a positive integer) on the first image layer 608. For a first layer object 610-1, a set of second images 612A capture respective visual view that is occluded by the first layer object 610-1, and form a second image layer 612. The set of second images 612A include one or more second images 612A-1, ..., 612-L, where L is a positive integer. The second image layer 612 is associated with the data serialization file 660 for rendering the input image in the 3D format. The data serialization file further identifies locations of the set of second images 612A of the second image layer 612 in the input image 504. Further, in some embodiments, the electronic device determines a set of second depth maps 612B-1, ..., and 612B-M corresponding to the set of second images 612A-1, ..., and 612-M, respectively.

Each second depth map 612B-i corresponds to a respective distinct second image 612A-i, and has a second common location in the input image 504 as the respective distinct second image 612A-i. The second image layer 612 further includes the set of second depth maps 612B. [0066] In some embodiments, during serialization, the input image 504, input depth map 604, first image layer 608 including the set of first images 608A and the set of first depth maps 608B, second image layer 612 including the set of second images 612A and the set of second depth maps 612B, and data serialization file 660 are grouped together for rendering the input image 504 in the 3D format. Specifically, in some situations, during deserialization, the electronic device extracts the input image 504, input depth map 604, first image layer 608 including the set of first images 608A and/or the set of first depth maps 608B, second image layer 612 including the set of second images 612A and/or the set of second depth maps 612B, and data serialization file 660. The electronic device determines the locations of the set of first images 608A and the set of second images 612A in the input image 504 from the data serialization file 660, combines the sets of first images 608A, first depth maps 608B, second images 612A, and second depth maps 612B with the input image 504 according to the determined locations, and renders the input image 504 in the 3D format.

[0067] In some embodiments not shown, the electronic device further identifies a second layer object on the second image layer, and determines a set of third images that capture respective visual view occluded by the second layer object. The electronic device forms a third image layer including the set of third images of the second layer object, and associates the third image layer with the data serialization file 660 for rendering the input image 504 in the 3D format. The data serialization file 660 further identifies locations of the set of third images of the third image layer in the input image 504.

[0068] In some embodiments, in addition to the first foreground object 514-1, the input image 504 further includes one or more remaining foreground objects 514-2, ..., and/or 514-N (where N is equal to 2 or above), which are distinct from the first foreground object 514-1. The electronic device determines a set of remaining images that capture respective visual view occluded by each remaining foreground object 514-2, ..., and/or 514-N. The first image layer 608 is formed to include the set of remaining images of each remaining foreground object 514-2, ..., and/or 514-N. The data serialization file 660 further identifies locations of the set of remaining images of each remaining foreground object 514-2, ..., and/or 514-N of the first image layer 608 in the input image 504.

[0069] Figure 7A illustrates example LDIs 508 of an input image 504, in accordance with some embodiments, and Figure 7B is a perspective view 750 of objects in an input image 504, in accordance with some embodiments. Figure 7C is an example YAML file 780 that is outputted as a serialization file 660, in accordance with some embodiments. The LDIs 508 include an input image 504, an input depth map 604 optionally captured by a depth sensor 280 or obtained from the input image 504, and a first image layer 608 including a set of first images 608A and a set of corresponding first depth maps 608B. The input image 504 and input depth map 604 form a top image layer 606. A data serialization file 660 identifying locations of the set of first images 608A of the first image layer 608 in the input image 504. The set of first depth maps 608A of the first image layer 608 have the same locations in the input image 504 with the set of first images 608A.

[0070] An electronic device constructs and serializes the LDIs 508 for rendering a 3D scene of interest. During this process, the electronic device obtains depth information and color information of the scene of interest. A plurality of rays originate from a camera view point, extends into the 3D scene, and intersects with one or more surfaces (e.g., a surface of an object located in the 3D scene). A surface is optionally divided into multiple segments having associated labels. A non-occluded surface is classified as a foreground surface captured in the input image 504, and an occluded surface behind the foreground surface is classified with a different label. The occluded surface is recorded in an image layer hidden behind the input image 504. In some embodiments, the electronic device detects a surface border of a surface, and identifies a set of pixels associated with the surface border as silhouette pixels. [0071] A surface in the 3D scene includes a plurality of depth pixels, and each depth pixel has one or more of: a depth value, a color value, a segmentation label, and a silhouette pixel flag value. In an example, the depth value includes a normalized depth value in a range of [0, 1], and the color value is a combination of red, green, blue, and alpha values. The segmentation label is one of a first value indicating that the respective depth pixel corresponds to a foreground surface of a foreground object on the corresponding input image and a second value indicating that the respective depth pixel corresponds to a pixel occluded by the foreground surface. The silhouette pixel flag value includes a Boolean value (e.g., true or false) indicating whether the corresponding depth pixel belongs to a surface border. A set of depth pixels of a surface are gathered as first depth maps 608B and associated with an image segment (e.g., first images 608 A indicated as occluded by corresponding segmentation labels). The image segment is written into a data serialization file 660, which is stored in a local memory or transmitted to another electronic device. In some embodiments, the data serialization file 660 is extracted and deserialized for rending a 3D representation of the scene.

[0072] In some embodiments, each image segment (e.g., first images 608A indicated as occluded by corresponding segmentation labels) is bound by a bounding box. Information of the bounding box is determined with respect to a viewing frame of the input image 504. The information of the bounding box is stored with one or more of information items of the depth pixels of the image segment in the data serialization file 660. In some embodiments, the data serialization file 660 includes a Yet Another Markup Language (YAML) file (e.g., in Figure 7C) or an Extensible Markup Language (XML).

[0073] Referring to Figures 7A and 7B, the scene includes five objects having five front planes located at five distinct depths. The five objects include a foreground object 514 that is located in front of other four planes from a camera view point 702, i.e., a first depth of the foreground object 514 is smaller than each other depth of the remaining objects 704-710. The foreground object 514 occludes a subset of each of the remaining objects 704, 706, 708, and 710. The occluded portions of the remaining objects 704, 706, 708, and 710 includes a set of first images 608A-1, 608 A-2, 608 A-3, and 608 A-4, which correspond to a set of first depth maps 608B-1, 608B-2, 608B-3, and 608B-4, respectively. The set of first images 608A- 1, 608 A-2, 608 A-3, and 608 A-4 and the set of first depth maps 608B-1, 608B-2, 608B-3, and 608B-4 form a first image layer 608. The data serialization file 660 includes information 712 of the input image 504, a total number of labels 714, image labels 716, and information 718 of bounding boxes of images of the first image layer 608. Referring to Figure 7C, in this example, the total number of labels corresponds to a number of first images 608 A occluded by a foreground object 514, e.g., 4 first images 608A. The image labels include a serial number of each image in the set of first images 608A. The 4 first images 608A are defined by bounding boxes located at designated locations 718 of the input image 504.

[0074] Figure 8 is a flow diagram of an example process 800 of serializing and deserializing LDIs 508 of an input image 504, in accordance with some embodiments. An electronic device obtains a single input image 504 and renders an output image 502 that has a 3D effect of the input image 504. For rendering the output image 502, LDIs 508 are constructed and serialized with a data serialization file 660. Specifically, after obtaining the input image 504, the electronic device determines (802) an input depth map 604 including 3D information of the input image 504. In some embodiments, the input depth map 604 is generated from the input image 504 using an artificial intelligence or computer vision technique. For example, MiDaS is a machine learning model established based on ResNet and applied to estimate the input depth map 604. In some embodiments, the electronic device processes (804) the input depth map 604 to identify a foreground object 514 that is not occluded by any other object and detects (806) a surface border 814 of a surface (i.e., an outline, edge, or contour) of the foreground object 514. The surface border 814 includes a plurality of silhouette pixels (e.g., 922 in Figure 9B) identified (806) by silhouette pixel flag values.

[0075] The foreground object 514 occludes other objects, surfaces, or view in the input image 504. The electronic device identifies (808) a context region of each set of successive silhouette pixels 922 in a background immediately adjacent to the foreground object 514 of the input image 504, and applies the context region to determine (810) a corresponding first image 608A-i occluded by the respective set of successive silhouette pixels 922. Each foreground object 514 includes a plurality of sets of successive silhouette pixels 922, and the set of first image 608A are determined (812) for the plurality of sets of successive silhouette pixels 922 based on corresponding context regions that are located in the background of the input image 504 and immediately adjacent to the foreground object 514. For example, an inpainting model includes a neural network, and is applied to process the input image 504 and context regions to determine the set of first images 608A. Alternatively, in another example, one or more of the first images 608A are captured by a camera directly. Each of the set of first image 608A is associated with a respective edge region of the foreground object 514. By these means, each pixel on a foreground object 514 of the input image is associated with a non-occluded foreground surface, while each of the first images 608A is occluded by the foreground object 514 and assigned with different labels.

[0076] The electronic device further determines a depth pixel at each intersected surface of a ray radiated to the input image’s pixel location. Each depth pixel has one or more of a depth value, a color value, a segmentation label, and a silhouette pixel flag value. A set of depth pixels of a surface are gathered to form first depth maps 608B and associated with an image segment (e.g., first images 608A indicated as occluded by corresponding segmentation labels). During LDI serialization, a location of each image segment is written (816) into a data serialization file 660, which is stored in a local memory or transmitted to another electronic device. In some embodiments, the data serialization file 660 is extracted and deserialized for rending a 3D representation of the scene. In some embodiments, each image segment (e.g., first images 608A indicated as occluded by corresponding segmentation labels) is bound by a bounding box. Information of the bounding box is determined with respect to a viewing frame of the input image 504. The information of the bounding box is stored with one or more of information items of the depth pixels of the image segment in the data serialization file 660. In some embodiments, the data serialization file 660 includes an XML or YAML file.

[0077] The input image 504, input depth map 604, the set of first images 608A (e.g., 608A-1 to 608A-4), and the set of first depth maps 608B (e.g., 608B-1 to 608B-4) are collectively called layered depth images (LDIs) 508. In some embodiments, a top image layer 606 includes the input image 504 and input depth map 604. The first image layer 608 includes the set of first images 608A and the set of first depth maps 608B, and each first image 608A-i is paired with a respective first depth map 608B-i. The LDIs 508 include the top image layer 606, the first image layer 608, and one or more additional image layers (if any). A serialization operation 816 is applied to the LDIs 508 to generate the data serialization file 660 and prepare the LDIs 508 for storage or transmission.

[0078] A deserialization operation 818 is opposite to the serialization operation 816 and extracts the LDIs 508 for rendering an output image 502 including a 3D effect on the input image 504. The LDIs 508 includes the input image 504, input depth map 604, set of first images 608A, and set of first depth maps 608B, and are arranged according to the data serialization file 660. In some embodiments, a 3D mesh model is constructed (820) based on the LDIs 508. The 3D mesh model is applied jointly with camera parameters 822 (e.g., a camera location, a camera orientation, a field of view) to render (824) the output image 502 including the 3D effect on the input image 504. In some embodiments, the deserialization operation 818 is implemented at a user application (e.g., a photo album application) of the same electronic device that serializes the input image 504. Alternatively, in some embodiments, the deserialization operation 818 is implemented at a user application of a distinct electronic device, which receives the LDIs 508 and the data serialization file 660 from the electronic device that serializes the input image 504 (e.g., by way of a server 102). [0079] Figure 9A illustrates LDIs 508 including two image layers 606 and 608, in accordance with some embodiments, and Figure 9B illustrates information 920 of an image segment occluded by a foreground object 514 of an input image 504, in accordance with some embodiments. Figure 9C is an example serialization file 660 associated with an image layer 608 shown in Figure 9A, in accordance with some embodiments. An input image 504 has a foreground object 514 (e.g., a person). An outline 814 of the foreground object 514 is identified in the input image 504, and corresponds to a plurality of silhouette pixels 922 of the foreground object 514. The outline 814 of the foreground object 514 is expanded to an outline region, e.g., by identifying a number of pixels immediately adjacent to the outline. The outline region of the foreground object 514 is divided into a plurality of image segments that are occluded by the foreground object 514 in the input image 504. Each image segment corresponds to a respective set of silhouette pixels 922. For each image segment, the electronic device identifies a context region of the respective set of silhouette pixels in a background immediately adjacent to the foreground object 514 of the input image 504, and applies the context region to determine information 920 of the respective image segment. [0080] In some embodiments, the information of the plurality of image segments obtained from the outline region of the foreground image 504 includes a set of first image 608A, a set of first input depth maps 608B, a set of segmentation label maps 608C, and a set of silhouette pixel maps 608D. The information of the plurality of image segments of the foreground image 504 is further stored with the data serialization file 660. Referring to Figure 9A, in an example, 7 image segments are divided from an outline region of the foreground object 514 of the input image 504. Information of these 7 image segments includes 28 images or maps, i.e., 7 first image 608A, 7 first input depth maps 608B, 7 segmentation label maps 608C, and 7 silhouette pixel maps 608D. More specifically, referring to Figure 9B, the fourth image segment of the outline region of the foreground image 504 has a first image 608A-4, a first input depth maps 608B-4, a segmentation label maps 608C-4, and a silhouette pixel maps 608D-4. In this example, the 28 images or maps are stored jointly with the input image 504, the input depth map 604, and a YAML file 660. [0081] Specifically, in this example, the input image 504 has an image resolution of 4000x6000 and a file size of 3MB, and the input depth map 604 and the 28 images or maps of the first image layer 608 have a total file size of 0.6MB. A size of the data serialization file 660 is negligible compared with the total size of the input image 504, input depth map 604, and 28 images or maps (e.g., 3.6MB). Conversely, in another example, a 3D mesh model enables the 3D effect of the input image 504 and has a resolution of 480x640. The 3D mesh model has a size of at least 15.3MB including 3MB for the input image 504 and 12.3MB for data of meshes. The output image 502 is rendered based on up-sampling of the meshes with the input image 504 in a 3D rendering pipeline. The LDIs 508 and data serialization file 660 save a storage resource by approximately 20.5 times. Particularly, the serialization operation 816 requires a storage of 0.6MB to store the first image layer 608, while the 3D mesh model stores low-resolution 3D meshes and requires additional up-sampling operations. Stated another way, as the LDIs 508 are stored in memory of an electronic device or transferred to another electronic device, 3D visual information is managed efficiently to conserve storage and communication resources allocated for rendering the 3D effect of the input image 504.

[0082] Referring to Figure 9C, the data serialization file 660 includes information 912 of the input image 504, a total number of labels 914, image labels 916, and information 918 of bounding boxes of images of the first image layer 608. In this example, the total number of labels corresponds to a number of first images 608A occluded by a foreground object 514, e.g., 7 first images 608A. The image labels include a serial number of each image in the set of first images 608A. The 7 first images 608A are defined by bounding boxes located at designated locations 918 of the input image 504. A size of the data serialization file 660 is negligible compared with the sizes of the images or maps in the LDIs 508.

[0083] Under many circumstances, a capability of serializing and deserializing the LDIs 508 is important, allowing the input image 504 to be preprocessed up to a step implemented immediately prior to 3D rendering. This can save a preprocessing time, particularly if preprocessing steps take considerable amount of time before 3D rendering. Additionally, a storage space or transmission bandwidth of serialized LDIs 508 is compact, particularly when the LDIs 508 are compared with 3D rendering meshes and videos having advanced video compressing formats (e.g., MP4, H264). Deserialization of LDIs 508 is fast and can readily be converted to meshes for OpenGL rendering in a hardware implementing 3D view generation.

[0084] Figure 10 is a flow diagram of an example image processing method 1000, in accordance with some embodiments. For convenience, the method 1000 is described as being implemented by an electronic device (e.g., a mobile phone 104C). Method 1000 is, optionally, governed by instructions that are stored in a non-transitory computer readable storage medium and that are executed by one or more processors of the computer system. Each of the operations shown in Figure 10 may correspond to instructions stored in a computer memory or non-transitory computer readable storage medium (e.g., memory 206 in Figure 2). The computer readable storage medium may include a magnetic or optical disk storage device, solid state storage devices such as Flash memory, or other non-volatile memory device or devices. The instructions stored on the computer readable storage medium may include one or more of: source code, assembly language code, object code, or other instruction format that is interpreted by one or more processors. Some operations in method 1000 may be combined and/or the order of some operations may be changed.

[0085] In some embodiments, the electronic device obtains (1002) an input image 504 and obtains (1004) an input depth map 604 corresponding to the input image 504. The electronic device identifies (1006) a foreground object 514 in the input image 504, and determines (1008) a set of first images that capture respective visual view that is occluded by the foreground object 514. A first image layer 608 is formed (1010) to include the set of first images 608A of the foreground object 514. The electronic device associates (1012) the input image 504 and the first image layer 608 with a data serialization file 660 for rendering the input image 504 in a 3D format. The data serialization file 660 identifies (1014) locations of the set of first images 608A of the first image layer 608 in the input image 504. In some embodiments, each first image 608A-i corresponds to a portion of a contour of the foreground object 514.

[0086] In some embodiments, the input depth map 604 is generated from the input image 504, e.g., using an artificial intelligence or computer vision technique. For example, MiDaS is a machine learning model established based on ResNet and applied to estimate the input depth map 604. Alternatively, in some embodiments, the input depth map 604 is measured by one or more depth sensors 280 concurrently while capturing the input image 504 by a camera. The electronic device optionally includes the one or more depth sensors 280 and camera 260. Further, in some embodiments, the input depth map 604 includes a plurality of depth pixels. Each depth pixel includes a normalized depth value in a range of [0, 1], a color value, a segmentation label, a silhouette pixel flag value. For example, the segmentation label is one of a first value indicating that the respective depth pixel corresponds to the foreground object 514 on the corresponding input image 504 and a second value indicating that the respective depth pixel corresponds to a pixel on the first image layer 608 that is occluded by the foreground object 514.

[0087] In some embodiments, the electronic device determines a set of first depth maps 608B corresponding to the set of first images 608A. Each first depth map 608B-i corresponds to, and has a first common location in the input image 504 as, a distinct first image. Further, in some embodiments, the electronic device obtains the input image 504, the input depth map 604, the first image layer 608 including the set of first images 608A and the set of first depth maps 608B, and the data serialization file 660. The electronic device determines the locations of the set of first images 608A of the first image layer 608 in the input image 504 from the data serialization file 660, and combines the set of first images 608A and the set of first depth maps 608B of the first image layer 608 with the input image 504 according to the locations of the set of first images 608A of the first image layer 608 in the input image 504, thereby rendering the input image 504 in the 3D format.

[0088] In some embodiments, the electronic device identifies a layer object 610 on the first image layer 608 for each foreground object 514, and determines a set of second images 612A that capture respective visual view that is occluded by the layer object 610. The electronic device forms a second image layer 612 including the set of second images 612A of the layer object 610, and associates the second image layer 612 with the data serialization file 660 for rendering the input image 504 in the 3D format. The data serialization file 660 further identifies locations of the set of second images 612A of the second image layer 612 in the input image 504.

[0089] Additionally, in some embodiments, the electronic device determines a set of second depth maps 612B corresponding to the set of second images 612A. Each second depth map 612B-i corresponds to, and has a second common location in the input image 504 as, a distinct second image 612A-i. The second image layer 612 further includes the set of second depth maps 612B.

[0090] In some embodiments, the foreground object 514 includes a first foreground object 514-1. The electronic device identifies one or more remaining foreground objects 514- 2, ..., 514-N distinct from the first foreground object 514-1 in the input image 504, and determines a set of remaining images that capture respective visual view that is occluded by each remaining foreground object 514. The first image layer 608 is formed to include the set of remaining images of each remaining foreground object 514, and the data serialization file 660 further identifies locations of the set of remaining images of each remaining foreground object 514 of the first image layer 608 in the input image 504. [0091] In some embodiments, the electronic device determines the set of first images 608A that capture respective visual view that is occluded by the foreground object 514 by applying an inpainting model to the input image 504 to determine the set of first images 608A, the inpainting model including a neural network. Further, in some embodiments, for each first image 608 A-i, the electronic device identifies one or more context regions adjacent to the respective first image 608-i on the input image 504. The inpainting model is applied to the one or more context regions to generate the respective first image 608 A-i.

[0092] In some embodiments, the electronic device determines the set of first images 608A that capture respective visual view that is occluded by the foreground object 514 by capturing the set of first images 608A by a camera.

[0093] In some embodiments, each of the set of first images 608A includes a bounding box, and is the data serialization file 660 includes a YAML or XML file configured to store a location of the bounding box of each first image 608-i as the location of the respective first image 608-i in the input image 504.

[0094] In some embodiments, the electronic device stores the input image 504, the input depth image, one or more image layers including the first image layer 608, and the data serialization file 660 jointly in memory of the electronic device. Alternatively, in some embodiments, the electronic device transfers the input image 504, the input depth image, the one or more image layers including the first image layer 608, and the data serialization file 660 to a computer system, such that the computer system renders the input image 504 in the 3D format. Further, in some embodiments, the input image 504 corresponds to a set of 3D meshes for rendering the input image 504 in the 3D format, the set of 3D meshes having a mesh file size. The input image 504, the one or more image layers, and the data serialization file 660 have a total file size. In an example, the total file size is less than 10% of the mesh file size. In another example, the total file size corresponding to a high-resolution serialized LDI is approximately 0.6MB, and the mesh file size for a low-resolution 3D meshes that require up-sampling has a size of 12.3MB. The total file size is less than 5% of the mesh file size.

[0095] It should be understood that the particular order in which the operations in Figure 10 have been described are merely exemplary and are not intended to indicate that the described order is the only order in which the operations could be performed. One of ordinary skill in the art would recognize various ways to process images. Additionally, it should be noted that details of other processes described above with respect to Figures 1-9 are also applicable in an analogous manner to method 1000 described above with respect to Figure 10. For brevity, these details are not repeated here.

[0096] Figure 11 is a flow diagram explaining a method 1100 for updating edge pixels of an object in an input image 1102, in accordance with some embodiments. The input image 1102 is associated with a depth map 1104. The input image 1102 is captured by a camera 260. In some embodiments, the depth map 1104 is generated from the input image 1102, e.g., using a neural network. In some embodiments, the depth map 1104 is measured by one or more depth sensors 280 jointly with the input image 1102 captured by the camera 260. A foreground object 1106 is identified in the input image 1102 based on the depth map 1104. The foreground object 1106 is defined in the input image 1102 and the depth map 1104 by a plurality of edge pixels 1108 associated with an edge of the foreground object 1106. An edge map 1110 is generated to identify the plurality of edge pixels 1108 in the input image 1102. In some embodiments, depth values of the depth map 1104 are compared with one or more depth thresholds to binarize the depth map 1104 and generate the edge map 1110. The edge map 1110 is a binary map having a first set of pixels and a second set of pixels. The first set of pixels correspond to the plurality of edge pixels 1108 and have a first binary value (e.g., “1”), and the second set of pixels correspond to a plurality of remaining pixels 1112 and having a second binary value (e.g., “0”). Each remaining pixel 1112 is distinct from the plurality of edge pixels 1108 in the input image 1102.

[0097] Each edge pixel 1108 corresponds to a respective background pixel that is occluded by the foreground object 1106. A respective alpha value 1114 is determined for each of the plurality of edge pixels 1108. The input image 1102 is rendered by blending colors of each of the plurality of edge pixels 1108 and the respective background pixel based on the respective alpha value. In an example, each edge pixel 1108 is a foreground pixel on the foreground object, and corresponds to the background pixel 1108’ occluded by the foreground object 1106. The foreground pixel 1108 has a first alpha value 1114F, and the background pixel 1108’ has a second alpha value 1114B. For each edge pixel 1108, the foreground pixel 1108 and the background pixel 1108’ are combined in a weighted manner using the first and second alpha values 1114F and 1114B to generate an updated edge pixel 1116 for the input image 1102. In some embodiments, the plurality of edge pixels 1108 are updated jointly on an image level (e.g., using an edge map 1110).

[0098] In some embodiments, each edge pixel 1108 and the respective background pixel 1108’ correspond to a location of the input image 1102. In some embodiments, each edge pixel 1108 and the respective background pixel 1108’ are aligned on a line of sight originated from a view point where the input image 1102 is captured by a camera 260. In some embodiments, a set of first images (e.g., 612A in Figure 6) are obtained to capture respective visual view that is occluded by the foreground object 1106 (e.g., foreground object 514-1 in Figure 6). The respective background pixel 1108’ of each edge pixel 1108 is located on a respective one of the set of first images. Further, in some embodiments, the set of first images are captured by a camera. Alternatively, in some embodiments, an inpainting model is applied to process the input image 1102 and generate the set of first images. The inpainting model includes a neural network. Additionally, in some embodiments, for each first image, one or more context regions adjacent to the respective first image are identified on the input image, and the inpainting model is applied to process the one or more context regions to generate the respective first image. More details on identifying the respective background pixel 1108’ of each edge pixel 1108 are explained above with reference to Figures 6 and 8. [0099] In some embodiments, a Gaussian blur filter is applied to the edge map 1110 and generate a blurred edge map 1118 (also called a pre-alpha map 1118). The blurred edge map 1118 is converted to an alpha map 1120 on a pixel-by-pixel basis. For example, the edge map 1110 and blurred edge map 1118 have the same resolution that is equal to that of the depth map 1104. Each pixel of the blurred edge map 1118 is converted to a respective pixel of the alpha map 1120 as follows: a = 1 - E_b(l - D) where Eh is a pre-alpha value of each pixel of the blurred edge map 1118, D is a depth value of a respective pixel of the depth map 1104, and a is an alpha value of the respective pixel of the alpha map 1120. As such, the alpha map 1120 has a plurality of alpha values corresponding to a plurality of pixels of the input image 1102 and depth map 1104, and the plurality of alpha values include the respective alpha value of each edge pixel 1108 of the input image 1102 and depth map 1104.

[00100] The alpha value controls a transparency or opacity level of a color, and is optionally represented as a real value, a percentage, or an integer: full transparency is 0.0, 0% or 0, whereas full opacity is 1.0, 100% or 255, respectively. In some embodiments, respective alpha values of the plurality of edge pixels 1108 are normalized in a range of [0, 1.0], Alternatively, in some embodiments, respective alpha values of the plurality of edge pixels 1108 are normalized in a range of [0, 255], Each respective alpha value is optionally an integer. Alternatively and additionally, in some embodiments, respective alpha values of the plurality of edge pixels 1108 are normalized in a range of [0, 100%]. [00101] In some embodiments, the input image 1102 has a first resolution, and the depth map 1104 has a second resolution that is lower than the first resolution. Each pixel of the depth map 1104 corresponds to a set of adjacent pixels on the input image 1102. Further, in some embodiments, the edge map 1110, blurred edge map 1118, and alpha map 1120 have the same resolution, which is equal to the first resolution of the input image 1102. The depth map 1104 is up-sampled to the first resolution and applied to generate the edge map 1110 and alpha map 1120. The edge map 1110 and alpha map 1120 are further used to generate the updated edge pixels 1116 of the input image 1102. Alternatively, in some embodiments, the edge map 1110, blurred edge map 1118, and alpha map 1120 have the same resolution, which is equal to the second resolution of the depth map 1104. The edge map 1110 and alpha map 1120 are up-sampled to the first resolution and used to generate the updated edge pixels 1116 in the input image 1102.

[00102] Figure 12 is a flow diagram of an image processing method 1200 implemented based on color blending, in accordance with some embodiments. An electronic device obtains (1202) 3D spatial information, color information, and alpha values of a scene of interest. In some embodiments, the scene is a real scene recorded on an input image 1102 (Figure 11) captured by a camera 260. Alternatively, in some embodiments, the scene is created by an artist’s 3D graphical drawing or rendering, and the input image 1102 is a product of the artist’s work. In some situations, the input image 1102 is used by itself to extract the 3D spatial information, color information, and alpha values of the scene of interest based on a machine learning model, e.g., MiDaS. MiDaS is established based on ResNet and applied to estimate the depth map 1104 (Figure 11). The input image 102 is optionally captured by the camera 260 or created by the artist. Alternatively, in some situations, the input image 1102 creates 3D effects jointly with one or more additional images. The input image 1102 and the one or more additional images correspond to different viewing angles, and are optionally captured by the camera 260, two distinct cameras 260, or other 3D acquisition sensors. For example, a plurality of images including the input image 1102 are captured by a plurality of cameras including the camera 260, and applied to determine the depth map 1104 of the scene using machine learning, computer vision, or a combination thereof.

[00103] In some embodiments, the input image 1102 includes 3D graphical drawing or rendering created by an artist. No pre-alpha value Eb is provided with the input image 1102. Each pixel has a pre-alpha value Eb that is set at a default value (e.g., 1.0) corresponding to full opacity. In some embodiments, the input image 1102 is captured by a camera 260, and the pre-alpha value Eh of each pixel is an alpha value captured by the camera 260 for an alpha channel of the input image 102.

[00104] The electronic device identifies (1204) a view point and projects (1206) 3D scene points, from back to front, onto the view point, thereby forming a depth map 1104 of a scene. Alpha values are determined (1208) for foreground surfaces in the input image 1102. Specifically, the electronic device detects (1210) edges associated with object borders (e.g., Figure 11, edge pixels 1108) on the depth map 1104, and applies (1212) a Gaussian blur filter onto the detected edges. In some embodiments, depth values on the edges are one of the parameters used to control alpha values. The further an object depth, the more opaque an edge pixel. A range of blurriness is normalized (1214) to a range specified for the alpha values, e.g. [0, 1], [0, 100%], or [0, 255], Further, the electronic device defines (1216) the alpha values of the detected edges with respective newly derived values. The alpha values are applied (1218) to blend colors to update the input image 1102 for the given view point, and the input image 1102 is rendered with a plurality of updated edge pixels 1116 (Figure 11). [00105] As the updated edge pixel 1116 is applied for each edge pixel 1108 of the input image 1102, jaggy or staircase effects are removed from borders of foreground objects 1106 in the input image 1102. An alpha value of each pixel corresponds to an opacity channel (also called an alpha channel). Each image layer has the opacity channel, governing how transparent or opaque every pixel on the image layer is. A ray originates from a give view point and intersects with different image layers. Alpha values are used to mix colors on respective image layers with which the ray intersects to determine a new color for a pixel projected by the ray. In some embodiments, a two-dimensional (2D) array of radiated rays are projected onto surfaces having different depths to form a 2D image. Each radiated ray hits different surfaces from a view point and intersects with the surfaces at different pixels. Each of the different pixels associated with a radiated ray has an associated color (e.g., red, green, blue, or a combination thereof) and an alpha value. For a particular view point, the intersected surfaces of a corresponding radiated ray are classified either as foreground (non-occluded or first layer) or background (occluded or second layer onward). In some embodiments, an initial alpha value (also called pre-alpha value Eh of a foreground layer corresponds to opacity, and an initial alpha value of each background layer is set by an animator or artist creator to translucency or transparency.

[00106] Edge pixels define borders or silhouettes of objects on a foreground layer. An alpha map is determined to replace initial alpha values of the edge pixels, which are provided by an input image captured by a camera or set for an image created by an animator or artist. As such, the borders or silhouettes of the foreground layer are alpha blended to the background layers to determine new colors for image anti-aliasing.

[00107] Figures 13 A and 13B are an input image 1102 and a depth map 1104, in accordance with some embodiments. The input image 1102 is captured by a camera 260 and corresponds to a real scene. An electronic device obtains 3D spatial information, color information, and alpha values of the scene of interest based on the input image 1102. In some situations, the input image 1102 is used by itself to extract depth information based on a machine learning model, e.g., MiDaS. MiDaS is established based on ResNet and applied to determine the depth map 1104 from the input image 1102. Alternatively, in some situations, the input image 1102 creates an 3D effect jointly with one or more additional images (not shown). The input image 1102 and the one or more additional images correspond to different viewing angles, and are optionally captured by the same camera 260, two distinct cameras 260, or other 3D acquisition sensors. For example, a plurality of images including the input image 1102 are captured by a plurality of cameras including the camera 260, and applied to determine the depth map 1104 of the scene using machine learning, computer vision, or a combination thereof. Additionally and alternatively, the electronic device measures the depth map 1104 by one or more depth sensors 280 concurrently while capturing the input image 1102 by the camera 260.

[00108] The depth map 1104 includes a grayscale image in which a grayscale level of each pixel represents a corresponding depth value. For example, a foreground person 1302 is white indicating a small depth value, and a background alley 1304 is black indicating a large depth value. Silhouettes or object borders of the foreground person 1302 are determined from the depth map 1104, and processed by a Gaussian blur filter to obtain an alpha map 1120 for at least edge pixels 1108 associated with the silhouettes or object borders.

[00109] The input image 1102 has a first resolution, and the depth map 1104 has a second resolution. In some embodiments, the second resolution is equal to the first resolution. Alternatively, in some embodiments, the second resolution is lower than the first resolution, and each pixel of the depth map 1104 corresponds to a respective set of adjacent pixels on the input image 1102.

[00110] Figures 14A and 14B are an input image 1102 and an updated image 1400 rendered based on alpha blending, in accordance with some embodiments. A foreground person 1302 includes a left portion 1402 and a right portion 1404. Jaggies are observed on at least edge pixels on the left portion 1402, while not on edge pixels on the right portion 1404. A depth map 1104 corresponds to the input image 1102. The foreground person 1302 (Figures 13 A and 13B) is identified in the input image 1102 based on the depth map 1104. A plurality of edge pixels 1108 (Figure 11) associated with an edge of the foreground object 1106 are identified on the depth map 1104. Each edge pixel corresponds to a respective background pixel 1108’ that is occluded by the foreground person 1302. A respective alpha value is determined for each of the plurality of edge pixels 1108. The input image 1102 is rendered by blending colors of each of the plurality of edge pixels 1108 and the respective background pixel 1108’ based on the respective alpha value. On an image level, an edge map 1110 (Figure 11) is generated to include binary values identifying the plurality of edge pixels 1108, and an alpha map 1120 is generated from the edge map 1110 to include alpha values for the plurality of edge pixels 1108. The alpha map 1120 is applied to update the input image 1102 and generate the updated image 1400. The jaggies are smoothed on the left portion 1402 of the updated image 1400.

[00111] Figure 15 is an example color blending file 1500 associated with an input image 1102, in accordance with some embodiments. The color blending file 1500 includes an OpenGL fragment shader script, and a fragment shader is written in OpenGL, which is a cross-language, cross-platform application programming interface for rendering 2D and 3D vector graphics. Alpha blending is applied to blend colors of edge pixels 1108 (Figure 11) associated with object borders on a foreground layer and colors of background pixels 1108’. Alpha blending does not require extra memory space or blur internal textures, shortens a computation time, handles large geometric aliasing, and deals with transparency in the scene. Specifically, a foreground surface corresponds to layer 0, which is identified by a statement 1502 (i.e., “if(v_layer < 0.5)”), and a passed-in derived alpha value v_Color.a 1504 is applied to a texture of the foreground surface. Each of one or more non-foreground surfaces corresponds to a background surface, and keeps a corresponding original texture according to a statement 1506 (i.e., “fragColor = v_Color”).

[00112] Figure 16 is a flow diagram of another example image processing method 1600 implemented based on color blending, in accordance with some embodiments. For convenience, the method 1600 is described as being implemented by an electronic device (e.g., a mobile phone 104C). Method 1600 is, optionally, governed by instructions that are stored in a non-transitory computer readable storage medium and that are executed by one or more processors of the computer system. Each of the operations shown in Figure 16 may correspond to instructions stored in a computer memory or non-transitory computer readable storage medium (e.g., memory 206 in Figure 2). The computer readable storage medium may include a magnetic or optical disk storage device, solid state storage devices such as Flash memory, or other non-volatile memory device or devices. The instructions stored on the computer readable storage medium may include one or more of: source code, assembly language code, object code, or other instruction format that is interpreted by one or more processors. Some operations in method 1600 may be combined and/or the order of some operations may be changed.

[00113] The electronic device obtains (1602) an input image 1102 and a depth map 1104 corresponding to the input image 1102 and identifies (1604) a foreground object 1106 in the input image 1102 based on the depth map 1104. The electronic device identifies (1606) a plurality of edge pixels 1108 associated with an edge of the foreground object 1106 on the depth map 1104. Each edge pixel 1108 corresponds to a respective background pixel 1108’ that is occluded by the foreground object 1106. The electronic device determines (1608) a respective alpha value for each of the plurality of edge pixels 1108, and renders (1610) the input image 1102 by blending colors of each of the plurality of edge pixels 1108 and the respective background pixel 1108’ based on the respective alpha value.

[00114] In some embodiments, the plurality of edge pixels 1108 are identified on the depth map 1104 by generating (1612) an edge map 1110 identifying the plurality of edge pixels 1108. Further, in some embodiments, the edge map 1110 includes (1614) a binary map, and the binary map has (1) a first set of pixels corresponding to the plurality of edge pixels 1108 and having a first binary value and (2) a second set of pixels corresponding to a plurality of remaining pixels and having a second binary value. Each remaining pixel is distinct from the plurality of edge pixels 1108 in the input image 1102. Additionally, in some embodiments, the electronic device applies (1616) a Gaussian blur filter on the edge map 1110 to generate a blurred edge map 1118 and converts (1618) the blurred edge map 1110 to an alpha map 1120 based on the depth map 1104. The alpha map 1120 has a plurality of alpha values corresponding to a plurality of pixels of the input image 1102. The plurality of alpha values include the respective alpha value of each edge pixel 1108.

[00115] In some embodiments, the plurality of edge pixels 1108 include (1620) the edge of the foreground object 1106 and one or more pixels that are immediately adjacent to the edge of the foreground object 1106. Further, in some embodiments, the one or more pixels include a predefined number (e.g., 3-5) of pixels.

[00116] In some embodiments, respective alpha values of the plurality of edge pixels 1108 are normalized (1624) in a range of [0, 1], Alternatively, in some embodiments, respective alpha values of the plurality of edge pixels 1108 are normalized (1626) in a range of [0, 255], [00117] In some embodiments, the electronic device obtains a set of first images 608A (Figures 6 and 8) that capture respective visual view that is occluded by the foreground object 1106. The respective background pixel 1108’ of each edge pixel 1108 is located on a respective one of the set of first images. Further, in some embodiments, the electronic device obtains the set of first images by applying an inpainting model to the input image 1102 to generate the set of first images. The inpainting model includes a neural network.

Additionally, in some embodiments, for each first image, the electronic device identifies one or more context regions adjacent to the respective first image on the input image 1102. The inpainting model is applied to process the one or more context regions to generate the respective first image. Alternatively, in some embodiments, the electronic device determines the set of first images further comprises capturing the set of first images by a camera.

[00118] In some embodiments, each edge pixel 1108 and the respective background pixel 1108’ correspond to a location of the input image 1102. Alternatively, in some embodiments, each edge pixel 1108 and the respective background pixel 1108’ are aligned (1622) on a line of sight originated from a view point where the input image 1102 is captured by a camera.

[00119] In some embodiments, the electronic device obtains the depth map 1104 by generating (1628) the depth map 1104 from the input image 1102, e.g., using an artificial intelligence or computer vision technique. Alternatively, in some embodiments, the electronic device obtains the depth map 1104 by measuring (1630) the depth map 1104 by one or more depth sensors concurrently while capturing the input image 1102 by a camera.

[00120] In some embodiments, the depth map 1104 includes a plurality of depth pixels. Each depth pixel includes a normalized depth value in [0, 1], a color value, a segmentation label, or a silhouette pixel flag value.

[00121] In some embodiments, for each edge pixel 1108, the respective background pixel 1108’ that is occluded by the foreground object 1106 is artificially synthesized by a graphical system. Alternatively, in some embodiments, for each edge pixel 1108, the respective background pixel 1108’ that is occluded by the foreground object 1106 is captured by a camera.

[00122] It should be understood that the particular order in which the operations in Figure 16 have been described are merely exemplary and are not intended to indicate that the described order is the only order in which the operations could be performed. One of ordinary skill in the art would recognize various ways to process images. Additionally, it should be noted that details of other processes described above with respect to Figures 1-15 are also applicable in an analogous manner to method 1600 described above with respect to Figure 16. For brevity, these details are not repeated here.

[00123] The terminology used in the description of the various described implementations herein is for the purpose of describing particular implementations only and is not intended to be limiting. As used in the description of the various described implementations and the appended claims, the singular forms “a”, “an” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will also be understood that the term “and/or” as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items. It will be further understood that the terms “includes,” “including,” “comprises,” and/or “comprising,” when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. Additionally, it will be understood that, although the terms “first,” “second,” etc. may be used herein to describe various elements, these elements should not be limited by these terms. These terms are only used to distinguish one element from another.

[00124] As used herein, the term “if’ is, optionally, construed to mean “when” or “upon” or “in response to determining” or “in response to detecting” or “in accordance with a determination that,” depending on the context. Similarly, the phrase “if it is determined” or “if [a stated condition or event] is detected” is, optionally, construed to mean “upon determining” or “in response to determining” or “upon detecting [the stated condition or event]” or “in response to detecting [the stated condition or event]” or “in accordance with a determination that [a stated condition or event] is detected,” depending on the context.

[00125] The foregoing description, for purpose of explanation, has been described with reference to specific embodiments. However, the illustrative discussions above are not intended to be exhaustive or to limit the claims to the precise forms disclosed. Many modifications and variations are possible in view of the above teachings. The embodiments were chosen and described in order to best explain principles of operation and practical applications, to thereby enable others skilled in the art.

[00126] Although various drawings illustrate a number of logical stages in a particular order, stages that are not order dependent may be reordered and other stages may be combined or broken out. While some reordering or other groupings are specifically mentioned, others will be obvious to those of ordinary skill in the art, so the ordering and groupings presented herein are not an exhaustive list of alternatives. Moreover, it should be recognized that the stages can be implemented in hardware, firmware, software or any combination thereof.

Claims

What is claimed is:

1. An image processing method, comprising: obtaining an input image and a depth map corresponding to the input image; identifying a foreground object in the input image based on the depth map; identifying a plurality of edge pixels associated with an edge of the foreground object on the depth map, each edge pixel corresponding to a respective background pixel that is occluded by the foreground object; determining a respective alpha value for each of the plurality of edge pixels; and rendering the input image by blending colors of each of the plurality of edge pixels and the respective background pixel based on the respective alpha value.

2. The method of claim 1, wherein identifying the plurality of edge pixels on the depth map further comprises generating an edge map identifying the plurality of edge pixels.

3. The method of claim 2, wherein the edge map includes a binary map, the binary map having (1) a first set of pixels corresponding to the plurality of edge pixels and having a first binary value and (2) a second set of pixels corresponding to a plurality of remaining pixels and having a second binary value, each remaining pixel distinct from the plurality of edge pixels in the input image.

4. The method of claim 2 or 3, further comprising: applying a Gaussian blur filter on the edge map to generate a blurred edge map; and converting the blurred edge map to an alpha map based on the depth map, the alpha map having a plurality of alpha values corresponding to a plurality of pixels of the input image, the plurality of alpha values including the respective alpha value of each edge pixel.

5. The method of any of the preceding claims, wherein the plurality of edge pixels include the edge of the foreground object and one or more pixels that are immediately adjacent to the edge of the foreground object.

6. The method of any of the preceding claims, wherein respective alpha values of the plurality of edge pixels are normalized in a range of [0, 1],

7. The method of any of claims 1-5, wherein respective alpha values of the plurality of edge pixels are normalized in a range of [0, 255],

8. The method of any of the preceding claims, further comprising: obtaining a set of first images that capture respective visual view that is occluded by the foreground object, wherein the respective background pixel of each edge pixel is located on a respective one of the set of first images.

9. The method of claim 8, wherein obtaining the set of first images further comprises: applying an inpainting model to process the input image and generate the set of first images, the inpainting model including a neural network.

10. The method of claim 9, further comprising: for each first image, identifying one or more context regions adjacent to the respective first image on the input image, wherein the inpainting model is applied to process the one or more context regions to generate the respective first image.

11. The method of claim 8, wherein determining the set of first images further comprises capturing the set of first images by a camera.

12. The method of any of the preceding claims, wherein each edge pixel and the respective background pixel correspond to a location of the input image.

13. The method of any of claims 1-11, wherein each edge pixel and the respective background pixel are aligned on a line of sight originated from a viewpoint where the input image is captured by a camera.

14. The method of any of the preceding claims, obtaining the depth map further comprising generating the depth map from the input image.

15. The method of claims 1-13, obtaining the depth map further comprising measuring the depth map by one or more depth sensors concurrently while capturing the input image by a camera.

16. The method of any of the preceding claims, wherein the depth map includes a plurality of depth pixels, each depth pixel including a normalized depth value in [0, 1], a color value, a segmentation label, or a silhouette pixel flag value.

17. An electronic device, comprising: one or more processors; and memory having instructions stored thereon, which when executed by the one or more processors cause the processors to perform a method of any of claims 1-16.

18. A non-transitory computer-readable medium, having instructions stored thereon, which when executed by one or more processors cause the processors to perform a method of any of claims 1-16.