WO2023023162A1

WO2023023162A1 - 3d semantic plane detection and reconstruction from multi-view stereo (mvs) images

Info

Publication number: WO2023023162A1
Application number: PCT/US2022/040610
Authority: WO
Inventors: Pan JI; Jiachen Liu; Yi Xu
Original assignee: Innopeak Technology, Inc.
Priority date: 2021-08-18
Filing date: 2022-08-17
Publication date: 2023-02-23
Also published as: WO2023023160A1

Abstract

This application is directed to plane detection using multiple images. An electronic device obtains a source image, a target image, and information of a plurality of slanted planes, and generates a feature cost volume of the target image with reference to the plurality of slanted planes. The electronic device generates a plane probability volume from the feature cost volume, and the plane probability volume has a plurality of elements each of which indicates a probability of each pixel in the source image being located on a respective one of the plurality of slanted planes. A plurality of pixel-based plane parameters are generated from the plane probability volume, and are pooled to generate an instance plane parameter. Each pixel-based plane parameter optionally includes a respective 3D plane parameter vector of a respective pixel of the source image.

Description

3D Semantic Plane Detection and Reconstruction from MultiView Stereo (MVS) Images

RELATED APPLICATION

[0001] This application claims priority to U.S. Provisional Patent Application No. 63/234,618, titled “PlaneMVS: A Method for 3D Semantic Plane Detection Using MultiView Stereo Images,” filed on August 18, 2021, and U.S. Provisional Patent Application No. 63/285,927, titled “PlaneMVS: 3D Plane Reconstruction from Multi-View Stereo,” filed on December 3, 2021, each of which is hereby incorporated by reference in its entirety.

TECHNICAL FIELD

[0002] This application relates generally to image processing technology including, but not limited to, methods, systems, and non-transitory computer-readable media for applying deep learning techniques to identify planes in images and determine plane parameters and depth information.

BACKGROUND

[0003] Three dimensional (3D) planar structure detection and reconstruction from color images have been an important yet challenging problem in computer vision. It aims to detect piece-wise planar regions and predict corresponding 3D plane parameters from the color images. The predicted 3D plane parameters can be used in various applications such as robotics, augmented reality (AR), and indoor scene understanding. Some 3D plane detection or reconstruction solutions highly rely on some assumptions (e.g., Manhattan-world assumption) of the target scene and are not always robust in complicated real-world cases. Current practices apply a convolutional neural network (CNN) to a single image to estimate plane parameters directly. Given that only a single image is applied, information in a depth scale is ambiguous, nor can the plane parameters represent geometric information of corresponding planes accurately. It would be beneficial to develop systems and methods for detecting and reconstructing 3D planes accurately using deep learning techniques.

SUMMARY

[0004] Various embodiments of this application are directed to methods, systems, devices, non-transitory computer-readable media for detecting 3D semantic planes from multiple views. These 3D plane detection methods are applied in extended reality products, e.g., augmented reality (AR) glasses, AR applications executed by mobile phones. These products generate 3D semantic planes to make a virtual object placed in a scene merge into the scene and appear realistic. The multiple views applied in 3D plane detection are captured by two separate cameras having known relative positions with respect to each other or by a single camera at two distinct camera poses. Particularly, multi-view images are applied as input, and multi-view geometry is modeled by constructing a feature cost volume using the multi-view images. Multi-view based 3D plane detection is based on 3D plane hypotheses, and preserves information of input multi-view images by concatenating features of the multiview images. By these means, various embodiments of this application leverage multi-view geometry and constraints to identify geometrically accurate 3D semantic planes and depth information in a consistent manner.

[0005] In one aspect, an image processing method is implemented at an electronic device. The method includes obtaining a source image captured by a first camera at a first camera pose and obtaining a target image captured by a second camera at a second camera pose. The method further includes obtaining information of a plurality of slanted planes, generating a feature cost volume of the target image with reference to the plurality of slanted planes, and generating a plane probability volume from the feature cost volume. The plane probability volume has a plurality of elements each of which indicates a probability of each pixel in the source image being located on a respective one of the plurality of slanted planes. The method further includes generating a plurality of pixel-based plane parameters from the plane probability volume and pooling the plurality of pixel-based plane parameters to generate an instance plane parameter. In an example, each pixel-based plane parameter includes a respective 3D plane parameter vector of a respective pixel of the target image. [0006] In some embodiments, the first camera and the second camera are the same camera, and the source and target images are two distinct image frames in a sequence of image frames captured by the same camera, and wherein the first and second camera poses are distinct from each other and correspond to two distinct time instants. Alternatively, in some embodiments, the first camera and the second camera are distinct from each other, and the first and second camera poses are distinct from each other and correspond to the same time instant or two distinct time instants.

[0007] In yet another aspect, an image processing method is implemented by an electronic device. The method includes obtaining a source image captured by a first camera at a first camera pose, obtaining a target image captured by a second camera at a second camera pose, and generating a plane mask from the target image using a plane mask model. The method further includes generating a plurality of pixel-based plane parameters from the source image and the target image using a plane multi -view stereo model. In an example, each pixel-based plane parameter includes a respective 3D plane parameter vector of a respective pixel of the target image. The method further includes pooling the plurality of pixel-based plane parameters based on the plane mask to generate an instance plane parameter and generating an instance-level planar depth map corresponding to the target image based on the instance plane parameter. In some embodiments, the plane mask model includes a plane region-based convolutional neural network (PlaneRCNN) that is configured to detect and reconstruct a plurality of piecewise planar surfaces from a single input image including the target image.

[0009] In another aspect, some implementations include an electronic device that includes one or more processors and memory having instructions stored thereon, which when executed by the one or more processors cause the processors to perform any of the above methods.

[0010] In yet another aspect, some implementations include a non-transitory computer-readable medium, having instructions stored thereon, which when executed by one or more processors cause the processors to perform any of the above methods.

[0011] These illustrative embodiments and implementations are mentioned not to limit or define the disclosure, but to provide examples to aid understanding thereof. Additional embodiments are discussed in the Detailed Description, and further description is provided there.

BRIEF DESCRIPTION OF THE DRAWINGS

[0012] For a better understanding of the various described implementations, reference should be made to the Detailed Description below, in conjunction with the following drawings in which like reference numerals refer to corresponding parts throughout the figures.

[0013] Figure 1 is an example data processing environment having one or more servers communicatively coupled to one or more client devices, in accordance with some embodiments.

[0014] Figure 2 is a block diagram illustrating an electronic device configured to process content data (e.g., image data), in accordance with some embodiments. [0015] Figure 3 is an example data processing environment for training and applying a neural network-based data processing model for processing visual and/or audio data, in accordance with some embodiments.

[0016] Figure 4A is an example neural network applied to process content data in an NN-based data processing model, in accordance with some embodiments, and Figure 4B is an example node in the neural network, in accordance with some embodiments.

[0017] Figure 5 is a flowchart of a process for processing inertial sensor data and image data of an electronic system using a SLAM module, in accordance with some embodiments.

[0018] Figure 6A illustrates an input and output interface of an example planar multiview stereo (MVS) system, in accordance with some embodiments.

[0019] Figure 6B is a flow diagram of a process for rendering a virtual object in a scene, in accordance with some embodiments.

[0020] Figure 7A is a block diagram of a single view plane reconstruction framework 700, in accordance with some embodiments.

[0021] Figure 7B is a block diagram of a depth-based MVS framework, in accordance with some embodiments.

[0022] Figure 7C is a block diagram of a plane MVS system, in accordance with some embodiments.

[0023] Figure 8A is a block diagram of another example plane MVS branch of a plane MVS system, in accordance with some embodiments.

[0024] Figure 8B is a flow diagram of an example process for refining a pixel-wise plane map including a plurality of pixel-based plane parameters, in accordance with some embodiments.

[0025] Figure 9 is a block diagram of an example plane MVS system, in accordance with some embodiments.

[0026] Figure 10 is a flow diagram of an example process implemented by a client device to generate plane parameters from a plurality of input images, in accordance with some embodiments.

[0027] Figure 11 is a flow diagram of an example process implemented by a client device 104 to form a stitched depth map, in accordance with some embodiments.

[0028] Figure 12 is a flow diagram of an example image processing method, in accordance with some embodiments. [0029] Figure 13 is a flow diagram of an example image processing method, in accordance with some embodiments.

[0030] Like reference numerals refer to corresponding parts throughout the several views of the drawings.

DETAILED DESCRIPTION

[0031] Reference will now be made in detail to specific embodiments, examples of which are illustrated in the accompanying drawings. In the following detailed description, numerous non-limiting specific details are set forth in order to assist in understanding the subject matter presented herein. But it will be apparent to one of ordinary skill in the art that various alternatives may be used without departing from the scope of claims and the subject matter may be practiced without these specific details. For example, it will be apparent to one of ordinary skill in the art that the subject matter presented herein can be implemented on many types of electronic devices with digital video capabilities.

[0032] Various embodiments of this application are directed to plane detection using multi -view stereo (PlaneMVS) imaging, in which planes are detected in a target image using a plurality of images including the target image. The plurality of images are optionally captured by distinct cameras from different camera poses (i.e., corresponding to different camera positions or orientations) or by a single camera moved among different camera poses. In an example, a PlaneMVS system takes a pair of images (e.g., a target image and a source image) as inputs and predicts two-dimensional (2D) plane masks and three-dimensional (3D) plane geometry of the target image. The input images are optionally a stereo pair of images captured by two cameras or two neighboring frames from a monocular sequence captured by a camera. The relative pose corresponding to the source and target images is assumed to be known. For a stereo pair of images, the relative pose of the cameras is estimated via a calibration process. For two neighboring frames from a sequence of imaged captured by the same camera, the relative pose of the camera is estimated, e.g., by simultaneous localization and mapping (SLAM). PlaneMVS leverages multi -view geometry for 3D semantic plane prediction, thereby making prediction of 3D planes geometrically more accurate than a single-image based method (e.g., PlaneRCNN). Additionally, PlaneMVS makes use of geometrical constraints in a neural network, and may be generalized across different datasets. [0033] A plane of the target image is associated with a plurality of plane parameters (e.g., a plane normal and a plane offset). The target image corresponds to a plane mask and a depth map indicating depth values at different pixels. In some embodiments, the target image is an RGB color image for which the plane parameters, depth map, and/or plane mask are predicted. A source image is an image that precedes or follows the target image in a sequence of image frames and is used to determine the plane parameters, depth map, and/or plane mask of the target image. In some embodiments, a PlaneRCNN is also used in a plane mask model for predicting the plane parameters, depth map, and/or plane mask of the target image. Further, in some embodiments, basic model architecture of the PlaneRCNN is derived from MaskRCNN, which is an object detection model. Each of PlaneRCNN and MaskRCNN includes a convolutional neural network (CNN), which is optionally a deep neural network (DNN). In some embodiments, the plane parameters, depth map, and/or plane mask of the target image are determined based on intrinsic and extrinsic parameters of a camera. The intrinsic parameters include an intrinsic parameter matrix of the camera, and are constant for a given dataset or data captured by the camera. The extrinsic parameters include a relative pose matrix of the camera for a given RGB image.

[0034] In some embodiments, PlaneMVS is implemented by an extended reality application that is executed by an electronic device (e.g., a head-mounted display that is configured to display extended reality content). Extended reality includes augmented reality (AR) in which virtual objects are overlaid on a view of a real physical world, virtual reality (VR) that includes only virtual content, and mixed reality (MR) that combines both AR and VR and in which a user is allowed to interact with real-world and virtual objects. More specifically, AR is an interactive experience of a real -world environment where the objects that reside in the real world are enhanced by computer-generated perceptual information, e.g., across multiple sensory modalities including visual, auditory, haptic, somatosensory, and olfactory.

[0035] Figure 1 is an example data processing environment 100 having one or more servers 102 communicatively coupled to one or more client devices 104, in accordance with some embodiments. The one or more client devices 104 may be, for example, desktop computers 104A, tablet computers 104B, mobile phones 104C, head-mounted display (HMD) (also called augmented reality (AR) glasses) 104D, or intelligent, multi-sensing, network- connected home devices (e.g., a surveillance camera 104E, a smart television device, a drone). Each client device 104 can collect data or user inputs, executes user applications, and present outputs on its user interface. The collected data or user inputs can be processed locally at the client device 104 and/or remotely by the server(s) 102. The one or more servers 102 provide system data (e.g., boot files, operating system images, and user applications) to the client devices 104, and in some embodiments, processes the data and user inputs received from the client device(s) 104 when the user applications are executed on the client devices 104. In some embodiments, the data processing environment 100 further includes a storage 106 for storing data related to the servers 102, client devices 104, and applications executed on the client devices 104.

[0036] The one or more servers 102 are configured to enable real-time data communication with the client devices 104 that are remote from each other or from the one or more servers 102. Further, in some embodiments, the one or more servers 102 are configured to implement data processing tasks that cannot be or are preferably not completed locally by the client devices 104. For example, the client devices 104 include a game console (e.g., the HMD 104D) that executes an interactive online gaming application. The game console receives a user instruction and sends it to a game server 102 with user data. The game server 102 generates a stream of video data based on the user instruction and user data and providing the stream of video data for display on the game console and other client devices that are engaged in the same game session with the game console. In another example, the client devices 104 include a networked surveillance camera 104E and a mobile phone 104C. The networked surveillance camera 104E collects video data and streams the video data to a surveillance camera server 102 in real time. While the video data is optionally pre-processed on the surveillance camera 104E, the surveillance camera server 102 processes the video data to identify motion or audio events in the video data and share information of these events with the mobile phone 104C, thereby allowing a user of the mobile phone 104 to monitor the events occurring near the networked surveillance camera 104E in the real time and remotely. [0037] The one or more servers 102, one or more client devices 104, and storage 106 are communicatively coupled to each other via one or more communication networks 108, which are the medium used to provide communications links between these devices and computers connected together within the data processing environment 100. The one or more communication networks 108 may include connections, such as wire, wireless communication links, or fiber optic cables. Examples of the one or more communication networks 108 include local area networks (LAN), wide area networks (WAN) such as the Internet, or a combination thereof. The one or more communication networks 108 are, optionally, implemented using any known network protocol, including various wired or wireless protocols, such as Ethernet, Universal Serial Bus (USB), FIREWIRE, Long Term Evolution (LTE), Global System for Mobile Communications (GSM), Enhanced Data GSM Environment (EDGE), code division multiple access (CDMA), time division multiple access

(TDMA), Bluetooth, Wi-Fi, voice over Internet Protocol (VoIP), Wi-MAX, or any other suitable communication protocol. A connection to the one or more communication networks 108 may be established either directly (e.g., using 3G/4G connectivity to a wireless carrier), or through a network interface 110 (e.g., a router, switch, gateway, hub, or an intelligent, dedicated whole-home control node), or through any combination thereof. As such, the one or more communication networks 108 can represent the Internet of a worldwide collection of networks and gateways that use the Transmission Control Protocol/Internet Protocol (TCP/IP) suite of protocols to communicate with one another. At the heart of the Internet is a backbone of high-speed data communication lines between major nodes or host computers, consisting of thousands of commercial, governmental, educational and other computer systems that route data and messages.

[0038] In some embodiments, deep learning techniques are applied in the data processing environment 100 to process content data (e.g., video data, visual data, audio data) obtained by an application executed at a client device 104 to identify information contained in the content data, match the content data with other data, categorize the content data, or synthesize related content data. In these deep learning techniques, data processing models are created based on one or more neural networks to process the content data. These data processing models are trained with training data before they are applied to process the content data. Subsequently to model training, the client device 104 obtains the content data

(e.g., captures video data via an internal camera) and processes the content data using the data processing models locally.

[0039] In some embodiments, both model training and data processing are implemented locally at each individual client device 104. The client device 104 obtains the training data from the one or more servers 102 or storage 106 and applies the training data to train the data processing models. Alternatively, in some embodiments, both model training and data processing are implemented remotely at a server 102 (e.g., the server 102 A) associated with a client device 104. The server 102A obtains the training data from itself, another server 102 or the storage 106 applies the training data to train the data processing models. The client device 104 obtains the content data, sends the content data to the server 102A (e.g., in an application) for data processing using the trained data processing models, receives data processing results (e.g., recognized hand gestures) from the server 102 A, presents the results on a user interface (e.g., associated with the application), renders virtual objects in a field of view based on the poses, or implements some other functions based on the results. The client device 104 itself implements no or little data processing on the content data prior to sending them to the server 102 A. Additionally, in some embodiments, data processing is implemented locally at a client device 104, while model training is implemented remotely at a server 102 (e.g., the server 102B) associated with the client device 104. The server 102B obtains the training data from itself, another server 102 or the storage 106 and applies the training data to train the data processing models. The trained data processing models are optionally stored in the server 102B or storage 106. The client device 104 imports the trained data processing models from the server 102B or storage 106, processes the content data using the data processing models, and generates data processing results to be presented on a user interface or used to initiate some functions (e.g., rendering virtual objects based on device poses) locally.

[0040] In some embodiments, a pair of AR glasses 104D (also called an HMD) are communicatively coupled in the data processing environment 100. The HMD 104D includes a camera, a microphone, a speaker, one or more inertial sensors (e.g., gyroscope, accelerometer), and a display. The camera and microphone are configured to capture video and audio data from a scene of the HMD 104D, while the one or more inertial sensors are configured to capture inertial sensor data. In some situations, the camera captures hand gestures of a user wearing the HMD 104D, and recognizes the hand gestures locally and in real time using a two-stage hand gesture recognition model. In some situations, the microphone records ambient sound, including user’s voice commands. In some situations, both video or static visual data captured by the camera and the inertial sensor data measured by the one or more inertial sensors are applied to determine and predict device poses. The video, static image, audio, or inertial sensor data captured by the HMD 104D is processed by the HMD 104D, server(s) 102, or both to recognize the device poses. Optionally, deep learning techniques are applied by the server(s) 102 and HMD 104D jointly to recognize and predict the device poses. The device poses are used to control the HMD 104D itself or interact with an application (e.g., a gaming application) executed by the HMD 104D. In some embodiments, the display of the HMD 104D displays a user interface, and the recognized or predicted device poses are used to render or interact with user selectable display items (e.g., an avatar) on the user interface.

[0041] Figure 2 is a block diagram illustrating an electronic system 200 configured to process content data (e.g., image data), in accordance with some embodiments. The electronic system 200 includes a server 102, a client device 104 (e.g., HMD 104D in Figure 1), a storage 106, or a combination thereof. The electronic system 200 , typically, includes one or more processing units (CPUs) 202, one or more network interfaces 204, memory 206, and one or more communication buses 208 for interconnecting these components (sometimes called a chipset). The electronic system 200 includes one or more input devices 210 that facilitate user input, such as a keyboard, a mouse, a voice-command input unit or microphone, a touch screen display, a touch-sensitive input pad, a gesture capturing camera, or other input buttons or controls. Furthermore, in some embodiments, the client device 104 of the electronic system 200 uses a microphone for voice recognition or a camera 260 for gesture recognition to supplement or replace the keyboard. In some embodiments, the client device 104 includes one or more optical cameras 260 (e.g., an RGB camera), scanners, or photo sensor units for capturing images, for example, of graphic serial codes printed on the electronic devices. The electronic system 200 also includes one or more output devices 212 that enable presentation of user interfaces and display content, including one or more speakers and/or one or more visual displays. Optionally, the client device 104 includes a location detection device, such as a GPS (global positioning system) or other geo-location receiver, for determining the location of the client device 104.

[0042] Optionally, the client device 104 includes an inertial measurement unit (IMU) 280 integrating sensor data captured by multi-axes inertial sensors to provide estimation of a location and an orientation of the client device 104 in space. Examples of the one or more inertial sensors of the IMU 280 include, but are not limited to, a gyroscope, an accelerometer, a magnetometer, and an inclinometer.

[0043] Memory 206 includes high-speed random access memory, such as DRAM, SRAM, DDR RAM, or other random access solid state memory devices; and, optionally, includes non-volatile memory, such as one or more magnetic disk storage devices, one or more optical disk storage devices, one or more flash memory devices, or one or more other non-volatile solid state storage devices. Memory 206, optionally, includes one or more storage devices remotely located from one or more processing units 202. Memory 206, or alternatively the non-volatile memory within memory 206, includes a non-transitory computer readable storage medium. In some embodiments, memory 206, or the non- transitory computer readable storage medium of memory 206, stores the following programs, modules, and data structures, or a subset or superset thereof:

• Operating system 214 including procedures for handling various basic system services and for performing hardware dependent tasks;

• Network communication module 216 for connecting each server 102 or client device 104 to other devices (e.g., server 102, client device 104, or storage 106) via one or more network interfaces 204 (wired or wireless) and one or more communication networks 108, such as the Internet, other wide area networks, local area networks, metropolitan area networks, and so on;

• User interface module 218 for enabling presentation of information (e.g., a graphical user interface for application(s) 224, widgets, websites and web pages thereof, and/or games, audio and/or video content, text, etc.) at each client device 104 via one or more output devices 212 (e.g., displays, speakers, etc.);

• Input processing module 220 for detecting one or more user inputs or interactions from one of the one or more input devices 210 and interpreting the detected input or interaction;

• Web browser module 222 for navigating, requesting (e.g., via HTTP), and displaying websites and web pages thereof, including a web interface for logging into a user account associated with a client device 104 or another electronic device, controlling the client or electronic device if associated with the user account, and editing and reviewing settings and data that are associated with the user account;

• One or more user applications 224 for execution by the electronic system 200 (e.g., games, social network applications, smart home applications, and/or other web or non-web based applications for controlling another electronic device and reviewing data captured by such devices), where in some embodiments, the user application(s) 224 include an extended reality application 225 configured to interact with a user and provide extended reality content;

• Model training module 226 for receiving training data and establishing a data processing model for processing content data (e.g., video, image, audio, or textual data) to be collected or obtained by a client device 104;

• Data processing module 228 for processing content data using data processing models 250, thereby identifying information contained in the content data, matching the content data with other data, categorizing the content data, or synthesizing related content data, where in some embodiments, the data processing module 228 is associated with one of the user applications 224 to process the content data in response to a user instruction received from the user application 224;

• Pose determination and prediction module 230 for determining and predicting a pose of the client device 104 (e.g., HMD 104D), where in some embodiments, the pose is determined and predicted jointly by the pose determination and prediction module 230 and data processing module 228, and the module 230 further includes an SLAM module 232 for mapping a scene where a client device 104 is located and identifying a pose of the client device 104 within the scene using image and IMU sensor data, and a Plane MVS system 234 for applying multiple images to determine pixel -based and instance plane parameters, plane masks of one or more planes, or depth maps;

• Pose-based rendering module 238 for rendering virtual objects on top of a field of view of the camera 260 of the client device 104 or creating extended reality content using images captured by the camera 260 based on information of camera poses and planes provided by the data processing module 230; and

• One or more databases 240 for storing at least data including one or more of: o Device settings 242 including common device settings (e.g., service tier, device model, storage capacity, processing capabilities, communication capabilities, etc.) of the one or more servers 102 or client devices 104; o User account information 244 for the one or more user applications 224, e.g., user names, security questions, account history data, user preferences, and predefined account settings; o Network parameters 246 for the one or more communication networks 108, e.g., IP address, subnet mask, default gateway, DNS server and host name; o Training data 248 for training one or more data processing models 250; o Data processing model(s) 250 for processing content data (e.g., video, image, audio, or textual data) using deep learning techniques, where the data processing models 250 includes an image retrieval model for implementing an image retrieval process 600 and a camera localization model for implementing a camera localization process 800; o Pose data database 252 for storing pose data; and o Content data and results 254 that are obtained by and outputted to the client device 104 of the electronic system 200 , respectively, where the content data is processed by the data processing models 250 locally at the client device 104 or remotely at the server 102 to provide the associated results to be presented on client device 104, and include the candidate images.

[0044] Optionally, the one or more databases 240 are stored in one of the server 102, client device 104, and storage 106 of the electronic system 200 . Optionally, the one or more databases 240 are distributed in more than one of the server 102, client device 104, and storage 106 of the electronic system 200 . In some embodiments, more than one copy of the above data is stored at distinct devices, e.g., two copies of the data processing models 250 are stored at the server 102 and storage 106, respectively. [0045] Each of the above identified elements may be stored in one or more of the previously mentioned memory devices, and corresponds to a set of instructions for performing a function described above. The above identified modules or programs (i.e., sets of instructions) need not be implemented as separate software programs, procedures, modules or data structures, and thus various subsets of these modules may be combined or otherwise re-arranged in various embodiments. In some embodiments, memory 206, optionally, stores a subset of the modules and data structures identified above. Furthermore, memory 206, optionally, stores additional modules and data structures not described above.

[0046] Figure 3 is another example of a data processing system 300 for training and applying a neural network based (NN-based) data processing model 250 for processing content data (e.g., video, image, audio, or textual data), in accordance with some embodiments. The data processing system 300 includes a model training module 226 for establishing the data processing model 250 and a data processing module 228 for processing the content data using the data processing model 250. In some embodiments, both of the model training module 226 and the data processing module 228 are located on a client device 104 of the data processing system 300, while a training data source 304 distinct from the client device 104 provides training data 306 to the client device 104. The training data source 304 is optionally a server 102 or storage 106. Alternatively, in some embodiments, the model training module 226 and the data processing module 228 are both located on a server 102 of the data processing system 300. The training data source 304 providing the training data 306 is optionally the server 102 itself, another server 102, or the storage 106. Additionally, in some embodiments, the model training module 226 and the data processing module 228 are separately located on a server 102 and client device 104, and the server 102 provides the trained data processing model 250 to the client device 104.

[0047] The model training module 226 includes one or more data pre-processing modules 308, a model training engine 310, and a loss control module 312. The data processing model 250 is trained according to the type of content data to be processed. The training data 306 is consistent with the type of the content data, so is a data pre-processing module 308 applied to process the training data 306 consistent with the type of the content data. For example, an image pre-processing module 308A is configured to process image training data 306 to a predefined image format, e.g., extract a region of interest (ROI) in each training image, and crop each training image to a predefined image size. Alternatively, an audio pre-processing module 308B is configured to process audio training data 306 to a predefined audio format, e.g., converting each training sequence to a frequency domain using a Fourier transform. The model training engine 310 receives pre-processed training data provided by the data pre-processing modules 308, further processes the pre-processed training data using an existing data processing model 250, and generates an output from each training data item. During this course, the loss control module 312 can monitor a loss function comparing the output associated with the respective training data item and a ground truth of the respective training data item. The model training engine 310 modifies the data processing model 250 to reduce the loss function, until the loss function satisfies a loss criterion (e.g., a comparison result of the loss function is minimized or reduced below a loss threshold). The modified data processing model 250 is provided to the data processing module 228 to process the content data.

[0048] In some embodiments, the model training module 226 offers supervised learning in which the training data is entirely labelled and includes a desired output for each training data item (also called the ground truth in some situations). Conversely, in some embodiments, the model training module 226 offers unsupervised learning in which the training data are not labelled. The model training module 226 is configured to identify previously undetected patterns in the training data without pre-existing labels and with no or little human supervision. Additionally, in some embodiments, the model training module 226 offers partially supervised learning in which the training data are partially labelled.

[0049] The data processing module 228 includes a data pre-processing module 314, a model-based processing module 316, and a data post-processing module 318. The data preprocessing modules 314 pre-processes the content data based on the type of the content data. Functions of the data pre-processing modules 314 are consistent with those of the preprocessing modules 308 and covert the content data to a predefined content format that is acceptable by inputs of the model-based processing module 316. Examples of the content data include one or more of the following: video, image, audio, textual, and other types of data. For example, each image is pre-processed to extract an ROI or cropped to a predefined image size, and an audio clip is pre-processed to convert to a frequency domain using a Fourier transform. In some situations, the content data includes two or more types, e.g., video data and textual data. The model -based processing module 316 applies the trained data processing model 250 provided by the model training module 226 to process the pre- processed content data. The model -based processing module 316 can also monitor an error indicator to determine whether the content data has been properly processed in the data processing module 228. In some embodiments, the processed content data is further processed by the data post-processing module 318 to present the processed content data in a preferred format or to provide other related information that can be derived from the processed content data.

[0050] Figure 4A is an exemplary neural network (NN) 400 applied to process content data in an NN-based data processing model 250, in accordance with some embodiments, and Figure 4B is an example of a node 420 in the neural network (NN) 400, in accordance with some embodiments. The data processing model 250 is established based on the neural network 400. A corresponding model-based processing module 316 applies the data processing model 250 including the neural network 400 to process content data that has been converted to a predefined content format. The neural network 400 includes a collection of nodes 420 that are connected by links 412. Each node 420 receives one or more node inputs and applies a propagation function to generate a node output from the node input(s). As the node output is provided via one or more links 412 to one or more other nodes 420, a weight w associated with each link 412 is applied to the node output. Likewise, the node input(s) can be combined based on corresponding weights wi, W2, W3, and W4 according to the propagation function. For example, the propagation function is a product of a non-linear activation function and a linear weighted combination of the node input(s).

[0051] The collection of nodes 420 is organized into one or more layers in the neural network 400. Optionally, the layer(s) may include a single layer acting as both an input layer and an output layer. Optionally, the layer(s) may include an input layer 402 for receiving inputs, an output layer 406 for providing outputs, and zero or more hidden layers 404 (e.g., 404A and 404B) between the input and output layers 402 and 406. A deep neural network has more than one hidden layer 404 between the input and output layers 402 and 406. In the neural network 400, each layer is only connected with its immediately preceding and/or immediately following layer. In some embodiments, a layer 402 or 404B is a fully connected layer because each node 420 in the layer 402 or 404B is connected to every node 420 in its immediately following layer. In some embodiments, one of the hidden layers 404 includes two or more nodes that are connected to the same node in its immediately following layer for down sampling or pooling the nodes 420 between these two layers. Particularly, max pooling uses a maximum value of the two or more nodes in the layer 404B for generating the node of the immediately following layer 406 connected to the two or more nodes.

[0052] In some embodiments, a convolutional neural network (CNN) is applied in a data processing model 250 to process content data (particularly, video and image data). The CNN employs convolution operations and belongs to a class of deep neural networks 400, i.e., a feedforward neural network that only moves data forward from the input layer 402 through the hidden layers to the output layer 406. The hidden layer(s) of the CNN can be convolutional layers convolving with multiplication or dot product. Each node in a convolutional layer receives inputs from a receptive area associated with a previous layer (e.g., five nodes), and the receptive area is smaller than the entire previous layer and may vary based on a location of the convolution layer in the convolutional neural network. Video or image data is pre-processed to a predefined video/image format corresponding to the inputs of the CNN. The pre-processed video or image data is abstracted by each layer of the CNN to a respective feature map. By these means, video and image data can be processed by the CNN for video and image recognition, classification, analysis, imprinting, or synthesis. [0053] Alternatively and additionally, in some embodiments, a recurrent neural network (RNN) is applied in the data processing model 250 to process content data (particularly, textual and audio data). Nodes in successive layers of the RNN follow a temporal sequence, such that the RNN exhibits a temporal dynamic behavior. For example, each node 420 of the RNN has a time-varying real-valued activation. Examples of the RNN include, but are not limited to, a long short-term memory (LSTM) network, a fully recurrent network, an Elman network, a Jordan network, a Hopfield network, a bidirectional associative memory (BAM network), an echo state network, an independently RNN (IndRNN), a recursive neural network, and a neural history compressor. In some embodiments, the RNN can be used for handwriting or speech recognition. It is noted that in some embodiments, two or more types of content data are processed by the data processing module 228, and two or more types of neural networks (e.g., both CNN and RNN) are applied to process the content data jointly.

[0054] The training process is a process for calibrating all of the weights w, for each layer of the learning model using a training data set which is provided in the input layer 402. The training process typically includes two steps, forward propagation and backward propagation, which are repeated multiple times until a predefined convergence condition is satisfied. In the forward propagation, the set of weights for different layers are applied to the input data and intermediate results from the previous layers. In the backward propagation, a margin of error of the output (e.g., a loss function) is measured, and the weights are adjusted accordingly to decrease the error. The activation function is optionally linear, rectified linear unit, sigmoid, hyperbolic tangent, or of other types. In some embodiments, a network bias term b is added to the sum of the weighted outputs from the previous layer before the activation function is applied. The network bias b provides a perturbation that helps the NN 400 avoid over fitting the training data. The result of the training includes the network bias parameter b for each layer.

[0055] Figure 5 is a flowchart of a process 500 for processing inertial sensor data and image data of an electronic system (e.g., a server 102, a client device 104, or a combination of both) using a SLAM module 232, in accordance with some embodiments. The process 500 includes measurement preprocessing 502, initialization 504, local visual-inertial odometry (VIO) with relocation 506, and global pose graph optimization 508. In measurement preprocessing 502, an RGB camera 260 captures image data of a scene at an image frame rate (e.g., 30 FPS), and features are detected and tracked (510) from the image data. An IMU 280 measures inertial sensor data at a sampling frequency (e.g., 1000 Hz) concurrently with the RGB camera 260 capturing the image data, and the inertial sensor data are pre-integrated (512) to provide data of a variation of device poses 540. In initialization 504, the image data captured by the RGB camera 260 and the inertial sensor data measured by the IMU 280 are temporally aligned (514). A vision-only structure from motion (SfM) techniques 514 are applied (516) to couple the image data and inertial sensor data, estimate three-dimensional structures, and map the scene of the RGB camera 260.

[0056] After initialization 504 and during relocation 506, a sliding window 518 and associated states from a loop closure 520 are used to optimize (522) a VIO. When the VIO corresponds (524) to a keyframe of a smooth video transition and a corresponding loop is detected (526), features are retrieved (528) and used to generate the associated states from the loop closure 520. In global pose graph optimization 508, a multi-degree-of-freedom (multiDOF) pose graph is optimized (530) based on the states from the loop closure 520, and a keyframe database 532 is updated with the keyframe associated with the VIO.

[0057] Additionally, the features that are detected and tracked (510) are used to monitor (534) motion of an object in the image data and estimate image-based poses 536, e.g., according to the image frame rate. In some embodiments, the inertial sensor data that are pre-integrated (512) may be propagated (538) based on the motion of the object and used to estimate inertial-based poses 540, e.g., according to a sampling frequency of the IMU 280. The image-based poses 536 and the inertial-based poses 540 are stored in the pose data database 252 and used by the module 230 to estimate and predict poses that are used by the pose-based rendering module 238. Alternatively, in some embodiments, the SLAM module 232 receives the inertial sensor data measured by the IMU 280 and obtains image-based poses 536 to estimate and predict more poses 540 that are further used by the pose-based rendering module 238. [0058] In SLAM, high frequency pose estimation is enabled by sensor fusion, which relies on data synchronization between imaging sensors and the IMU 280. The imaging sensors (e.g., the RGB camera 260, a LiDAR scanner) provide image data desirable for pose estimation, and oftentimes operate at a lower frequency (e.g., 30 frames per second) and with a larger latency (e.g., 30 millisecond) than the IMU 280. Conversely, the IMU 280 can measure inertial sensor data and operate at a very high frequency (e.g., 1000 samples per second) and with a negligible latency (e.g., < 0.1 millisecond). Asynchronous time warping (ATW) is often applied in an AR system to warp an image before it is sent to a display to correct for head movement and pose variation that occurs after the image is rendered. ATW algorithms reduce a latency of the image, increase or maintain a frame rate, or reduce judders caused by missing image frames. In both SLAM and ATW, relevant image data and inertial sensor data are stored locally, such that they can be synchronized and used for pose estimation/predication. In some embodiments, the image and inertial sensor data are stored in one of multiple STL containers, e.g., std::vector, std::queue, std: :list, etc., or other selfdefined containers. These containers are generally convenient for use. The image and inertial sensor data are stored in the STL containers with their timestamps, and the timestamps are used for data search, data insertion, and data organization.

[0059] Figure 6A illustrates an input and output interface 600 of an example planar multi -view stereo (MVS) system 234, in accordance with some embodiments. The plane MVS system 234 receives a pair of input images including a source image 604 and a target image 606, and predicts one or more plane masks 608 and plane geometry 610 (e.g., depth map) of the target image 606. The source image 604 is captured by a first camera at a first camera pose, and the target image 606 is captured by a second camera at a second camera pose. In some embodiments, the source image 604 and target image 606 are a stereo pair. The first camera and the second camera are distinct from each other, and the first and second camera poses are distinct from each other. The source image 604 and target image 606 are captured at the same time instant or two distinct time instants. Alternatively, in some embodiments, the first camera and the second camera are the same camera 260, and the source and target images 604 and 606 are two distinct image frames in a sequence of image frames captured by the same camera. The first and second camera poses are distinct from each other and correspond to two distinct time instants. In an example, the source image 604 and target image 606 are two immediately successive frames in a sequence of images captured by the same camera 260. In an example, the source image 604 and target image 606 are separated by one or more frames in a sequence of images captured by the camera 260. [0060] The first and second camera poses corresponding to the source and target images 604 and 606 differ by a relative pose. The relative pose is known, given the source and target images 604 and 606. For a stereo pair of cameras (i.e., the first and second cameras), the relative pose is optionally estimated via a calibration process and based on intrinsic and extrinsic parameters of the cameras. If the source and target images 604 and 606 are captured by the same camera 260, the relative pose is also estimated, e.g., by SLAM.

[0061] Figure 6B is a flow diagram of a process 650 for rendering a virtual object in a scene, in accordance with some embodiments. The process 650 is implemented jointly by a pose determination and prediction module 230, a pose-based rendering module 238, and an extended reality application 225 of a client device 104 (e.g., an HMD 104D). The pose determination and prediction module 230 includes a SLAM module 232 and a plane MVS system 234. The pose determination and prediction module 230 receives a sequence of image frames 612 including a source image 604 and a target image 606 that follows the source image 604. The SLAM module 232 maps the scene where the client device 104 is located and identifies a pose of the client device 104 within the scene using image data (e.g., the sequence of image frames 612) and IMU sensor data. The plane MVS system 234 identifies one or more planes in the scene based on the sequence of image frames 612, and determines plane parameters and depth information of the one or more planes. The pose-based rendering module 238 renders virtual objects on top of a field of view of the client device 104 or creates extended reality content using the image frames 612 captured by the camera 260.

Specifically, the virtual objects are overlaid on the one or more planes identified by the plane MVS system 234 to create the extended reality content.

[0062] In some embodiments, the extended reality application 225 includes an AR application configured to render AR content to a user. The AR application is executed jointly with the SLAM module 232, plane MVS system 234, and pose-based rendering module 238. In an example, the virtual objects are overlaid on the one or more planes identified by the plane MVS system 234 in a physical world to create the AR content. Neural networks are trained and deployed on the client device 104 or on a server 102. Examples of the client device 104 include hand-held devices (e.g., a mobile phone 104C) and wearable devices (e.g., the HMD 104D). As the sequence of image frames 612 are streamed from the client device 194, the plane MVS system 234 identifies a set of 3D planes and associated semantic labels, and the SLAM module 232 tracks a pose of the client device 104. The virtual objects are seamlessly placed on top of 3D planes. In some situations, a virtual object moves around. In some situations, a virtual object is anchored at a location. The pose-based rendering module 238 reconstructs each virtual object via 3D planes and manages occlusion and dis-occlusion between real and virtual objects for the AR application.

[0063] In some embodiments, virtual objects are placed on planes based on corresponding semantic classes. For example, in an AR shopping application, floor lamps are automatically placed on floor planes (i.e., not on desk planes). Stated another way, in accordance with a determination that a semantic class of a first plane is a predefined first class associated with a first virtual object, the pose-based rendering module 238 renders the first virtual object on the first plane and causes the first virtual object to be overlaid on the first plane in the AR application. Conversely, in accordance with a determination that a semantic class of a second plane is not the predefined first class associated with the first virtual object, the pose-based rendering module 238 aborts rendering the first virtual object on the second plane, and the first virtual object is not overlaid on the second plane in the AR application.

[0064] Figure 7A is a block diagram of a single view plane reconstruction framework 700, in accordance with some embodiments. The single view plane reconstruction framework 700 includes a plane geometry branch 702 and a plane detection branch 704, and is configured to receive a single input image 706 and generate one or more plane parameters 708 and one or more plane masks 710. The plane detection branch 704 detects one or more planes from the single input image 706, and the plane geometry branch 702 determines the one or more plane parameters 708 including a plane normal and a plane offset for each of the detected planes. In some embodiments, each plane is identified by a binary plane mask 710. For each plane, the binary plane mask 710 includes a plurality of elements, and each element has one of two predefined values (e.g., 0, 1) and indicates whether one or more adjacent pixels belongs to a respective plane. Alternatively, in some embodiments, a first number of planes are identified from the single input image 706. A single plane mask 710 includes a plurality of elements, and each element has a value selected from the first number of predefined values. Each predefined value corresponds to a distinct plane. For each predefined value, corresponding elements correspond to a set of pixels that belong to the same one of the first number of planes identified in the single input image.

[0065] Figure 7B is a block diagram of a depth-based MVS framework 740, in accordance with some embodiments. The depth-based MVS framework 740 includes a multiview stereo model 742 applied to process a pair of stereo images 744 (e.g., a source image 604 and a target image 606) to generate a depth map 746 of one of the stereo images directly. In some embodiments, the multi-view stereo model 742 includes a CNN. [0066] Figure 7C is a block diagram of a planar multi-view stereo (MVS) system 234, in accordance with some embodiments. The plane MVS system 234 employs slanted plane hypotheses 752 for plane-sweeping to build a plane MVS branch 754, which interacts with a plane detection branch 704. The plane detection branch 704 detects one or more planes from the single input image 706, and generates one or more masks 710 identifying the one or more detected planes. The plane MVS system 234 receives at least two stereo images 756 (e.g., a target image 606, a source image 604), and process the stereo images 756 to generate one or more plane parameters 708 for each of the one or more detected planes of the target image 606. A source feature of the source image 604 is warped to the target image 606 (e.g., a pose of the target image 806) with respect to a plurality of slanted planes 752 identified in reference images via a differentiable homography computed from each slanted plane 752. In some embodiments, the plane parameters 708 are combined with the one or more plane masks 710 of the one or more detected planes to generate an instance-level depth map 756. [0067] In some embodiments, the plane detection branch 704 predicts a set of 2D plane masks 710 with corresponding semantic labels of the target image 606, and the plane MVS branch 754 receives the target and source images 606 and 604 as input and applies a slanted plane sweeping strategy to learn the plane parameters 708 without ambiguity. Specifically, plane sweeping is performed with a group of slanted plane hypotheses 752 to build a feature cost volume (e.g., 808 in Figure 8A) and regress per-pixel plane parameters (e.g., 814 in Figure 8A). Soft pooling is further applied to get piece-wise plane parameters in view of a loss object, thereby associating the plane MVS branch 754 and plane detection branch 704 with each other. Learned uncertainties are applied on different loss terms to train such a multi-task learning system in a balanced way. The plane MVS system 234 generalizes in new environments with different data distributions. Results of the plane masks 710 and plane parameters 708 are further improved with a finetuning strategy without ground truth plane annotations. In some embodiments, the plane MVS system 234 is trained in an end-to- end manner. The reconstructed depth map 756 applies multi-view geometry to reduce scale ambiguity and is geometrically smoother compared with a depth map 746 generated by a depth-based MVS system 740 by parsing planar structures. Experimental results across different indoor datasets demonstrate that the plane MVS system 234 outperforms the singleview plane reconstruction framework 700 and learning-based MVS model 742.

[0068] Figure 8A is a block diagram of another example plane MVS branch 754 of a plane MVS system 234 (Figure 7C), in accordance with some embodiments. The plane MVS branch 754 generates a pixel-wise plane map 802 of a plane parameter 708 from a plurality of input images (e.g., a source image 604 and a target image 606). The plane MVS system 234 obtains the source image 604 captured by a first camera at a first camera pose and a target image 606 captured by a second camera at a second camera pose. Each of the first and second camera poses optionally includes a camera position and a camera orientation, and the first and second camera poses are distinct and different from each other. The plane MVS system 234 obtains information of a plurality of slanted planes 752 (i.e., slanted plane hypotheses), and generates a feature cost volume 808 of the target image 606 with reference to the plurality of slanted planes 752. The feature cost volume 808 of the target image 606 is processed by a regularization and regression network 812 to generate the pixel-wise plane map 802 of the plane parameter 708. In some embodiments, the feature cost volume 808 is applied to generate a plurality of plane parameters 708, and each plane parameter 708 corresponds to a respective pixel-wise plane map 802 including a plurality of pixel-based plane parameters 814. For each pixel-wise plane map 802, the plurality of pixel-based plane parameters 814 are pooled to generate an instance plane parameter 816. Examples of the instance plane parameter 816 include, but are not limited to, a normal direction, a plane location, and a plane shape.

[0069] In some embodiments, the plane MVS system 234 generates a plane probability volume 820 from the feature cost volume 808. The plane probability volume 820 has a plurality of elements each of which indicates a probability of each pixel in the target image 606 being located on a respective one of the plurality of slanted planes 752. The plurality of pixel-based plane parameters 814 of each pixel-wise plane map 802 are generated from the plane probability volume 820, e.g., using the regularization and regression network 812. In an example, each pixel-based plane parameter 814 includes a respective 3D plane parameter vector of a respective pixel of the target image 606. For determination of the feature cost volume 808, the plane MVS system 234 has the predefined slanted planes 752 and performs homography warping of the source feature map 804 into the target image 606 (e.g., a pose of the target image 806) using the slanted planes 752 and a relative pose between the first and second camera. In some embodiments, the warped source features 804 are concatenated with the target features 806 along a channel dimension to form the feature cost volume 808. In some embodiments, a series of 3D convolution layers are applied to transform the feature cost volume 808 into the plane probability volume 820. Further, in some situations, a soft-argmin operation is performed to convert the plane probability volume 820 to the pixel-wise plane map 802. [0070] In some embodiments, the plane MVS system 234 includes a backbone network 810 configured to generate the source feature 804 from the source image 604 and the target feature 806 from the target image 606. In an example, the source feature 804 is generated from the source image 604 using a CNN. A first resolution ( x W) of the source image 604 is reduced to a second resolution (H/S* W/S) of the source feature 804 according to a scaling factor S, before the source image 604 is converted to the source feature 804. In another example, the backbone network 810 includes a feature pyramid network (FPN) having five levels (L=5), and each of the source feature 804 and target feature 806 includes a multi-scale 2D feature map extracted from a finest level of the FPN (L=5), i.e., f₀ E R4^HX4^WxC . In some embodiments, the plane MVS system 234 further passes f₀ corresponding to each of the source feature 804 and target feature 806 into a dimensiondeduction layer and an average pooling layer to get a reduced feature representation, i.e., /₀' E

thereby balancing memory consumption and accuracy. The reduced feature representation /₀' serves as the feature 804 or 806 in the plane MVS system 234. Additionally, in some embodiments, each of the source feature 804 and target feature 806 includes respective two or more levels of features.

[0071] Figure 8B is a flow diagram of an example process 850 for refining a pixelwise plane map 802 including a plurality of pixel-based plane parameters 814, in accordance with some embodiments. The plane MVS branch 754 in Figure 8A generates a pixel-wise plane map 802 of a plane parameter 708 from a source image 604 and a target image 606. In some embodiments, the plane MVS branch 754 further includes a refinement module (e.g., 904 in Figure 9) having a refinement network 852. The pixel-wise plane map 802 is concatenated (854) with the target image 606 to generate a concatenated map. The concatenated map is processed by the refinement network 852 to generate residual parameters 860 (SP ’). Each of the pixel-level refined parameters 814’ (Pr) is a combination of a pixelbased plane parameter 814 and a corresponding residual parameter, i.e., Pr = P ’+ 3P’. A plurality of pixel -based refined parameters 814’ forms a pixel -wise refined plane map 802’. In some embodiments, the refinement network 852 includes a residual neural network (ResNet).

[0072] Figure 9 is a block diagram of an example plane MVS system 234, in accordance with some embodiments. The Plane MVS system 234 includes a plane stereo module 902, a plane refinement module 904, and a plane detection branch 704. The plane stereo module 902 and plane refine module 904 form a plane MVS branch 754. The plane stereo module 902 predicts a pixel-wise plane map 802 including a plurality of pixel-based plane parameters 814 via a feature cost volume 808 (Figure 8 A). The plane refinement module 904 refines the pixel-wise plane map 802, e.g., by a residual refinement network 852 (Figure 8B). The plane detection branch 704 predicts one or more plane bounding boxes and one or more corresponding plane masks 710. Each plane bounding box identifies a respective plane in the target image 606. A soft pooling operation 908 is performed to pool the pixelwise plane map 802 (which is optionally refined) into an instance-level plane parameter 816 corresponding to the corresponding plane mask 710.

[0073] In some embodiments, each plane is identified by a binary plane mask 710. For each plane, the binary plane mask 710 includes a plurality of elements, and each element has one of two predefined values (e.g., 0, 1) and indicates whether one or more adjacent pixels belongs to a respective plane. Alternatively, in some embodiments, a first number of planes are identified from the single input image 706. A single plane mask 710 includes a plurality of elements, and each element has a value selected from the first number of predefined values. Each predefined value corresponds to a distinct plane. For each predefined value, corresponding elements correspond to a set of pixels that belong to the same one of the first number of planes identified in the single input image.

[0074] In some embodiments, the plane detection branch 704 includes a PlaneRCNN 906 that identifies planes for a single input image (e.g., the target image 606). The PlaneRCNN 906 is formed based on Mask-RCNN, a Deep Neural Network (DNN) that detects objects in the target image 606 and generates a segmentation mask for each object. The PlaneRCNN 906 is applied to estimate 2D plane masks 710 (M). Specifically, in some embodiments, the PlaneRCNN 906 applies an FPN to extract an intermediate feature map from the target image 606 and adopts a two-stage detection framework to predict the 2D plane masks 710. In priori art, PlaneRCNN 906 includes a plurality of separate branches that also estimate 3D plane parameters 708. The PlaneRCNN 906 further includes an encoderdecoder architecture that processes the intermediate feature map of the target image 606 to get a per-pixel depth map Di. Instance-level plane features from ROI-Align are passed into a plane normal branch to predict plane normals N. A refinement network 852 is further applied to refine plane masks 802 and a reprojection loss between neighboring views to enforce multi -view geometry consistency during training. With predicted 2D plane masks 710 (M), the per-pixel depth map D> and the plane normals N, a piecewise planar depth map D_p is reconstructed. [0075] Conversely, in some embodiments of this application, the plane detection branch 704 does not estimate plane parameters 708 (e.g., plane normals TV, depths D, and Dp , and the plane MVS branch 754 is configured to determine the plane parameters 708 that describe 3D plane geometry (e.g., N, Di, D_p). In some embodiments, the plane detection branch 704 does not perform plane refinement or determine any multi-view reprojection loss used in PlaneRCNN to conserve memory. Further, in some embodiments, the plane detection branch 704 performs semantic label prediction for each plane instance to determine a corresponding semantic class. Each plane is associated with semantic plane annotations. As such, the plane detection branch 704 determines a set of bounding boxes B={bi, b2, bk}, confidence scores S={si, S2, s/t}, where s, G (0,1), and semantic labels C={ci, C2, Ck} for the target image having a resolution of x W.

[0076] In some embodiments, neural networks applied in the plane stereo module 902, plane refinement module 904, and plane detection branch 704 are trained jointly in an end-to-end manner. For example, the feature backbone 810 (Figure 8 A), regularization and regression network 812 (Figure 8A), refinement network 852 (Figure 8B), and PlaneRCNN network 906 (Figure 9) are trained jointly. In an example, during training, the plane MVS system 234 is trained using a loss function L represented as follows:

where w_t is a learnable weight to automatically balance the contribution of each loss term.

Such an adaptive weighting strategy is also known as loss uncertainty learning. Specifically, Ldet and other loss terms are represented as follows:

where pt is a foreground class output for box i (also called pixel-based plane parameters 814, which includes a 3D vector for a predicted plane), is a four-dimensional (4D) vector applied in predicted regression for positive box i, is a target regression, p_u is an output for positive box i at true class it,

is 4-dimensional vector applied in predicted regression for positive box i at true class it,

is an output for a foreground pixel i at true class it, pl is a

3D vector for a targeted plane at pixel i, p is a 3D vector for a predicted refine plane at pixel i, di is a predicted depth at pixel i, d_t* is a target depth at pixel i, and df is a predicted refine depth at pixel i.

[0077] Figure 10 is a flow diagram of an example process implemented by a client device 104 to generate plane parameters from a plurality of input images (e.g., a source image 604 and a target image 606), in accordance with some embodiments. The source image 604 is captured by a first camera at a first camera pose, and the target image 606 is captured by a second camera at a second camera pose distinct from the first camera pose. The client device 104 executes a plane MVS system 234 that obtains information of a plurality of slanted planes 752 and generates a feature cost volume 808 of the target image 606 with reference to the plurality of slanted planes 752. A plane probability volume 820 is generated from the feature cost volume 808, and has a plurality of elements each of which indicates a probability of each pixel in the target image 606 being located on a respective one of the plurality of slanted planes. The plane MVS system 234 generates a plurality of pixel-based plane parameters 814 (e.g., forming a plane map 802) from the plane probability volume 820 and the information of the plurality of slanted planes 752. In an example, each pixel-based plane parameter 814 includes a respective 3D plane parameter vector of a respective pixel of the target image 606. The plurality of pixel-based plane parameters 814 are pooled to generate an instance plane parameter 816.

[0078] The plane MVS system 234 of the client device 104 applies slanted plane hypotheses 752 to perform plane sweeping and determine pixel-based plane parameters 814. In some embodiments, the plane MVS system 234 obtains a dataset including a plurality of reference images, and identifies the plurality of slanted planes 752 from the plurality of reference images. The information of the plurality of slanted planes 752 includes a respective plane normal and a respective location of each slanted plane 752. The representation of differentiable homography uses the slanted plane hypotheses 752. A homography between two views at plane (nj)^Tx+ej=O, where n, is the plane normal and is the offset at pixel i of the target image 606, is represented as:

where symbol ~ means “equality up to a scale”, K is an intrinsic matrix. R and t are the relative camera rotation and translation matrices between two views, respectively. Therefore it can be concluded that, without considering occlusion and object motion, the homography at pixel z between two views is only determined by the plane pi=n^/ei with known camera poses. This perfectly aligns with a goal to learn 3D plane parameters with MVS. The pixellevel plane parameters p^n ^/ei is a non-ambiguous representation for a plane by employing slanted plane-sweeping in the plane MVS system 234.

[0079] The plane MVS system 234 includes a set of three-dimensional slanted plane hypotheses n^T/e. The number of candidate planes that pass through a 3D point is infinite. The appropriate hypothesis range is determined for each dimension of n^T/e. In some embodiments, the client device 104 randomly sample a number of (e.g., 10000) training images and plot a distribution for every axis of ground truth plane n^T/e, which reflects the general distribution for plane parameters 816 in various scenes. The client device 104 selects upper and lower bounds for each axis by ensuring a predefined portion (e.g., 90%) of ground truth values lie within a range defined by the selected bounds. The client device 104 samples the slanted plane hypotheses 756 uniformly between the bounds along every axis.

[0080] After determining the slanted plane hypotheses 752, the plane MVS system 234 warps the source feature map 804 into the target image 806 (e.g., a pose of the target image 806) by equation (12). For every slanted plane hypothesis 752, the plane MVS system 234 concatenates the warped source feature 805 and target feature 806. The features are stacked along a hypothesis dimension to build a feature cost volume 808. Stated another way, the plane MVS system 234 warps the source feature 804 to the target image 806 (e.g., a pose of the target image 806) with respect to the plurality of slanted planes via a differentiable homography computed from each slanted plane 752. For each slanted plane 752, the warped source feature and the target feature are combined to generate a respective comprehensive feature. In some embodiments, comprehensive features are combined to generate the feature cost volume 808 of the source and target features 804 and 806.

[0081] A regularization and regression network 812 is applied to regularize the feature cost volume 808 to generate a plane probability volume 820. In an example, the regularization and regression network 812 includes an encoder-decoder architecture with 3D CNN layers. The plane MVS system 234 applies a soft-argmax operation to get the initial pixel-based plane parameters 814. For the plane hypothesis set C = {po,pi,

3D plane parameter p, at pixel i is inferred as:

(^{1 3})

where U(pt) is the probability of hypothesis pi at pixel i. Soft-argmax provides an initial pixel-level plane parameter tensor P G R~^Hxs^Wx3 which is upsampled to an original image resolution. In some embodiments, direct application of bilinear upsampling results in an oversmoothness issue. For each pixel of , a convex combination is applied by predicting an 8 x 8 x 3 x 3 grid and applying weighted combination over the learned weights of its 3 x3 coarse neighbors to get the upsampled plane parameters P’ G /?^{Hx Kx3} This upsampling preserves boundaries of planes and other details in the reconstructed planar depth map 756.

[0082] The plane refinement module 904 is applied to learn the residual of the initial pixel-based plane parameters 814 with respect to ground truth. The upsampled initial plane parameters 814 (P’) is concatenated with the target image 606 to preserve image details and passed into the refinement network 852, which optionally includes a plurality of 2D CNN layers, to predict a residual parameter 860 (dP ’) (shown in Figure 8B). Each of the pixel-level refined parameters 814’ ( r) is a combination of a pixel-based plane parameter 814 and corresponding residual parameter, i.e., Pr = P ’+ 3P’.

[0083] Specifically, in some embodiments, prior to pooling the plurality of pixelbased plane parameters 814, The plurality of pixel-based plane parameters 814 are refined based on the target image 606, e.g., using a refinement network 852, to generate a plurality of pixel-based refined parameters 814’ that forms a refined plane map 802’. Each pixel-based refined parameter 814’ includes a plane refined parameter vector of a respective pixel of the target image 606. In some embodiments, the plurality of pixel-based plane parameters 814 are refined based on the target image 606 to generate a plurality of pixel-based residual parameters 860 P’), which are combined with the pixel-based plane parameters 814 to generate the refined plane map 802’ including the plurality of pixel-based refined parameters 814’. The plurality of pixel-based refined parameters 814’ is pooled to generate an instance refined parameter 816’, which is further applied to generate the instance-level planar depth map 756’ with refinement. An example of the refinement network 852 includes a 2D CNN. [0084] It is noted that the instance plane parameter 816 or refined plane parameter 816’ identifies a corresponding plane accurately. In some embodiments, the client device 104 executes an extended reality application 225 (e.g., an AR application) and renders a virtual object on the 3D plane.

[0085] In some embodiments, the plane detection branch 704 of the client device 104 generates a plane mask 710 (/??<,) from the target image 606 using a plane mask model 906. The plurality of pixel-based plane parameters 814 of the pixel-wise plane map 802 are generated from the source image 604 and the target image 606 using a plane MVS model 1010. In an example, each pixel -based plane parameter 802 includes a respective 3D plane parameter vector of a respective pixel of the target image 606. In some embodiments, the plane MVS model 1010 is a combination of a backbone network 810 and a regularization and regression network 812. The plurality of pixel-based plane parameters 802 (/?,) are pooled based on the plane mask 710 to generate an instance plane parameter 816 (/?/), which is converted to an instance-level planar depth map 756 D_;) corresponding to the target image 606. For a detected plane, the plane mask 710 (/??<,) includes a plurality of elements, and each element <7, at a pixel i indicates a foreground probability of belong to the detected plane. The instance plane parameter 816 (P_t) is represented based on weighted averaging as follows:

The instance-level planar depth map 756 (Z>_;) is reconstructed as follows:

where F is an indicator variable to identify foreground pixels of a corresponding plane. In some embodiments, a foreground threshold (e.g., equal to 0.5) is applied on each element <7, of the plane mask 710 (/7?_s) to determine whether a corresponding pixel i is identified as foreground, i.e., a pixel of a detected plane. K~‘ is an inverse intrinsic matrix and x_; is the homogeneous coordinate of pixel i.

[0086] Referring to Figure 10, in some embodiments, the plurality of pixel -based plane parameters 814 of the plane map 802 is generated from the feature cost volume 808 using a 3D CNN 812. The plurality of pixel-based plane parameters 814 is refined using a 2D CNN 852. The plurality of pixel-based plane parameters 814 is pooled using a first pooling network. The plurality of pixel-based refined parameters 814’ is pooled using a second pooling network. A plane mask 710 is generated from the target image 660 using a PlaneRCNN 906 (Figure 9). In some embodiments, the 3D CNN 812, 2D CNN 852, first pooling network, second pooling network, and PaneRCNN are trained end-to-end. In some embodiments, the backbone network 810, 3D CNN 812, 2D CNN 852, first pooling network, second pooling network, and PaneRCNN are trained end-to-end.

[0087] Figure 11 is a flow diagram of an example process 1100 implemented by a client device 104 to form a stitched depth map 756S, in accordance with some embodiments. The client device forms the stitched depth map 756S for the target image 606 by filling planar pixels with instance planar depth map 756 (Z>_;) determined from the instance plane parameter 816 (Pt). Stated another way, in some embodiments, the client device 104 generates a plurality of pixel-based plane parameters 814A of a plane map 802A, which is associated with a first plane of a first portion of the target image 606 and pooled to generate a first instance plane parameter 816A. The client device 104 generates a plurality of pixel-based plane parameters 814B of a plane map 802B, which is associated with a second plane of a second portion of the target image 606 and pooled to generate a second instance plane parameter 816B. The first and second instance plane parameters 816A and 816B are converted to a first instance-level planar depth map 756A and a second instance-level planar depth map 756B, respectively. A first subset of planar pixels of a stitched instance-level planar depth map 756S of the target image 606 is filled with the first instance-level planar depth map 756A and the second instance-level planar depth map 756B. The subset of planar pixels of the stitched instance-level planar depth map 756S correspond to the first and second planes of the target image 606.

[0088] Additionally, while the pixel-wise plane parameters 802 (Pi) capture local planarity, the client device 104 fill a subset of non-planar pixels with a reconstructed pixelwise planar depth map. In some embodiments, the client device 104 reconstructs a third pixel-level planar depth map 756C corresponding to an area of the target image 606 that does not correspond to any plane, e.g., using a deep learning model. A second subset of non-planar pixels of the stitched instance-level planar depth map 756S is filled with the third pixel-level planar depth map 756C.

[0089] Figure 12 is a flow diagram of an example image processing method 1200, in accordance with some embodiments. For convenience, the method 1200 is described as being implemented by an electronic device (e.g., a mobile phone 104C). Method 1200 is, optionally, governed by instructions that are stored in a non-transitory computer readable storage medium and that are executed by one or more processors of the computer system. Each of the operations shown in Figure 12 may correspond to instructions stored in a computer memory or non-transitory computer readable storage medium (e.g., memory 206 in

Figure 2). The computer readable storage medium may include a magnetic or optical disk storage device, solid state storage devices such as Flash memory, or other non-volatile memory device or devices. The instructions stored on the computer readable storage medium may include one or more of: source code, assembly language code, object code, or other instruction format that is interpreted by one or more processors. Some operations in method 1200 may be combined and/or the order of some operations may be changed.

[0090] The electronic device obtains a source image 604 captured by a first camera at a first camera pose (1202), a target image 606 captured by a second camera at a second camera pose (1204), and information of a plurality of slanted planes 752 (1206). A feature cost volume 808 of the target image 606 is generated (1208) with reference to the plurality of slanted planes 752, and a plane probability volume 820 is generated (1210) from the feature cost volume 808. The plane probability volume 820 has a plurality of elements each of which indicates a probability of each pixel in the source image 604 being located on a respective one of the plurality of slanted planes 752. The electronic device generates (1212) a plurality of pixel-based plane parameters 814 from the plane probability volume 820. In an example, each pixel-based plane parameter (1214) includes a respective 3D plane parameter vector of a respective pixel of the target image 606. In some embodiments, the pixel-based plane parameters 814 are generated based on the information of the plurality of slanted planes 752. The electronic device pools (1216) the plurality of pixel-based plane parameters 814 to generate an instance plane parameter 816.

[0091] In some embodiments, the electronic device generates a source feature 804 and a target feature 806 from the source image 604 and the target image 606, respectively. The source feature 804 is warped to the target image 606 (e.g., a pose of the target image 806) with respect to the plurality of slanted planes 752 via a differentiable homograph computed from each slanted plane. The electronic device combines the warped source feature 804 and the target feature 806 to generate a comprehensive feature. Specifically, in some embodiments, the source feature 804 is warped according to each of the plurality of slanted planes 752, e.g., using homography, and each warped source feature 804 is concatenated with the target feature 806 to form a comprehensive feature. Comprehensive features corresponding to the plurality of slanted planes 752 are organized to form the feature cost volume 802. Further, in some embodiments, the source feature 804 is generated from the source image 604 using a convolutional neural network (CNN). The source feature 804 is generated from the source image 604 by reducing first resolution of the source image 604 to a second resolution of the source feature 804 according to a scaling factor S. For example, the first resolution is H* W, and the second resolution is — X — . Additionally, in some ’ s s embodiments, based on the comprehensive feature, the electronic device generates the feature cost volume 808 of the target feature 806 with reference to the plurality of slanted planes 752. The electronic device converts the feature cost volume 808 to the plane probability volume 820 using 3D CNN layers. In some embodiments, the electronic device applies an encoderdecoder architecture with 3D CNN layers to regularize the feature cost volume 808. In some embodiments, the electronic device uses a single 3D CNN layer with softmax activation to transform the feature cost volume 808 into a plane probability volume 820.

[0092] In some embodiments, the electronic device refines the plurality of pixelbased plane parameters 814 based on the target image 606 to generate a plurality of residual parameters 860. The plurality of pixel-based plane parameters 814 and the plurality of pixelbased residual parameters 860 are combined to generate a plurality of pixel-based refined parameters 814’. Each pixel-based refined parameter 814’ includes a plane refined parameter vector of a respective pixel of the target image 606. Further, in some embodiments, the plurality of pixel-based refined parameters 814’ is pooled to generate an instance refined parameter 816’. Additionally, in some embodiments, the electronic device generates a refined planar depth map 756’ based on the instance refined parameter 816’. In some embodiments, the plurality of pixel-based plane parameters 814 is generated from the feature cost volume 808 using a 3D CNN. The plurality of pixel-based plane parameters 814 is refined using a 2D CNN. The plurality of pixel-based plane parameters 814 is pooled using a first pooling network. The plurality of pixel-based refined parameters is pooled using a second pooling network. A plane mask 710 is generated from the target image 606 using a PlaneRCNN. The 3D CNN, 2D CNN, first pooling network, second pooling network, and PlaneRCNN are trained end-to-end in a server 102 or the electronic device.

[0093] In some embodiments, the electronic device obtains (1218) a dataset including a plurality of reference images, identifies (1220) the plurality of slanted planes 752 from the plurality of reference images, and determines (1222) the information of the plurality of slanted planes 752. The information includes a respective plane normal and a respective location of each slanted plane 752.

[0094] In some embodiments, the electronic device generates a plane mask 710 from one of the source and target images 604 and 606. The plurality of pixel-based plane parameters 814 is pooled based on the plane mask 710 to generate the instance plane parameter 816. [0095] In some embodiments, the electronic device identifies a 3D plane based on the instance plane parameter 816. An augmented reality application is executed on the electronic device, and a virtual object is rendered on the 3D plane in the augmented reality application. [0096] In some embodiments, the first camera and the second camera are (1224) the same camera, and the source and target images 604 and 606 are two distinct image frames in a sequence of image frames captured by the same camera. The first and second camera poses are distinct from each other and correspond to two distinct time instants. Alternatively, in some embodiments, the first camera and the second camera are (1226) distinct from each other, and the first and second camera poses are distinct from each other and correspond to the same time instant or two distinct time instants.

[0097] It should be understood that the particular order in which the operations in Figure 12 have been described are merely exemplary and are not intended to indicate that the described order is the only order in which the operations could be performed. One of ordinary skill in the art would recognize various ways to process images. Additionally, it should be noted that details of other processes described above with respect to Figures 5-11 and 13 are also applicable in an analogous manner to method 1200 described above with respect to Figure 12. For brevity, these details are not repeated here.

[0098] Figure 13 is a flow diagram of an example image processing method 1300, in accordance with some embodiments. For convenience, the method 1300 is described as being implemented by an electronic device (e.g., a mobile phone 104C). Method 1300 is, optionally, governed by instructions that are stored in a non-transitory computer readable storage medium and that are executed by one or more processors of the computer system. Each of the operations shown in Figure 13 may correspond to instructions stored in a computer memory or non-transitory computer readable storage medium (e.g., memory 206 in Figure 2). The computer readable storage medium may include a magnetic or optical disk storage device, solid state storage devices such as Flash memory, or other non-volatile memory device or devices. The instructions stored on the computer readable storage medium may include one or more of source code, assembly language code, object code, or other instruction format that is interpreted by one or more processors. Some operations in method 1000 may be combined and/or the order of some operations may be changed.

[0099] The electronic device obtains a source image 604 (1302) captured by a first camera at a first camera pose and a target image 606 (1304) captured by a second camera at a second camera pose. A plane mask 710 is generated (1306) from the target image 606 using a plane mask model. A plurality of pixel-based plane parameters 814 is generated (1308) from the source image 604 and the target image 606 using a plane multi -view stereo model 1010 (e.g., in Figure 10). In an example, each pixel-based plane parameter includes (1310) a respective 3D plane parameter vector of a respective pixel of the target image 606. The plurality of pixel-based plane parameters 814 is pooled (1312) based on the plane mask 710 to generate an instance plane parameter 816. The electronic device generates (1314) an instance-level planar depth map 756 corresponding to the target image 606 based on the instance plane parameter 816. In some embodiments, the plane mask model includes (1316) a PlaneRCNN that is configured to detect and reconstruct a plurality of piecewise planar surfaces from a single input image including the target image 606. PlaneRCNN employs a variant of Mask R-CNN to detect planes with plane parameters and segmentation masks. PlaneRCNN then jointly refines all the segmentation masks with a loss enforcing consistency with a nearby view during training.

[00100] In some embodiments, the plane mask 710 includes (1318) a plurality of elements, and each element corresponds to a predicted foreground probability of a respective pixel. The plurality of pixel-based plane parameters 814 is pooled by assigning (1320) a corresponding predicted foreground probability to each of the plurality of pixel-based plane parameters 814 as a respective weight and applying (1322) a weighted average operation on the plurality of pixel-based plane parameters 814 using respective weights to generate the instance plane parameter 816, e.g., based on equation (14).

[00101] In some embodiments, the instance-level planar depth map 756 is generated by determining a pixel -based foreground indicator variable F that identifies a plurality of foreground pixels and determining the instance-level planar depth map 756 based on an inverse intrinsic matrix K'¹ for a pixel i as follows:

Di ¹ = p _TjK- xt

where D> is a depth value of the pixel i,pt is the instance plane parameter 816, and x_; is a homogeneous coordinate of the pixel i, and the foreground indicator variable indicates whether the pixel i is a foreground pixel. Further, in some embodiments, the plane mask 710 includes a plurality of elements, and each element corresponds to a predicted foreground probability of a respective pixel. For each pixel, a respective element of the plane mask 710 is compared with a predefined foreground threshold to determine whether the pixel i is a foreground pixel and the corresponding foreground indicator variable F.

[00102] In some embodiments, the instance plane parameter 816 includes a first instance plane parameter 816A associated with a first plane of a first portion of the target image 606. The instance-level planar depth map 756 includes a first instance-level planar depth map 756A associated with the first plane of the first portion of the target image 606. For a second plane associated with a second portion of the target image 606, the electronic device generates (1324) a second instance-level planar depth map 756B corresponding to the second portion of the target image 606 based on a second instance plane parameter 816B, and fills (1326) a first subset of planar pixels of a stitched instance-level planar depth map 756S of the target image 606 with the first instance-level planar depth map 756A and the second instance-level planar depth map 756B. The subset of planar pixels of the stitched instancelevel planar depth map 756S corresponds to the first and second planes of the target image 606. Further, in some embodiments, the electronic device reconstructs a third pixel-level planar depth map 756C corresponding to an area of the target image 606 that does not correspond to any plane (e.g., using a deep learning model) and fills a second subset of non- planar pixels of the stitched instance-level planar depth map 756S using the third pixel-level planar depth map 756C.

[00103] In some embodiments, prior to pooling the plurality of pixel-based plane parameters 814, the electronic device refines the plurality of pixel -based plane parameters 814 based on the target image 606 to generate a plurality of pixel-based residual parameters 860. The plurality of pixel-based plane parameters 814 and the plurality of pixel-based residual parameters 860 are combined to generate a plurality of pixel-based refined parameters 814’. Each pixel-based refined parameter 814’ includes a plane refined parameter vector of a respective pixel of the target image 606. The plurality of pixel -based refined parameters 814’ is pooled to generate the instance plane parameter 816 with refinement, and the instance-level planar depth map 756' corresponding to the target image 606 is generated based on the instance plane parameter 816 with refinement.

[00104] In some embodiments, the plurality of pixel-based plane parameters 814 is generated using the plane multi-view stereo model by obtaining information of a plurality of slanted planes 752 and generating a feature cost volume 808 of the target image 606 with reference to the plurality of slanted planes 752. Further, in some embodiments, a plane probability volume 820 is generated from the feature cost volume 808. The plane probability volume 820 has a plurality of elements each of which indicates a probability of each pixel in the source image 604 being located on a respective one of the plurality of slanted planes 752. Additionally, in some embodiments, the plurality of pixel-based plane parameters 814 is generated from the plane probability volume 820. Optionally, the information of the plurality of slanted planes 752 is applied jointly with the plane probability volume 820 to generate the plurality of pixel-based plane parameters 814. In an example, each pixel-based plane parameter 814 includes a respective 3D plane parameter vector of a respective pixel of the target image 606.

[00105] In some embodiments, the first camera and the second camera are the same camera, and the source and target images 604 and 606 are two distinct image frames in a sequence of image frames captured by the same camera, and wherein the first and second camera poses are distinct from each other and correspond to two distinct time instants.

[00106] In some embodiments, the first camera and the second camera are distinct from each other, and the first and second camera poses are distinct from each other and correspond to the same time instant or two distinct time instants.

[00107] It should be understood that the particular order in which the operations in Figure 13 have been described are merely exemplary and are not intended to indicate that the described order is the only order in which the operations could be performed. One of ordinary skill in the art would recognize various ways to process images. Additionally, it should be noted that details of other processes described above with respect to Figures 5-12 are also applicable in an analogous manner to method 1300 described above with respect to Figure 13. For brevity, these details are not repeated here.

[00108] The terminology used in the description of the various described implementations herein is for the purpose of describing particular implementations only and is not intended to be limiting. As used in the description of the various described implementations and the appended claims, the singular forms “a”, “an” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will also be understood that the term “and/or” as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items. It will be further understood that the terms “includes,” “including,” “comprises,” and/or “comprising,” when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. Additionally, it will be understood that, although the terms “first,” “second,” etc. may be used herein to describe various elements, these elements should not be limited by these terms. These terms are only used to distinguish one element from another.

[00109] As used herein, the term “if’ is, optionally, construed to mean “when” or “upon” or “in response to determining” or “in response to detecting” or “in accordance with a determination that,” depending on the context. Similarly, the phrase “if it is determined” or “if [a stated condition or event] is detected” is, optionally, construed to mean “upon determining” or “in response to determining” or “upon detecting [the stated condition or event]” or “in response to detecting [the stated condition or event]” or “in accordance with a determination that [a stated condition or event] is detected,” depending on the context.

[00110] The foregoing description, for purpose of explanation, has been described with reference to specific embodiments. However, the illustrative discussions above are not intended to be exhaustive or to limit the claims to the precise forms disclosed. Many modifications and variations are possible in view of the above teachings. The embodiments were chosen and described in order to best explain principles of operation and practical applications, to thereby enable others skilled in the art.

[00111] Although various drawings illustrate a number of logical stages in a particular order, stages that are not order dependent may be reordered and other stages may be combined or broken out. While some reordering or other groupings are specifically mentioned, others will be obvious to those of ordinary skill in the art, so the ordering and groupings presented herein are not an exhaustive list of alternatives. Moreover, it should be recognized that the stages can be implemented in hardware, firmware, software or any combination thereof.

Claims

What is claimed is:

1. An image processing method, implemented by an electronic device, comprising: obtaining a source image captured by a first camera at a first camera pose; obtaining a target image captured by a second camera at a second camera pose; obtaining information of a plurality of slanted planes; generating a feature cost volume of the target image with reference to the plurality of slanted planes; generating a plane probability volume from the feature cost volume, the probability volume having a plurality of elements each of which indicates a probability of each pixel in the target image being located on a respective one of the plurality of slanted planes; generating a plurality of pixel-based plane parameters from the planar probability volume; and pooling the plurality of pixel-based plane parameters to generate an instance plane parameter.

2. The method of claim 1, further comprising: generating a source feature from the source image; generating a target feature from the target image; warping the source feature to a pose of the target image with respect to the plurality of slanted planes via a differentiable homography computed from each slanted plane; and combining the warped source feature and the target feature to generate a comprehensive feature.

3. The method of claim 2, wherein the source feature is generated from the source image using a convolutional neural network (CNN), and generating the source feature from the source image further comprises: reducing a first resolution of the source image to a second resolution of the source feature according to a scaling factor S, the first resolution being Z/x W, the second resolution . . H W being - x -

4. The method of claim 2, wherein: based on the comprehensive feature, the feature cost volume of the target feature is generated with reference to the plurality of slanted planes; and

38 the feature cost volume is converted to the plane probability volume using 3D CNN layers.

5. The method of any of claims 1-4, further comprising: refining the plurality of pixel-based plane parameters based on the target image to generate a plurality of pixel-based residual parameters; and combining the plurality of pixel-based plane parameters and the plurality of pixelbased residual parameters to generate a plurality of pixel-based refined parameters, each pixel-based refined parameter including a plane refined parameter vector of a respective pixel of the target image.

6. The method of claim 5, further comprising pooling the plurality of pixel-based refined parameters to generate an instance refined parameter.

7. The method of claim 6, further comprising generating a refined planar depth map from the instance refined parameter.

8. The method of claim 5, wherein: the plurality of pixel-based plane parameters is generated from the feature cost volume using a 3D CNN; the plurality of pixel-based plane parameters is refined using a 2D CNN; the plurality of pixel-based plane parameters is pooled using a first pooling network; the plurality of pixel-based refined parameters is pooled using a second pooling network; a plane mask is generated from the target image using a PlaneRCNN; and the method further comprises training the 3D CNN, 2D CNN, first pooling network, second pooling network, and PaneRCNN end-to-end.

9. The method of any of claims 1-8, further comprising: obtaining a dataset including a plurality of reference images; identifying the plurality of slanted planes from the plurality of reference images; and determining the information of the plurality of slanted planes, wherein the information including a respective plane normal and a respective location of each slanted plane.

10. The method of any of claims 1-9, further comprising:

39 generating a plane mask from one of the source and target images, wherein the plurality of pixel-based plane parameters is pooled based on the plane mask to generate the instance plane parameter.

11. The method of any of claims 1-10, further comprising: identifying a 3D plane based on the instance plane parameters; executing an augmented reality application on the electronic device; and rendering a virtual object on the 3D plane in the augmented reality application.

12. The method of any of claims 1-11, wherein the first camera and the second camera are the same camera, and the source and target images are two distinct image frames in a sequence of image frames captured by the same camera, and wherein the first and second camera poses are distinct from each other and correspond to two distinct time instants.

13. The method of any of claims 1-11, wherein the first camera and the second camera are distinct from each other, and the first and second camera poses are distinct from each other and correspond to the same time instant or two distinct time instants.

14. The method of any of claims 1-13, wherein each pixel-based plane parameter includes a respective 3D plane parameter vector of a respective pixel of the target image.

15. An electronic device, comprising: one or more processors; and memory having instructions stored thereon, which when executed by the one or more processors cause the processors to perform a method of any of claims 1-14.

16. A non-transitory computer-readable medium, having instructions stored thereon, which when executed by one or more processors cause the one or more processors to perform a method of any of claims 1-14.

40