WO2023102223A1

WO2023102223A1 - Cross-coupled multi-task learning for depth mapping and semantic segmentation

Info

Publication number: WO2023102223A1
Application number: PCT/US2022/051711
Authority: WO
Inventors: Nitin Bansal; Pan JI; Yi Xu; Junsong Yuan
Original assignee: Innopeak Technology, Inc.
Priority date: 2021-12-03
Filing date: 2022-12-02
Publication date: 2023-06-08
Also published as: WO2023102224A1

Abstract

This application is directed to depth mapping and semantic segmentation in a cross-coupled manner. A computer obtains an input image and applies an encoder network to process the input image and generate an encoded feature map. A first decoder network is applied to generate a first decoded feature map based on the encoded feature map. A second decoder network is applied to generate a second decoded feature map based on the encoded feature map. The first and second decoded feature maps are combined to generate a cross-channel attention modulated (CCAM) cross-feature score map. The first decoded feature map is modified in the first decoder network based on the CCAM cross-feature score map to generate a depth feature of the input image. The second decoded feature map is modified in the second decoder network based on the CCAM cross-feature score map to generate a semantic feature of the input image.

Description

CROSS-COUPLED MULTI-TASK LEARNING FOR DEPTH MAPPING

AND SEMANTIC SEGMENTATION

RELATED APPLICATIONS

[0001] This application claims the benefit of and priority to U.S. Provisional Application No. 63/285,931, entitled “Semantics-Depth-Symbiosis: Deeply Coupled Semi- Supervised Learning of Semantics and Depth,” filed December 3, 2021, and U.S. Provisional Application No. 63/352,940, entitled “Semantics-Depth-Symbiosis: Deeply Coupled Semi- Supervised Learning of Semantics and Depth,” filed June 16, 2022, all of which are hereby incorporated by reference in their entirety.

TECHNICAL FIELD

[0002] This application relates generally to image processing technology including, but not limited to, methods, systems, and non-transitory computer-readable media for applying deep learning techniques to determine depth information and semantic information based on cross-channel coupling.

BACKGROUND

[0003] Convolutional Neural Networks (CNNs) have been applied to implement a range of computer vision tasks including image classification, semantic segmentation, and depth estimation. Features associated with each of these computer vision tasks are largely independent, and therefore, each computer vision task is often trained in isolation. However, in many situations, there are no sufficient labeled data available for training of each computer vision task. This data sparsity problem is more prominent for dense tasks such as semantic segmentation and depth estimation, where the perfect per-pixel annotation is expensive and untenable, making fully supervised learning infeasible under most circumstances. It would be beneficial to use deep learning techniques that can be efficiently trained to determine depth information and semantic information accurately.

SUMMARY

[0004] Various embodiments of this application are directed to methods, systems, devices, and non-transitory computer-readable media for multi-task learning of semantic segmentation and depth estimation based on cross-channel coupling, e.g., using a cross- channel attention module (CCAM). The CCAM facilitates effective feature sharing between two channels of semantic segmentation and depth estimation, leading to mutual performance gains with a negligible increase in trainable parameters. Additionally, a data augmentation method is formed for the semantic segmentation task using the predicted depth. As such, the CCAM and data augmentation enable performance gains for semantic segmentation and depth estimation and provide deep learning solutions based on a semi-supervised joint model. [0005] In one aspect, image processing is implemented in an electronic system. The method includes obtaining an input image and applying an encoder network to process the input image and generate an encoded feature map. The method includes applying a first decoder network to generate a first decoded feature map based on the encoded feature map, applying a second decoder network to generate a second decoded feature map based on the encoded feature map, and combining the first and second decoded feature maps to generate a cross-channel attention modulated (CCAM) cross-feature score map. The method further includes modifying the first decoded feature map in the first decoder network based on the CCAM cross-feature score map to generate a depth feature of the input image and modifying the second decoded feature map in the second decoder network based on the CCAM crossfeature score map to generate a semantic feature of the input image.

[0006] In some embodiments, combining the first and second decoded feature maps to generate the CCAM cross-feature score map further includes applying at least a product operation to combine the first and second decoded feature maps and generate a cross-task feature map, applying a channel attention network to process the cross-task feature map and generate a cross-task affinity matrix, and combining the first and second decoded feature maps based on the cross-task affinity matrix to generate the CCAM cross-feature score map. Further, in some embodiments, the cross-task affinity matrix includes a plurality of affinity scores, and each affinity score indicates an affinity level of a respective channel of the first decoded feature map with respect to a respective channel of the second decoded feature map. Additionally, in some embodiments, combining the first and second decoded feature maps further includes applying a first spatial attention network on the first decoded feature map to generate a first self-attended feature map, applying a second spatial attention network on the second decoded feature map to generate a second self-attended feature map, transposing the first self-attended feature map to generate a transposed first self-attended feature map, and combining the transposed first self-attended feature map and the second self-attended feature map by the product operation. [0007] In one aspect, a data augmentation method is implemented at an electronic system. The method obtaining a first image captured by a camera and applying an encoderdecoder network to extract depth information and semantic information of the first image. The method includes identifying an object (e.g., a moveable object) and an associated region of interest (ROI) in the first image based on the semantic information, adjusting the ROI in the first image to generate an adjusted ROI, and combining the first image and the adjusted ROI to generate a second image. The method includes applying the second image to train the encoder-decoder network. The ROI includes the object.

[0008] In some embodiments, adjusting the ROI further includes determining a first depth of the object in a field of view of the camera based on the depth information and the semantic information of the first image, determining a scale factor based on the first depth and a target depth of the object, and scaling the ROI including the object by the scale factor to generate the adjusted ROI. A target location of the adjusted ROI on the second image corresponds to the target depth in the field of view of the camera.

[0009] In some embodiments, the semantic information includes a semantic mask identifying the object. The method further includes, based on the semantic mask, identifying a set of object pixels corresponding to the object in the first image. Adjusting the ROI includes adjusting at least one contrast level, a brightness level, a saturation level, and a hue level of the set of object pixels to generate the adjusted ROI. The ROI has a first location in the first image, and the adjusted ROI has a second location in the second image. The first location is consistent with the second location.

[0010] In another aspect, some implementations include an electronic system or an electronic device, which includes one or more processors and memory having instructions stored thereon, which when executed by the one or more processors cause the processors to perform any of the above methods.

[0011] In yet another aspect, some implementations include a non-transitory computer-readable medium, having instructions stored thereon, which when executed by one or more processors cause the processors to perform any of the above methods.

[0012] These illustrative embodiments and implementations are mentioned not to limit or define the disclosure, but to provide examples to aid understanding thereof. Additional embodiments are discussed in the Detailed Description, and further description is provided there. BRIEF DESCRIPTION OF THE DRAWINGS

[0013] For a better understanding of the various described implementations, reference should be made to the Detailed Description below, in conjunction with the following drawings in which like reference numerals refer to corresponding parts throughout the figures.

[0014] Figure 1 is an example data processing environment having one or more servers communicatively coupled to one or more client devices, in accordance with some embodiments.

[0015] Figure 2 is a block diagram illustrating an electronic system configured to process content data (e.g., speech data), in accordance with some embodiments.

[0016] Figure 3 is an example data processing environment for training and applying a neural network-based machine learning model for processing visual and/or speech data, in accordance with some embodiments.

[0017] Figure 4A is an example neural network applied to process content data in an NN-based machine learning model, in accordance with some embodiments, and Figure 4B is an example node in the neural network, in accordance with some embodiments.

[0018] Figure 5 is a block diagram of an example machine learning model applied to determine depth information and semantic information of an input image, in accordance with some embodiments.

[0019] Figure 6 is a block diagram of an example encoder-decoder network (e.g., a U- net) applied to process an input image based on a CCAM cross-feature score map, in accordance with some embodiments.

[0020] Figures 7A and 7B are block diagrams of two portions of a cross-channel attention module for determining cross-task affinity of two distinct decoded feature maps generated from an input image, in accordance with some embodiments.

[0021] Figure 8 is a block diagram of an example CCAM network for generating a cross-task affinity matrix associated with an input image, in accordance with some embodiments.

[0022] Figure 9 is a diagram of an example orthogonal loss generated by an encoderdecoder network applied to process an input image based on a CCAM cross-feature score map, in accordance with some embodiments. [0023] Figure 10 is a flow diagram of an example data augmentation process for augmenting training data for a machine learning model for semantic segmentation, in accordance with some embodiments.

[0024] Figure 11 is a comparison of original images and augmented images, which are applied to train a machine learning model , in accordance with some embodiments.

[0025] Figure 12 is a flow diagram of an example data processing method, in accordance with some embodiments.

[0026] Figure 13 is a flow diagram of an example data augmentation method, in accordance with some embodiments.

[0027] Like reference numerals refer to corresponding parts throughout the several views of the drawings.

DETAILED DESCRIPTION

[0028] Reference will now be made in detail to specific embodiments, examples of which are illustrated in the accompanying drawings. In the following detailed description, numerous non-limiting specific details are set forth in order to assist in understanding the subject matter presented herein. But it will be apparent to one of ordinary skill in the art that various alternatives may be used without departing from the scope of claims and the subject matter may be practiced without these specific details. For example, it will be apparent to one of ordinary skill in the art that the subject matter presented herein can be implemented on many types of electronic systems with image processing capabilities.

[0029] Figure 1 is an example data processing environment 100 having one or more servers 102 communicatively coupled to one or more client devices 104, in accordance with some embodiments. The one or more client devices 104 may be, for example, desktop computers 104A, tablet computers 104B, mobile phones 104C, head-mounted display (HMD) (also called augmented reality (AR) glasses) 104D, or intelligent, multi-sensing, network- connected home devices (e.g., a surveillance camera 104E, a smart television device, a drone). Each client device 104 can collect data or user inputs, executes user applications, and present outputs on its user interface. The collected data or user inputs can be processed locally at the client device 104 and/or remotely by the server(s) 102. The one or more servers 102 provide system data (e.g., boot files, operating system images, and user applications) to the client devices 104, and in some embodiments, processes the data and user inputs received from the client device(s) 104 when the user applications are executed on the client devices 104. In some embodiments, the data processing environment 100 further includes a storage 106 for storing data related to the servers 102, client devices 104, and applications executed on the client devices 104.

[0030] The one or more servers 102 are configured to enable real-time data communication with the client devices 104 that are remote from each other or from the one or more servers 102. Further, in some embodiments, the one or more servers 102 are configured to implement data processing tasks that cannot be or are preferably not completed locally by the client devices 104. For example, the client devices 104 include a game console (e.g., the HMD 104D) that executes an interactive online gaming application. The game console receives a user instruction and sends it to a game server 102 with user data. The game server 102 generates a stream of video data based on the user instruction and user data and providing the stream of video data for display on the game console and other client devices that are engaged in the same game session with the game console. In another example, the client devices 104 include a networked surveillance camera 104E and a mobile phone 104C. The networked surveillance camera 104E collects video data and streams the video data to a surveillance camera server 102 in real time. While the video data is optionally pre-processed on the surveillance camera 104E, the surveillance camera server 102 processes the video data to identify motion or speech events in the video data and share information of these events with the mobile phone 104C, thereby allowing a user of the mobile phone 104 to monitor the events occurring near the networked surveillance camera 104E in the real time and remotely. [0031] The one or more servers 102, one or more client devices 104, and storage 106 are communicatively coupled to each other via one or more communication networks 108, which are the medium used to provide communications links between these devices and computers connected together within the data processing environment 100. The one or more communication networks 108 may include connections, such as wire, wireless communication links, or fiber optic cables. Examples of the one or more communication networks 108 include local area networks (LAN), wide area networks (WAN) such as the Internet, or a combination thereof. The one or more communication networks 108 are, optionally, implemented using any known network protocol, including various wired or wireless protocols, such as Ethernet, Universal Serial Bus (USB), FIREWIRE, Long Term Evolution (LTE), Global System for Mobile Communications (GSM), Enhanced Data GSM Environment (EDGE), code division multiple access (CDMA), time division multiple access (TDMA), Bluetooth, Wi-Fi, voice over Internet Protocol (VoIP), Wi-MAX, or any other suitable communication protocol. A connection to the one or more communication networks 108 may be established either directly (e.g., using 3G/4G connectivity to a wireless carrier), or through a network interface 110 (e.g., a router, switch, gateway, hub, or an intelligent, dedicated whole-home control node), or through any combination thereof. As such, the one or more communication networks 108 can represent the Internet of a worldwide collection of networks and gateways that use the Transmission Control Protocol/Internet Protocol (TCP/IP) suite of protocols to communicate with one another. At the heart of the Internet is a backbone of high-speed data communication lines between major nodes or host computers, consisting of thousands of commercial, governmental, educational and other computer systems that route data and messages.

[0032] In some embodiments, deep learning techniques are applied in the data processing environment 100 to process content data (e.g., video data, visual data, speech data) obtained by an application executed at a client device 104 to enhance quality of the content data, identify information contained in the content data, match the content data with other data, categorize the content data, or synthesize related content data. In these deep learning techniques, machine learning models (e.g., models 240 in Figure 2) are created based on one or more neural networks to process the content data. These machine learning models are trained with training data before they are applied to process the content data. Subsequently to model training, the client device 104 obtains the content data (e.g., captures audio data via a microphone) and processes the content data using the machine learning models locally.

[0033] In an example, machine learning models are trained and used to process one or more images captured by a camera to extract depth information and/or semantic information. The machine learning models are optionally trained in a server 102, a client device 104, or a combination thereof. Also, the machine learning models are optionally applied in the server 102, client device 104, or a combination thereof to process the one or more images. Examples of the client device 104 include, but are not limited to, a digital camera device 104E, mobile devices 104A-104C, and an HMD 104D.

[0034] In some embodiments, both model training and data processing are implemented locally at each individual client device 104. The client device 104 obtains the training data from the one or more servers 102 or storage 106 and applies the training data to train the machine learning models. Alternatively, in some embodiments, both model training and data processing are implemented remotely at a server 102 (e.g., the server 102A) associated with a client device 104. The server 102A obtains the training data from itself, another server 102 or the storage 106 applies the training data to train the machine learning models. The client device 104 obtains the content data, sends the content data to the server 102A (e.g., in an application) for data processing using the trained machine learning models, receives data processing results (e.g., recognized hand gestures) from the server 102A, presents the results on a user interface (e.g., associated with the application), renders virtual objects in a field of view based on the poses, or implements some other functions based on the results. The client device 104 itself implements no or little data processing on the content data prior to sending them to the server 102 A. Additionally, in some embodiments, data processing is implemented locally at a client device 104, while model training is implemented remotely at server 102 (e.g., the server 102B) associated with the client device 104. The server 102B obtains the training data from itself, another server 102 or the storage 106 and applies the training data to train the machine learning models. The trained machine learning models are optionally stored in the server 102B or storage 106. The client device 104 imports the trained machine learning models from the server 102B or storage 106, processes the content data using the machine learning models and generates data processing results to be presented on a user interface locally.

[0035] In some embodiments, a pair of AR glasses 104D (also called an HMD) are communicatively coupled in the data processing environment 100. The HMD 104D includes a camera, a microphone, a speaker, one or more inertial sensors (e.g., a gyroscope, accelerometer), and a display. The camera and microphone are configured to capture video and speech data from a scene of the HMD 104D, while the one or more inertial sensors are configured to capture inertial sensor data. In some situations, the camera captures the hand gestures of a user wearing the HMD 104D, and recognizes the hand gestures locally and in real-time using a two-stage hand gesture recognition model. In some situations, the microphone records ambient sound, including the user’s voice commands. In some situations, both video or static visual data captured by the camera and the inertial sensor data measured by one or more inertial sensors are applied to determine and predict device poses. The video, static image, speech, or inertial sensor data captured by the HMD 104D is processed by the HMD 104D, server(s) 102, or both to recognize the device poses. Optionally, deep learning techniques are applied by the server(s) 102 and HMD 104D jointly to recognize and predict the device poses. The device poses are used to control the HMD 104D itself or interact with an application (e.g., a gaming application) executed by the HMD 104D. In some embodiments, the display of the HMD 104D displays a user interface, and the recognized or predicted device poses are used to render or interact with user-selectable display items (e.g., an avatar) on the user interface. [0036] Figure 2 is a block diagram illustrating an electronic system 200 configured to process content data (e.g., image data), in accordance with some embodiments. The electronic system 200 includes a server 102, a client device 104 (e.g., HMD 104D in Figure 1), a storage 106, or a combination thereof. The electronic system 200, typically, includes one or more processing units (CPUs) 202, one or more network interfaces 204, memory 206, and one or more communication buses 208 for interconnecting these components (sometimes called a chipset). The electronic system 200 includes one or more input devices 210 that facilitate user input, such as a keyboard, a mouse, a voice-command input unit or microphone, a touch screen display, a touch-sensitive input pad, a gesture-capturing camera, or other input buttons or controls. Furthermore, in some embodiments, the client device 104 of the electronic system 200 uses a microphone for voice recognition or a camera 260 for gesture recognition to supplement or replace the keyboard. In some embodiments, the client device 104 includes one or more optical cameras 260 (e.g., an RGB camera), scanners, or photo sensor units for capturing images, for example, of graphic serial codes printed on the electronic devices. The electronic system 200 also includes one or more output devices 212 that enable the presentation of user interfaces and display content, including one or more speakers and/or one or more visual displays.

[0037] Memory 206 includes high-speed random access memory, such as DRAM, SRAM, DDR RAM, or other random access solid state memory devices; and, optionally, includes non-volatile memory, such as one or more magnetic disk storage devices, one or more optical disk storage devices, one or more flash memory devices, or one or more other non-volatile solid state storage devices. Memory 206, optionally, includes one or more storage devices remotely located from one or more processing units 202. Memory 206, or alternatively the non-volatile memory within memory 206, includes a non-transitory computer-readable storage medium. In some embodiments, memory 206, or the non- transitory computer-readable storage medium of memory 206, stores the following programs, modules, and data structures, or a subset or superset thereof

• Operating system 214 including procedures for handling various basic system services and for performing hardware-dependent tasks;

• Network communication module 216 for connecting each server 102 or client device 104 to other devices (e.g., server 102, client device 104, or storage 106) via one or more network interfaces 204 (wired or wireless) and one or more communication networks 108, such as the Internet, other wide area networks, local area networks, metropolitan area networks, and so on; • The user interface module 218 for enabling presentation of information (e.g., a graphical user interface for application(s) 224, widgets, websites and web pages thereof, and/or games, speech and/or video content, text, etc.) at each client device 104 via one or more output devices 212 (e.g., displays, speakers, etc.);

• Input processing module 220 for detecting one or more user inputs or interactions from one of the one or more input devices 210 and interpreting the detected input or interaction;

• The web browser module 222 for navigating, requesting (e.g., via HTTP), and displaying websites and web pages thereof, including a web interface for logging into a user account associated with a client device 104 or another electronic device, controlling the client or electronic device if associated with the user account, and editing and reviewing settings and data that are associated with the user account;

• One or more user applications 224 for execution by the electronic system 200 (e.g., games, social network applications, smart home applications, and/or other web or non-web-based applications for controlling another electronic device and reviewing data captured by such devices);

• Model training module 226 for receiving training data and establishing a machine learning model 240 for processing content data (e.g., video, image, speech, or textual data) to be collected or obtained by a client device 104, where the model training module 226 includes a data augmentation module 227 for generating a second image from a first image by adjusting one or more ROIs in the first image;

• Data processing module 228 for processing content data using machine learning models 240, thereby identifying information contained in the content data, matching the content data with other data, categorizing the content data, or synthesizing related content data, where in some embodiments, the data processing module 228 applies a decoder-encoder network (e.g., a network 600 in Figure 6) to generate encoded feature maps in two distinct channels that are cross-modulated and applied to generate a depth and semantic features of an input image;

• One or more databases 230 for storing at least data including one or more of: o Device settings 232 including common device settings (e.g., service tier, device model, storage capacity, processing capabilities, communication capabilities, etc.) of the one or more servers 102 or client devices 104; o User account information 234 for the one or more user applications 224, e.g., user names, security questions, account history data, user preferences, and predefined account settings; o Network parameters 236 for the one or more communication networks 108, e.g., IP address, subnet mask, default gateway, DNS server and host name; o Training data 238 for training one or more machine learning models 240; o Machine learning model(s) 240 for processing content data (e.g., video, image, speech, or textual data) using deep learning techniques, where the machine learning models 240 includes an encoder-decoder network having two parallel decoder networks 506 and 508 that are coupled with a CCAM network 522 (Figure 5) and configured to generate depth and semantic information of an input image based on cross-channel coupling; and o Content data and results 242 that are obtained by and outputted to the client device 104 of the electronic system 200 , respectively, where the content data is processed by the machine learning models 240 locally at the client device 104 or remotely at the server 102 to provide the associated results (e.g., depth information 518 and semantic information 520 of an input image 502 in Figure 5).

[0038] Optionally, the one or more databases 240 are stored in one of the server 102, client device 104, and storage 106 of the electronic system 200 . Optionally, the one or more databases 240 are distributed in more than one of the server 102, client device 104, and storage 106 of the electronic system 200 . In some embodiments, more than one copy of the above data is stored at distinct devices, e.g., two copies of the machine learning models 240 are stored at the server 102 and storage 106, respectively.

[0039] Each of the above identified elements may be stored in one or more of the previously mentioned memory devices, and corresponds to a set of instructions for performing a function described above. The above identified modules or programs (i.e., sets of instructions) need not be implemented as separate software programs, procedures, modules or data structures, and thus various subsets of these modules may be combined or otherwise re-arranged in various embodiments. In some embodiments, memory 206, optionally, stores a subset of the modules and data structures identified above. Furthermore, memory 206, optionally, stores additional modules and data structures not described above.

[0040] Figure 3 is another example of a data processing system 300 for training and applying a neural network based (NN-based) machine learning model 240 for processing content data (e.g., video, image, speech, or textual data), in accordance with some embodiments. The data processing system 300 includes a model training module 226 for establishing the machine learning model 240 and a data processing module 228 for processing the content data using the machine learning model 240. In some embodiments, both of the model training module 226 and the data processing module 228 are located on a client device 104 of the data processing system 300, while a training data source 304 distinct from the client device 104 provides at least part of training data 238 to the client device 104. The training data source 304 is optionally a server 102 or storage 106. Alternatively, in some embodiments, the model training module 226 and the data processing module 228 are both located on a server 102 of the data processing system 300. The training data source 304 providing at least part of the training data 238 is optionally the server 102 itself, another server 102, or the storage 106. Additionally, in some embodiments, the model training module 226 and the data processing module 228 are separately located on a server 102 and client device 104, and the server 102 provides the trained machine learning model 240 to the client device 104. In some embodiments, a first subset of training data is augmented to generate a second subset of training data.

[0041] The model training module 226 includes one or more data pre-processing modules 308, a model training engine 310, and a loss control module 312. The machine learning model 240 is trained according to the type of content data to be processed. The training data 238 is consistent with the type of the content data, so is a data pre-processing module 308 applied to process the training data 238 consistent with the type of the content data. For example, an image pre-processing module 308A is configured to process image training data 238 to a predefined image format, e.g., extract an ROI in each training image, and crop each training image to a predefined image size. Alternatively, an speech preprocessing module 308B is configured to process speech training data 238 to a predefined speech format, e.g., converting each training sequence to a frequency domain using a Fourier transform. The model training engine 310 receives pre-processed training data provided by the data pre-processing modules 308, further processes the pre-processed training data using an existing machine learning model 240, and generates an output from each training data item. During this course, the loss control module 312 can monitor a loss function comparing the output associated with the respective training data item and a ground truth of the respective training data item. The model training engine 310 modifies the machine learning model 240 to reduce the loss function, until the loss function satisfies a loss criterion (e.g., a comparison result of the loss function is minimized or reduced below a loss threshold). The modified machine learning model 240 is provided to the data processing module 228 to process the content data.

[0042] In some embodiments, the model training module 226 offers supervised learning in which the training data is entirely labelled and includes a desired output for each training data item (also called the ground truth in some situations). Conversely, in some embodiments, the model training module 226 offers unsupervised learning in which the training data are not labelled. The model training module 226 is configured to identify previously undetected patterns in the training data without pre-existing labels and with no or little human supervision. Additionally, in some embodiments, the model training module 226 offers partially supervised learning in which the training data are partially labelled.

[0043] The data processing module 228 includes a data pre-processing module 314, a model-based processing module 316, and a data post-processing module 318. The data preprocessing modules 314 pre-processes the content data based on the type of the content data. Functions of the data pre-processing modules 314 are consistent with those of the preprocessing modules 308 and covert the content data to a predefined content format that is acceptable by inputs of the model-based processing module 316. Examples of the content data include one or more of the following: video, image, speech, textual, and other types of data. For example, each image is pre-processed to extract an ROI or cropped to a predefined image size, and an speech clip is pre-processed to convert to a frequency domain using a Fourier transform. In some situations, the content data includes two or more types, e.g., video data and textual data. The model -based processing module 316 applies the trained machine learning model 240 provided by the model training module 226 to process the pre-processed content data. The model -based processing module 316 can also monitor an error indicator to determine whether the content data has been properly processed in the data processing module 228. In some embodiments, the processed content data is further processed by the data post-processing module 318 to present the processed content data in a preferred format or to provide other related information that can be derived from the processed content data. [0044] Figure 4A is an exemplary neural network (NN) 400 applied to process content data in an NN-based machine learning model 240, in accordance with some embodiments, and Figure 4B is an example of a node 420 in the neural network (NN) 400, in accordance with some embodiments. The machine learning model 240 is established based on the neural network 400. A corresponding model-based processing module 316 applies the machine learning model 240 including the neural network 400 to process content data that has been converted to a predefined content format. The neural network 400 includes a collection of nodes 420 that are connected by links 412. Each node 420 receives one or more node inputs and applies a propagation function to generate a node output from the node input(s). As the node output is provided via one or more links 412 to one or more other nodes 420, a weight w associated with each link 412 is applied to the node output. Likewise, the node input(s) can be combined based on corresponding weights wi, W2, ws, and W4 according to the propagation function. For example, the propagation function is a product of a nonlinear activation function and a linear weighted combination of the node input(s).

[0045] The collection of nodes 420 is organized into one or more layers in the neural network 400. Optionally, the layer(s) may include a single layer acting as both an input layer and an output layer. Optionally, the layer(s) may include an input layer 402 for receiving inputs, an output layer 406 for providing outputs, and zero or more hidden layers 404 (e.g., 404A and 404B) between the input and output layers 402 and 406. A deep neural network has more than one hidden layer 404 between the input and output layers 402 and 406. In the neural network 400, each layer is only connected with its immediately preceding and/or immediately following layer. In some embodiments, a layer 402 or 404B is a fully connected layer because each node 420 in the layer 402 or 404B is connected to every node 420 in its immediately following layer. In some embodiments, one of the hidden layers 404 includes two or more nodes that are connected to the same node in its immediately following layer for down sampling or pooling the nodes 420 between these two layers. Particularly, max pooling uses a maximum value of the two or more nodes in the layer 404B for generating the node of the immediately following layer 406 connected to the two or more nodes.

[0046] In some embodiments, a convolutional neural network (CNN) is applied in a machine learning model 240 to process content data (particularly, video and image data). The CNN employs convolution operations and belongs to a class of deep neural networks 400, i.e., a feedforward neural network that only moves data forward from the input layer 402 through the hidden layers to the output layer 406. The hidden layer(s) of the CNN can be convolutional layers convolving with multiplication or dot product. Each node in a convolutional layer receives inputs from a receptive area associated with a previous layer (e.g., five nodes), and the receptive area is smaller than the entire previous layer and may vary based on a location of the convolution layer in the convolutional neural network. Video or image data is pre-processed to a predefined video/image format corresponding to the inputs of the CNN. The pre-processed video or image data is abstracted by each layer of the CNN to a respective feature map. By these means, video and image data can be processed by the CNN for video and image recognition, classification, analysis, imprinting, or synthesis. [0047] Alternatively and additionally, in some embodiments, a recurrent neural network (RNN) is applied in the machine learning model 240 to process content data (particularly, textual and speech data). Nodes in successive layers of the RNN follow a temporal sequence, such that the RNN exhibits a temporal dynamic behavior. For example, each node 420 of the RNN has a time-varying real-valued activation. Examples of the RNN include, but are not limited to, a long short-term memory (LSTM) network, a fully recurrent network, an Elman network, a Jordan network, a Hopfield network, a bidirectional associative memory (BAM network), an echo state network, an independently RNN (IndRNN), a recursive neural network, and a neural history compressor. In some embodiments, the RNN can be used for handwriting or speech recognition. It is noted that in some embodiments, two or more types of content data are processed by the data processing module 228, and two or more types of neural networks (e.g., both CNN and RNN) are applied to process the content data jointly.

[0048] The training process is a process for calibrating all of the weights w, for each layer of the learning model using a training data set which is provided in the input layer 402. The training process typically includes two steps, forward propagation and backward propagation, which are repeated multiple times until a predefined convergence condition is satisfied. In the forward propagation, the set of weights for different layers are applied to the input data and intermediate results from the previous layers. In the backward propagation, a margin of error of the output (e.g., a loss function) is measured, and the weights are adjusted accordingly to decrease the error. The activation function is optionally linear, rectified linear unit, sigmoid, hyperbolic tangent, or of other types. In some embodiments, a network bias term b is added to the sum of the weighted outputs from the previous layer before the activation function is applied. The network bias b provides a perturbation that helps the NN 400 avoid over fitting the training data. The result of the training includes the network bias parameter b for each layer.

[0049] Figure 5 is a block diagram of an example machine learning model 240 applied (e.g., by a data processing module 228 of an electronic device) to determine depth information and semantic information of an input image 502, in accordance with some embodiments. The machine learning model 240 includes an encoder network 504, a first decoder network 506, and a second decoder network 508. The encoder network 504 obtains the input image 502 and processes the input image 502 to generate an encoded feature map 510. The input image 502 is captured by an electronic device, and is optionally one of a sequence of image frames. Both the first and second decoder network 506 and 508 are coupled to the encoder network 506. The first decoder network 506 is applied to generate a first decoded feature map 512 based on the encoded feature map 510, and the second decoder network 508 is applied to generate a second decoded feature map 514 based on the encoded feature map 510. The first decoded feature map 514 optionally matches (e.g., has the same resolution as) the first decoded feature map 512. The first and second decoded feature maps 512 and 514 are combined to generate a CCAM cross-feature score map 516. The first decoded feature map 512 in the first decoder network is modified based on the CCAM crossfeature score map 516 to generate a depth feature 518 of the input image 502. The second decoded feature map 514 in the second decoder network 508 is modified based on the CCAM cross-feature score map 516 to generate a semantic feature 520 of the input image 502.

[0050] In some embodiments, each of the first and second encoder networks 506 and 508 includes a plurality of successive stages configured to generate a plurality of intermediate feature maps. The first decoded feature map 512 is one of the plurality of intermediate feature maps generated by the first encoder network 506, and the second decoded feature map 514 is one of the plurality of intermediate feature maps generated by the second encoder network 508.

[0051] In some embodiments, the first and second decoded feature maps 512 and 514 are combined using a CCAM network 522 to generate the CCAM cross-feature score map 516. Further, in some embodiments, the CCAM network 522 includes convolutional, global average pooling and fully connected layers, and configured to compute spatial and crosschannel attention. For example, the CCAM network 522 includes a first spatial attention network 702, second spatial attention network 706, and a channel attention network 716 (Figure 7A).

[0052] In some embodiments, the input image 502 includes one of a plurality of training images. The encoder network 504, first decoder network 506, second decoder network 508, and CCAM network 522 are trained using the plurality of training images in an end-to-end manner. Additionally, in some embodiments, the encoder network 504, first decoder network 506, second decoder network 508, and CCAM network 522 are trained using a comprehensive loss, and the comprehensive loss is a combination of a depth loss 524, a semantics loss 526, and an orthogonal loss 528. The orthogonal loss 528 is determined based on parameters of the encoder network 504, first decoder network 506, and second decoder network 508. More details on the orthogonal loss 528 are explained below with reference to Figure 9. [0053] In some embodiments, for each of the plurality of training images, the depth loss 524 is determined based on the depth feature 518 and a respective pose 530 of an electronic device that captures the training images. The respective pose 530 optionally includes a position and an orientation of the electronic device when the respective training image is captured. Further, in some embodiments, a pose network 534 is applied to determine the respective pose 530 from the input image 502 or a sequence of image frames 532 including the input image 502. In some embodiments, each training image (e.g., the input image 502) is provided with a ground truth semantic label 536, and the encoder network 504 and second decoder network 508 are trained in a supervised manner for semantic segmentation. Specifically, during training, weights of the encoder network 504 and second decoder network 508 are adjusted to control the semantic loss 526 determined based on the semantic feature 520 and ground truth semantic label 536.

[0054] The machine learning model 240 is applied to determine how exactly intermediate features (e.g., decoded feature maps 512 and 514) are associated with different tasks (depth estimation and semantic segmentation) and interact with each other. The same encoder network 504 is applied with two separate encoder networks 506 and 508 associated with two different tasks, and the CCAM cross-feature score map 516 is shared by decoder layers of the two separate encoder networks 506 and 508. By these means, a hard parameter and a soft parameter are applied to facilitate both flexibility and inter-feature learnability. The CCAM network 522 enforces dual attention on intermediate depth and segmentation features over both spatial and channel dimensions to emphasize inter-channel interaction between the two different tasks of depth estimation and semantic segmentation. This enables estimation of a degree-of-affinity between inter-task channel features as an intermediary score in an end-to- end framework, which is fully differentiable. The CCAM network 522 linearly weighs a contribution of features from each task before sharing, and thus, encourages a more informed and reliable feature transfer between two tasks. Specifically, the CCAM network 522 estimates cross channel affinity scores between task feature maps 512 and 514, and this enables better inter-task feature transfer, resulting in a mutual performance gain.

[0055] Figure 6 is a block diagram of an example encoder-decoder network (e.g., a U- net) 600 applied to process an input image 502 based on a CCAM cross-feature score map 516, in accordance with some embodiments. An electronic device employs the encoderdecoder network 600 to generate an output feature 604 (e.g., a depth feature 518, a semantic feature 520). In the U-net, the input image is processed successively by a set of downsampling stages (i.e., encoding stages) 606 to extract a series of encoded feature maps, as well as to reduce spatial resolutions of these feature maps successively. An encoded feature map 510 outputted by the encoding stages 606 is then processed by a bottleneck network 608 followed by a set of upscaling stages (i.e., decoding stages) 610. The series of decoding stages 610 include the same number of stages as the series of encoding stages 606. In some embodiments, in each decoding stage 610, an input feature map 616 is upscaled and concatenated with a pooled feature map (i.e., a skip connection) 614 of the same resolution from the encoding stage 606 to effectively preserve the details in the input image 602. [0056] In an example, the encoder-decoder network 600 has four encoding stages 606A-606D and four decoding stages 610A-610D. The bottleneck network 608 is coupled between the encoding stages 606 and decoding stages 610. In some embodiments, the input image 502 is successively processed by the series of encoding stages 606A-606D, the bottleneck network 608, and the series of decoding stages 610A-610D to generate the output feature 604. In some embodiments, the input image 502 is divided into a plurality of image tiles 602, and each of the plurality of image tiles 602 is processed using the encoder-decoder network 600. After all of the image tiles 602 in the input image 502 are successively processed using the encoder-decoder network 600, output features 604 corresponding to the plurality of image tiles 602 are collected and combined to one another to reconstruct a comprehensive output feature 604 corresponding to the input image 502.

[0057] The series of encoding stages 606 include an ordered sequence of encoding stages 606, e.g., stages 606 A, 606B, 606C, and 606D, and have an encoding scale factor. Each encoding stage 606 generates an encoded feature map 612 having a feature resolution and a number of encoding channels. Among the encoding stages 606A-606D, the feature resolution is scaled down and the number of encoding channels is scaled up according to the encoding scale factor. In an example, the encoding scale factor is 2. A first encoded feature map 612A of a first encoding stage 606A has a first feature resolution (e.g., x W) related to the image resolution and a first number of (e.g., NCH) encoding channels, and a second encoded feature map 612B of a second encoding stage 606B has a second feature resolution (e.g., V2HXV2W) and a second number of (e.g., ZNCH) encoding channels. A third encoded feature map 612C of a third encoding stage 606C has a third feature resolution (e.g., and a third number of (e.g., 4NCH) encoding channels, and a fourth encoded feature map 612D of a fourth encoding stage 606D has a fourth feature resolution (e.g., VsH^VsW) and a fourth number of (e.g., 8NCH) encoding channels.

[0058] For each encoding stage 606, the encoded feature map 612 is processed and provided as an input to a next encoding stage 606, except that the encoded feature map 612 of a last encoding stage 606 (e.g., stage 606D in Figure 6) is processed and provided as an input to the bottleneck network 608. Additionally, for each encoding stage 606, the encoded feature map 612 is converted to generate a pooled feature map 614, e.g., using a max pooling layer. The pooled feature map 614 is temporarily stored in memory and extracted for further processing by a corresponding decoding stages 610. Stated another way, the pooled feature maps 614A-614D are stored in the memory as skip connections that skip part of the encoderdecoder network 600.

[0059] The bottleneck network 608 is coupled to the last stage of the encoding stages 606 (e.g., stage 606D in Figure 6), and continues to process the total number of encoding channels of the encoded feature map 612D of the last encoding stage 606D and generate an intermediate feature map 616A (i.e., a first input feature map 616A to be used by a first decoding stage 610A). In an example, the bottleneck network 608 includes a first set of 3^3 CNN and Rectified Linear Unit (ReLU), a second set of 3x3 CNN and ReLU, a global pooling network, a bilinear upsampling network, and a set of 1 x 1 CNN and ReLU. The encoded feature map 612D of the last encoding stage 606D is normalized (e.g., using a pooling layer), and fed to the first set of 3 x3 CNN and ReLU of the bottleneck network 608. A bottleneck feature map 616A is outputted by the set of 1 x 1 CNN and ReLU of the bottleneck network 608 and provided to the decoding stages 610.

[0060] The series of decoding stages 610 include an ordered sequence of decoding stages 610, e.g., stages 610A, 610B, 610D, and 610D, and have a decoding upsampling factor. Each decoding stage 610 generates a decoded feature map 618 having a feature resolution and a number of decoding channels. Among the decoding stages 610A-610D, the feature resolution is scaled up and the number of decoding channels is scaled down according to the decoding upsampling factor. In an example, the decoding upsampling factor is 2. A first decoded feature map 618A of a first decoding stage 610A has a first feature resolution (e.g., ’/sJ/’x’/sIF’) and a first number of (e.g., 8NCH ’) decoding channels, and a second decoded feature map 618B of a second decoding stage 610B has a second feature resolution (e.g., VH’^ V W’) and a second number of (e.g., NCH ’) decoding channels. A third decoded feature map 618C of a third decoding stage 610C has a third feature resolution (e.g., UH’xUIF’) and a third number of (e.g., ZNCH ’) decoding channels, and a fourth decoded feature map 618D of a fourth decoding stage 610D has a fourth feature resolution (e.g., J/’x IF’) related to a resolution of the output image 604 and a fourth number of (e.g., NCH ’) decoding channels. [0061] For each decoding stage 610, the decoded feature map 618 is processed and provided as an input feature map 616 to a next encoding stage 606, except that the decoded feature map 618 of a last encoding stage 610 (e.g., stage 610D in Figure 6) is processed to generate the output feature 604. For example, the decoded feature map 618D of the last encoding stage 606 is processed by a 1 >< 1 CNN 622 to generate the output feature 604. Additionally, each respective decoding stage 610 combines the pooled feature map 614 with an input feature map 616 of the respective decoding stage 610 using a set of neural networks 624. Each respective decoding stage 610 and the corresponding encoding stage 606 are symmetric with respect to the bottleneck network 608, i.e., separated from the bottleneck network 608 by the same number of decoding or encoding stages 610 or 606.

[0062] One of the series of encoding stages 610 is coupled to a CCAM 522. The CCAM 522 receives from the one of the series of encoding stages 610, and modifies, one of the decoded feature maps 618A, 618B, and 618C. The modified one of the decoded feature maps 618A, 618B, and 618C is provided as an input feature map 616 to a next encoding stage 610 that immediately follows the one of series of encoding stages 610. For example, the CCAM 522 is coupled to the second encoding stage 610B. The CCAM 522 receives from the second encoding stage 610B, and modifies, the decoded feature map 618B, and the modified feature map 618B is used as an input feature map 616C provided to the third encoding stage 610C. In some embodiments, the CCAM 522 is coupled to the first encoding stage 610A, which leads a plurality of successive decoding stages 610B-610D, and modifies the decoded feature map 618A provided by the first encoding stage 610A. In some embodiments, the CCAM 522 is coupled to the third encoding stage 610C, and modifies the decoded feature map 618C provided by the third encoding stage 610C. In some embodiments, the CCAM 522 is coupled to two or more of the series of encoding stages 610, and modifies each of two or more of the decoded feature maps 618 outputted by the two or more of the series of encoding stages 610.

[0063] In some embodiments, the encoder-decoder network 600 has a second series of decoding stages 630. The series of encoding stages 606 is coupled to another series of decoding stages 630 in addition to the series of decoding stages 610. The decoding stages 610 and 630 are configured to output the depth feature 524 and semantic feature 520, respectively. Further, in some embodiments, the CCAM 522 modifies one of the decoded feature maps 618A, 618B, and 618C based on a corresponding decoded feature map 626 provided by the series of decoding stages 630. For example, the CCAM 522 modifies the decoded feature map 618B received from the second decoding stage 610B based on a decoded feature map received from a second decoding stage of the series of decoding stages 630, and the modified decoded feature map 618B is used as an input feature map 616C provided to the third encoding stage 610C. More specifically, in some situations, the CCAM 522 combines the decoded feature map 618B received from the second decoding stage 61 OB and the decoded feature map received from the second decoding stage of the series of decoding stages 630 to generate a CCAM cross-feature score map 516 (Figure 5). The decoded feature map 618B is modified based on the CCAM cross-feature score map 516 before being applied as the input feature map 616C to the third encoding stage 610C.

[0064] Referring to Figures 5 and 6, in some embodiments, the first decoder network 506 has a first number of successive decoding stages 610, and the second decoder network 508 has a second number of successive decoding stages 630. The second number is equal to the first number. Each of the first number of successive decoding stages 610 of the first decoder network 506 corresponds to a respective second decoding stage 630 of the second decoded network 508 and is configured to generate a respective first intermediate feature map 618 having the same resolution as a respective second intermediate feature map generated by the respective second decoding stage 630 of the second decoded network 508. The first decoded feature map 512 corresponds to one of the respective first intermediate feature maps 618 generated by the first decoder network 506, and the second decoded feature map 514 corresponds to one of the respective decoded feature maps 626 generated by the second decoder network 508. The first decoded feature map 512 matches, and have the same resolution as, the second decoded feature map 514.

[0065] In some embodiments, the encoder network 504 and the first decoder network 506 form a first U-Net, and each encoding stage 606 of the encoder network 504 is configured to provide a first skip connection 614 to a respective decoding stage 610 of the first decoder network 506. The encoder network 506 and the second decoder network 508 form a second U-Net, and each encoding stage 606 of the encoder network 504 is configured to provide a second skip connection 614 to a respective decoding stage of the second decoder network. For each encoding stage 606, the first skip connection (e.g., 614A, 614B, 614C, and 614D) is the same as the second skip connection.

[0066] Figures 7A and 7B are block diagrams of two portions 522A and 522B of a cross-channel attention module (CCAM) 522 for determining cross-task affinity of two distinct decoded feature maps 512 and 514 generated from an input image 502, in accordance with some embodiments. A first spatial attention network 702 is applied on a first decoded feature map 512 (e.g., associated with a depth feature 518) to generate a first self-attended feature map 704. A second spatial attention network 706 is applied on a second decoded feature map 514 (e.g., associated with a semantic feature 520) to generate a second selfattended feature map 708. The first self-attended feature map 704 is transposed to generate a transposed first self-attended feature map 710. The transposed first self-attended feature map 710 and the second self-attended feature map 708 are combined by a product operation 712 to generate a cross-task feature map 714. A channel attention network 716 is applied to process the cross-task feature map 714 and generate a cross-task affinity matrix 718. Specifically, in some embodiments, the cross-task affinity matrix 718 includes a plurality of affinity scores 720, and each affinity score 720 indicates an affinity level of a respective channel of the first decoded feature map 512 with respect to a respective channel of the second decoded feature map 514.

[0067] Referring to Figure 7B, the first decoded feature map 512, second decoded feature map 514, and cross-task affinity matrix 718 are combined to generate the CCAM cross-feature score map 516 and modify the first and second decoded feature maps 512 and 514. In some embodiments, the first decoded feature map 512 is combined with the cross-task affinity matrix 718 to generate a first CCAM cross-feature score map 516. The second decoded feature map 514 is combined with the first CCAM cross-feature score map 516 to modify the second decoded feature map 514 and generate a modified second decoded feature map 516. Alternatively, in some embodiments, the second decoded feature map 514 is combined with a transposed cross-task affinity matrix 718’ to generate a second CCAM cross-feature score map 516’. The first decoded feature map 512 is combined with the second CCAM cross-feature score map 516’ to modify the first decoded feature map 512 and generate a modified first decoded feature map 512’. As such, the modified first and second decoded feature maps 512’ and 514’ are provided to their corresponding first and second decoder networks 506 and 508, and continue to be processed by respective decoding stage(s) 610 therein.

[0068] Cross-task feature transfer is divided in to three sub-categories: (i) sharing initial layers to facilitate learning common features for complimentary tasks, (ii) using adversarial networks to learn common feature representation, and (iii) learning different but related feature representations. In some embodiments, a multimodal distillation block (e.g., a CCAM network 522) is applied to share cross-task features through message passing and simulate a gating mechanism as shown in equations (1) and (2), by leveraging spatial attention maps of each individual features of all tasks. This helps decide what features of a given task are shared with other tasks. In some embodiments, a total number of T tasks are trained, and F^k denotes the i-th feature of the k-th task before message passing and F°'^k after message passing. The message transfer is defined as follows:

where ® means element-wise product, ® represents convolution operation, Wt.k represents the convolution block, and G^k denotes the gating matrix for the i-th feature of the k-th task:

where 14^^k is a convolution block and a denotes the sigmoid operator. According to equation (1), it only shares cross-task features naively across the channel dimension. Suppose we are training simultaneously for two tasks namely: F^k and F^l. Equation (1) indirectly implies that i- th channel-feature of F^k is only important to i-th channel-feature of F^l, which is not necessarily true in all scenarios. We overcome this major limitation by designing a module that calculates an affinity vector at, which gives an estimate about how the i-th channel of task F^k is related to any j-th channel of task F^l. Referring to Figures 7A and 7B, in some embodiments, the entire process of building scores of inter-task channels are subdivided into four sub-blocks, and process two tasks, Task A and Task B. Intermediate encoded output features 512 and 514 are extracted from respective decoder modules represented by AF and BF respectively, where AF, BF G R *C*H*W

[0069] In some embodiments, in a first CCAM Sub-block, the intermediate features 512 and 514 (i.e., AF and BF) are passed through a spatial attention network 702 (e.g., including a sequence of convolutional blocks) to compute self-attended feature maps 704 and 708 (i.e., ASF and BSF) as follows: ©'\ © II ’ ₅ A A /.-. ) , (3)

Refined features are obtained from both tasks before estimating their cross-correlation. The output of this layer preserves the spatial resolution of the input features and gives output features represented by ASF and BSF respectively.

[0070] In some embodiments, in a second CCAM Sub-block, a cross-task relation matrix CM is determined (i.e., cross-task feature map 714) for each channel i of AS , where CMI G 7?^AfxCx7fx^ The resultant matrix CM is passed to a channel attention network 716, which estimates the affinity vector z between an i-th channel of ASF and all the channels of BSF as follows:

where in some embodiments, denotes a combination of global average pooling layer followed by fully connected layers, with a sigmoid layer at the end, which serves as the channel attention network 716. This operation is repeated for all the channels of AsFto get the corresponding affinity vector.

[0071] In some embodiments, in a third CCAM Sub-block, affinity scores are accumulated for all channels of AsFto achieve a cross-task affinity matrix 718 as follows:

M ~ F F? t Vz, j & [0, C , (6) where © denotes concatenation across a row dimension.

[0072] In some embodiments, in a fourth CCAM Sub-block, the cross-task affinity matrix 718 serves as a score accumulator, which helps get a linearly weighted features AF’ and BF ’ (i.e., modified first and second decoded feature maps 512’ and 514’ in Figure 7B). [0073] Figure 8 is a block diagram of an example CCAM network 522 for generating a cross-task affinity matrix 718 associated with an input image 502, in accordance with some embodiments. The CCAM network 522 is divided into three blocks 802, 804, and 806 to serve distinct purposes. A first block 802 includes a spatial attention network 702 (e.g., two successive two-dimensional convolutional layers). A second block 804 is configured to estimate cross-feature correlation and follow it up with a channel attention network 716 to provide a plurality of feature-channel affinity scores 812. A third block 806 applies a simple channel-wise accumulation of the affinity scores 812 to generate the cross-task affinity matrix 718. Additional operations are implemented to determine the cross-task affinity matrix 718, depth feature 518, and semantic feature 520 as follows:

# Estimate cross channel affinity:

Ysegatt = Spatial Attn(Xseg)

Ydepthatt = Spatial Attn(Xdepth)

YTdepthattn = Transpose(Ydepthatt) i 0 while i < C do 0 C represents number of channels ai = Channel Affmity( Ysegatt * YTdepthattn)

CT = CT + ai 0 concat across dimension end while # Mutual features sharing between tasks: i «— 0 while i < C do

Xseg = Xseg + Xdepthi * CTi 0 along row dimension end while j ^ o while j < C do

Xdepth = Xdepth + Xsegj * CTj 0 along column dimension end while

[0074] Figure 9 is a diagram of an example orthogonal loss 528 generated by an encoder-decoder network (e.g., in Figure 5) applied to process an input image 502 based on a CCAM cross-feature score map 516, in accordance with some embodiments. The orthogonal loss 528 is applied to improve accuracy levels of a depth feature 518 and a semantic feature 520. An average inter-channel correlation is estimated for decoder layers of decoding stages of the first and second decoder networks 506 and 508 associated with depth estimation and semantic segmentation as follows:

Require: layer 0 feature layer

B,C,H,W = layer.dim while c < C do norm = 0.0 while i < C do if i = c then continue else corr = layerf:, i, :] * layerf:, c, :]^T norm+ = 11 norm(corr) end if end while norm = norm/C 0 average norm per layer end while

In some embodiments, the inter-channel correlation is reduced in each of the depth estimation and semantic segmentation tasks after orthogonal regularization is applied.

[0075] Orthogonality constraint on model’s parameters is applied on tasks such as image classification, image retrieval, 3D classification. Enforcing orthogonality has also helped improve model’s convergence, training stability, and promote learning independent parameters. In a multi-task setup (e.g., involving depth estimation and semantic segmentation), feature independence within a given task is important. An effect of applying a variation of the orthogonal scheme is investigated on different submodules. A loss function of the model (e.g., a comprehensive loss) includes an orthogonal loss 528 and is represented as follows:

LI=LSSL+LD (7)

where W, | j | I, LF, LI, LD, and LSSL represent weights (for each layer), spectral norm, identity matrix, final model loss, initial loss, self-supervised depth loss 524, and semisupervised semantic loss 526, respectively. It is noted that enforcing orthogonality (e.g., corresponding to 2 j

| in equation (8)) particularly on the parameters of shared encoder 504 (Figure 5), depth decoder 506, and segmentation decoder 508 has a positive impact on the model’s performance. In some embodiments, the average inter-channel correlation is determined for all decoder layers for both of the depth estimation and semantic segmentation tasks, with and without orthogonality regulation. Independent features within semantics and depth module would make feature transfer between the tasks more effective. [0076] Additionally, in some embodiments, data augmentation mechanisms are applied for both semantic segmentation and depth estimation, thereby dealing with data sparsity for multi-task learning. This approach enhances not only diversity and class balance for semantic segmentation, but also region discrimination for depth estimation. Orthogonal regularization is applied to determine depth and semantics with diminishing weighting to facilitate feature generalization and independence. More details on data augmentations are discussed below with reference to at least Figures 10, 11, and 13.

[0077] Figure 10 is a flow diagram of an example data augmentation process 1000 for augmenting training data 238 (e.g., a first image 1002) for a machine learning model for semantic segmentation, in accordance with some embodiments. The data augmentation process 1000 is implemented by a data augmentation module 227 of an electronic device. The electronic device obtains the first image 1002 captured by a camera 260 and applies an encoder-decoder network 600 (e.g., including encoder 504 and decoders 506 and 508) to extract depth information 518 (e.g., a depth map) and semantic information 520 (e.g., a semantic segmentation mask) of the first image 1002. An object 1006 and an associated ROI 1008 are identified in the first image 1002 based on the semantic information 520. The ROI 1008 includes the object 1006. The ROI 1008 in the first image 1002 is adjusted to generate an adjusted ROI 1010. In some embodiments, the semantic information 520 includes a first semantic mask 1005 (also called a foreground mask AT) having an array of semantic elements. A first subset of semantic elements of the first semantic mask 1005 have a first value (e.g., “1”) corresponding to each pixel of the object 1006 in the first image 1002, and a second subset of remaining semantic elements of the first semantic mask 1005 have a second value (e.g., “0”) corresponding to each background pixel distinct from any object in the first image 1002. As the ROI 1008 is adjusted (e.g., scaled), the object 1006 is adjusted with the ROI 1008, e.g., in the first semantic mask 1005. The first image 1002 and the adjusted ROI 1010 are combined to generate a second image 1012 in which the adjusted ROI 1010 is applied to replace the ROI 1008. The electronic device applies the second image 1012 to train the encoder-decoder network 600. In some embodiments, the semantic information 520 is updated to reflect the adjusted ROI 1010 and generate the updated semantic information 1014, which corresponds to the second image 1012.

[0078] In some embodiments, the object 1006 includes a moveable object (e.g., a pedestrian, a parked or moving vehicle). In some embodiments, the ROI 1008 includes the object 1006 and a portion of a background of the first image 1002 immediately adjacent to the object 1006. In an example, the ROI 1008 is defined by a bounding box that closely encloses the object 1006, and an edge of the ROI 1008 matches a contour of the object 1006 in the first image 1002. Each edge of the bounding box of the ROI 1008 overlaps with at least one pixel of the contour of the object 1006. In another example, each edge of the bounding box of the ROI 1008 is separated from a closest pixel of the object 1006 by at least a predefined number of pixels (e.g., 10 pixels).

[0079] In some embodiments, referring to Figure 5, the encoder-decoder network includes an encoder network 504 and two parallel decoder networks 506 and 508. The two parallel decoder networks 506 and 508 are coupled to the encoder network 504, and configured for generating the depth information 518 and the semantic information 520 of the first image 1002, respectively. In some embodiments, prior to applying the encoder-decoder network, the encoder-decoder network is trained using a set of supervised training data including a set of training images and associated ground truth depth and semantic information. The encoder-decoder network is further trained using the second image 1012. Further, in some embodiments, training is implemented in a server 102.

[0080] Specifically, in some embodiments, referring to Figure 5, the encoder-decoder network is applied to extract the depth information 518 and semantic information 520 of the first image 1002 by applying an encoder network 504 to process the first image 1002 and generate an encoded feature map 510. A first decoder network 506 is applied to generate a depth feature 518 based on the encoded feature map 510, and a second decoder network 508 to generate a semantic feature 520 based on the encoded feature map 510. The first decoder network 506 and second decoder network 508 are coupled to each other via a CCAM network 522. Further, in some embodiments, the electronic device obtains a first decoded feature map 512 and a second decoded feature map 514 from the first and second decoder networks 506 and 508, respectively. The CCAM network 522 combines the first and second decoded feature maps 512 and 514 to generate a CCAM cross-feature score map 516. The first decoded feature map 512 in the first decoder network 506 is modified based on the CCAM cross-feature score map 516 to generate the depth feature 518 of the first image 1102. The second decoded feature map 514 in the second decoder network 508 is modified based on the CCAM cross-feature score map 516 to generate the semantic feature 520 of the first image 1102.

[0081] In some embodiments, the first image 1002 is applied to generate the second image 1012 by re-scaling a randomly selected movable object 1006. In an intra data augmentation scheme, no additional image is involved in augmenting the first image 1002. In an example, the scale factor for depth is in a range of [0.5, 1.5], and an affine transform is applied on the first image to generate an output that is further processed with a softmax function. In some embodiments, a foreground mask AT is applied to determine the second image 1012, pseudo labels, and softmax function. Specifically, the intra data augmentation scheme is implemented as follows:

Require: I, L, S, D 0 Image and (Predicted) GT/Psuedo Label, Softmax and Depth

# Intra Data Augmentation step:

Scale <— random(0.5, 1.5) tx = (1.0 - 1/scale)) * (ox) ty = (1.0 - 1/scale)) * (oy)

# Affine transform according to Scale: la = aff transform(I, 1/scale, [fc, C])

La = aff transform(L, 1/scale, [fc, C]) Sa = aff transform(S, 1/scale, [6, A] ) Da = aff transform(D, 1/scale, [A, ty]) Da = Scale * Da

# Generate new image

0 foreground mask

[0082] More specifically, in some embodiments, the ROI 1008 is adjusted based on a depth of the object 1006 included therein. The electronic device determines a first depth of the object 1006 in a field of view of the camera 260 based on the depth information 518 and the semantic information 520 of the first image 1002. A scale factor 5 is determined based on the first depth and a target depth of the object 1006. In an example, the scale factor 5 is greater than 1. The ROI 1008 is scaled by the scale factor 5 to generate the adjusted ROI 1010. A target location of the adjusted ROI 1010 on the second image 1012 corresponds to the target depth in the field of view of the camera 260. Further, in some embodiments, based on the first depth and the target depth, the electronic device determines a translational shift (tx, ty from a first location (p_x, o_y of the ROI 1008 in the first image 1002 to the target location of the adjusted ROI 1010 in the second image 1012. The adjusted ROI 1010 is placed on the target location of the adjusted ROI 1010 in the second image 1012 based on the translational shift (fc, t_y).

[0083] Additionally, in some embodiments not shown, a depth map is modified based on the target depth of the object 1006 to generate a modified depth map. A semantic segmentation mask 1005 is modified based on the adjusted ROI 1010 to generate a modified semantic mask. The modified depth map and modified semantic map are associated with the second image 1012 as ground truth.

[0084] In some embodiments, large-scale video data is leveraged using a semisupervised multi-task learning training paradigm where semantic segmentation follows a semi-supervised setting and depth estimation follows a self-supervised manner. For example, an AffineMix data augmentation strategy is applied to improve semi-supervised semantic training. The AdfineMix data augmentation strategy aims to create new labeled images (e.g., a second image 1012) under a varied range of depth scales. Under this scheme, randomly selected movable objects are projected over the same image (e.g., a first image 1002), for a randomly selected depth scale. This unlocks another degree of freedom in data augmentation scheme, generating images (e.g., the second image 1012) which are not only close to original data distribution but also more diverse and class balanced. In some embodiments, for depth estimation, a data augmentation scheme called ColorAug is applied to establish a contrast between movable objects and adjacent regions, using intermediate semantic information. In some embodiments, orthogonal regularization (Figure 9) is applied to improve machine learning training efficacy. Orthogonality is applied to specific task modules and helps learn more independent features across depth and semantics feature spaces. This eventually has a positive impact on both semantics and depth evaluation. [0085] Data augmentation plays a pivotal role in machine learning tasks, as it helps gather varied data samples from a similar distribution. In the spirit of cooperative multitasking, data augmentation is applied to both segmentation and depth estimation tasks using predicted depth and semantics respectively. In some embodiments, data augmentation for segmentation is applied in a semi-supervised manner. Models leverage consistency training by mixing image masks across two different images to generate a new image and its semantic labels. Further, in some embodiments, a diverse mixed label space is generated by maintaining the integrity of the scene structure, an AffineMix data augmentation strategy considers mixing labels within the same image 1002 under a varied range of random depth values, thus producing a new set of affine-transformed images (e.g., the second image 1012 in Figure 10).

[0086] Additionally, in some embodiments, masks associated with only movable objects are mixed to counter class imbalance, which is stark in some known datasets. In an example, a mixed image I’ is generated based on an image I and corresponding predicted depth map D by scaling a depth of a selected movable object 1006 by a scale factor 5 as follows:

D ’=s * D, (9) such that its spatial location in the image is changed in a geometrically realistic way. Changing the depth by a factor of 5, results in an inverse scaling in the image domain and translational shift (A, ty) which would be given by:

where o_x and o_y are normalized offsets along x and y directions. Using A t_y and inverse scaling 1/s, we can perform affine transformation on the image and label space to generate la and L_a. The foreground mask AT is estimated by comparing the new and old depths and masking it with the region which has the movable object in I_a and name it Mm. The final image and label would be then given by:

I’ = M_m Q I_a + (1 -M_m) Q I , (12)

L )m O La + (1 - )m) O L . (13) [0087] In data augmentation for depth estimation, factors such as position in the image, texture density, shading, and illumination are some of the pictorial cues about distance in a given image. In some embodiments, a contrast is adjusted between adjacent regions (e.g., the ROI 1008 and corresponding background). In some embodiments, bright and dark regions within an image are adjusted. For example, a brightness level of the ROI 1008 is adjusted. This is associated with an effective data augmentation technique called ColorAug, which uses different appearance based augmentation on movable and non-movable objects. Particularly, in some embodiments, the movable and non-movable objects are identified based on semantic segmentation (e.g., intermediate semantic labels and semantic map 1015 predicted by an encoder-decoder network 600).

[0088] Figure 11 is a comparison 1100 of original images 1102A, 1104A, and 1106A and augmented images 1102B, 1104B, and 1106B, which are applied to train a machine learning model 240, in accordance with some embodiments. In some embodiments, intermediate semantics output 520, purposefully, create regions of different brightness, contrast and saturation in movable objects 1114 enclosed in bounding boxes of the regions 1108-1112. An electronic device obtains an original first image 1102A, 1104A, or 1106A captured by a camera 260 and applies an encoder-decoder network 600 (e.g., including encoder 504 and decoders 506 and 508) to extract depth information 518 and semantic information 520 of the first image 1002. One or more objects 1114 and associated regions of interest (ROI) 1008-1112 are identified in each first image 1102 A, 1104 A, or 1106 A based on the semantic information 520. Each ROI in the first image 1102A, 1104A, or 1106A is adjusted to generate an adjusted ROI 1108, 1110, or 1112. Each first image 1102A, 1104A, or 1106A and the corresponding adjusted ROI 1108, 1110, or 1112 are combined to generate a second image 1102B, 1104B, or 1106B in which the adjusted ROI 1108, 1110, or 1112 is applied to replace the original ROI. The electronic device applies the second image 1102B, 1104B, or 1106B to train the encoder-decoder network 600.

[0089] In some embodiments, an image 1102B or 1104B includes more than one ROI 1108 or 1110. Alternatively, in some embodiments, an image 1106B includes only one ROI 1112. In some embodiments, each ROI includes a respective object. Alternatively, in some embodiments, each ROI includes more than one object 1114. In some embodiments, each ROI 1108-1112 includes one type of objects (e.g., vehicles only, pedestrians only). In some embodiments not shown, an ROI includes two or more types of objects. In some embodiments, each object 1114 includes a moveable object (e.g., a pedestrian, a parked or moving vehicle). [0090] Specifically, in some embodiments, the semantic information 520 includes a semantic mask 1005 identifying the object 1114. Based on the semantic mask 1005, a set of object pixels corresponding to the object 1114 in the first image 1102 A, 1104 A, or 1106 A are identified. An original ROI is adjusted by adjusting at least one of a contrast level, a brightness level, a saturation level, and a hue level of the set of object pixels to generate a corresponding adjusted ROI 1108, 1110, or 1112. For example, a contrast level of the ROI 1110 is adjusted in the image 1104B. A brightness level of the ROI 1108 (including ROIs 1108A and 1108B) is adjusted in the image 1102B, and a saturation level of the ROI 1112 is adjusted in the image 1106B. The original ROI has a first location in the first image 1102A, 1104A, or 1106A, and the adjusted ROI 1108, 1110, or 1112 has a second location in the second image 1102B, 1104B, or 1106B, respectively. Each first location is consistent with a respective second location.

[0091] Further, in some embodiments, the object includes a first object 1114A corresponding to a first set of object pixels. Based on the semantic mask, the electronic device identifies a second set of object pixels corresponding to a second object 1114B in the first image 1102A, and adjusts at least one of the contrast level, brightness level, saturation level, and hue levels of the second set of object pixels to generate a second adjusted ROI 1108B. The first image 1102A, adjusted ROI 1108A, and second adjusted ROI 1108B are combined to generate the second image 1102B. Additionally, in some embodiments, the at least one of the contrast level, brightness level, saturation level, and hue levels of the first set of object pixels and the at least one of the contrast level, brightness level, saturation level, and hue levels of the second set of object pixels are adjusted jointly (e.g., by the same or related changes) or independently (e.g., by different changes). In an example, a brightness level of the ROI 1108 A is adjusted based on a first scale factor, and a brightness level of the ROI 1108B is adjusted based on a second scale factor. The first and second scale factors are optionally identical or different. In another example, a brightness level of the ROI 1108 A is adjusted based on a first scale factor, and a contrast level of the ROI 1108B is adjusted based on a second scale factor. The first and second scale factors are optionally identical or different.

[0092] Figure 12 is a flow diagram of an example data processing method 1200, in accordance with some embodiments. For convenience, the method 1200 is described as being implemented by a data processing module 228 of an electronic system 200 (e.g., a server 102, a mobile phone 104C, or a combination thereof). Method 1200 is, optionally, governed by instructions that are stored in a non-transitory computer readable storage medium and that are executed by one or more processors of the computer system. Each of the operations shown in Figure 12 may correspond to instructions stored in a computer memory or non-transitory computer readable storage medium (e.g., memory 206 in Figure 2). The computer readable storage medium may include a magnetic or optical disk storage device, solid state storage devices such as Flash memory, or other non-volatile memory device or devices. The instructions stored on the computer readable storage medium may include one or more of: source code, assembly language code, object code, or other instruction format that is interpreted by one or more processors. Some operations in method 1200 may be combined and/or the order of some operations may be changed.

[0093] An electronic device obtains (1202) an input image 502 and applies (1204) an encoder network 504 to process the input image 502 and generate an encoded feature map 510. The electronic device applies a first decoder network 506 to generate (1206) a first decoded feature map 512 based on the encoded feature map 510 and a second decoder network 508 to generate (1208) a second decoded feature map 514 based on the encoded feature map 510. The first and second decoded feature maps 512 and 514 are combined (1210) to generate a CCAM cross-feature score map 516. The first decoded feature map 512 in the first decoder network 506 is modified (1212) based on the CCAM cross-feature score map 516 to generate a depth feature 518 of the input image 502, and the second decoded feature map 514 in the second decoder network 508 is modified (1214) based on the CCAM cross-feature score map 516 to generate a semantic feature 520 of the input image 502.

[0094] In some embodiments, the first decoder network 506 has a first number of successive decoding stages 610 (Figure 6), and the second decoder network 508 has a second number of successive decoding stages 630 (Figure 6). The second number is equal to the first number. Each of the first number of successive decoding stages 610 corresponds to a respective second decoding stage 630 and is configured to generate a respective first intermediate feature map having the same resolution as a respective second intermediate feature map generated by the respective second decoding stage 630. The first decoded feature map 512 matches, and have the same resolution as, the second decoded feature map 514.

[0095] In some embodiments, referring to Figure 6, the first decoder network 506 has a first decoding stage (e.g., stage 610B) immediately followed by a first next stage (e.g., stage 610C). The first decoding stage (e.g., stage 610B) is configured to output the first decoded feature map 512. The first decoded feature map 512 is modified based on the CCAM crossfeature score map 516. The modified first decoded feature map 512 is applied at an input of the first next stage (e.g., stage 610C). [0096] Further, in some embodiments, the encoded feature map 510 is received by the first decoding stage (e.g., stage 610A) of the first decoder network 506, and the first decoding stage (e.g., stage 610A) leads a plurality of successive decoding stages (e.g., stages 61 OB- 610D) of the first decoder network 506. Alternatively, in some embodiments, the encoded feature map 510 is received by one or more alternative decoding stages (e.g., stages 610A- 610B). One of the one or more alternative decoding stages (e.g., stage 610A) leads a plurality of successive decoding stages (e.g., stages 610A-610D) of the first decoder network 506. The first decoding stage (e.g., stage 610C) follows the one or more alternative decode stages (e.g., stages 610A-610B) in the first decoder network 506. The modified first decoded feature map 512 is applied at the input of the first next stage (e.g., stage 610D).

[0097] In some embodiments, the second decoder network 508 has a second decoding stage immediately followed by a second next stage. The second decoding stage is configured to output the second decoded feature map 514. Modifying the second decoded feature map 514 in the second decoder network 508 further includes modifying the second decoded feature map 514 based on the CCAM cross-feature score map 516 and applying the modified second decoded feature map at an input of the second next stage.

[0098] In some embodiments, the encoder network 504 and the first decoder network 506 form a first U-Net. Each encoding stage of the encoder network 504 is configured to provide a first skip connection 614 to a respective decoding stage 610 of the first decoder network 506. The encoder network 504 and the second decoder network 508 form a second U-Net. Each encoding stage of the encoder network 504 is configured to provide a second skip connection to a respective decoding stage 630 of the second decoder network 508. In some embodiments, for each encoding stage, the second skip connection is identical to the first skip connection 614.

[0099] In some embodiments, referring to Figures 7A and 7B, the first and second decoded feature maps 512 and 514 are combined to generate the CCAM cross-feature score map 516 by applying (1216) at least a product operation 712 to combine the first and second decoded feature maps 512 and 514 and generate a cross-task feature map 714, applying (1218) a channel attention network to process the cross-task feature map 714 and generate a cross-task affinity matrix 718, and combining (1220) the first and second decoded feature maps 512 and 514 based on the cross-task affinity matrix 718 to generate the CCAM crossfeature score map 516.

[00100] Further, in some embodiments, the cross-task affinity matrix 718 includes (1222) a plurality of affinity scores. Each affinity score indicates an affinity level of a respective channel of the first decoded feature map 512 with respect to a respective channel of the second decoded feature map 514. In some embodiments, referring to Figure 7A, the electronic device applies (1224) a first spatial attention network 702 on the first decoded feature map 512 to generate a first self-attended feature map 704, and applies (1226) a second spatial attention network 706 on the second decoded feature map 514 to generate a second self-attended feature map 708. The first self-attended feature map 704 is transposed (1228) to generate a transposed first self-attended feature map 710. The transposed first self-attended feature map 710 and the second self-attended feature map 708 are combined (1230) by the product operation 712.

[00101] In some embodiments, the first and second decoded feature maps 512 and 514 are combined using a CCAM network 522 to generate the CCAM cross-feature score map 516, and the input image 502 includes one of a plurality of training images. The encoder network 504, first decoder network 506, second decoder network 508, and CCAM network 522 are trained using the plurality of training images in an end-to-end manner.

[00102] Further, in some embodiments, the encoder network 504, first decoder network 506, second decoder network 508, and CCAM network 522 are trained using a comprehensive loss. Referring to Figure 5, the comprehensive loss is a combination of a depth loss 524, a semantics loss 526, and an orthogonal loss 528. The electronic device determines the orthogonal loss 528 based on parameters of the encoder network 504, first decoder network 506, and second decoder network 508. Additionally, in some embodiments, for each of the plurality of training images, the depth loss 524 is determined based on the depth feature 518 and a pose 530 of an electronic device that captures the training images. [00103] It should be understood that the particular order in which the operations in Figure 12 have been described are merely exemplary and are not intended to indicate that the described order is the only order in which the operations could be performed. One of ordinary skill in the art would recognize various ways to enhance speech quality. Additionally, it should be noted that details of other processes described above with respect to Figures 1-11 and 13 are also applicable in an analogous manner to method 1200 described above with respect to Figure 12. For brevity, these details are not repeated here.

[00104] Some embodiments of this application tackle an multi-task learning problem of two dense tasks, i.e., semantic segmentation and depth estimation, and present a CCAM, which facilitates effective feature sharing along each channel between the two tasks, leading to mutual performance gain with a negligible increase in trainable parameters. Multi-task learning paradigm focuses on jointly learning two or more tasks, aiming for significant improvement with respect to model’s generalizability, performance, and training/inference memory footprint. The aforementioned benefits become ever so indispensable in the case of joint training for vision-related dense prediction tasks. Such multi-task learning relies on an inherent symbiotic relation among multiple tasks (e.g., semantic segmentation and depth estimation), where one task benefits from the other task. Parameter are shared among different tasks to overcome a data sparsity problem and enforce task generalization by leveraging task losses to regularize each other.

[00105] Figure 13 is a flow diagram of an example data augmentation method, in accordance with some embodiments. For convenience, the method 1300 is described as being implemented by a data augmentation module 227 of an electronic system 200 (e.g., a server 102, a mobile phone 104C, or a combination thereof). Method 1300 is, optionally, governed by instructions that are stored in a non-transitory computer readable storage medium and that are executed by one or more processors of the electronic system 200. Each of the operations shown in Figure 13 may correspond to instructions stored in a computer memory or non- transitory computer readable storage medium (e.g., memory 206 in Figure 2). The computer readable storage medium may include a magnetic or optical disk storage device, solid state storage devices such as Flash memory, or other non-volatile memory device or devices. The instructions stored on the computer readable storage medium may include one or more of source code, assembly language code, object code, or other instruction format that is interpreted by one or more processors. Some operations in method 1300 may be combined and/or the order of some operations may be changed.

[00106] In accordance with the data augmentation method 1300, the electronic system 200 obtains (1302) a first image 1002 captured by a camera and applies (1304) an encoderdecoder network 600 to extract depth information 518 and semantic information 520 of the first image. An object 1006 and an associated ROI 1008 are identified (1306) in the first image 1002 based on the semantic information. The ROI includes the object 1006. In some embodiments, the object 1006 includes (1307) a moveable object 1006. The electronic system 200 adjusts (1308) the ROI in the first image 1002 to generate an adjusted ROI 1010 and combines (1310) the first image 1002 and the adjusted ROI 1010 to generate a second image 1012. The second image 1012 is applied (1312) to train the encoder-decoder network. In some embodiments, the ROI is adjusted by determining (1314) a first depth of the object 1006 in a field of view of the camera based on the depth information 518 and the semantic information 520 of the first image 1002, determining (1316) a scale factor 5 based on the first depth and a target depth of the object 1006, and scaling (1318) the ROI 1008 including the object 1006 by the scale factor 5 to generate the adjusted ROI 1010. A target location of the adjusted ROI 1010 on the second image 1012 corresponds to the target depth in the field of view of the camera 260.

[00107] Further, in some embodiments, the second image 1012 is generated by determining (1320) a shift (fc, t_y from a first location of the ROI to the target location of the adjusted ROI 1010 based on the first depth and the target depth and placing (1322) the adjusted ROI 1010 on the target location of the adjusted ROI 1010 based on the shift (fc, t_y). In some embodiments, the depth information 518 includes a depth map, and the semantic information 520 includes a semantic segmentation mask 1005. Additionally, in some embodiments, the electronic system 200 modifies the depth map based on the target depth of the object 1006 to generate a modified depth map, modifies the semantic segmentation mask 1005 based on the adjusted ROI 1010 to generate a modified semantic mask, and associates the modified depth map and modified semantic map with the second image 1012 as ground truth.

[00108] In some embodiments, the encoder-decoder network 600 includes an encoder network 504 and two parallel decoder networks 506 and 508, the two parallel decoder networks 506 and 508 coupled to the encoder network 504 and configured for generating the depth information 518 and the semantic information 520 of the first image 1002.

[00109] In some embodiments, prior to applying the encoder-decoder network 600, training the encoder-decoder network 600 using a set of supervised training data including a set of training images and associated ground truth depth and semantic information. The encoder-decoder network 600 is further trained using the second image 1012.

[00110] In some embodiments, the ROI 1008 closely encloses the object 1006, and an edge of the ROI matches a contour of the object 1006 in the first image 1002.

[00111] In some embodiments, the semantic information 520 includes a semantic mask identifying the object 1006. Based on the semantic mask, a set of object 1006 pixels corresponding to the object 1006 are identified in the first image 1002. At least one contrast level, a brightness level, a saturation level, and a hue level of the set of object pixels is modified to generate the adjusted ROI 1010. The ROI 1008 has a first location in the first image 1002 and the adjusted ROI 1010 has a second location in the second image 1012. The first location is consistent with the second location. Further, in some embodiments, referring to Figure 11, the object 1006 includes a first object 1114A corresponding to a first set of object pixels in the first image 1102A. Based on the semantic mask, the electronic system 200 identifies a second set of object pixels corresponding to a second object 1114B in the first image 1102A and adjusts at least one of the contrast level, brightness level, saturation level, and hue levels of the second set of object pixels to generate a second adjusted ROI 1108B. The first image 1102A, adjusted ROI 1108A, and second adjusted ROI 1108B are combined to generate the second image 1102B. Additionally, in some embodiments, the at least one of the contrast level, brightness level, saturation level, and hue levels of the first set of object pixels and the at least one of the contrast level, brightness level, saturation level, and hue levels of the second set of object pixels are adjusted jointly.

[00112] In some embodiments, the encoder-decoder network 600 is applied to extract the depth information 518 and semantic information 520 of the first image 1002 by applying an encoder network 504 to process the first image 1002 and generate an encoded feature map 510, applying a first decoder network 506 to generate a depth feature 518 based on the encoded feature map 510, and applying a second decoder network 508 to generate a semantic feature 520 based on the encoded feature map 510. The first decoder network 506 and second decoder network 508 are coupled to each other via a CCAM network 522.

[00113] Further, in some embodiments, the electronic system 200 obtains a first decoded feature map 512 and a second decoded feature map 514 from the first and second decoder networks 506 and 508, respectively. The CCAM network 522 combines the first and second decoded feature maps 512 and 514 to generate a CCAM cross-feature score map 516. The first decoded feature map 512 in the first decoder network 506 is modified based on the CCAM cross-feature score map 516 to generate the depth feature 518 of the first image 1002. The second decoded feature map 514 in the second decoder network 508 is modified based on the CCAM cross-feature score map 516 to generate the semantic feature 520 of the first image 1002.

[00114] It should be understood that the particular order in which the operations in Figure 13 have been described are merely exemplary and are not intended to indicate that the described order is the only order in which the operations could be performed. One of ordinary skill in the art would recognize various ways to enhance speech quality. Additionally, it should be noted that details of other processes described above with respect to Figures 1-12 are also applicable in an analogous manner to method 1300 described above with respect to Figure 13. For brevity, these details are not repeated here.

[00115] The terminology used in the description of the various described implementations herein is for the purpose of describing particular implementations only and is not intended to be limiting. As used in the description of the various described implementations and the appended claims, the singular forms “a”, “an” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will also be understood that the term “and/or” as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items. It will be further understood that the terms “includes,” “including,” “comprises,” and/or “comprising,” when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. Additionally, it will be understood that, although the terms “first,” “second,” etc. may be used herein to describe various elements, these elements should not be limited by these terms. These terms are only used to distinguish one element from another.

[00116] As used herein, the term “if’ is, optionally, construed to mean “when” or “upon” or “in response to determining” or “in response to detecting” or “in accordance with a determination that,” depending on the context. Similarly, the phrase “if it is determined” or “if [a stated condition or event] is detected” is, optionally, construed to mean “upon determining” or “in response to determining” or “upon detecting [the stated condition or event]” or “in response to detecting [the stated condition or event]” or “in accordance with a determination that [a stated condition or event] is detected,” depending on the context.

[00117] The foregoing description, for purpose of explanation, has been described with reference to specific embodiments. However, the illustrative discussions above are not intended to be exhaustive or to limit the claims to the precise forms disclosed. Many modifications and variations are possible in view of the above teachings. The embodiments were chosen and described in order to best explain principles of operation and practical applications, to thereby enable others skilled in the art.

[00118] Although various drawings illustrate a number of logical stages in a particular order, stages that are not order dependent may be reordered and other stages may be combined or broken out. While some reordering or other groupings are specifically mentioned, others will be obvious to those of ordinary skill in the art, so the ordering and groupings presented herein are not an exhaustive list of alternatives. Moreover, it should be recognized that the stages can be implemented in hardware, firmware, software or any combination thereof.

Claims

What is claimed is:

1. An image processing method, comprising: obtaining an input image; applying an encoder network to process the input image and generate an encoded feature map; applying a first decoder network to generate a first decoded feature map based on the encoded feature map; applying a second decoder network to generate a second decoded feature map based on the encoded feature map; combining the first and second decoded feature maps to generate a cross-channel attention modulated (CCAM) cross-feature score map; modifying the first decoded feature map in the first decoder network based on the CCAM cross-feature score map to generate a depth feature of the input image; and modifying the second decoded feature map in the second decoder network based on the CCAM cross-feature score map to generate a semantic feature of the input image.

2. The method of claim 1, wherein: the first decoder network has a first number of successive decoding stages, and the second decoder network has a second number of successive decoding stages, the second number equal to the first number. each of the first number of successive decoding stages corresponds to a respective second decoding stage and is configured to generate a respective first intermediate feature map having the same resolution as a respective second intermediate feature map generated by the respective second decoding stage; and the first decoded feature map matches, and have the same resolution as, the second decoded feature map.

3. The method of claim 1 or 2, wherein: the first decoder network has a first decoding stage immediately followed by a first next stage, the first decoding stage configured to output the first decoded feature map; and modifying the first decoded feature map in the first decoder network further includes modifying the first decoded feature map based on the CCAM cross-feature score map and applying the modified first decoded feature map at an input of the first next stage.

4. The method of claim 3, further comprising: receiving the encoded feature map by the first decoding stage of the first decoder network, the first decoding stage leading a plurality of successive decoding stages of the first decoder network.

5. The method of claim 3, further comprising: receiving the encoded feature map by one or more alternative decoding stages, one of the one or more alternative decoding stages leading a plurality of successive decoding stages of the first decoder network, the first decoding stage following the one or more alternative decode stages in the first decoder network.

6. The method of any of claims 1-5, wherein: the second decoder network has a second decoding stage immediately followed by a second next stage, the second decoding stage configured to output the second decoded feature map; and modifying the second decoded feature map in the second decoder network further includes modifying the second decoded feature map based on the CCAM cross-feature score map and applying the modified second decoded feature map at an input of the second next stage.

7. The method of any of claims 1-6, wherein: the encoder network and the first decoder network form a first U-Net, each encoding stage of the encoder network configured to provide a first skip connection to a respective decoding stage of the first decoder network; and the encoder network and the second decoder network form a second U-Net, each encoding stage of the encoder network configured to provide a second skip connection to a respective decoding stage of the second decoder network.

8. The method of any of claims 1-7, wherein combining the first and second decoded feature maps to generate the CCAM cross-feature score map further comprising: applying at least a product operation to combine the first and second decoded feature maps and generate a cross-task feature map; applying a channel attention network to process the cross-task feature map and generate a cross-task affinity matrix; and combining the first and second decoded feature maps based on the cross-task affinity matrix to generate the CCAM cross-feature score map.

9. The method of claim 8, wherein the cross-task affinity matrix includes a plurality of affinity scores, each affinity score indicating an affinity level of a respective channel of the first decoded feature map with respect to a respective channel of the second decoded feature map.

10. The method of claim 8 or 9, wherein applying at least the product operation to combine the first and second decoded feature maps and generate the cross-task feature map further comprises: applying a first spatial attention network on the first decoded feature map to generate a first self-attended feature map; applying a second spatial attention network on the second decoded feature map to generate a second self-attended feature map; transposing the first self-attended feature map to generate a transposed first selfattended feature map; and combining the transposed first self-attended feature map and the second self-attended feature map by the product operation.

11. The method of any of claims 1-10, wherein the first and second decoded feature maps are combined using a CCAM network to generate the CCAM cross-feature score map, and the input image includes one of a plurality of training images, the method further comprising: training the encoder network, first decoder network, second decoder network, and CCAM network using the plurality of training images in an end-to-end manner.

12. The method of claim 11, wherein the encoder network, first decoder network, second decoder network, and CCAM network are trained using a comprehensive loss, and the comprehensive loss is a combination of a depth loss, a semantics loss, and an orthogonal loss, the method further comprising: determining the orthogonal loss based on parameters of the encoder network, first decoder network, and second decoder network.

13. The method of claim 12, wherein for each of the plurality of training images, the depth loss is determined based on the depth feature and a pose of an electronic device that captures the training images.

14. An electronic system, comprising: one or more processors; and memory having instructions stored thereon, which when executed by the one or more processors cause the processors to perform a method of any of claims 1-13.

15. A non-transitory computer-readable medium, having instructions stored thereon, which when executed by one or more processors cause the one or more processors to perform a method of any of claims 1-13.